# Retrieval Augmented Generation for Wikipedia using LangChain

In [2]:
import dotenv
import os

from langchain.retrievers import WikipediaRetriever
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI

## Setting up Wikipedia Retriever

In [3]:
retriever = WikipediaRetriever()

# Testing retriever 
docs = retriever.get_relevant_documents(query="ASCAT")
docs[1].metadata

{'title': 'Scatterometer',
 'summary': 'A scatterometer or diffusionmeter is a scientific instrument to measure the return of a beam of light or radar waves scattered by diffusion in a medium such as air. Diffusionmeters using visible light are found in airports or along roads to measure horizontal visibility. Radar scatterometers use radio or microwaves to determine the normalized radar cross section (σ0, "sigma zero" or "sigma naught") of a surface. They are often mounted on weather satellites to find wind speed and direction, and are used in industries to analyze the roughness of surfaces.',
 'source': 'https://en.wikipedia.org/wiki/Scatterometer'}

In [4]:
docs[1].page_content[:100]

'A scatterometer or diffusionmeter is a scientific instrument to measure the return of a beam of ligh'

## Initializing LLM with Retriever

In [5]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(model_name="gpt-3.5-turbo")  # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

### Testing Chain

In [6]:
questions = [
    "What is Apify?",
    "When the Monument to the Martyrs of the 1830 Revolution was created?",
    "What is the Abhayagiri Vihāra?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

  warn_deprecated(


-> **Question**: What is Apify? 

**Answer**: Apify is a web scraping and automation platform. It allows users to extract data from websites, automate workflows, and create custom web scrapers using a simple and intuitive interface. It provides tools and resources for developers to build, run, and scale web scraping tasks, as well as a marketplace for ready-to-use scraping solutions. 

-> **Question**: When the Monument to the Martyrs of the 1830 Revolution was created? 

**Answer**: I'm sorry, but I don't have any information about a specific monument called the "Monument to the Martyrs of the 1830 Revolution." It's possible that you may be referring to a different monument or historical event. 

-> **Question**: What is the Abhayagiri Vihāra? 

**Answer**: Abhayagiri Vihāra was a major monastery site of Theravada, Mahayana, and Vajrayana Buddhism located in Anuradhapura, Sri Lanka. It was one of the most sacred Buddhist pilgrimage cities in the country and a significant monastic cent

### Testing Chain on Domain Specific Knowledge

In [7]:
questions = [
    "What is MetOp?",
    "What instruments are on MetOp-A?"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is MetOp? 

**Answer**: MetOp is a satellite mission operated by the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT). It stands for "Meteorological Operational." The MetOp satellites are polar-orbiting satellites that provide detailed observations of atmospheric temperature and moisture profiles. These observations are crucial for numerical weather prediction and climate monitoring. The MetOp mission consists of a series of three satellites, MetOp-A, MetOp-B, and MetOp-C, which are flown successively for more than 14 years each. These satellites gather essential data for weather forecasting and climate studies. 

-> **Question**: What instruments are on MetOp-A? 

**Answer**: The following instruments are flown exclusively on the MetOp-A satellite:

- IASI – Infrared Atmospheric Sounding Interferometer
- GRAS – Global Navigation Satellite System Receiver for Atmospheric Sounding
- ASCAT – Advanced SCATterometer
- GOME-2 – Global 

### Testing Chain on Math Questions
Wikipedia is filled with good math pages, plus if you write the notation in markdown it can be more easily searchable than using google.

In [8]:
questions = [
    "What is the symbol $\Sigma$ mean?",
    "What is the formula for Multivariate Normal Function?",
    "What does the symbol $\mathbb{I(x=0)} mean?$"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is the symbol $\Sigma$ mean? 

**Answer**: The symbol $\Sigma$ represents the mathematical notation for summation. It is used to indicate that a series of terms should be added together. The uppercase Greek letter Sigma is often used to represent this concept in mathematics and statistics. 

-> **Question**: What is the formula for Multivariate Normal Function? 

**Answer**: The multivariate normal distribution is a generalization of the normal distribution to higher dimensions. It is defined by a mean vector μ and a covariance matrix Σ.

The probability density function (PDF) of the multivariate normal distribution is given by:

f(x) = (1 / (sqrt((2π)^k * det(Σ)))) * exp(-0.5 * (x - μ)^T * Σ^(-1) * (x - μ))

where:
- f(x) is the PDF of the multivariate normal distribution
- x is a k-dimensional vector representing a data point
- μ is a k-dimensional vector representing the mean of the distribution
- Σ is a k x k covariance matrix representing the variability of t