# Retrieval Augmented Generation for Wikipedia using LangChain

In [1]:
import dotenv
import os

from langchain.retrievers import WikipediaRetriever
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI

## Setting up Wikipedia Retriever

In [2]:
retriever = WikipediaRetriever()

# Testing retriever 
docs = retriever.get_relevant_documents(query="ASCAT")
docs[1].metadata

{'title': 'Scatterometer',
 'summary': 'A scatterometer or diffusionmeter is a scientific instrument to measure the return of a beam of light or radar waves scattered by diffusion in a medium such as air. Diffusionmeters using visible light are found in airports or along roads to measure horizontal visibility. Radar scatterometers use radio or microwaves to determine the normalized radar cross section (σ0, "sigma zero" or "sigma naught") of a surface. They are often mounted on weather satellites to find wind speed and direction, and are used in industries to analyze the roughness of surfaces.',
 'source': 'https://en.wikipedia.org/wiki/Scatterometer'}

In [3]:
docs[1].page_content[:100]

'A scatterometer or diffusionmeter is a scientific instrument to measure the return of a beam of ligh'

## Initializing LLM with Retriever

In [4]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(model_name="gpt-3.5-turbo")  # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

### Testing Chain

In [5]:
questions = [
    "What is Apify?",
    "When the Monument to the Martyrs of the 1830 Revolution was created?",
    "What is the Abhayagiri Vihāra?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

  warn_deprecated(


-> **Question**: What is Apify? 

**Answer**: Apify is a web scraping and automation platform that allows users to extract data from websites, automate workflows, and build applications using web technologies. It provides tools and resources for developers to create, deploy, and scale web scraping and automation projects. 

-> **Question**: When the Monument to the Martyrs of the 1830 Revolution was created? 

**Answer**: I don't have that information. 

-> **Question**: What is the Abhayagiri Vihāra? 

**Answer**: Abhayagiri Vihāra was a major monastery site located in Anuradhapura, Sri Lanka. It was one of the largest viharas and a significant religious and educational institution in ancient Sri Lanka. The complex consisted of monastic buildings and served as a seat for the Northern Monastery. It was founded in the 2nd century BC and grew into an international institution by the 1st century AD, attracting scholars from various parts of the world. Abhayagiri Vihāra played a crucial ro

### Testing Chain on Domain Specific Knowledge

In [6]:
questions = [
    "What is MetOp?",
    "What instruments are on MetOp-A?"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is MetOp? 

**Answer**: MetOp is a series of polar-orbiting weather satellites operated by the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT). The MetOp satellites are designed to provide more precise details about atmospheric temperature and moisture profiles, as well as other meteorological data. The mission consists of three satellites, MetOp-A, MetOp-B, and MetOp-C, which are flown successively for more than 14 years each. These satellites play a crucial role in numerical weather prediction and climate monitoring. 

-> **Question**: What instruments are on MetOp-A? 

**Answer**: The MetOp-A satellite is equipped with several instruments. These include:

1. Advanced Scatterometer (ASCAT): Measures wind speed and direction over the oceans.

2. Advanced Very High Resolution Radiometer (AVHRR): Provides high-resolution visible and infrared imagery of the Earth's surface.

3. Global Ozone Monitoring Experiment (GOME): Measures at

### Testing Chain on Math Questions
Wikipedia is filled with good math pages, plus if you write the notation in markdown it can be more easily searchable than using google.

In [9]:
questions = [
    "What is the symbol $\Sigma$ mean?",
    "What is the formula for Multivariate Normal Function?",
    "What does the symbol $\mathbb{I(x=0)} mean?$"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is the symbol $\Sigma$ mean? 

**Answer**: The symbol $\Sigma$ represents the mathematical notation for summation. It is derived from the Greek letter sigma. When used in a mathematical expression, the symbol $\Sigma$ is followed by an expression or equation that specifies the terms to be summed. The symbol is often used to represent the sum of a series of numbers or variables. 

-> **Question**: What is the formula for Multivariate Normal Function? 

**Answer**: The formula for the multivariate normal distribution is given by:

f(x) = (2π)^(−k/2) * det(Σ)^(-1/2) * exp(-1/2 * (x-μ)' * Σ^(-1) * (x-μ))

where:
- f(x) is the probability density function of the multivariate normal distribution
- k is the dimension of the random vector x
- μ is the mean vector of length k
- Σ is the covariance matrix of size k x k
- det(Σ) is the determinant of the covariance matrix
- (x-μ)' denotes the transpose of the vector (x-μ)
- Σ^(-1) is the inverse of the covariance matrix

Not