# Conversational Question Answering Agent using SQUAD (Stanford QnA Dataset)
#### We are going to build a conversational agent using a SQUAD Dataset, Pinecone and Langchain libraries

### We will start by first loading the SQUAD dataset

In [2]:
from datasets import load_dataset

In [3]:
data = load_dataset("squad",
                   split="train")
data

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [4]:
import pandas as pd
df = data.to_pandas()
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


In [5]:
df.drop_duplicates(subset="context", keep='first', inplace=True)
df.shape

(18891, 5)

In [11]:
len(df)

18891

#### Then we will create embeddings and a vector database for the records. We will be using OpenAI's `text-embedding-ada-002` model for the same

In [7]:
import os
from getpass import getpass
from langchain.embeddings.openai import OpenAIEmbeddings
from UDCUtils import UDCUtils
utils = UDCUtils()

In [8]:
openai_api_key = utils.get_openai_api_key()
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    api_key=openai_api_key,
    model=model_name
)

#### Using Pinecone to create a vector database

In [10]:
from pinecone import Pinecone, ServerlessSpec

pinecone_api_key = utils.get_pinecone_api_key()

pinecone = Pinecone(api_key=pinecone_api_key)

index_name = "idx-squad-dev-001"

existing_indexes = [index_info["name"] for index_info in pinecone.list_indexes()]

if index_name not in existing_indexes:
    pinecone.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of ada 002
        metric="dotproduct",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

vector_db = pinecone.Index(index_name)


#### Now we are going to populate the vector database with vectors for our SQUAD dataset

In [12]:
from tqdm import tqdm

batch_size=100
data_len = len(df)

for i in tqdm(range(0, data_len, batch_size)):
    i_end = min(data_len, i+batch_size)
    batch = df.iloc[i:i_end]
    #populate metadata for batch
    metadatas = [{
        "title":record["title"],
        "text":record["context"]
    } for j, record in batch.iterrows()]
    #populate the embeddings for documents
    documents=batch["context"]
    embeds=embed.embed_documents(documents)
    #populate ids
    ids=batch["id"]
    #now upsert the tuple of (id, vector, metadata) into index
    vector_db.upsert(vectors=zip(ids, embeds, metadatas))

100%|█████████████████████████████████████████| 189/189 [15:40<00:00,  4.98s/it]


In [13]:
vector_db.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}

#### We will create a Pinecone vectorstore using langchain to query our data

In [14]:
from langchain.vectorstores import Pinecone

content_field = "text"  #the metadata field that contains our content

vectorstore = Pinecone(
    vector_db, embed.embed_query, content_field
)




#### Let us query our data using similarity search

In [16]:
query = "When was the college of engineering in the University of Notre Dame established?"
results = vectorstore.similarity_search(
    query,
    k=3,
)

In [19]:
import pprint
pprint.pprint(results)

[Document(metadata={'title': 'University_of_Notre_Dame'}, page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president."),
 Document(metadata={'title': 'University_of_Notre_Dame'}, page_content='The College of Engin

### Now let us start building our conversational agent for QnA
##### Our conversational agent needs a Chat LLM, a conversation memory to store the history and a RetrievalQA

In [20]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

In [22]:
#initialize llm chat model
llm_chat_model = ChatOpenAI(
    api_key=openai_api_key,
    model="gpt-3.5-turbo",
    temperature=0
)

In [23]:
#conversational memory
conversation_mem = ConversationBufferWindowMemory(
    memory_key="chat_history",
    k=5,
    return_messages=True
)

In [24]:
qa = RetrievalQA.from_chain_type(
    llm=llm_chat_model,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

#### Using this we can generate answers using run method

In [25]:
qa.run(query)

'The College of Engineering at the University of Notre Dame was established in 1920.'

### Now lets convert this retriever chain into a tool

In [26]:
from langchain.agents import Tool

tools = [
    Tool(
        name="Knowledge Base",
        func=qa.run,
        description="Use this tool to answer any questions based on England and its history of universities."
    )
]

In [27]:
from langchain.agents import initialize_agent

agent = initialize_agent(
    tools=tools,
    llm=llm_chat_model,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversation_mem,
    agent = 'chat-conversational-react-description',
)

### Now we will use this agent to answer our questions

In [28]:
agent(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Knowledge Base",
    "action_input": "Establishment date of the College of Engineering at the University of Notre Dame"
}
```[0m
Observation: [36;1m[1;3mThe College of Engineering at the University of Notre Dame was established in 1920.[0m
Thought:[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The College of Engineering at the University of Notre Dame was established in 1920."
}
```[0m

[1m> Finished chain.[0m


{'input': 'When was the college of engineering in the University of Notre Dame established?',
 'chat_history': [],
 'output': 'The College of Engineering at the University of Notre Dame was established in 1920.'}

In [29]:
agent("can you tell me some facts about the University of Notre Dam?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Knowledge Base",
    "action_input": "Facts about the University of Notre Dame"
}
```[0m
Observation: [36;1m[1;3mThe University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large undergraduate and graduate student body, with notable colleges such as Arts and Letters, Science, Engineering, Business, and the Architecture School. The university is known for its research institutes in various fields, including the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a strong alumni network and is recognized for its intramural sports programs. The campus covers 1,250 acres and features landmarks like the Golden Dome and the Basilica.[0m
Thought:[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large 

{'input': 'can you tell me some facts about the University of Notre Dam?',
 'chat_history': [HumanMessage(content='When was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.')],
 'output': 'The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large undergraduate and graduate student body, with notable colleges such as Arts and Letters, Science, Engineering, Business, and the Architecture School. The university is known for its research institutes in various fields, including the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a strong alumni network and is recognized for its intramural sports programs. The campus covers 1,250 acres and features landmarks like the Golden Dome and the Basilica.'}

In [30]:
agent("where is it located?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is located in South Bend, Indiana."
}
```[0m

[1m> Finished chain.[0m


{'input': 'where is it located?',
 'chat_history': [HumanMessage(content='When was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='can you tell me some facts about the University of Notre Dam?'),
  AIMessage(content='The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large undergraduate and graduate student body, with notable colleges such as Arts and Letters, Science, Engineering, Business, and the Architecture School. The university is known for its research institutes in various fields, including the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a strong alumni network and is recognized for its intramural sports programs. The campus covers 1,250 acres and features landmarks like the Golden Dome and the Basilica.')],
 'output': 'The Un

### As you can see the agent successfully used the conversation history to identify the subject for the question.

#### Now let us test the agent for non-knowledge base questions

In [31]:
agent("what is 30+20?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The sum of 30 and 20 is 50."
}
```[0m

[1m> Finished chain.[0m


{'input': 'what is 30+20?',
 'chat_history': [HumanMessage(content='When was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='can you tell me some facts about the University of Notre Dam?'),
  AIMessage(content='The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large undergraduate and graduate student body, with notable colleges such as Arts and Letters, Science, Engineering, Business, and the Architecture School. The university is known for its research institutes in various fields, including the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a strong alumni network and is recognized for its intramural sports programs. The campus covers 1,250 acres and features landmarks like the Golden Dome and the Basilica.'),
  HumanMessage(content='w

In [32]:
agent("can you summarize these two facts in short?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana, established in 1842. The College of Engineering at the University of Notre Dame was established in 1920."
}
```[0m

[1m> Finished chain.[0m


{'input': 'can you summarize these two facts in short?',
 'chat_history': [HumanMessage(content='When was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='can you tell me some facts about the University of Notre Dam?'),
  AIMessage(content='The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It has a large undergraduate and graduate student body, with notable colleges such as Arts and Letters, Science, Engineering, Business, and the Architecture School. The university is known for its research institutes in various fields, including the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a strong alumni network and is recognized for its intramural sports programs. The campus covers 1,250 acres and features landmarks like the Golden Dome and the Basilica.