# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [40]:
pip install dotenv



In [41]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [42]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [43]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [44]:
print(df.columns)


Index(['id', 'title', 'context', 'question', 'answers'], dtype='object')


In [45]:
# select only title and context column
df = df[["title", "context"]]

# Drop duplicate context passages
df = df.drop_duplicates(subset=["context"])

df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [46]:
!pip install -qU langchain-pinecone pinecone-notebooks

[0m

In [None]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = 'PINECONE_API_KEY',
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [48]:

#index_name = "extractive-question-answering"

#if index_name in pc.list_indexes().names():
 #   pc.delete_index(index_name)


In [65]:

# Define the index name you want to use
index_name = "extractive-question-answering"

# Check if the index exists
if index_name not in pc.list_indexes().names():
    # Create the index if it does not exist
    pc.create_index(
        name=index_name,
        dimension=384,  # Change this based on your embedding model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Connect to the index
index = pc.Index(index_name)


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [66]:
import torch
from sentence_transformers import SentenceTransformer

# Set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load the retriever model from Hugging Face model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device=device)

retriever


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [67]:
from tqdm.auto import tqdm

# use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))

    # extract batch
    batch = df.iloc[i:i_end]

    # generate embeddings for batch
    emb = retriever.encode(batch["context"].tolist(), convert_to_numpy=True)

    # get metadata
    meta = [{"context": c, "title": t}
            for c, t in zip(batch["context"], batch["title"])]

    # create unique IDs
    ids = [f"{i+j}" for j in range(i_end - i)]

    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))

    # upsert/insert these records to Pinecone
    index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()


  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [68]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7a96eae1bb10>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [69]:
# gets context passages from the pinecone index


def get_context(question, top_k):

    # generate embeddings for the question
    # Step 1: Embed the question
  xq = retriever.encode([question])[0].tolist()  # convert to plain list

    # search pinecone index for context passage with the answer
    # Step 2: Query Pinecone for top_k most similar contexts
  result = index.query(vector=xq, top_k=top_k, include_metadata=True)

    # extract the context passage from pinecone search result
    # Step 3: Extract the context passages from the metadata
  contexts = [match['metadata']['context'] for match in result['matches']]

  return contexts





In [70]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result


In [55]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['Rajasthan is[when?] earning Rs. 150 million (approx. US$2.5 million) per day as revenue from the crude oil sector. This earning is expected to reach ₹250 million per day in 2013 (which is an increase of ₹100 million or more than 66 percent). The government of India has given permission to extract 300,000 barrels of crude per day from Barmer region which is now 175,000 barrels per day. Once this limit is achieved Rajasthan will become a leader in Crude extraction in Country. Bombay High leads with a production of 250,000 barrels crude per day. Once the limit if 300,000 barrels per day is reached, the overall production of the country will increase by 15 percent. Cairn India is doing the work of exploration and extraction of crude oil in Rajasthan.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [56]:
extract_answer(question, context)

[{'answer': '₹250 million per day in 2013',
  'context': 'Rajasthan is[when?] earning Rs. 150 million (approx. US$2.5 '
             'million) per day as revenue from the crude oil sector. This '
             'earning is expected to reach ₹250 million per day in 2013 (which '
             'is an increase of ₹100 million or more than 66 percent). The '
             'government of India has given permission to extract 300,000 '
             'barrels of crude per day from Barmer region which is now 175,000 '
             'barrels per day. Once this limit is achieved Rajasthan will '
             'become a leader in Crude extraction in Country. Bombay High '
             'leads with a production of 250,000 barrels crude per day. Once '
             'the limit if 300,000 barrels per day is reached, the overall '
             'production of the country will increase by 15 percent. Cairn '
             'India is doing the work of exploration and extraction of crude '
             'oil in Raja

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [57]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Hurley and Chen',
  'context': 'According to a story that has often been repeated in the media, '
             'Hurley and Chen developed the idea for YouTube during the early '
             'months of 2005, after they had experienced difficulty sharing '
             "videos that had been shot at a dinner party at Chen's apartment "
             'in San Francisco. Karim did not attend the party and denied that '
             'it had occurred, but Chen commented that the idea that YouTube '
             'was founded after a dinner party "was probably very strengthened '
             'by marketing ideas around creating a story that was very '
             'digestible".',
  'end': 79,
  'score': 0.9999276399612427,
  'start': 64}]


In [58]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 86,
  'score': 0.9500371217727661,
  'start': 29}]


Let's run another question. This time for top 3 context passages from the retriever.

In [59]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

The result looks pretty good.

### Add a few more questions. What did you observe?

In [71]:
questions = [
    "What is the capital of Canada?",
    "When did the World War II end?",
    "Who is the CEO of Tesla?",
    "How many countries are in the European Union?",
    "What is the boiling point of water in Celsius?"
]

for q in questions:
    print(f"\n Question: {q}")
    context = get_context(q, top_k=3)
    extract_answer(q, context)





 Question: What is the capital of Canada?
[{'answer': '/ˌseɪntˈdʒɑːnz',
  'context': "St. John's (/ˌseɪntˈdʒɒnz/, local /ˌseɪntˈdʒɑːnz/) is the "
             'capital and largest city in Newfoundland and Labrador, Canada. '
             "St. John's was incorporated as a city in 1888, yet is considered "
             'by some to be the oldest English-founded city in North America. '
             'It is located on the eastern tip of the Avalon Peninsula on the '
             'island of Newfoundland. With a population of 214,285 as of July '
             "1, 2015, the St. John's Metropolitan Area is the second largest "
             'Census Metropolitan Area (CMA) in Atlantic Canada after Halifax '
             'and the 20th largest metropolitan area in Canada. It is one of '
             "the world's top ten oceanside destinations, according to "
             'National Geographic Magazine. Its name has been attributed to '
             'the feast day of John the Baptist, when John Cabo

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'answer': 'Alan Mulally',
  'context': 'Notable alumni include: Alan Mulally (BS/MS), former President '
             'and CEO of Ford Motor Company, Lou Montulli, co-founder of '
             'Netscape and author of the Lynx web browser, Brian McClendon '
             '(BSEE 1986), VP of Engineering at Google, Charles E. Spahr '
             '(1934), former CEO of Standard Oil of Ohio.',
  'end': 36,
  'score': 7.620492965543235e-07,
  'start': 24},
 {'answer': 'Andrew N. Liveris',
  'context': "The company's 14 member Board of Directors is responsible for "
             "overall corporate management. As of Cathie Black's resignation "
             'in November 2010 its membership (by affiliation and year of '
             "joining) included: Alain J. P. Belda '08 (Alcoa), William R. "
             "Brody '07 (Salk Institute / Johns Hopkins University), Kenneth "
             "Chenault '98 (American Express), Michael L. Eskew '05 (UPS), "
             "Shirley Ann Jackson '05 (Renss

#What did you observe?
- Correct answers when the retrieved context is relevant.

- Wrong or empty answers if context doesn't match the question.

- Repeated or generic answers if the same context is used for multiple questions.

- No answers if Pinecone index is empty or the embedding dimensions mismatch.

In [72]:
pc.delete_index(index_name)