# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True,trust_remote_code=True
).shuffle(seed=960)

  from .autonotebook import tqdm as notebook_tqdm


We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [3]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [4]:
# filter only documents with History as section_title - Replace None with your code
history = wiki_data.filter(lambda x: x['section_title'] == 'History')

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [5]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 20000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # --- HERE ARE THE CHANGES ---
    # 1. Extract the fields we need - article, section, and passage
    extracted_doc = {
        "wiki_id": d["wiki_id"],          # <--- ADD THIS LINE to include the unique ID
        "article_title": d["article_title"],
        "section_title": d["section_title"],
        "passage_text": d["passage_text"],
        # OPTIONAL: Add other fields if you want them as metadata in Pinecone
        "start_paragraph": d["start_paragraph"],
        "start_character": d["start_character"],
        "end_paragraph": d["end_paragraph"],
        "end_character": d["end_character"]
    }
    docs.append(extracted_doc) # Add the extracted document to your list
    # --- END OF CHANGES ---

    # increase the counter on every iteration
    counter += 1

    # --- ADD THIS BREAK CONDITION ---
    if counter >= total_doc_count:
        break
    # --- END OF ADDITION ---

print(f"Collected {len(docs)} documents.")
# Optional: print the first document to verify
if docs:
    print("First collected document:", docs[0])

100%|█████████▉| 19999/20000 [06:32<00:00, 51.00it/s] 

Collected 20000 documents.
First collected document: {'wiki_id': 'Q2644349', 'article_title': 'Taupo District', 'section_title': 'History', 'passage_text': 'was not until the 1950s that the region started to develop, with forestry and the construction of the Wairakei geothermal power station.', 'start_paragraph': 10, 'start_character': 397, 'end_paragraph': 10, 'end_character': 534}





In [6]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,wiki_id,article_title,section_title,passage_text,start_paragraph,start_character,end_paragraph,end_character
0,Q2644349,Taupo District,History,was not until the 1950s that the region starte...,10,397,10,534
1,Q7718184,The Bishop Wand Church of England School,History,The Bishop Wand Church of England School Histo...,2,0,6,553
2,Q24090072,Surface Hill Uniting Church,History,in perpetual reminder that work and worship go...,6,6538,6,7119
3,Q25351809,The Electras (band),History,"as its B-side. However, copies of the single, ...",6,3105,6,3722
4,Q39054288,Swanton House,History,it. Lane provided funds for restoration by the...,6,1801,6,2393


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [7]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [8]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [9]:
index_name  = "abstractive-question-answering"

if index_name not in pc.list_indexes().names():
    print(f"Index '{index_name}' does not exist. Creating it now...")
    # Create the index
    pc.create_index(name=index_name, dimension=768, metric="cosine", spec=spec)
    print(f"Index '{index_name}' created successfully.")
else:
    print(f"Index '{index_name}' already exists.")

# Connect to the index
index = pc.Index(index_name) # Corrected: using index_name

print(f"Connected to index '{index_name}'.")

Index 'abstractive-question-answering' already exists.
Connected to index 'abstractive-question-answering'.


In [10]:
import time

# 1. Wait for the index to be ready
print(f"Waiting for index '{index_name}' to be ready...")
while True:
    try:
        # describe_index returns an IndexDescription object which has a status field
        index_description = pc.describe_index(index_name)
        if index_description.status.ready:
            print(f"Index '{index_name}' is ready.")
            break
        else:
            # Index is not ready yet, print current state and wait
            print(f"Index '{index_name}' is not ready yet. Status: {index_description.status.state}. Waiting...")
            time.sleep(10) # Wait for 10 seconds before checking again
    except Exception as e:
        # Catch potential errors if index is still being created/initialized by Pinecone
        print(f"Error checking index status: {e}. Retrying in 10 seconds...")
        time.sleep(10)

# 2. Ensure the stats are all zeros (for a freshly created/empty index)
# Use index.describe_index_stats() to get the current statistics
stats = index.describe_index_stats()

# Check if the total_vector_count is 0
if stats.total_vector_count == 0:
    print(f"Index '{index_name}' is empty (total_vector_count: {stats.total_vector_count}).")
    print(f"Dimension: {stats.dimension}")
else:
    print(f"WARNING: Index '{index_name}' is not empty. Current stats:")
    print(stats) # Print full stats if not empty

Waiting for index 'abstractive-question-answering' to be ready...
Index 'abstractive-question-answering' is ready.
{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 19811}},
 'total_vector_count': 19811}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [11]:
import torch
from sentence_transformers import SentenceTransformer

# Set device to GPU if available, otherwise CPU
# Since you're using CPU, this will correctly evaluate to 'cpu'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}") # Just to confirm it's using 'cpu'

# Load the retriever model from HuggingFace model hub
# Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
model_name = 'flax-sentence-embeddings/all_datasets_v3_mpnet-base'
retriever = SentenceTransformer(model_name)

# Ensure the model is moved to the selected device (CPU in your case)
retriever.to(device)

print(f"Retriever model '{model_name}' loaded successfully on {device}.")

Using device: cpu
Retriever model 'flax-sentence-embeddings/all_datasets_v3_mpnet-base' loaded successfully on cpu.


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [15]:
# import time

# batch_size = 64
# for i in tqdm(range(0, len(df), batch_size), desc="Upserting batches to Pinecone"):
#     # ... (batch_df, batch_texts, batch_embeddings, vectors_for_batch preparation) ...

#     MAX_RETRIES = 5
#     for attempt in range(MAX_RETRIES):
#         try:
#             index.upsert(vectors=vectors_for_batch)
#             break # If upsert is successful, break from retry loop
#         except Exception as e:
#             print(f"\nError upserting batch {i}-{i_end} (Attempt {attempt + 1}/{MAX_RETRIES}): {e}")
#             if attempt < MAX_RETRIES - 1:
#                 time.sleep(min(2 ** attempt, 60)) # Exponential backoff, max 60 seconds
#             else:
#                 print(f"Failed to upsert batch {i}-{i_end} after {MAX_RETRIES} attempts. Skipping.")
#                 # Optionally, log these failed batches to reprocess later
#                 # failed_batches.append((i, i_end)) # You'd need to define failed_batches list
# # ... (rest of the code) ...

In [14]:
# We will use batches of 64
batch_size = 64

print(f"\nStarting embedding generation and upsert for {len(df)} documents into Pinecone index '{index_name}'...")

# Iterate through the DataFrame in batches
for i in tqdm(range(0, len(df), batch_size), desc="Upserting batches to Pinecone"):
    # Find the end index for the current batch
    i_end = min(i + batch_size, len(df))

    # Extract the current batch of documents from the DataFrame
    batch_df = df.iloc[i:i_end]

    # Extract the 'passage_text' for embedding generation
    batch_texts = batch_df['passage_text'].tolist()

    # Generate embeddings for the current batch of texts
    batch_embeddings = retriever.encode(batch_texts, show_progress_bar=False, device=device).tolist()

    # Prepare records for Pinecone upsert
    vectors_for_batch = []
    for j, row in enumerate(batch_df.itertuples(index=False)):
        # The unique ID for each passage comes from the 'wiki_id' column
        doc_id = str(row.wiki_id) # Ensure the ID is a string

        # Create the metadata dictionary
        metadata = {
            'article_title': row.article_title,
            'section_title': row.section_title,
            'passage_text': row.passage_text, # Common to include the original text in metadata for retrieval
            'start_paragraph': row.start_paragraph,
            'start_character': row.start_character,
            'end_paragraph': row.end_paragraph,
            'end_character': row.end_character
        }
        # Add any other relevant columns from your 'df' to metadata here if you wish

        vectors_for_batch.append({
            'id': doc_id,
            'values': batch_embeddings[j], # 'j' is the index within the current batch_embeddings list
            'metadata': metadata
        })
    
    # Upsert the prepared batch of vectors to the Pinecone index
    try:
        index.upsert(vectors=vectors_for_batch) # Corrected syntax here
    except Exception as e:
        print(f"\nError upserting batch {i}-{i_end}: {e}")
        # Consider logging the specific batch causing issues or implementing a retry mechanism

print("\nFinished generating embeddings and upserting all batches to Pinecone.")

# Final check: Describe index stats to verify the total number of vectors
print("\nFinal Pinecone Index Statistics:")
final_index_stats = index.describe_index_stats()
print(f"Total vectors in index: {final_index_stats.namespaces[''].vector_count}")
print(final_index_stats)


Starting embedding generation and upsert for 20000 documents into Pinecone index 'abstractive-question-answering'...


Upserting batches to Pinecone: 100%|██████████| 313/313 [1:32:51<00:00, 17.80s/it]



Finished generating embeddings and upserting all batches to Pinecone.

Final Pinecone Index Statistics:
Total vectors in index: 19811
{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 19811}},
 'total_vector_count': 19811}


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [16]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [18]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode(query, device=device).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(
        vector=xq,
        top_k=top_k,
        include_metadata=True  # Crucial to get back your passage_text and other info
    )
    return xc
print("\n'query_pinecone' function defined.")


'query_pinecone' function defined.


In [19]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context_passages = [f"<P> {m['metadata']['passage_text']}" for m in context]

    # concatenate all context passages into a single string
    context = "".join(context_passages) 

    # concatenate the query and context passages into the final input string
    query = f"question: {query} context: {context}"

    return query


Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [20]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': 'Q2388981',
              'metadata': {'article_title': 'Electric power system',
                           'end_character': 597.0,
                           'end_paragraph': 6.0,
                           'passage_text': 'Electric power system History In '
                                           '1881, two electricians built the '
                                           "world's first power system at "
                                           'Godalming in England. It was '
                                           'powered by two waterwheels and '
                                           'produced an alternating current '
                                           'that in turn supplied seven '
                                           'Siemens arc lamps at 250 volts and '
                                           '34 incandescent lamps at 40 volts. '
                                           'However, supply to the lamps was '
                    

In [22]:
from pprint import pprint

In [23]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('question: question: when was the first electric power system built? context: '
 '<P> Electric power system History In 1881, two electricians built the '
 "world's first power system at Godalming in England. It was powered by two "
 'waterwheels and produced an alternating current that in turn supplied seven '
 'Siemens arc lamps at 250 volts and 34 incandescent lamps at 40 volts. '
 'However, supply to the lamps was intermittent and in 1882 Thomas Edison and '
 'his company, The Edison Electric Light Company, developed the first '
 'steam-powered electric power station on Pearl Street in New York City. The '
 'Pearl Street Station initially powered around 3,000 lamps for 59 customers. '
 'The power station generated direct current and context: <P> Electric power '
 "system History In 1881, two electricians built the world's first power "
 'system at Godalming in England. It was powered by two waterwheels and '
 'produced an alternating current that in turn supplied seven Siemens arc 

The output looks great. Now let's write a function to generate answers.

In [24]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [25]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('The first electric power system was built at Godalming in England in 1881. '
 'It was powered by two waterwheels and produced an alternating current that '
 'in turn supplied seven Siemens arc')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [26]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by the first radio transmitter. The '
 'first radio transmitter was built in 1894 by Guglielmo Marconi. The first '
 'radio transmitter was built in 1896')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [27]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

of its transmitter.
---
audio distance records, and were heard as far west as Hawaii. They were also received in Paris, France, which marked the first transmission of speech across the Atlantic.
With the entrance of the United States into World War I in April 1917 the federal government took over full control of the radio industry, and it became illegal for civilians to possess an operational radio receiver. However NAA continued to operate during the conflict. In addition to time signals and weather reports, it also broadcast news summaries received by troops on land and aboard ships in the Atlantic. Effective April 15, 1919
---
Message from space (science fiction) For other uses, see Message from Space (disambiguation).
"Message from space" is a type of "first contact" theme in science fiction . Stories of this type involve receiving an interstellar message which reveals the existence of other intelligent life in the universe. History An early short story, A Message from Space (Josep

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [28]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('COVID-19 is a virus that infects humans. It is not a bacterium, it is a '
 'virus that infects humans. It is not a bacterium, it is a')


In [29]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

plague is a zoonotic disease, as are salmonellosis, Rocky Mountain spotted fever, and Lyme disease.
A major factor contributing to the appearance of new zoonotic pathogens in human populations is increased contact between humans and wildlife. This can be caused either by encroachment of human activity into wilderness areas or by movement of wild animals into areas of human activity. An example of this is the outbreak of Nipah virus in peninsular Malaysia in 1999, when intensive pig farming began on the habitat of infected fruit bats. Unidentified infection of the pigs amplified the force of infection, eventually transmitting the virus
---
diseases when they occupy at certain times of the year natural habitat of a certain pathogen (plague, tularemia, leptospirosis, arboviruses, tick-borne relapsing fever. The WHO Expert Committee on Zoonoses listed over 100 such diseases. About natural focality of the diseases is known elsewhere. History Historically, Sanitary epidemiological reconnaiss

Let’s finish with a final few questions.

In [30]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The War of Currents was a series of naval battles between the British and '
 'French navies during the Napoleonic Wars. The British and French navies '
 'fought each other in a series of')


In [31]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first person to walk on the moon was Neil Armstrong, who walked on the '
 'moon in 1969.')


In [32]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $3.5 billion.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [33]:
query = "who has invented Mobilephone?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure who invented it, but I think it's pretty much the same as the "
 'idea of a cell phone. The idea is that you can use a phone to communicate '
 'with someone')


In [34]:
query = "Indian history famous for?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not a historian, but I've read a lot of Indian history, and I think it's "
 'important to note that there are a lot of misconceptions about Indian '
 'history. There are a')
