<a href="https://colab.research.google.com/github/racheltlw/htx_qa_demo/blob/main/QA_Multi_Demo_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Demo for Question and Answering System (Multi) 

This demo will walk you through how to train a Question and Answering pipeline using Haystack for Multiple Document QA  

#### 1. Setup

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving requirements.txt to requirements.txt
User uploaded file "requirements.txt" with length 93 bytes


In [2]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting django_haystack==3.2.1
  Downloading django-haystack-3.2.1.tar.gz (466 kB)
[K     |████████████████████████████████| 466 kB 26.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[31mERROR: Could not find a version that satisfies the requirement numpy==1.23.1 (from versions: 1.3.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.10.0.post2, 1.10.1, 1.10.2, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.13.0rc1, 1.13.0rc2, 1.13.0, 1.13.1, 1.13.3, 1.14.0rc1, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4, 1.16.0rc1, 1.16.0rc2, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6, 1.17.0rc1, 1.17.0rc2, 1.17.0, 

In [3]:
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113


In [4]:
!pip install git+https://github.com/deepset-ai/haystack.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-69xulz3h
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-69xulz3h
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting quantulum3
  Downloading quantulum3-0.7.10-py3-none-any.whl (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 24.2 MB/s 
Collecting rapidfuzz<3,>=2.0.15
  Downloading rapidfuzz-2.6.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 59.6 MB/s 
Collecting elastic-apm
  Downloading elastic_apm-6.11.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (381 kB)


Import packages

In [5]:
import pandas as pd
import pprint
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PreProcessor, DensePassageRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import FARMReader

#### 2. Create a document store

Think of this as a database where your documents will be stored, to be used by the QA system

In [6]:
document_store = InMemoryDocumentStore() 
#you can also use a faiss document store which is optimised vector storage for DPR, for simplicity sake we will use InMemory

#### 3. Load and format your data 
The data a CSV file containing covid 19 related information. We will see how a haystack retriever can filter for the most relevant articles to a question, before using a QA model to extract the answer.

In [8]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving multi_demo_covid.csv to multi_demo_covid.csv
User uploaded file "multi_demo_covid.csv" with length 160284 bytes


In [9]:
#read data as a pandas dataframe
df = pd.read_csv('multi_demo_covid.csv') #load into the correct format for the haystack pipeline
#load reader and retriever 

In [10]:
#reformat data so that haystack framework can use it
def get_docs(input_df):
    docs = []
    for i in range(len(input_df)): 
        doc = {
            'content': input_df['text'][i], 
            'meta': {'link': input_df['link'][i], 
                    'source': input_df['source'][i]}
        }
        docs.append(doc)
    return docs

In [11]:
#some articles are quite long so we need to split them into smaller chunks
preprocessor = PreProcessor(split_by = 'word', 
                            split_length = 300, #each chunk is 300 words long
                            split_overlap = 30, #each chunk overlaps with the previous chunk by 30 words
                            split_respect_sentence_boundary= True) #will split according to complete sentences 



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [12]:
data = get_docs(df)
preprocessed_data = preprocessor.process(data)

Preprocessing:   0%|          | 0/213 [00:00<?, ?docs/s]

Write our data into the document store

In [13]:
document_store.write_documents(preprocessed_data)

#### 4. Load DPR and QA Model 

Load the DPR

In [14]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=128, 
    max_seq_len_passage=512,
    batch_size=16,
    use_gpu=True, #if you do not have a gpu you can turn this off, it will just take longer
)

document_store.update_embeddings(retriever)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Updating Embedding:   0%|          | 0/218 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/224 [00:00<?, ? Docs/s]

Documents Processed: 10000 docs [00:09, 1093.51 docs/s]


Load the Reader (this is the QA model from Huggingface)

In [15]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Create a pipeline using both the retriever and reader

In [16]:
pipeline = ExtractiveQAPipeline(reader, retriever)

#### 5. Trying out our pipeline

Here is a simple function to allow us to display our results nicely

In [17]:
def print_preds_df(results):
    answers = results["answers"]
    pp = pprint.PrettyPrinter(indent=4)
    keys_to_keep = set(["answer", "context", "score", "probability"])

    # filter the results
    filtered_answers = []
    for ans in answers:
        filtered_answers.append({'answer': ans.answer, 'context': ans.context, 'score': ans.score, 
                               'link': ans.meta['link'], 'source': ans.meta['source']})

    df_res = pd.DataFrame({"answer":[], "context":[], "score":[], "link":[], "source":[]})

    for i in filtered_answers:
        df_res.loc[len(df_res)] = i

    df_res.sort_values(by=['score'], inplace = True, ascending=False)
    df_res = df_res.reset_index(drop = True)
    df_res['score'] = df_res['score'].round(2)
    return df_res

Run the pipeline on a question 
- The retriever filters out the top 20 most relevant articles
- Then the QA finds the top 5 most probable answers from those articles

In [18]:
qn = 'Where did the coronavirus first appear? '
prediction = pipeline.run(query=qn, params={'Retriever': {'top_k': 20}, 'Reader': {'top_k':5}})
prediction_df = print_preds_df(prediction)
prediction_df #shows the top 5 answers by score 

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

Unnamed: 0,answer,context,score,link,source
0,Wuhan China,"t is causing the 2019 novel coronavirus outbreak, first identified in Wuhan ...",0.97,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)
1,China,The novel coronavirus detected in China is genetically closely related to th...,0.85,https://www.ecdc.europa.eu/en/novel-coronavirus-china/questions-answers,European Centre for Disease Prevention and Control (ECDC)
2,animals,Coronaviruses are a large family of viruses that are common in animals. Occa...,0.77,https://www.who.int/news-room/q-a-detail/q-a-coronaviruses,World Health Organization (WHO)
3,humans,ily of viruses. There are some coronaviruses that commonly circulate in huma...,0.73,https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/nCoV2019.aspx#,California Department of Public Health
4,Wuhan City,"This virus was first detected in Wuhan City, Hubei Province, China. The firs...",0.7,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)
