## Demo for Question and Answering System (Multi) 

This demo will walk you through how to train a Question and Answering pipeline using Haystack for Multiple Document QA  

#### 1. Setup

In [35]:
# Make sure you have a GPU running
!nvidia-smi

Wed Aug 10 15:39:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.36       Driver Version: 512.36       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 5000... WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P8     9W /  N/A |   3282MiB / 16384MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
#Install the latest master of Haystack
#run this in command line 
!pip install git+https://github.com/deepset-ai/haystack.git

Collecting pip
  Using cached pip-22.2.2-py3-none-any.whl (2.0 MB)


ERROR: To modify pip, please run the following command:
C:\Users\Rachel Tan\Documents\France Trip\paris_demo\Scripts\python.exe -m pip install --upgrade pip


Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to c:\users\rachel tan\appdata\local\temp\pip-req-build-p_a03c61
  Resolved https://github.com/deepset-ai/haystack.git to commit b685409c78663751ff5256b053f722cf1e08240b
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting quantulum3
  Using cached quantulum3-0.7.10-py3-none-any.whl (10.7 MB)
Collecting pydantic
  Downloading pydantic-1.9.1-cp39-cp39-win_amd64.whl (2.0 MB)
     ---------------------------------------- 2.0/2.0 MB 2.0 MB/s eta 0:00:00
Collecting tika
  Using cached tika-1.24-py3-none-any.whl
Collecting elasticsearch<7.11,>=7.7
  Downloading elasticsearch-7.10.1-py2.py3-none-any.w

  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git 'C:\Users\Rachel Tan\AppData\Local\Temp\pip-req-build-p_a03c61'
ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\Rachel Tan\\Documents\\France Trip\\paris_demo\\Lib\\site-packages\\~ywin32_system32\\pythoncom39.dll'
Check the permissions.



Import packages

In [36]:
import pandas as pd
import pprint
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PreProcessor, DensePassageRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import FARMReader

#### 2. Create a document store

Think of this as a database where your documents will be stored, to be used by the QA system

In [37]:
document_store = InMemoryDocumentStore() 
#you cam also use a faiss document store which is optimised vector storage for DPR, for simplicity sake we will use InMemory

#### 3. Load and format your data 
The data is a mixture of news articles and covid 19 related information. We will see how a haystack retriever can filter for the most relevant articles to a question, before using a QA model to extract the answer.

In [38]:
#read data as a pandas dataframe
df = pd.read_csv('multi_demo.csv') #load into the correct format for the haystack pipeline
#load reader and retriever 

In [39]:
#reformat data so that haystack framework can use it
def get_docs(input_df):
    docs = []
    for i in range(len(input_df)): 
        doc = {
            'content': input_df['text'][i], 
            'meta': {'link': input_df['link'][i], 
                    'source': input_df['source'][i]}
        }
        docs.append(doc)
    return docs

In [40]:
#some articles are quite long so we need to split them into smaller chunks
preprocessor = PreProcessor(split_by = 'word', 
                            split_length = 300, #each chunk is 300 words long
                            split_overlap = 30, #each chunk overlaps with the previous chunk by 30 words
                            split_respect_sentence_boundary= True) #will split according to complete sentences 



In [41]:
data = get_docs(df)
preprocessed_data = preprocessor.process(data)

Preprocessing: 100%|█████████████████████████████████████████████████████████████| 268/268 [00:00<00:00, 4786.41docs/s]


Write our data into the document store

In [None]:
document_store.write_documents(preprocessed_data)

#### 4. Load DPR and QA Model 

Load the DPR

In [23]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=128, 
    max_seq_len_passage=512,
    batch_size=16,
    use_gpu=True, #if you do not have a gpu you can turn this off, it will just take longer
)

document_store.update_embeddings(retriever)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
Updating Embedding:   0%|                                                                   | 0/322 [00:00<?, ? docs/s]
Create embeddings:   0%|                                                                    | 0/336 [00:00<?, ? Docs/s][A
Create embeddings:   5%|██▊                                                        | 16/336 [00:00<00:11, 27.05 Docs/s][A
Create embeddings:  10%|█████▌                                                     | 32/336 [00:00<00:08, 36.07 Docs/s][A
Create embeddings:  14%|████████▍                                                  | 48/336 [00:01<00:07, 40.22 Docs/s][A
Create embeddings:  19%|███████████▏                                  

Load the Reader (this is the QA model from Huggingface)

In [24]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Create a pipeline using both the retriever and reader

In [25]:
pipeline = ExtractiveQAPipeline(reader, retriever)

#### 5. Trying out our pipeline

Here is a simple function to allow us to display our results nicely

In [27]:
def print_preds_df(results):
    answers = results["answers"]
    pp = pprint.PrettyPrinter(indent=4)
    keys_to_keep = set(["answer", "context", "score", "probability"])

    # filter the results
    filtered_answers = []
    for ans in answers:
        filtered_answers.append({'answer': ans.answer, 'context': ans.context, 'score': ans.score, 
                               'link': ans.meta['link'], 'source': ans.meta['source']})

    df_res = pd.DataFrame({"answer":[], "context":[], "score":[], "link":[], "source":[]})

    for i in filtered_answers:
        df_res.loc[len(df_res)] = i

    df_res.sort_values(by=['score'], inplace = True, ascending=False)
    df_res = df_res.reset_index(drop = True)
    df_res['score'] = df_res['score'].round(2)
    return df_res

Run the pipeline on a question 
- The retriever filters out the top 20 most relevant articles
- Then the QA finds the top 5 most probable answers from those articles

In [48]:
qn = 'Where did the coronavirus first appear? '
prediction = pipeline.run(query=qn, params={'Retriever': {'top_k': 20}, 'Reader': {'top_k':5}})
prediction_df = print_preds_df(prediction)
prediction_df #shows the top 5 answers by score 

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.37 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.12 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.23 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.62 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.29 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.77 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.04 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.52 Batches/s]
Inferencing Samples: 100%|██████████████

Unnamed: 0,answer,context,score,link,source
0,Wuhan China,"t is causing the 2019 novel coronavirus outbreak, first identified in Wuhan ...",0.97,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)
1,China,The novel coronavirus detected in China is genetically closely related to th...,0.85,https://www.ecdc.europa.eu/en/novel-coronavirus-china/questions-answers,European Centre for Disease Prevention and Control (ECDC)
2,animals,Coronaviruses are a large family of viruses that are common in animals. Occa...,0.77,https://www.who.int/news-room/q-a-detail/q-a-coronaviruses,World Health Organization (WHO)
3,humans,ily of viruses. There are some coronaviruses that commonly circulate in huma...,0.73,https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/nCoV2019.aspx#,California Department of Public Health
4,Wuhan City,"This virus was first detected in Wuhan City, Hubei Province, China. The firs...",0.7,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)


In [50]:
qn = "What happened on Halloween at Marina Bay Sands?"
prediction = pipeline.run(query=qn, params={'Retriever': {'top_k': 20}, 'Reader': {'top_k':5}})
prediction_df = print_preds_df(prediction)
prediction_df

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.73 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.61 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.58 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.69 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.03 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.08 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.31 Batches/s]
Inferencing Samples: 100%|██████████████

Unnamed: 0,answer,context,score,link,source
0,an attack,One of the victims of an attack at Marina Bay Sands on Halloween sustained b...,0.33,https://www.asiaone.com/singapore/halloween-horror-attack-marina-bay-sands-v...,Asia One
1,accosted,6am after a Halloween-themed party at nightclub Marquee when they were accos...,0.21,https://www.asiaone.com/singapore/halloween-horror-attack-marina-bay-sands-v...,Asia One
2,attacked,red Hello Kitty theme. Banquet waiter Joshua Koh Kian Yong (above) was attac...,0.11,https://www.asiaone.com/singapore/businessman-gets-6-years-jail-paying-hitme...,Asia One
3,disappearance from the restaurant,"Singh, said her husband was behaving suspiciously and his disappearance fro...",0.05,https://www.asiaone.com/singapore/woman-jailed-7-months-smashing-beer-bottle...,Asia One
4,after the accident,"Lee, who was seated in the dock, as the driver. DPP Koh said that after the ...",0.05,https://www.asiaone.com/singapore/man-denies-driving-maserati-dragged-traffi...,Asia One
