## Demo for Question and Answering System (Multi) 

This demo will walk you through how to train a Question and Answering pipeline using Haystack for Multiple Document QA  

#### 1. Setup

In [None]:
# Make sure you have a GPU running
!nvidia-smi #for windows

Import packages

In [1]:
import pandas as pd
import pprint
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import PreProcessor, DensePassageRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import FARMReader

  from .autonotebook import tqdm as notebook_tqdm


#### 2. Create a document store

Think of this as a database where your documents will be stored, to be used by the QA system

In [2]:
document_store = InMemoryDocumentStore() 
#you can also use a faiss document store which is optimised vector storage for DPR, for simplicity sake we will use InMemory

#### 3. Load and format your data 
The data a CSV file containing covid 19 related information. We will see how a haystack retriever can filter for the most relevant articles to a question, before using a QA model to extract the answer.

In [4]:
#read data as a pandas dataframe
df = pd.read_csv('multi_demo_covid.csv') #load into the correct format for the haystack pipeline
#load reader and retriever 

In [5]:
#reformat data so that haystack framework can use it
def get_docs(input_df):
    docs = []
    for i in range(len(input_df)): 
        doc = {
            'content': input_df['text'][i], 
            'meta': {'link': input_df['link'][i], 
                    'source': input_df['source'][i]}
        }
        docs.append(doc)
    return docs

In [6]:
#some articles are quite long so we need to split them into smaller chunks
preprocessor = PreProcessor(split_by = 'word', 
                            split_length = 300, #each chunk is 300 words long
                            split_overlap = 30, #each chunk overlaps with the previous chunk by 30 words
                            split_respect_sentence_boundary= True) #will split according to complete sentences 



In [7]:
data = get_docs(df)
preprocessed_data = preprocessor.process(data)

Preprocessing: 100%|█████████████████████████████████████████████████████████████| 213/213 [00:00<00:00, 4797.82docs/s]


Write our data into the document store

In [8]:
document_store.write_documents(preprocessed_data)

#### 4. Load DPR and QA Model 

Load the DPR

In [9]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=128, 
    max_seq_len_passage=512,
    batch_size=16,
    use_gpu=True, #if you do not have a gpu you can turn this off, it will just take longer
)

document_store.update_embeddings(retriever)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
Updating Embedding:   0%|                                                                   | 0/218 [00:00<?, ? docs/s]
Create embeddings:   0%|                                                                    | 0/224 [00:00<?, ? Docs/s][A
Create embeddings:   7%|████▏                                                      | 16/224 [00:01<00:24,  8.36 Docs/s][A
Create embeddings:  14%|████████▍                                                  | 32/224 [00:02<00:11, 16.33 Docs/s][A
Create embeddings:  21%|████████████▋                                              | 48/224 [00:02<00:07, 23.50 Docs/s][A
Create embeddings:  29%|████████████████▊                             

Load the Reader (this is the QA model from Huggingface)

In [10]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Create a pipeline using both the retriever and reader

In [11]:
pipeline = ExtractiveQAPipeline(reader, retriever)

#### 5. Trying out our pipeline

Here is a simple function to allow us to display our results nicely

In [12]:
def print_preds_df(results):
    answers = results["answers"]
    pp = pprint.PrettyPrinter(indent=4)
    keys_to_keep = set(["answer", "context", "score", "probability"])

    # filter the results
    filtered_answers = []
    for ans in answers:
        filtered_answers.append({'answer': ans.answer, 'context': ans.context, 'score': ans.score, 
                               'link': ans.meta['link'], 'source': ans.meta['source']})

    df_res = pd.DataFrame({"answer":[], "context":[], "score":[], "link":[], "source":[]})

    for i in filtered_answers:
        df_res.loc[len(df_res)] = i

    df_res.sort_values(by=['score'], inplace = True, ascending=False)
    df_res = df_res.reset_index(drop = True)
    df_res['score'] = df_res['score'].round(2)
    return df_res

Run the pipeline on a question 
- The retriever filters out the top 20 most relevant articles
- Then the QA finds the top 5 most probable answers from those articles

In [13]:
qn = 'Where did the coronavirus first appear? '
prediction = pipeline.run(query=qn, params={'Retriever': {'top_k': 20}, 'Reader': {'top_k':5}})
prediction_df = print_preds_df(prediction)
prediction_df #shows the top 5 answers by score 

Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.36 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.63 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.39 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.66 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.48 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 24.91 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.64 Batches/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 31.44 Batches/s]
Inferencing Samples: 100%|██████████████

Unnamed: 0,answer,context,score,link,source
0,Wuhan China,"t is causing the 2019 novel coronavirus outbreak, first identified in Wuhan ...",0.97,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)
1,China,The novel coronavirus detected in China is genetically closely related to th...,0.85,https://www.ecdc.europa.eu/en/novel-coronavirus-china/questions-answers,European Centre for Disease Prevention and Control (ECDC)
2,animals,Coronaviruses are a large family of viruses that are common in animals. Occa...,0.77,https://www.who.int/news-room/q-a-detail/q-a-coronaviruses,World Health Organization (WHO)
3,humans,ily of viruses. There are some coronaviruses that commonly circulate in huma...,0.73,https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/nCoV2019.aspx#,California Department of Public Health
4,Wuhan City,"This virus was first detected in Wuhan City, Hubei Province, China. The firs...",0.7,\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html,Center for Disease Control and Prevention (CDC)
