## QA System with Haystack and Elastic Search Backend 

In [1]:
import warnings

warnings.filterwarnings("ignore")

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(return_embedding=True)

11/28/2021 02:33:24 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.013s]
11/28/2021 02:33:24 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.005s]
11/28/2021 02:33:24 - INFO - elasticsearch -   GET http://localhost:9200/document [status:200 request:0.004s]
11/28/2021 02:33:24 - INFO - elasticsearch -   PUT http://localhost:9200/document/_mapping [status:200 request:0.027s]
11/28/2021 02:33:24 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.007s]


> _By default, ElasticsearchDocumentStore creates two indices on Elasticsearch: one called document for (you guessed it) storing documents, and another called label for storing the annotated answer spans. For now, we’ll just populate the document index with the SubjQA reviews, and Haystack’s document stores expect a list of dictionaries with text and meta keys as follows:_

In [2]:
# Take each split of the SubjQA dataset and write to document_store
from datasets import load_dataset

subjqa_dataset = load_dataset("subjqa", "electronics")
subjqa_dataset.set_format("pandas")

dfs = {split: ds[:] for split, ds in subjqa_dataset.flatten().items()}




### Load all reviews into Elasticsearch Index

In [3]:
def write_documents(split, df):
    docs = [
        {
            "text": row["context"],
            "meta": {"item_id": row["title"], "qid": row["id"], "split": split},
        }
        for _, row in df.drop_duplicates(subset="context").iterrows()
    ]
    document_store.write_documents(docs, index="document")


for split, df in dfs.items():
    write_documents(split, df)

print(f"Loaded {document_store.get_document_count()} documents")

11/28/2021 02:33:35 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.004s]
11/28/2021 02:33:36 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.455s]
11/28/2021 02:33:37 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.041s]
11/28/2021 02:33:38 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.032s]
11/28/2021 02:33:38 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.003s]
11/28/2021 02:33:39 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.025s]
11/28/2021 02:33:39 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.004s]
11/28/2021 02:33:40 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.994s]
11/28/2021 02:33:40 - INFO - elasticsearch

Loaded 1615 documents


### Search for reviews from Elasticsearch Index using Retriever

> _The Elasticsearch document store can be paired with any of the Haystack retrievers
BM25 is an improved version of the classic TF-IDF metric and represents the question and context as sparse vectors that can be searched efficiently on Elasticsearch. The BM25 score measures how much matched text is about a search query and improves on TF-IDF by saturating TF values quickly and normalizing the document length so that short documents are favoured over long ones._

In [4]:
from haystack.retriever.sparse import ElasticsearchRetriever

es_retriever = ElasticsearchRetriever(document_store)

In [5]:
# Search for sample query from reviews and filter by Item ID and split
item_id = "B0074BW614"
query = "How is the weight?"
retrieval_results = es_retriever.retrieve(
    query = query,top_k=3, filters={"item_id": [item_id], "split": ["train"]})


retrieval_results

11/28/2021 02:33:40 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.013s]


[{'text': "From the limited time I've had to work with my Kindle, I'm extremely pleased.  The graphics are awesome, the size and weight is just right, speed is phenomenal.  My wife is jealous and she has an iPad 2.", 'score': 0.6429956517280498, 'question': None, 'meta': {'item_id': 'B0074BW614', 'qid': '7e55bbbe23fe233278352936e5af66a0', 'split': 'train'}, 'embedding': None, 'id': '90f04ba41b1884aeab29eb73d297733'},
 {'text': 'Cannot believe what we have been missing. We are a tech family, and have been Ipad fans from the get go, have owned all three versions. Never even considered the Kindle until I saw one with no glare to read. So we thought for the price and Amazons great return policy why not try the new HD out and see how it is for reading.OH MY WORD......(mouth open) it is awesome. Reading books on this is like night and day to our Pads. The screen is rich and clear, the abilities to adjust things is great and love the being able to download books through Amazon right to the Ki

> _We can now load all relevant documents using the Retreiver, we have to load the QA model and extract the answers for questions using the Reader_

### Load deepset/minilm-uncased-squad2 using FARMReader

In [6]:
from haystack.reader.farm import FARMReader

model = "deepset/minilm-uncased-squad2"
max_seq_length = 384
doc_stride = 128

reader = FARMReader(model_name_or_path=model,
                 max_seq_len=max_seq_length,
                 doc_stride=doc_stride,
                 return_no_answer=True)

11/28/2021 02:33:41 - INFO - farm.utils -   Using device: CUDA 
11/28/2021 02:33:41 - INFO - farm.utils -   Number of GPUs: 1
11/28/2021 02:33:41 - INFO - farm.utils -   Distributed Training: False
11/28/2021 02:33:41 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertModel: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
11/28/2021 02:33:57 - INFO - farm.utils -   Using device: CUDA 
11/28/2021 02:33:57 - INFO - farm.utils -   Number of G

In [7]:
query = "How much music can it hold?"
context = """An MP3 is about 1 MB/minute, so about 6000 hours depending on \
file size."""

reader.predict_on_texts(query, texts = [context],top_k=2)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.53 Batches/s]


{'query': 'How much music can it hold?',
 'no_ans_gap': 13.045684576034546,
 'answers': [{'answer': '1 MB/minute',
   'score': 0.680178701877594,
   'context': 'An MP3 is about 1 MB/minute, so about 6000 hours depending on file size.',
   'offset_start': 16,
   'offset_end': 27,
   'offset_start_in_doc': 16,
   'offset_end_in_doc': 27,
   'document_id': 'e344757014e804eff50faa3ecf1c9c75'},
  {'answer': None,
   'score': 0.45309837548893656,
   'context': None,
   'offset_start': 0,
   'offset_end': 0,
   'document_id': None,
   'meta': None}]}

### Using Haystack Pipelines for QA

In [8]:
from haystack.pipeline import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever=es_retriever)

> _Each Pipeline has a run function that specifies how the query flow should be executed. For ExtractiveQAPipeline we just need to pass the query, the number of documents to retrieve with top_k_retriever, and number of answers to extract from these documents with top_k_reader. In our case, we also need to specify a filter over the item ID which can be done using the filters argument as we did with the Retriever earlier._

In [9]:
item_id = "B0074BW614"
query = "Is it good for reading?"
n_answers = 3

preds = pipe.run(
    query=query,
    params={
        "retriever": {"top_k": 5},
        "reader": {"top_k": n_answers},
        "filters": {"item_id": [item_id], "split": ["train"]},
    },
)

print(f"Question: {query}")
for i in range(n_answers):
    print(f"Answer {i+1}: {preds['answers'][i]['answer']}")
    print(f"Context: {preds['answers'][i]['context']}")

11/28/2021 02:32:15 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.024s]
