## Introduction

This notebook will let you build a faster and accurate CORD Search Engine using TransformersðŸ¤— - on top of the research articles. 

In this notebook, we will:
 - First, structure the input data `(.json)` into easy to use `pandas.DataFrame`
 - Then, use `RoBERTa` model to build Q&A model
 - and Finally, use `Haystack` to scale QA model to thousands of documents and build a search engine.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data

 - We will take 25,000 articles from `pmc_json` directory and 25000 articles from `pdf_json` - So that it fits into Kaggle's compute limits.
 -  We will extract `paper_id`,`title`,`abstract`,`full_text` and put it in an easy to use `pandas.DataFrame`. 

In [None]:
import numpy as np
import pandas as pd
import os
import json
import re
from tqdm import tqdm


dirs=["pmc_json","pdf_json"]
docs=[]
counts=0
for d in dirs:
    print(d)
    counts = 0
    for file in tqdm(os.listdir(f"../input/CORD-19-research-challenge/document_parses/{d}")):#What is an f string?
        file_path = f"../input/CORD-19-research-challenge/document_parses/{d}/{file}"
        j = json.load(open(file_path,"rb"))
        #Taking last 7 characters. it removes the 'PMC' appended to the beginning
        #also paperid in pdf_json are guids and hard to plot in the graphs hence the substring
        paper_id = j['paper_id']
        paper_id = paper_id[-7:]
        title = j['metadata']['title']

        try:#sometimes there are no abstracts
            abstract = j['abstract'][0]['text']
        except:
            abstract = ""
            
        full_text = ""
        bib_entries = []
        for text in j['body_text']:
            full_text += text['text']
                
        docs.append([paper_id, title, abstract, full_text])
        #comment this below block if you want to consider all files
        #comment block start
        counts = counts + 1
        if(counts >= 25000):
            break
        #comment block end    
df=pd.DataFrame(docs,columns=['paper_id','title','abstract','full_text']) 


In [None]:
print(df.shape)
df.head()

We have 50,000 articles and columns like `paper_id`, `title`, `abstract` and `full_text`

We will be interested in `title` and `full_text` columns as these columns will be used to build the engine. Let's setup a Search Engine on top `full_text` - which contains the full content of the research papers.

## Welcome Haystack!

![](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/sketched_concepts_white.png)

The secret sauce behind scaling up is `Haystack`. It lets you scale QA models to large collections of documents! You can read more about this amazing library here https://github.com/deepset-ai/haystack

For installation: ! `pip install git+https://github.com/deepset-ai/haystack.git`

But just to give a background, there are 3 major components to `Haystack`.

**Document Store**: Database storing the documents for our search. We recommend `Elasticsearch`, but have also more light-weight options for fast prototyping (SQL or In-Memory).

**Retriever**: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include `TF-IDF` or `BM25`, custom `Elasticsearch` queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.

**Reader**: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like `BERT`, `RoBERTa` or `XLNet` trained via `FARM` or `Transformers` on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.

And then there is **Finder** which glues together a **Reader** and a **Retriever** as a pipeline to provide an easy-to-use question answering interface.

In [None]:
# installing haystack

! pip install git+https://github.com/deepset-ai/haystack.git

In [None]:
# importing necessary dependencies

from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Setting up DocumentStore
`Haystack` finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

But they recommend `ElasticsearchDocumentStore` because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

So - Let's set up a `ElasticsearchDocumentStore`

In [None]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

Once `ElasticsearchDocumentStore` is setup, we will write our documents/texts to the `DocumentStore`.

Writing documents to `ElasticsearchDocumentStore` requires a format - List of dictionaries The default format here is: 
`[{"name": "<some-document-name>, "text": "<the-actual-text>"},`
`{"name": "<some-document-name>, "text": "<the-actual-text>"}`
`{"name": "<some-document-name>, "text": "<the-actual-text>"}]`

(Optionally: you can also add more key-value-pairs here, that will be indexed as fields in `Elasticsearch` and can be accessed later for filtering or shown in the responses of the Finder)

We will use `title` column to pass as `name` and `full_text` column to pass as the `text`

In [None]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(df[['title', 'full_text']].rename(columns={'title':'name','full_text':'text'}).to_dict(orient='records'))

**Let's prepare Retriever, Reader, & FinderÂ¶**

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

Here: We use Elasticsearch's default BM25 algorithm

In [None]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

A **Reader** scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

`Haystack` currently supports Readers based on the frameworks `FARM` and `Transformers`. With both you can either load a local model or one from `Hugging Face's` model hub (https://huggingface.co/models).

Here: a medium sized `RoBERTa QA` model using a Reader based on `FARM` (https://huggingface.co/deepset/roberta-base-squad2)

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=True, context_window_size=1000)

And finally: The **Finder** sticks together reader and retriever in a pipeline to answer our actual questions.

In [None]:
finder = Finder(reader, retriever)

## Voila! We're done!

Let's see, how well our search engine works!

As a response to our query - 
 - we get the `answer`
 - a 1000 words `context` around the answer 
 - and the `name` of the research paper

In [None]:
question = "What is the impact of coronavirus on babies?"
number_of_answers_to_fetch = 2

prediction = finder.get_answers(question=question, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch)
print(f"Question: {prediction['question']}")
print("\n")
for i in range(number_of_answers_to_fetch):
    print(f"#{i+1}")
    print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"Research Paper: {prediction['answers'][i]['meta']['name']}")
    print(f"Context: {prediction['answers'][i]['context']}")
    print('\n\n')

In [None]:
question = "What is the impact of coronavirus on pregnant women?"
number_of_answers_to_fetch = 2

prediction = finder.get_answers(question=question, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch)
print(f"Question: {prediction['question']}")
print("\n")
for i in range(number_of_answers_to_fetch):
    print(f"#{i+1}")
    print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"Research Paper: {prediction['answers'][i]['meta']['name']}")
    print(f"Context: {prediction['answers'][i]['context']}")
    print('\n\n')

In [None]:
question = "which organ does coronavirus impact?"
number_of_answers_to_fetch = 2

prediction = finder.get_answers(question=question, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch)
print(f"Question: {prediction['question']}")
print("\n")
for i in range(number_of_answers_to_fetch):
    print(f"#{i+1}")
    print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"Research Paper: {prediction['answers'][i]['meta']['name']}")
    print(f"Context: {prediction['answers'][i]['context']}")
    print('\n\n')