<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/QuestionAnswering/OpenDomainQA_DeepPavlov.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DeepPavlov - Open Domain Question Answering

Adapted from: [link](https://colab.research.google.com/github/deepmipt/dp_notebooks/blob/master/DP_ODQA.ipynb)

The architecture of the DeepPavlov ODQA skill is modular and consists of two components: a **ranker** and a **reader**.

In order to answer any question, the **ranker** first retrieves a few relevant articles from the article collection, and then the **reader** scans them carefully to identify the answer. The **ranker** is based on DrQA [1] proposed by Facebook Research. Specifically, the DrQA approach uses unigram-bigram hashing and TF-IDF matching designed to efficiently return a subset of relevant articles based on a question. 

The **reader** is based on R-NET [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The R-NET architecture is an end-to-end neural network model that aims to answer questions based on a given article. R-NET first matches the question and the article via gated attention-based recurrent networks to obtain a question-aware article representation. Then the self-matching attention mechanism refines the representation by matching the article against itself, which effectively encodes information from the whole article. Finally, the pointer networks locate the positions of answers in the article. 

The scheme below shows DeepPavlov ODQA system architecture.

<img src="https://github.com/deepmipt/dp_notebooks/blob/master/odqa.png?raw=1">

<center>Picture 1. The DeepPavlov-based ODQA system architecture</center>


DeepPavlov’s ODQA system has two Wikipedia-based models. The first one is based on the English Wikipedia dump from 2018-02-11 (5,180,368 articles) and the second one is based on the Russian Wikipedia dump from 2018-04-01 (1,463,888 articles).

[1] [Chen, Danqi, et al. "Reading wikipedia to answer open-domain questions." arXiv preprint arXiv:1704.00051 (2017)](https://arxiv.org/pdf/1704.00051.pdf)

[2] [R-NET: Machine reading comprehension with self-matching networks](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)

# Model Requirements

Install DeepPavlov and all the model's requirements.

In [3]:
!pip install -q deeppavlov
!python -m deeppavlov install en_odqa_infer_wiki

# Model Description

The architecture of the ODQA skill is modular and consists of two components, a **ranker** and a **reader**. In order to answer any question, the **ranker** first retrieves **top_n** relevant articles from the document collection, and then the **reader** scans them carefully to identify the answer. The detailed description of the ODQA models can be found in the [DeepPavlov documentation](http://docs.deeppavlov.ai/en/0.1.6/skills/odqa.html).

In [None]:
# %load: Load code into the current frontend
%load https://github.com/deepmipt/DeepPavlov/blob/0.1.6/deeppavlov/configs/odqa/en_odqa_infer_wiki.json

In [11]:
# Visualize what just load in the frontend
import urllib.request, json 
with urllib.request.urlopen("https://raw.githubusercontent.com/deepmipt/DeepPavlov/0.1.6/deeppavlov/configs/odqa/en_odqa_infer_wiki.json") as url:
    data = json.loads(url.read().decode())
    print(json.dumps(data, indent=4, sort_keys=True))

{
    "chainer": {
        "in": [
            "question_raw"
        ],
        "out": [
            "best_answer"
        ],
        "pipe": [
            {
                "config_path": "{CONFIGS_PATH}/doc_retrieval/en_ranker_tfidf_wiki.json",
                "in": [
                    "question_raw"
                ],
                "out": [
                    "tfidf_doc_ids"
                ]
            },
            {
                "class_name": "wiki_sqlite_vocab",
                "in": [
                    "tfidf_doc_ids"
                ],
                "join_docs": false,
                "load_path": "{DOWNLOADS_PATH}/odqa/enwiki.db",
                "out": [
                    "tfidf_doc_text"
                ],
                "shuffle": false
            },
            {
                "class_name": "document_chunker",
                "flatten_result": true,
                "in": [
                    "tfidf_doc_text"
                ],
                "out": 

## Training the model

You can train a model by running the framework with **train** parameter, wherein the model will be trained on the document collection defined in the **dataset_reader** section of the configuration file. The **dataset_reader** section of the ranker’s configuration defines the source of the articles. The source can be of the following **dataset_format-**:

wiki — the Wikipedia dump,
txt — the path to the separated text files,
json — JSON files, which should be formatted as a list with dicts that contain the *title* and *doc* keywords.


* *wiki* - The Wikipedia dump
* *txt* - each document in separate txt file
* *json* - JSON files should be formatted as list with dicts which contain 'title' and 'doc' keywords.

As a training corpus, I will use the PloS sentence corpus. It consists of 300 computational biology articles, each of them stored in a separate *txt* file. For simplicity, we will use the same configuration files that is used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models.

In [None]:
!wget -q http://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip
!unzip SentenceCorpus.zip

In [14]:
!ls SentenceCorpus

Instructions_for_SentenceAnnotation.pdf  README		     word_lists
labeled_articles			 unlabeled_articles


In order to fit a model on new data, first, change the data_path parameter of the dataset_reader section. Then change the dataset_format to txt. Finally, train the model.

In [16]:
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = "/content/SentenceCorpus/unlabeled_articles/plos_unlabeled"
model_config["dataset_reader"]["dataset_format"] = "txt"
doc_retrieval = train_model(model_config)

2021-05-26 12:15:19.329 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 57: Reading files...
2021-05-26 12:15:19.335 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 134: Building the database...
  0%|          | 0/300 [00:00<?, ?it/s]
300it [00:00, 4450.32it/s]
100%|██████████| 300/300 [00:00<00:00, 3018.72it/s]
2021-05-26 12:15:19.595 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 57: Connecting to database, path: /root/.deeppavlov/downloads/odqa/enwiki.db
2021-05-26 12:15:19.598 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 112: SQLite iterator: The size of the database is 300 documents
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[

Examine the ranker output.

In [18]:
query_term = 'cerebellum'
print(f"Ranker of documents in the PloS sentence corpus for the term '{query_term}'")
for doc_title in doc_retrieval([query_term])[0]:
  print(f"\t - {doc_title}")

Ranker of documents in the PloS sentence corpus for the term 'cerebellum'
	 - 46.txt
	 - 453.txt
	 - 490.txt
	 - 485.txt
	 - 484.txt
	 - 478.txt
	 - 470.txt
	 - 466.txt
	 - 499.txt
	 - 430.txt
	 - 437.txt
	 - 429.txt


Everything is done to run the ODQA component, make sure that the `download = False` otherwise the pretrained Wikipedia dump will overwrite your model.

In [19]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

In [None]:
# Download all the SQuAD models
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)

In [21]:
# Do not download the ODQA models, we've just trained it
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)

2021-05-26 12:23:17.51 INFO in 'deeppavlov.models.vectorizers.hashing_tfidf_vectorizer'['hashing_tfidf_vectorizer'] at line 264: Loading tfidf matrix from /root/.deeppavlov/models/odqa/enwiki_tfidf_matrix.npz
2021-05-26 12:23:17.538 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 57: Connecting to database, path: /root/.deeppavlov/downloads/odqa/enwiki.db
2021-05-26 12:23:17.540 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 112: SQLite iterator: The size of the database is 300 documents
2021-05-26 12:23:17.550 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved tokens vocab from /root/.deeppavlov/models/multi_squad_model_noans/emb/vocab_embedder.pckl
2021-05-26 12:23:18.218 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved chars vocab from /root/.deeppavlov/mode

INFO:tensorflow:Restoring parameters from /root/.deeppavlov/models/multi_squad_model_noans/model


In [24]:
# Ask
answers = odqa(["what is tuberculosis?", "what is a gene?"])
answers



[['a disease for which a new drug is desperately needed', ''],
 [0.9975703954696655, 0.7835770845413208],
 [314, -1]]

In [34]:
answers = odqa(["what is a cycling transcription network?"])
answer = answers[0]
print(f"Answer: {answers[0][0]}")
print(f"Score: {answers[1][0]}")
print(f"Id: {answers[2][0]}")



Answer: Eukaryotes
Score: 0.9929509162902832
Id: 2402


## Interacting with the DeepPavlov’s PreTrained models

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model requires 35 GB of local storage, whereas the Russian version takes up about 20 GB. The Wikipedia dumps can be rebuilt by steps described in the [documentation](http://docs.deeppavlov.ai/en/0.1.6/components/tfidf_ranking.html#available-data-and-pretrained-models). Both models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB.

**As it was mentioned, the Wikipedia-based models have significant storage and RAM requirements, therefore it's impossible to interact with them on Colab, however you can do so localy (of course when the requirements are satisfied). **

Alternatively, you can check out our [demo](http://demo.ipavlov.ai/).

## Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](https://demo.ipavlov.ai)

[DeepPavlov documentation](https://docs.deeppavlov.ai)