<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/QuestionAnswering/OpenDomainQA-Ktrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KTRAIN - Open Domain Question Answering

Open-Domain Question-Answering (QA) systems accept natural language questions as input and return exact answers from content buried within large text corpora. 

In this notebook, we will build a fully-functional, end-to-end open-domain QA system in a few lines. To acomplish this we will use [KTRAIN](https://github.com/amaiya/ktrain).

- Adapted from : [link](https://towardsdatascience.com/build-an-open-domain-question-answering-system-with-bert-in-3-lines-of-code-da0131bc516b)
- More detailed information about this code: [link](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)

We will use [ktrain](https://github.com/amaiya/ktrain), a Python library and TensorFlow wrapper.

In [None]:
!pip install ktrain

In [2]:
import ktrain
from ktrain import text

In this article, we will use the 20 Newsgroups dataset as the knowledge base. As a collection of newsgroup postings which contain an abundance of opinions, debates, and arguments, the corpus is far from ideal as a knowledge base. 

In [1]:
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')

newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

docs = newsgroups_train.data + newsgroups_test.data

## Create a Search Index

The search index will allow us to quickly and easily retrieve documents that contain words present in the question. Such documents are likely to contain the answer and can be analyzed further to extract candidate answers.

In [3]:
INDEXDIR = "./tmp/myindex"

In [4]:
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs))

## Create a QA Instance

Next, we will create a QA instance, which is largely a wrapper around a pretrained BertForQuestionAnswering model from the transformers library.

In [5]:
qa = text.SimpleQA(INDEXDIR)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1341090760.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




## Ask Questions

We will invoke the `qa.ask` method to issue questions to the text corpus we indexed and retrieve answers. The ask method performs the following steps:

1. Uses the search index to locate documents that contain words in the question

2. Extracts paragraphs from these documents for use as contexts and uses a BERT model pretrained on the SQuAD dataset to parse out candidate answers

3. Sorts and prunes candidate answers by confidence scores and returns results

We will also use the `qa.display` method to nicely format and display the top 5 results in our Jupyter notebook. Since the model is combing through paragraphs and sentences to find answers, it may take a few moments to return results.

In [6]:
answers = qa.ask("When did the Cassini probe launch?")
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,in october of 1997,cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 .,0.819033,59
1,"on january 26,1962","ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit.",0.151229,8525
2,- 10 / 06 / 97,key scheduled dates for the cassini mission (vvejga trajectory)-------------------------------------------------------------10 / 06 / 97-titan iv / centaur launch 04 / 21 / 98-venus 1 gravity assist 06 / 20 / 99-venus 2 gravity assist 08 / 16 / 99-earth gravity assist 12 / 30 / 00-jupiter gravity assist 06 / 25 / 04-saturn arrival 01 / 09 / 05-titan probe release 01 / 30 / 05-titan probe entry 06 / 25 / 08-end of primary mission (schedule last updated 7 / 22 / 92) - 10 / 06 / 97,0.029694,59
3,* 98,"cassini * * * * * * * * * * * * * * * * * * 98 ,115 * * * *",2.6e-05,5356
4,the latter part of the 1990s,"scheduled for launch in the latter part of the 1990s , the craf and cassini missions are a collaborative project of nasa, the european space agency and the federal space agencies of germany and italy, as well as the united states air force and the department of energy.",1.7e-05,18684


In [10]:
answers = qa.ask("Who was Jesus Christ?")
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,"the one who was anointed to preach, bind up, proclaim, and open","the one who was anointed to preach, bind up, proclaim, and open was jesus christ.",0.433726,12218
1,means the person who was scorned in these verses,""" that means the person who was scorned in these verses was christ.",0.335112,12218
2,that david koresh,"now i was not there in galilee back in the roman occupation, so i do not know for certain that david koresh was not jesus christ, but i strongly suspect that he was not (even aside from the fact of never having seen them in a photograph together).",0.220499,14374
3,was perfect,jesus was perfect .,0.005831,5765
4,"born joseph, the husband of mary","and to jacob was born joseph, the husband of mary , by whom was born jesus, who is called christ.",0.002594,10683


In [11]:
answers = qa.ask("What causes computer images to be too dark?")
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,if your viewer does not do gamma correction,"if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok.",0.93799,13873
1,is gamma correction,"this, is gamma correction (or the lack of it).",0.045166,13873
2,so if you just dump your nice linear image out to a crt,"so if you just dump your nice linear image out to a crt , the image will look much too dark.",0.010337,13873
3,that small color details,"the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.",0.002115,6987
4,that small color details,"the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.",0.002114,12344
