# **Task: Question Answering for Game of Thrones**

Question Answering can be used in a variety of use cases. A very common one:  Using it to navigate through complex knowledge bases or long documents ("search setting").

A "knowledge base" could for example be your website, an internal wiki or a collection of financial reports. 
In this tutorial we will work on a slightly different domain: "Game of Thrones". 

Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the 
marvellous seven kingdoms..

In [3]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack and install the version of torch that works with the colab GPUs
#!pip install git+https://github.com/deepset-ai/haystack.git
#!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [4]:
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## **Document Store**

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).


### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [5]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [6]:
# Connect to Elasticsearch

from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

09/10/2020 10:49:15 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.411s]
09/10/2020 10:49:15 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.226s]


## **Cleaning & indexing documents**

Haystack provides a customizable cleaning and indexing pipeline for ingesting documents in Document Stores.

In this tutorial, we download Wikipedia articles on Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [7]:
# Let's first get some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

09/10/2020 10:49:27 - INFO - haystack.indexing.utils -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip to `data/article_txt_got`
100%|██████████| 1095120/1095120 [00:01<00:00, 699309.07B/s]


[{'text': "The soundtrack album of the seventh season of HBO series ''Game of Thrones'', titled '''''Game of Thrones: Season 7''''', was released digitally on August 25, 2017 on CD on September 29, 2017.", 'meta': {'name': '135_Game_of_Thrones__Season_7__soundtrack_.txt'}}, {'text': '\n==Credits and personnel==\nPersonnel adapted from the album liner notes.\n* Ramin Djawadi – composer, primary artist, producer', 'meta': {'name': '135_Game_of_Thrones__Season_7__soundtrack_.txt'}}, {'text': '\n==Charts==\n New Zealand Heatseekers Albums (RMNZ)', 'meta': {'name': '135_Game_of_Thrones__Season_7__soundtrack_.txt'}}]


09/10/2020 10:49:32 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.265s]
09/10/2020 10:49:33 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.059s]
09/10/2020 10:49:34 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.014s]
09/10/2020 10:49:35 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.006s]
09/10/2020 10:49:36 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.069s]


## **Initialize Retriever, Reader,  & Finder**

### **Retriever**

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm


In [8]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### **Reader**

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.

#### FARMReader

In [9]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

09/10/2020 10:50:14 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
09/10/2020 10:50:14 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
09/10/2020 10:50:15 - INFO - filelock -   Lock 139879711886472 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

09/10/2020 10:50:16 - INFO - filelock -   Lock 139879711886472 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





09/10/2020 10:50:16 - INFO - filelock -   Lock 139879741079504 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

09/10/2020 10:50:24 - INFO - filelock -   Lock 139879741079504 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/10/2020 10:50:37 - INFO - filelock -   Lock 139879709106128 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

09/10/2020 10:50:39 - INFO - filelock -   Lock 139879709106128 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock





09/10/2020 10:50:40 - INFO - filelock -   Lock 139879709106128 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

09/10/2020 10:50:41 - INFO - filelock -   Lock 139879709106128 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





09/10/2020 10:50:43 - INFO - filelock -   Lock 139879709106128 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

09/10/2020 10:50:44 - INFO - filelock -   Lock 139879709106128 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





09/10/2020 10:50:44 - INFO - filelock -   Lock 139879709106128 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

09/10/2020 10:50:45 - INFO - filelock -   Lock 139879709106128 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





09/10/2020 10:50:46 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
09/10/2020 10:50:46 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
09/10/2020 10:50:46 - INFO - farm.infer -    0 
09/10/2020 10:50:46 - INFO - farm.infer -   /w\
09/10/2020 10:50:46 - INFO - farm.infer -   /'\
09/10/2020 10:50:46 - INFO - farm.infer -   


#### **TransformersReader**

In [None]:
# Alternative:
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### **Finder**

The Finder sticks together reader and retriever in a pipeline to answer our actual questions. 

In [10]:
finder = Finder(reader, retriever)

## **Ask a question!**

In [11]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

09/10/2020 10:51:49 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.167s]
09/10/2020 10:51:49 - INFO - haystack.retriever.sparse -   Got 10 candidates from retriever
09/10/2020 10:51:49 - INFO - haystack.finder -   Reader is looking for detailed answer in 12544 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.45s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.27s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.48s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.49s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.53s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.50s/ Batches]
Inferencing Samples: 100%|

In [None]:
# prediction = finder.get_answers(question="Who is known as the mother of dragons?", top_k_reader=5)
# prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_reader=5)

In [14]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Tywin',
        'context': 'Stark marrying two of his children.\n'
                   'Tyrion Lannister suspects his father Tywin, who decides '
                   'Tyrion and his barbarians will fight in the vanguard, '
                   'want'},
    {   'answer': 'Yoren',
        'context': " Baelor the Blessed. Ned notices Arya and alerts Night's "
                   'Watch recruiter Yoren. Before Sansa, Cersei Lannister, '
                   'Jof