<a href="https://colab.research.google.com/github/umbertoselva/tolkienQA/blob/main/tolkienQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#TOLKIEN Q&A

This is a small Google Colab project to acquaint myself with Haystack.

My goal will be to create an NLP Q&A pipeline that can answer questions based on J.R.R. Tolkien's works.

###TABLE OF CONTENTS

1) [Prepare GPU environment on Google Colab](#s01)

2) [Install Haystack](#s02)

3) [Set up Elasticsearch](#s03)

4) [Upload the text files to Google Colab](#s04)

5) [Initialize a DocumentStore as an ElasticsearchDocumentStore object](#s05)

6) [Preprocess the texts and save them in the DocumentStore](#s06)

7) [Retriever initialization](#s07)

  - [Dense Passage Retriever](#s07a)
  - [BM25 Retriever](#s07b)


8) [Reader initialization](#s08)

9) [Pipeline setup](#s09)

  - [DPR Pipe](#s09a)
  - [BM25 Pipe](#s09b)


10) [Let's try out the pipelines](#s10)

11) [A custom function to compare the results of different pipelines](#s11)

12) [A few observations](#s12)

13) [Final comments](#s13)

<a name="s01"></a>
###1) PREPARE GPU ENVIRONMENT ON GOOGLE COLAB

The first step is to make sure a GPU is running:

*Runtime -> Change runtime type -> Hardware accelerator -> GPU*

In [None]:
!nvidia-smi

Mon May 30 17:25:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

<a name="s02"></a>
###2) INSTALL HAYSTACK



In [None]:
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.1.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 30.4 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.1.1


In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-tzxntgkb/farm-haystack_c9c129bdda9942f1afe9a724c5ae2da5
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-tzxntgkb/farm-haystack_c9c129bdda9942f1afe9a724c5ae2da5
  Resolved https://github.com/deepset-ai/haystack.git to commit fc25adf959760c647b1a0dc3883fa1abb70734cf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting mlflow
  Downloading mlflow-1.26.1-py3-none-any.whl (17.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.8/17.8 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=2.2.0
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[

<a name="s03"></a>
###3) SET UP ELASTICSEARCH

In [None]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], 
    stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

Test that the Elasticsearch server is running by querying the base endpoint (at localhost:9200)

In [None]:
!curl -X GET "localhost:9200/"

{
  "name" : "4e115108dd43",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "3M1vmv-RSX-nyoeArLvZ3w",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


Check that the correct processes are running



In [None]:
!ps -ef | grep elasticsearch

daemon       317      71 99 17:26 ?        00:00:31 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-1749297391996564798 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/content/elasticsearch-7.9.2 -Des.path.c

Check the server's health

In [None]:
import requests

In [None]:
requests.get("http://localhost:9200/_cluster/health")

<Response [200]>

In [None]:
requests.get("http://localhost:9200/_cluster/health").json()

{'active_primary_shards': 0,
 'active_shards': 0,
 'active_shards_percent_as_number': 100.0,
 'cluster_name': 'elasticsearch',
 'delayed_unassigned_shards': 0,
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'task_max_waiting_in_queue_millis': 0,
 'timed_out': False,
 'unassigned_shards': 0}

Check the existing indices: currently there should be none

In [None]:
requests.get("http://localhost:9200/_cat/indices").text

''

<a name="s04"></a>
###4) UPLOAD THE TEXT FILES

My text database will consist of three works by J.R.R. Tolkien for a total of five *.txt* files


*   The Hobbit
*   The Lord Of The Rings: The Fellowship Of The Ring
*   The Lord Of The Rings: The Two Towers
*   The Lord Of The Rings: The Return Of The King
*   The Silmarillion

Since these books are copyrighted, I am not sharing the raw files here. When developing this project, I manually uploaded them on Google Colab in a folder called *texts* located within the *content* folder.





In [None]:
!pwd

/content


In [None]:
!ls /content/texts

hobbit.txt  lotr1.txt  lotr2.txt  lotr3.txt  silmarillion.txt


This is what the raw texts look like

In [None]:
!head /content/texts/hobbit.txt

Chapter I
An Unexpected Party
In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.

It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats - the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill - The Hill, as all the people for many miles round called it - and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had w

<a name="s05"></a>
###5) SET UP A DOCUMENTSTORE

Here I am initializing a DocumentStore as an ElasticsearchDocumentStore class object, and I am creating an Elasticsearch index called "*tolkien*"

In [None]:
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    host="localhost", 
    username="", 
    password="", 
    index="tolkien"
    )

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.
INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


In [None]:
type(document_store)

haystack.document_stores.elasticsearch.ElasticsearchDocumentStore

Check that the index has been created

In [None]:
print(requests.get("http://localhost:9200/_cat/indices").text)

yellow open tolkien 5F5WT4JzQ2S1owiTwie75w 1 1 0 0 208b 208b
yellow open label   rhyGimVsTu6pVzLJ-B5YQQ 1 1 0 0 208b 208b



In [None]:
requests.get("http://localhost:9200/tolkien").json()

{'tolkien': {'aliases': {},
  'mappings': {'dynamic_templates': [{'strings': {'mapping': {'type': 'keyword'},
      'match_mapping_type': 'string',
      'path_match': '*'}}],
   'properties': {'content': {'type': 'text'},
    'embedding': {'dims': 768, 'type': 'dense_vector'},
    'name': {'type': 'keyword'}}},
  'settings': {'index': {'analysis': {'analyzer': {'default': {'type': 'standard'}}},
    'creation_date': '1653931634194',
    'number_of_replicas': '1',
    'number_of_shards': '1',
    'provided_name': 'tolkien',
    'uuid': '5F5WT4JzQ2S1owiTwie75w',
    'version': {'created': '7090299'}}}}}

<a name="s06"></a>
###6) PREPROCESSING

Now it's the time to preprocess the texts. Here I shall use the "*convert_files_to_doc*" function provided by Haystack without any further preprocessing to see how it fares just by itself.

This function accepts *.txt*, *.docx* and *.pdf* files (see https://github.com/deepset-ai/haystack/blob/master/haystack/utils/preprocessing.py ; since my files are in *.txt* format it will make use of the TextConverter class: https://github.com/deepset-ai/haystack/blob/master/haystack/nodes/file_converter/txt.py) 

This will split our texts into smaller portions or paragraphs. Each will become a document item within our DocumentStore. The latter is essentially a *list of dicts*, each one in the following format:

```
{
  "text": "This is the content of the paragraph", 
  'content_type': 'text', 
  'score': 0.6727078638161454
  'meta': {'name': 'filename.txt'}, 
  'embedding': None, 
  'id': 'f766d354fddaf9b367ca8f50ac976a14'}
}
```



In [None]:
# set path to the text files
doc_dir = "/content/texts"

# preprocess
from haystack.utils import convert_files_to_docs

preprocessed_texts = convert_files_to_docs(
    dir_path=doc_dir, 
    split_paragraphs=True
    )

INFO - haystack.utils.preprocessing -  Converting /content/texts/hobbit.txt
INFO - haystack.utils.preprocessing -  Converting /content/texts/lotr1.txt
INFO - haystack.utils.preprocessing -  Converting /content/texts/lotr3.txt
INFO - haystack.utils.preprocessing -  Converting /content/texts/lotr2.txt
INFO - haystack.utils.preprocessing -  Converting /content/texts/silmarillion.txt


In [None]:
type(preprocessed_texts)

list

In [None]:
preprocessed_texts[:5]

[<Document: {'content': 'Chapter I\nAn Unexpected Party\nIn a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.', 'content_type': 'text', 'score': None, 'meta': {'name': 'hobbit.txt'}, 'embedding': None, 'id': 'f766d354fddaf9b367ca8f50ac976a14'}>,
 <Document: {'content': 'It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats - the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill - The Hill, as all the people for many miles round called it - and man

Finally I will save the preprocessed texts into the DocumentStore object.

In [None]:
document_store.write_documents(preprocessed_texts)

Let's count the number of documents (i.e. paragraphs into which the text was split) that we have:

In [None]:
requests.get("http://localhost:9200/tolkien/_count").json()

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'count': 11882}

<a name="s07"></a>
###7) RETRIEVER INITIALIZATION

The retriever will communicate with our ElasticsearchDocumentStore and return another DocumentStore containing a number of contexts that are most relevant to a given query (each context will be stored as schema.Document item, essentially a dict, in the DocumentStore)



#### TYPES OF RETRIEVER

I shall try out a DPR (Dense Passage Retriever) and the BM25 Retriever (based on Elasticsearch's default BM25 similarity ranking algorithm) and compare their performances

<a name="s07a"></a>
####A) THE DENSE PASSAGE RETRIEVER

The DPR (Dense Passage Retriever) will embed our document indexes as vectors, and use a similarity measure to match a query to similar contexts.

We need to provide a 
* question encoder (a pre-trained query embedding model), and a
* context encoder (a pre-trained passage embedding model). 

We can find available ones from Hugging Face (search for "dpr"). 

The question encoder creates the embeddings for the query, while the context encoder the embeddings for the contexts. Given a query, the retriever will find simlar contexts by measuring the similarity between the query embeddings and the contexts embeddings.

In [None]:
from haystack.nodes import DensePassageRetriever

In [None]:
dpr_retriever = DensePassageRetriever(
    document_store=document_store, # here we pass in our texts
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=16,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True,
)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/493 [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-ctx_encoder-single-nq-base


Now that we've initialized our retriever, we can update our documents with the embeddings created by the retriever

In [None]:
document_store.update_embeddings(dpr_retriever)

INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 11882 docs ...


Updating embeddings:   0%|          | 0/11882 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/10000 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/1888 [00:00<?, ? Docs/s]

<a name="s07b"></a>
#### B) THE BM25 RETRIEVER

This is set up by a simple initialization

In [None]:
from haystack.nodes import BM25Retriever

bm25_retriever = BM25Retriever(document_store=document_store)

#### LET'S TRY OUT THE RETRIEVERS

The retrievers return a DocumentStore object containing a list of contexts in the form of Documents (dicts) (of the schema.Document type)

In [None]:
dpr_results = dpr_retriever.retrieve("Who carries the ring?")
dpr_results

[<Document: {'content': "With that he put on Bilbo a small coat of mail, wrought for some young elf-prince long ago. It was of silver-steel which the elves call mithril, and with it went a belt of pearls and crystals. A light helm of figured leather, strengthened beneath with hoops of steel, and studded about the bring with white gems, was set upon the hobbit's head.", 'content_type': 'text', 'score': 0.663955603194929, 'meta': {'name': 'hobbit.txt'}, 'embedding': None, 'id': '2a8d4539105876c72bb933207df45e97'}>,
 <Document: {'content': 'of silver was his habergeon,', 'content_type': 'text', 'score': 0.6626070824386119, 'meta': {'name': 'lotr1.txt'}, 'embedding': None, 'id': '79b43926481aafab67a0998527c78488'}>,
 <Document: {'content': "`Well, there are many reasons why they should,' said Gandalf, smiling. `I am one good reason. The Ring is another: you are the Ring-bearer. And you are the heir of Bilbo, the Ring-finder.'", 'content_type': 'text', 'score': 0.6615584644286779, 'meta': {

In [None]:
bm25_retriever.retrieve("Who carries the ring?")

[<Document: {'content': "`The messengers who are sent with the Ring.'", 'content_type': 'text', 'score': 0.7688756499028262, 'meta': {'name': 'lotr1.txt'}, 'embedding': None, 'id': '7ec6cb75585bc6b9f430ac71b43a0026'}>,
 <Document: {'content': 'From that time on Sam thought that he sensed a change in Gollum again. He was more fawning and would-be friendly; but Sam surprised some strange looks in his eyes at times, especially towards Frodo; and he went back more and more into his old manner of speaking. And Sam had another growing anxiety. Frodo seemed to be weary, weary to the point of exhaustion. He said nothing. indeed he hardly spoke at all; and he did not complain, but he walked like one who carries a load, the weight of which is ever increasing; and he dragged along, slower and slower, so that Sam had often to beg Gollum to wait and not to leave their master behind.', 'content_type': 'text', 'score': 0.73656660972089, 'meta': {'name': 'lotr2.txt'}, 'embedding': None, 'id': '8415726

In [None]:
type(dpr_results)

list

In [None]:
type(dpr_results[0])

haystack.schema.Document

<a name="s08"></a>
###8) READER INITIALIZATION

The Reader extracts the answers from the contexts returned by the Retriever.

Here I shall use the FARMReader. (On FARM see https://farm.deepset.ai/)

We need to provide a FARM model, e.g. the *RoBERTa-base-squad2* model fine-tuned using the SQuAD2.0 dataset of question-answer pairs available on Hugging Face (https://huggingface.co/deepset/roberta-base-squad2)

In [None]:
from haystack.nodes import FARMReader

In [None]:
reader = FARMReader(
    model_name_or_path="deepset/roberta-base-squad2", 
    use_gpu=True
    )

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


<a name="s09"></a>
###9) PIPELINE SETUP

I shall use the default ExtractiveQAPipeline provided by Haystack for Q&A systems. 

Let's set up a first pipeline with the DPRetriever and a second one with the BM25Retriever

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

<a name="s09a"></a>
#### DPR PIPELINE

In [None]:
dpr_pipe = ExtractiveQAPipeline(reader, dpr_retriever)

<a name="s09b"></a>
#### BM25 PIPELINE

In [None]:
bm25_pipe = ExtractiveQAPipeline(reader, bm25_retriever)

In [None]:
type(dpr_pipe)

haystack.pipelines.standard_pipelines.ExtractiveQAPipeline

<a name="s10"></a>
###10) LET'S TRY OUT OUR PIPELINES

First the DPR Pipe

In [None]:
dpr_prediction = dpr_pipe.run(
    query="Who carries the ring?", 
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 49.26 Batches/s]


In [None]:
dpr_prediction

{'answers': [<Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.9138280749320984, 'context': 'There was a hush, and all turned their eyes on Frodo. He was shaken by a sudden shame and fear; and he felt a great reluctance to reveal the Ring, and', 'offsets_in_document': [{'start': 47, 'end': 52}], 'offsets_in_context': [{'start': 47, 'end': 52}], 'document_id': 'c4916dbf477d3e5538fe8ab0c876f573', 'meta': {'name': 'lotr1.txt'}}>,
  <Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.8167659938335419, 'context': "'You are wise and fearless and fair, Lady Galadriel,' said Frodo. `I will give you the One Ring, if you ask for it. It is too great a matter for me.'", 'offsets_in_document': [{'start': 59, 'end': 64}], 'offsets_in_context': [{'start': 59, 'end': 64}], 'document_id': 'ab015b8a3620a9eccd87f539e09a828d', 'meta': {'name': 'lotr1.txt'}}>,
  <Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.6970351338386536, 'context': "`And you, Ring-bearer,' she said, t

The returned prediction is a dictionary with the following format

```
{
  'answers': [
    # a list of the top answers (in the schema.Answer format)
    <Answer {
      'answer': 'The answer', 
      'type': 'extractive', 
      'score': 0.9138280749320984, 
      'context': 'The text of the context retrieved', 
      'offsets_in_document': [{'start': 47, 'end': 52}], 'offsets_in_context': [{'start': 47, 'end': 52}], 
      'document_id': 'c4916dbf477d3e5538fe8ab0c876f573', 
      'meta': {'name': 'filename.txt'}
    }>,
    ...
  ],

  'documents': [
    # a list of the corresponding documents (in the schema.Document format)
    <Document: {
      'content': "The content of the document"
      'content_type': 'text', 
      'score': 0.663955603194929, 
      'meta': {'name': 'filename.txt'}, 
      'embedding': None, 
      'id': '2a8d4539105876c72bb933207df45e97'
    }>,
    ...
  ],

  'no_ans_gap': 8.631532192230225,
  'node_id': 'Reader',
  'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 10}},
  'query': 'The text of the question passed as argument to the "query=" parameter within pipe.run()',
  'root_node': 'Query'
}```



In [None]:
type(dpr_prediction)

dict

In [None]:
dpr_prediction.get('answers')

[<Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.9138280749320984, 'context': 'There was a hush, and all turned their eyes on Frodo. He was shaken by a sudden shame and fear; and he felt a great reluctance to reveal the Ring, and', 'offsets_in_document': [{'start': 47, 'end': 52}], 'offsets_in_context': [{'start': 47, 'end': 52}], 'document_id': 'c4916dbf477d3e5538fe8ab0c876f573', 'meta': {'name': 'lotr1.txt'}}>,
 <Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.8167659938335419, 'context': "'You are wise and fearless and fair, Lady Galadriel,' said Frodo. `I will give you the One Ring, if you ask for it. It is too great a matter for me.'", 'offsets_in_document': [{'start': 59, 'end': 64}], 'offsets_in_context': [{'start': 59, 'end': 64}], 'document_id': 'ab015b8a3620a9eccd87f539e09a828d', 'meta': {'name': 'lotr1.txt'}}>,
 <Answer {'answer': 'Frodo', 'type': 'extractive', 'score': 0.6970351338386536, 'context': "`And you, Ring-bearer,' she said, turning to Frod

In [None]:
type(dpr_prediction.get('answers'))

list

In [None]:
type(dpr_prediction.get('answers')[0])

haystack.schema.Answer

In [None]:
dpr_prediction.get('documents')

[<Document: {'content': "With that he put on Bilbo a small coat of mail, wrought for some young elf-prince long ago. It was of silver-steel which the elves call mithril, and with it went a belt of pearls and crystals. A light helm of figured leather, strengthened beneath with hoops of steel, and studded about the bring with white gems, was set upon the hobbit's head.", 'content_type': 'text', 'score': 0.663955603194929, 'meta': {'name': 'hobbit.txt'}, 'embedding': None, 'id': '2a8d4539105876c72bb933207df45e97'}>,
 <Document: {'content': 'of silver was his habergeon,', 'content_type': 'text', 'score': 0.6626070824386119, 'meta': {'name': 'lotr1.txt'}, 'embedding': None, 'id': '79b43926481aafab67a0998527c78488'}>,
 <Document: {'content': "`Well, there are many reasons why they should,' said Gandalf, smiling. `I am one good reason. The Ring is another: you are the Ring-bearer. And you are the heir of Bilbo, the Ring-finder.'", 'content_type': 'text', 'score': 0.6615584644286779, 'meta': {

In [None]:
type(dpr_prediction.get('documents'))

list

In [None]:
type(dpr_prediction.get('documents')[0])

haystack.schema.Document

Haystack provides a function to pretty-print the predictions

In [None]:
from haystack.utils import print_answers

In [None]:
print_answers(dpr_prediction, details="minimum")


Query: Who carries the ring?
Answers:
[   {   'answer': 'Frodo',
        'context': 'There was a hush, and all turned their eyes on Frodo. He '
                   'was shaken by a sudden shame and fear; and he felt a great '
                   'reluctance to reveal the Ring, and'},
    {   'answer': 'Frodo',
        'context': "'You are wise and fearless and fair, Lady Galadriel,' said "
                   'Frodo. `I will give you the One Ring, if you ask for it. '
                   "It is too great a matter for me.'"},
    {   'answer': 'Frodo',
        'context': "`And you, Ring-bearer,' she said, turning to Frodo. `I "
                   'come to you last who are not last in my thoughts. For you '
                   "I have prepared this.' She held up a "},
    {   'answer': 'Tom',
        'context': ' in the midst of the story: and Frodo, to his own '
                   'astonishment, drew out the chain from his pocket, and '
                   'unfastening the Ring handed it at 

Then the BM25 Pipe

In [None]:
bm25_prediction = bm25_pipe.run(
    query="Who carries the ring?", 
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 46.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 45.36 Batches/s]


In [None]:
print_answers(bm25_prediction, details="minimum")


Query: Who carries the ring?
Answers:
[   {   'answer': 'Frodo',
        'context': 'Elrond summoned the hobbits to him. He looked gravely at '
                   "Frodo. 'The time has come,' he said. `If the Ring is to "
                   'set out, it must go soon. But those w'},
    {   'answer': 'Sauron',
        'context': 'We cannot use the Ruling Ring. That we now know too well. '
                   'It belongs to Sauron and was made by him alone, and is '
                   'altogether evil. Its strength, Boromi'},
    {   'answer': 'messengers',
        'context': "`The messengers who are sent with the Ring.'"},
    {   'answer': 'Frodo',
        'context': 'Frodo looked at it closely, and rather suspiciously (like '
                   'one who has lent a trinket to a juggler). It was the same '
                   'Ring, or looked the same and weigh'},
    {   'answer': 'Nenya',
        'context': 'erily it is in the land of Lórien upon the finger of '
                   'Gal

<a name="s11"></a>
###11) A CUSTOM FUNCTION TO COMPARE THE PREDICTIONS

In [None]:
def question(query, context=False, pipe01=dpr_pipe, pipe02=bm25_pipe):
  """
  Function to compare the prediction of two pipes.

  Args:
    query (string): The question;
    context (bool): if False only the 'answer' and the 'score' are printed;
      if True, the 'context' from which the answer is extracted is printed too;
    pipe01, pipe02 (haystack.pipelines.standard_pipelines.ExtractiveQAPipeline): 
      by default the Dense Passage Retriever pipe
      and the BM25 Retriever pipe that we created above.
  """

  # run the DPR pipe
  prediction01 = pipe01.run(
    query=query, 
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})
  
  # run the BM25 pipe
  prediction02 = pipe02.run(
    query=query, 
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})
  
  # extract data from the top schema.Answer object returned from the DPR pipe
  answer01 = prediction01.get('answers')[0].answer
  score01 = prediction01.get('answers')[0].score
  context01 = prediction01.get('answers')[0].context

  # extract data from the top schema.Answer object returned from the BM25 pipe
  answer02 = prediction02.get('answers')[0].answer
  score02 = prediction02.get('answers')[0].score
  context02 = prediction02.get('answers')[0].context

  # output
  print()
  print(f'DPR Pipe Best Answer: "{answer01}" ; Score: {score01}')
  if context==True:
    print(f'Context: {context01}')
  print(f'BM25 Pipe Best Answer: "{answer02}" ; Score: {score02}')
  if context==True:
    print(f'Context: {context02}')


Let's try out some questions

In [None]:
question("Who carries the ring?")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.25 Batches/s


DPR Pipe Best Answer: "Frodo" ; Score: 0.9138280749320984
BM25 Pipe Best Answer: "Frodo" ; Score: 0.8505543172359467





In [None]:
question("What does Sauron want?")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.18 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 46.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 57.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 49.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 54.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.35 Batches/s


DPR Pipe Best Answer: "news of the One" ; Score: 0.7152166962623596
BM25 Pipe Best Answer: "conquer" ; Score: 0.2585424482822418





In [None]:
question("Who is Bilbo Baggins?")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 44.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 50.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.81 Batches/s


DPR Pipe Best Answer: "companion of Thorin" ; Score: 0.9290409684181213
BM25 Pipe Best Answer: "companion of Thorin" ; Score: 0.9290409684181213





In [None]:
question("What is an Elven cloak?")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.15 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 40.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.36 Batches/s


DPR Pipe Best Answer: "magic cloaks" ; Score: 0.3496596962213516
BM25 Pipe Best Answer: "sheath" ; Score: 0.5922504514455795





<a name="s12"></a>
###12) A FEW OBSERVATIONS

At first sight our pipelines seem to work fairly well, but, as I shall show below, other answers to other questions don't seem to make as much sense, and, if we look at the contexts of those answers above that looked correct, we can see that there are some issues.

In [None]:
question("What does Sauron want?", context=True)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 40.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.14 Batches/s


DPR Pipe Best Answer: "news of the One" ; Score: 0.7152166962623596
Context: and he is gathering again all the Rings to his hand; and he seeks ever for news of the One, and of the Heirs of Isildur, if they live still on earth.’
BM25 Pipe Best Answer: "conquer" ; Score: 0.2585424482822418
Context: 'And what if Sauron does not conquer? What will you do to him?' asked Pippin.





In the case above, everything seems to make reasonable sense

In [None]:
question("Who carries the ring?", context=True)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 48.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 49.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.33 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 45.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 50.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.74 Batches/s


DPR Pipe Best Answer: "Frodo" ; Score: 0.9138280749320984
Context: There was a hush, and all turned their eyes on Frodo. He was shaken by a sudden shame and fear; and he felt a great reluctance to reveal the Ring, and
BM25 Pipe Best Answer: "Frodo" ; Score: 0.8505543172359467
Context: Elrond summoned the hobbits to him. He looked gravely at Frodo. 'The time has come,' he said. `If the Ring is to set out, it must go soon. But those w





Here above, although the answers are correct, it is not very obvious how the retreived contexts are relevant

In [None]:
question("Who is Bilbo Baggins?", context=True)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 22.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 42.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 55.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 44.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.95 Batches/s


DPR Pipe Best Answer: "companion of Thorin" ; Score: 0.9290409684181213
Context: "It's me, Bilbo Baggins, companion of Thorin!" he cried, hurriedly taking off the ring.
BM25 Pipe Best Answer: "companion of Thorin" ; Score: 0.9290409684181213
Context: "It's me, Bilbo Baggins, companion of Thorin!" he cried, hurriedly taking off the ring.





Here the retrieved contexts make sense, but the answers are not very informative.

However, this gives us a glimpse of how, since our text data is not wiki-style data, but rather novels, it is not easy for our model to answer such wiki-style questions! It would probably fare much better if we scraped a Tolkien wiki website and used its content as the text data.

Let's ask a question from within the text then:

In [None]:
question("Is Rivendell safe?", context=True)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 50.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 52.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.35 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 52.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.10 Batches/s


DPR Pipe Best Answer: "they are all safe and sound" ; Score: 0.6565729975700378
Context: 'Yes, they are all safe and sound,' answered Gandalf. `Sam was here until I sent him off to get some rest, about half an hour ago.'
BM25 Pipe Best Answer: "`What about Rivendell and the Elves" ; Score: 0.582397386431694
Context: `What about Rivendell and the Elves? Is Rivendell safe?'





Here above I passed a question that appears as such in a dialogue in the first book. After having a vision of the Nazguls, Frodo asks Gandalf:

*Frodo: 'What about Rivendell and the Elves? Is Rivendell safe?'
Gandalf: 'Yes, at present, until all else is conquered.'*

So here the BM25 retrieved the very same paragraph with Frodo's question itself. The problem is that the relevant answer ('Yes') is in the following paragraph.

Namely:


```
[<Document: {'content': "`What about Rivendell and the Elves? Is Rivendell safe?'", 'content_type': 'text', 'score': None, 'meta': {'name': 'rivendell.txt'}, 'embedding': None, 'id': '6d8ce9649b189e7ca37a4d07811fde41'}>,
 <Document: {'content': "`Yes, at present, until all else is conquered. The Elves may fear the Dark Lord, and they may fly before him, but never again will they listen to him or serve him. And here in Rivendell there live still some of his chief foes: the Elven-wise, lords of the Eldar from beyond the furthest seas. They do not fear the Ringwraiths, for those who have dwelt in the Blessed Realm live at once in both worlds, and against both the Seen and the Unseen they have great power.'", 'content_type': 'text', 'score': None, 'meta': {'name': 'rivendell.txt'}, 'embedding': None, 'id': '4a9ecbe4c38e9570fb2d486da76e820b'}>]
```



The DPR retrieved another nearby passage earlier in the text containing the word 'safe', but that's in fact the answer to a different question: Frodo, waking up, asks Gandalf if his companions are safe.

<a name="s13"></a>
###13) FINAL COMMENTS

In conclusion, the Haystack library offers very agile tools to build Q&A pipelines. 

Nevertheless, our particular goal of creating a Q&A system based on J.R.R. Tolkien's novels has proved challenging. 

It is possible that the nature of the data is affecting the pipeline's performance. Perhaps, scraping a fandom wiki website on J.R.R. Tolkien and using that data instead might have resulted in a better performance, possibly because models like the RoBERTa base are trained on a dataset of Q&A based on Wikipedia articles rather than works of fiction. 

Another way to improve performance that I would be interested in looking into in the future would be to tinker with how dialogues from the novels are split into separate paragraphs, and check whether the model performs better when questions and answers from within a dialogue in the text are indexed as belonging to the same paragraph.
