<a href="https://colab.research.google.com/github/superchargez/haystack/blob/tutorials/ESDS_GoT_LFQA_via_Haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Long-Form Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Thu Nov  4 09:07:52 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-0k4fkbmz
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-0k4fkbmz
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 30 kB/s 
[?25hCollecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 33.4 MB/s 
Collecting fastapi
  Downloading fastapi-0.70.0-py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 22 kB/s 
[?25hCollecting uvicorn
  Downloading uvicorn-0.15.0-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.8 MB/s 
[?25hCollecting gunicorn
  Downloading gunicorn-20.1.0-py3-no

In [15]:
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, clean_wiki_text
from haystack.nodes import Seq2SeqGenerator

### Document Store

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

In [2]:
# from haystack.document_stores import FAISSDocumentStore
# document_store = FAISSDocumentStore(vector_dim=128, faiss_index_factory_str="Flat")

<haystack.document_stores.faiss.FAISSDocumentStore at 0x7f4cfa33e350>

In [3]:
# Download and install Ealstic Search not required for windows
# ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
# ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
# ! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

from haystack.utils import launch_es
launch_es()

Tried to start Elasticsearch through Docker but this failed. It is likely that there is already an existing Elasticsearch instance running. 


In [16]:
from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="",
                                            index="document", return_embedding=True)

### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [4]:
# Let's first get some files that we want to use
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
    
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

### Initalize Retriever and Reader/Generator

#### Retriever

**Here:** We use a `RetribertRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`



In [17]:
# from haystack.nodes import EmbeddingRetriever

# retriever = EmbeddingRetriever(document_store=document_store,
#                                embedding_model="yjernite/retribert-base-uncased",
#                                model_format="retribert")

# document_store.update_embeddings(retriever)

from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
# document_store.update_embeddings(retriever)

In [5]:
# document_store.save(index_path='FAISS_GoT')

In [6]:
# document_store.load(index_path='FAISS_GoT')

<haystack.document_stores.faiss.FAISSDocumentStore at 0x7f4cf7edd550>

Before we blindly use the `RetribertRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents.

In [18]:
from haystack.utils import print_documents
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(
    query="Tell me something about Arya Stark?",
    params={"Retriever": {"top_k": 10}}
)
print_documents(res, max_text_len=512)


Query: Tell me something about Arya Stark?

{   'content': '\n'
               '=== Background ===\n'
               'Arya is the third child and younger daughter of Eddard and '
               'Catelyn Stark and is nine years old at the beginning of the '
               'book series.  She has five siblings: an older brother Robb, an '
               'older sister Sansa, two younger brothers Bran and Rickon, and '
               'an older illegitimate half-brother, Jon Snow.',
    'name': '43_Arya_Stark.txt'}

{   'content': '\n'
               "===''A Feast for Crows''===\n"
               "After Lysa's death, Sansa becomes mistress of the Eyrie and "
               "still pretends to be Baelish's illegitimate daughter, Alayne "
               'Stone. Baelish successfully pacifies the lords of the Vale, '
               "who suspected Baelish's hand in Lysa's death. Afterwards, "
               'Baelish reveals to Sansa his plans to eventually marry her to '
               'the heir t

#### Reader/Generator

Similar to previous Tutorials we now initalize our reader/generator.

Here we use a `Seq2SeqGenerator` with the *yjernite/bart_eli5* model (see: https://huggingface.co/yjernite/bart_eli5)



In [19]:
generator = Seq2SeqGenerator(model_name_or_path="yjernite/bart_eli5")

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [20]:
from haystack.pipelines import GenerativeQAPipeline
pipe = GenerativeQAPipeline(generator, retriever)

## Voilà! Ask a question!

In [21]:
print_documents(pipe.run(
    query="What is Arya's sister's name?",
    params={"Retriever": {"top_k": 1}}
))

Query: What is Arya's sister's name?

{   'content': '\n'
               '=== Background ===\n'
               'Arya is the third child and younger daughter of Eddard and '
               'Catelyn Stark and is nine years old at the beginning of the '
               'book series.  She has five siblings: an older brother Robb, an '
               'older sister Sansa, two younger brothers Bran and Rickon, and '
               'an older illegitimate half-brother, Jon Snow.',
    'name': '43_Arya_Stark.txt'}


In [24]:
print_documents(pipe.run(query="Who was the luckiest character in the show?", params={"Retriever": {"top_k": 1}}))

Query: Who was the luckiest character in the show?

{   'content': '\n'
               '==Production==\n'
               "As with the previous episode, the show's opening title "
               'sequence is modified to depict the characters in their '
               'role-playing garb, while the soundtrack has been altered to '
               "include the penis-themed chorus singing to the ''Game of "
               "Thrones'' opening theme introduced in the previous episode. "
               'Series co-creators Trey Parker and Matt Stone said that they '
               'experimented with different styles of opening sequences before '
               'settling on the penis-themed chorus version; a Japanese '
               'Princess Kenny opening sequence was one of the original ideas.',
    'name': '101_Titties_and_Dragons.txt'}


## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)