# Long-Form Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Sun Nov  7 14:18:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-2v_7yv3d
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-2v_7yv3d
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 32 kB/s 
[?25hCollecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 27.2 MB/s 
Collecting fastapi
  Downloading fastapi-0.70.0-py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 561 kB/s 
[?25hCollecting uvicorn
  Downloading uvicorn-0.15.0-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.2 MB/s 
[?25hCollecting gunicorn
  Downloading gunicorn-20.1.0-py3-n

In [1]:
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, clean_wiki_text
from haystack.nodes import Seq2SeqGenerator



### Document Store

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

In [2]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(vector_dim=128, faiss_index_factory_str="Flat")

### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [6]:
# Let's first get some files that we want to use
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
s3_url = "https://github.com/zseebrz/SR_EU_added_value/raw/master/52_document_corpus_30_sept.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
    
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

### Initalize Retriever and Reader/Generator

#### Retriever

**Here:** We use a `RetribertRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`



In [7]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="yjernite/retribert-base-uncased",
                               model_format="retribert")

document_store.update_embeddings(retriever)

Downloading:   0%|          | 0.00/487 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/325M [00:00<?, ?B/s]

Some weights of RetriBertModel were not initialized from the model checkpoint at yjernite/retribert-base-uncased and are newly initialized: ['bert_query.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Updating Embedding:   0%|          | 0/507 [00:00<?, ? docs/s]

Creating Embeddings:   0%|          | 0/16 [00:00<?, ? Batches/s]

Documents Processed: 10000 docs [00:39, 254.11 docs/s]


Before we blindly use the `RetribertRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents.

In [10]:
from haystack.utils import print_documents
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(
    query="Tell me something about cybersecurity?",
    params={"Retriever": {"top_k": 10}}
)
print_documents(res, max_text_len=512)


Query: Tell me something about cybersecurity?

{   'content': '(pursuant to Article 287(4), second subparagraph, TFEU)\n'
               'The Common External Relations Information System (CRIS)\n'
               'together with the Commissions replies\n'
               'Despite some weakenesses, CRIS is now being developed to \n'
               'respond to the Commissions needs\t24-44\n'
               'System development projects now respond to well identified \n'
               'Need for an up-to-date definition of CRISs role\t30-35\n'
               'Persisting problems with user friendliness\t42-44\n'
               'CRIS management is not yet sufficiently effective in '
               'ensuring \n'
               'Insufficient security of the system and...',
    'name': 'DVC009318EN04-12PP-CH365-12APCFIN-RS-CRIS_final-ORAN.doc.txt'}

{   'content': '(pursuant to Article 287(4), second subparagraph, TFEU)\n'
               'The European Institute of Innovation and Technology must '


#### Reader/Generator

Similar to previous Tutorials we now initalize our reader/generator.

Here we use a `Seq2SeqGenerator` with the *yjernite/bart_eli5* model (see: https://huggingface.co/yjernite/bart_eli5)



In [11]:
generator = Seq2SeqGenerator(model_name_or_path="yjernite/bart_eli5")

Downloading:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [12]:
from haystack.pipelines import GenerativeQAPipeline
pipe = GenerativeQAPipeline(generator, retriever)

## Voilà! Ask a question!

In [13]:
pipe.run(
    query="What is the water quality situation in Romania?",
    params={"Retriever": {"top_k": 1}}
)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


{'answers': [" It's not that the water quality is bad, it's that there's no money to be made to fix it."],
 'documents': [{'content': '(pursuant to Article 287(4), second subparagraph, TFEU)\nIs EU Structural Measures spending on the supply of water for domestic \ntogether with the Commission’s replies\nEU Structural Measures co-financing of water supply \nAudit scope and objectives\t11 - 15\nWere the most appropriate solutions adopted in order to meet \nthe needs of the areas concerned?\t16 - 35\nIn almost all cases, forecasts of needs did not take into \naccount downward trends in per capita water consumption, \nand in some cases, not all resources already available were \nThe focus is on building infrastructures to exploit new water \nsources and attention is rarely paid to other solutions, \nsuch as reducing water losses, …\t25 - 27\n… or using more accessible resources\t28\nLimited value was added by the grant applications’ \nappraisal by the Commission and the Member States’ \nHa

In [14]:
pipe.run(query="Is high-speed rail expensive in Spain?", params={"Retriever": {"top_k": 1}})

{'answers': [' High speed rail is expensive in Spain because it takes a long time to build and maintain a high speed rail network. It takes a lot of money to build a rail network, and it takes even more money to maintain it.'],
 'documents': [{'content': "(pursuant to Article 287(4), second subparagraph, TFEU)\nA European high-speed rail network: \nnot a reality but an ineffective patchwork\ntogether with the Commission’s replies\nHigh-speed rail in Europe\t1 - 2\nThe EU’s high-speed rail network is growing in size and in rate of utilisation\t3 - 4\nEU policies for high-speed rail\t5 - 9\nEU support for building high-speed lines: significant, but a fraction of total cost\t10 - 13\nAudit scope and approach\t14 - 20\nEU co-funded investments in high-speed rail can be beneficial, but there is no solid EU-wide strategic approach\t21 - 36\nHigh-speed rail is a beneficial mode of transport which contributes to the EU’s sustainable-mobility objectives\t21 - 22\nThe Commission's powers are lim

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)