<a href="https://colab.research.google.com/github/witold87/pub_haystack_unstructurized/blob/master/Retrieval_via_DPR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Better Retrieval via "Dense Passage Retrieval"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/witold87/pub_haystack_unstructurized/blob/master/Retrieval_via_DPR.ipynb)

### Importance of Retrievers

The Retriever has a huge impact on the performance of our overall search pipeline.


### Different types of Retrievers
#### Sparse
Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.

**Examples**: BM25, TF-IDF

**Pros**: Simple, fast, well explainable

**Cons**: Relies on exact keyword matches between query and text
 

#### Dense
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two different approaches: 

a) Single encoder: Use a **single model** to embed both query and passage.  
b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage

Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...). 

**Examples**: REALM, DPR, Sentence-Transformers

**Pros**: Captures semantinc similarity instead of "word matches" (e.g. synonyms, related topics ...)

**Cons**: Computationally more heavy, initial training of model


### "Dense Passage Retrieval"

In this Tutorial, we want to highlight one "Dense Dual-Encoder" called Dense Passage Retriever. 
It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906. 

Original Abstract: 

_"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks."_

Paper: https://arxiv.org/abs/2004.04906  
Original Code: https://fburl.com/qa-dpr 


*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) *to open the notebook in Google Colab.*


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

Mon Jan 17 07:12:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting grpcio-tools==1.34.1
  Downloading grpcio_tools-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 5.1 MB/s 
Installing collected packages: grpcio-tools
Successfully installed grpcio-tools-1.34.1
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-1yu2xzju
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-1yu2xzju
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 9.5 MB/s 
[?25hCollecting transformers==4.13.0
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.8 MB/s 
Collecting fastapi
  Downloading fastapi-0.72.0-py3-none-any.whl (52 k

In [None]:
from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader


### Document Store

#### Option 1: FAISS

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

In [None]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

#### Option 2: Milvus

Milvus is an open source database library that is also optimized for vector similarity searches like FAISS.
Like FAISS it has both a "Flat" and "HNSW" mode but it outperforms FAISS when it comes to dynamic data management.
It does require a little more setup, however, as it is run through Docker and requires the setup of some config files.
See [their docs](https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details.

In [None]:
# from haystack.utils import launch_milvus
# from haystack.document_stores import MilvusDocumentStore

# launch_milvus()
# document_store = MilvusDocumentStore()

Retry to connect server localhost:19530 failed.
ERROR - milvus.client.grpc_handler -  Retry to connect server localhost:19530 failed.


NotConnectError: ignored

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[?25l[K     |████▎                           | 10 kB 17.9 MB/s eta 0:00:01[K     |████████▌                       | 20 kB 21.6 MB/s eta 0:00:01[K     |████████████▊                   | 30 kB 11.5 MB/s eta 0:00:01[K     |█████████████████               | 40 kB 9.4 MB/s eta 0:00:01[K     |█████████████████████▏          | 51 kB 5.9 MB/s eta 0:00:01[K     |█████████████████████████▍      | 61 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71 kB 6.2 MB/s eta 0:00:01[K     |████████████████████████████████| 77 kB 3.4 MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61102 sha256=999033440b08b37e71f766bf1e5dfd4f26dc913bec3c4354f420e40ce19f8ccb
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Successfu

In [None]:
import os, shutil
folder = '/content/data/ar'
for filename in os.listdir(folder):
    file_path = os.path.join(folder, filename)
    try:
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
    except Exception as e:
        print('Failed to delete %s. Reason: %s' % (file_path, e))

In [None]:
import PyPDF2
pdfFileObj = open('/content/drive/MyDrive/Colab Notebooks/Leave and License Agreement.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
number_pages=pdfReader.numPages
print("Page count:",number_pages)
path="/content/data/ar/"
passages = []
for i in range(number_pages):
    try:
        pageObj = pdfReader.getPage(i)
        paragraph=pageObj.extractText()
        doc_name="page_"+str(i)+"_"+paragraph[:10]+".txt" 
        file1 = open(path+doc_name,"w")
        file1.write(paragraph)
        file1.close() 
        passages.append(paragraph)
    except:
        pass

Page count: 3


### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [None]:
# Let's first get some files that we want to use
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
doc_dir = "/content/data/ar"
document_store.delete_all_documents()

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=None, split_paragraphs=True)

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

INFO - haystack.utils.import_utils -  Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.
                1. delete_all_documents() method is deprecated, please use delete_documents method
                For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/1045
                
INFO - haystack.utils.preprocessing -  Converting /content/data/ar/page_1_1.
 
This .txt
INFO - haystack.utils.preprocessing -  Converting /content/data/ar/page_0_ 
 
LEAVE .txt
INFO - haystack.utils.preprocessing -  Converting /content/data/ar/page_2_premises o.txt


Writing Documents:   0%|          | 0/5 [00:00<?, ?it/s]

### Initalize Retriever, Reader & Pipeline

#### Retriever

**Here:** We use a `DensePassageRetriever`

**Alternatives:**

- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging

In [None]:
from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
IN

Updating Embedding:   0%|          | 0/5 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/16 [00:00<?, ? Docs/s]

#### Reader

Similar to previous Tutorials we now initalize our reader.

Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)



##### FARMReader

In [None]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [None]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k for retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="How much is the deposit?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
print_answers(prediction, details="minimum")

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.48 Batches/s]


Query: How much is the deposit?
Answers:
[   {   'answer': 'Rs. \n50000',
        'context': 'permissive user as license only.\n'
                   ' \n'
                   ' \n'
                   '3.\n'
                   ' \n'
                   'T\n'
                   'he licensee shall deposit Rs. \n'
                   '50000\n'
                   ' \n'
                   'and keep deposited the said amount as \n'
                   'security deposit /m\n'
                   'oney adv'},
    {   'answer': '1 \nmonth/years',
        'context': 'est or objection to the licensor on \n'
                   'expi\n'
                   'ry of the above period of \n'
                   '1 \n'
                   'month/years, from the date of executing this \n'
                   'present Agreement for Leav\n'
                   'e and'},
    {   'answer': '5.\n'
                  ' \n'
                  'T\n'
                  'he licensee during the subsistence of this present '
    




In [None]:
print_answers(prediction, details="minimum")


Query: What are the key areas of business?
Answers:
[   {   'answer': 'Leave and License',
        'context': 'shall be payable within first five days of the concerned \n'
                   'month of Leave and License. Licensees shall also pay to '
                   'the Licensor Rs\n'
                   'the use of the said '},
    {   'answer': 'month of Leave and License',
        'context': 'es .\n'
                   'shall be payable within first five days of the concerned \n'
                   'month of Leave and License. Licensees shall also pay to '
                   'the Licensor Rs\n'
                   'the use of the '},
    {   'answer': 'named as also his respective heirs, successors, ass\n'
                  'igns, executors and administrators',
        'context': 'reement is made and executed on \n'
                   'named as also his respective heirs, successors, ass\n'
                   'igns, executors and administrators)\n'
                   '_____________

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)