<a href="https://colab.research.google.com/github/rajni-arora/Question_Answering-Similarity_search/blob/main/Faiss_in_Haystack_step3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Elasticsearch - This is a **BruteForce** technique, in which work as TF-IDF and BM25. **Sparse** vector based approach. Its a **Reading Comprehension** model.

FAISS - **Cluster** based technique and **dense** vector based approach, its a **Reader- Retreiver** model, have 3 step,
1. Dimensionality reduction - PCA/L2 Normalization
2. IVF
4. Coarse and Fine quantization

In [None]:
FIASS - faiss is a Information Retreival, this is a similarity search index tool which is used to retreive
information from large database.


FAISS is unfortunately not presently supported on Windows, so if you are on Windows then you will need to stick with Elasticsearch.

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]

In [None]:
!pip install farm-haystack[faiss]

Import and initialize a FAISS document store using a very similiar logic to what we used before - but this time we will be storing the FAISS index locally.

Storing the index locally means that we will need two files, a SQLite database, and the FAISS index. We create the FAISS index later, but we create the SQLite database on initialization.

We will store both in the models directory, but adjust this to your own needs.

In [1]:
path = '/content/models/faiss'

import os

if not os.path.exists(path):
    os.makedirs(path)

And now we include this path within a SQLite database location string in the following document store initialization.

In [2]:
from haystack.document_stores import FAISSDocumentStore

# initialize FAISS
document_store = FAISSDocumentStore(
    faiss_index_factory_str='Flat',
    sql_url=f'sqlite:///{path}/squad_dev.db',
    return_embedding=True
)

load validation data from file, which we will be adding to the FAISS index.

In [3]:
import json

with open('/content/dev.json', 'r') as f:
    squad = json.load(f)


# Adding Data

As we saw with Elasticsearch, our current FAISS index has been initialized but contains nothing. Now we need to populate the index with our dev.json data.

This time, we'll be making use of the Haystack Document object. Which we import with:

In [4]:
squad_docs = []

for sample in squad:
    squad_docs.append({
        'content': sample['context']
    })

In [5]:
document_store.write_documents(squad_docs)

Writing Documents: 20000it [00:29, 682.02it/s]


In [6]:
from haystack import Document

This object prepares our data into the correct object format for our document stores - which in this case is FAISS.

As before where we had a dictionary with two keys 'text' and 'meta', the Document object provides two corresponding arguments, text and meta. So rather than using the format we used before which looked like:

{

    'text': '<document text here>',
    'meta': {
        'other': '<other info here>'
    }

}

We will be using this Document object format instead:

Document(

    text='<document text here>',
    meta={
        'other': '<other info here>'
    }
)

Just like before, we will be feeding these Document objects into a list, which we will then feed into our FAISS write_documents method. Remember, our dataset contains duplicate contexts, so we must remove them first using list(set(...)).

In [7]:
# create list of contexts
contexts = [sample['context'] for sample in squad]

# remove duplicates
contexts = list(set(contexts))

# create list of Document objects
squad_docs = [Document(content=sample) for sample in contexts]

In [8]:
squad_docs[:2]

[<Document: {'content': 'The Yuan dynasty is considered both a successor to the Mongol Empire and an imperial Chinese dynasty. It was the khanate ruled by the successors of Möngke Khan after the division of the Mongol Empire. In official Chinese histories, the Yuan dynasty bore the Mandate of Heaven, following the Song dynasty and preceding the Ming dynasty. The dynasty was established by Kublai Khan, yet he placed his grandfather Genghis Khan on the imperial records as the official founder of the dynasty as Taizu.[b] In the Proclamation of the Dynastic Name (《建國號詔》), Kublai announced the name of the new dynasty as Great Yuan and claimed the succession of former Chinese dynasties from the Three Sovereigns and Five Emperors to the Tang dynasty.', 'content_type': 'text', 'score': None, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'b4e5e91dccc17139f62f473e283b8ebc'}>,
 <Document: {'content': "In April 1191 Richard the Lion-hearted left Messina with a large fleet in or

Now, because we're storing our FAISS index on file, we may find (if running this script more than once) that we first need to delete any documents that already exist in the index.

Below we are deleting all the duplicates

In [9]:
document_store.delete_all_documents()

                1. delete_all_documents() method is deprecated, please use delete_documents method
                For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/1045
                


Then we add the data to the index just like before:

And then writing the Non-Duplicate data to doc store

In [10]:
document_store.write_documents(squad_docs)

Writing Documents: 10000it [00:02, 3550.12it/s]


The way that our documents are indexed will depend on the embedding model being used by our retriever. So, we need to initialize our DPR model (the retriever), and then update_embeddings using this retriever.

In [11]:
from haystack.nodes import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

document_store.update_embeddings(retriever=retriever)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Updating Embedding:   0%|          | 0/1204 [00:00<?, ? docs/s]
Create embeddings:   0%|          | 0/1216 [00:00<?, ? Docs/s][A
Create embeddings:   1%|▏         | 16/1216 [00:15<19:13,  1.04 Docs/s][A
Create embeddings:   3%|▎         | 32/1216 [00:34<21:29,  1.09s/ Docs][A
Create embeddings:   4%|▍         | 48/1216 [00:47<19:09,  1.02 Docs/s][A
Create embeddings:   5%|▌         | 64/1216 [01:01<17:43,  1.08 Docs/s][A
Create embeddings:   7%|▋         | 80/1216 [01:14<16:57,  1.12 Docs/s][A
Create embeddings:   8%|▊         | 96/1216 [01:28<16:34,  1.13 Docs/s][A
Create embeddings:   9%|▉         | 112/1216 [01:42<16:10,  1.14 Docs/s][A
Create embeddings:  11%|█         | 128/1216 [01:55<15:40,  1.16 Docs/s][A
Create embeddings:  12%|█▏        | 144/1216 [02:09<15:18,  1.17 Docs/s][A
Create embeddings:  13%|█▎        | 160/1216 [02:25<15:46,  1.12 Docs/s][A
Create embeddings:  14%|█▍        | 176/1216 [02:40<15:56,  1.09 Docs/s][A
Create embeddings:  16%|█▌        | 192/

Now that we've fully prepared our document store, we can save it. We will save to the same location we saved our SQLite database, but this time we will be using the .faiss filetype.

In [12]:
document_store.save(f'{path}/squad_dev.faiss')

Our FAISS index is now saved to file! We'll go ahead and delete the document_store and retriever, and try reinitializing both using the data we've saved to file.

In [13]:
del document_store, retriever

This above step is basically we are doing to free the RAM, and again loading the model.

And now we can re-initialize our retriever, using the same arguments as before.

In [14]:
document_store = FAISSDocumentStore.load(

    index_path= f'{path}/squad_dev.faiss',

    config_path= f'{path}/squad_dev.json'

)

In [15]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

Finally, we can begin retrieving relevant contexts to our questions using retriever.retrieve, which requires a single argument, query.

In [16]:
retriever.retrieve('What subject is most abstract?')

[<Document: {'content': "A Turing machine is a mathematical model of a general computing machine. It is a theoretical device that manipulates symbols contained on a strip of tape. Turing machines are not intended as a practical computing technology, but rather as a thought experiment representing a computing machine—anything from an advanced supercomputer to a mathematician with a pencil and paper. It is believed that if a problem can be solved by an algorithm, there exists a Turing machine that solves the problem. Indeed, this is the statement of the Church–Turing thesis. Furthermore, it is known that everything that can be computed on other models of computation known to us today, such as a RAM machine, Conway's Game of Life, cellular automata or any programming language can be computed on a Turing machine. Since Turing machines are easy to analyze mathematically, and are believed to be as powerful as any other model of computation, the Turing machine is the most commonly used model 

And now we've extracted a few contexts stored within FAISS, that our DPR model believes answers our query.