### Uncomment and run the following cells if you work on GCP. Change runtime type to GPU.

In [1]:
pip install torch==1.8.1 transformers==3.3.1 sentence-transformers==0.3.8 pandas==1.1.2 faiss-cpu==1.6.1 numpy==1.19.2 folium==0.2.1 streamlit==0.62.0

Collecting torch==1.8.1
  Downloading torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
[K     |████████████████████████████████| 804.1 MB 2.6 kB/s 
[?25hCollecting transformers==3.3.1
  Downloading transformers-3.3.1-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 65.5 MB/s 
[?25hCollecting sentence-transformers==0.3.8
  Downloading sentence-transformers-0.3.8.tar.gz (66 kB)
[K     |████████████████████████████████| 66 kB 6.7 MB/s 
[?25hCollecting pandas==1.1.2
  Downloading pandas-1.1.2-cp37-cp37m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 25.7 MB/s 
[?25hCollecting faiss-cpu==1.6.1
  Downloading faiss_cpu-1.6.1-cp37-cp37m-manylinux2010_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 20.9 MB/s 
[?25hCollecting numpy==1.19.2
  Downloading numpy-1.19.2-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
[K     |████████████████████████████████| 14.5 MB 25.5 MB/s 
[?25hCollecting folium==0



---
This is mounting my (Kenza) drive to the collab notebook. I stored the wikidata there.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!bzip2 -d /content/drive/MyDrive/TTDS/enwiki-latest-pages-articles.xml.bz2

### Before we begin, make sure you restart (not factory reset) the runtime so that the relevant packages are used

In [2]:
%load_ext autoreload

In [3]:
%autoreload 2

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path
from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor

import xml

from nltk.stem import PorterStemmer
import re
ps = PorterStemmer()

In [4]:
def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level 
    DistilBERT model and finds similar vectors using FAISS.
    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.
    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.
    
    """
    # query = ps.stem(query)
    vector = model.encode(list(query))
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I


def id2details(I):
    """Returns the paper titles based on the paper index."""
    return [worker.pids[str(idx)] for idx in I[0]]

The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) offers pretrained transformers that produce SOTA sentence embeddings. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/) with all the available models.

In this tutorial, we will use the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size.

In [5]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

100%|██████████| 245M/245M [00:37<00:00, 6.53MB/s]


cuda:0


In [6]:
class wikiHandler(xml.sax.ContentHandler):

    def __init__(self, searchClass):
        self.tag = ""
        self.pid = None
        self.title = ""
        self.text = ""
        self.searcher = searchClass
        #self.executor = ProcessPoolExecutor(max_workers=1)
        self.progress = tqdm(total=70000000)

    def ended(self):
        self.progress.close()
        self.executor.shutdown()

    def startElement(self, tag, argument):
        self.tag = tag

    def characters(self, content):
        if self.tag == "id" and not content.isspace() and self.pid == None:
            self.pid = content
        if self.tag == "title":
            self.title += content
        if self.tag == "text":
            self.text += content

    def endElement(self, tag):
        self.progress.update(1)
        if tag == "page":
            self.searcher.perpage({"pid":self.pid, "title":self.title, "text":self.text})
            self.pid = None
            self.title = ""
            self.text = ""

In [7]:
# Convert abstracts to vectors
class encoder():

  def __init__(self):
    self.embeddings = []
    self.pids = {}

  def perpage(self, text):
    self.pids[text["pid"]] = text["title"]
    self.embeddings.append(model.encode(text["text"]))

worker = encoder()
parser = xml.sax.make_parser()  
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
handler = wikiHandler(worker)
parser.setContentHandler(handler)

parser.parse("/content/drive/MyDrive/TTDS/wikidata_notsoshort.xml")

# embeddings = model.encode(df.text.to_list(), show_progress_bar=True)

  0%|          | 68740/70000000 [37:40<1986:14:21,  9.78it/s]

In [8]:
print(f'Shape of the vectorised abstract: {len(worker.embeddings)}')

Shape of the vectorised abstract: 4171


## Vector similarity search with Faiss
[Faiss](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM. 
    
Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). You can even create [composite indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)). Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

**Note**: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.

To learn more about Faiss, you can read their paper on [arXiv](https://arxiv.org/abs/1702.08734).

To speed up the search, it is possible to segment the dataset into pieces. We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells. At search time, only the database vectors y contained in the cell the query x falls in and a few neighboring ones are compared against the query vector.

This is done via the IndexIVFFlat index. This type of index requires a training stage, that can be performed on any collection of vectors that has the same distribution as the database vectors. In this case we just use the database vectors themselves.

The IndexIVFFlat also requires another index, the quantizer, that assigns vectors to Voronoi cells. Each cell is defined by a centroid, and finding the Voronoi cell a vector falls in consists in finding the nearest neighbor of the vector in the set of centroids. This is the task of the other index, which is typically an IndexFlatL2.

There are two parameters to the search method: nlist, the number of cells, and nprobe, the number of cells (out of nlist) that are visited to perform a search. The search time roughly increases linearly with the number of probes plus some constant due to the quantization.

To create an index with the `wikidata` abstract vectors, we will:
1. Change the data type of the text vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their page IDs from MAG.

In [23]:
# Step 1: Change data type
worker.embeddings = np.array([embedding for embedding in worker.embeddings]).astype("float32")

# Step 2: Instantiate the index
quantizer = faiss.IndexFlatL2(worker.embeddings.shape[1])
nlist = 100
index = faiss.IndexIVFFlat(quantizer, worker.embeddings.shape[1], nlist)
index.train(worker.embeddings)
index = faiss.IndexIDMap(index)
index.add_with_ids(worker.embeddings, np.array(list(worker.pids.keys())).astype('int64'))


# # Step 3: Pass the index to IndexIDMap
# index = faiss.IndexIDMap(index)

# # Step 4: Add vectors and their IDs
# index.add_with_ids(embeddings, np.array(df.id.values))

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 4171


### Searching the index
The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. **The first one must be our query!**


In [None]:
# Wikidata Text
df.iloc[300, 2]

'redirect lycom t55 redirect categori shell R from incorrect name R from move'

In [None]:
nprobe = 2  # find 2 most similar clusters 
k = 10  # return 10 nearest neighbours
D, I = index.search(np.array([embeddings[300]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nWiki IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 103.7332763671875, 106.08882141113281, 108.4593734741211, 108.8836669921875, 118.27703094482422, 118.73644256591797, 123.32868957519531, 126.17843627929688, 126.55046081542969]

Wiki IDs: [69171661, 69171548, 69171417, 69171555, 69171578, 69171644, 69171363, 69171299, 69171558, 69171463]


In [None]:
# Fetch the Wikipedia titles based on their index
id2details(df, I, 'title')

[['Honeywell T55 Turboshaft Engine'],
 ['Mediterranean (Battle honour)'],
 ['Draft:Penguinxs/sandbox/List of Feminist Periodicals in the United States'],
 ['Masdar Institute of Science and Technology'],
 ['Draft:Vasudha Dhagamwar'],
 ['Draft:Jeevan Bahadur Shahi'],
 ['List of accidents and incidents involving the DC-3 in 1975'],
 ['List of accidents and incidents involving the DC-3 in the 1970s'],
 ['Mediterranean 1901–02 (Battle honour)'],
 ['Henri Louit']]


## Putting all together

So far, we've built a Faiss index using the wikidata text vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

1. Encode the stemmed query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
2. Change its data type to float32.
3. Search the index with the encoded query.

IDEA: Use the Answer of the Question Answering option as the input query for vector search or let the user write a query for vector search or both.


In [27]:
user_query = """Artificial Intelligence"""

In [28]:
# For convenience, I've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [183.08160400390625, 195.53298950195312, 221.16964721679688, 243.65826416015625, 245.3956756591797, 246.18502807617188, 250.58746337890625, 253.0164794921875, 255.68038940429688, 256.96875]

MAG paper IDs: [1164, 2862, 2142, 6596, 5783, 5309, 1456, 6216, 5311, 1349]


In [29]:
# Fetching the paper titles based on their index
id2details(I)

['Artificial intelligence\n    ',
 'AI-complete\n    ',
 'List of artificial intelligence projects\n    ',
 'Computer vision\n    ',
 'Computer program\n    ',
 'Software\n    ',
 'AWK\n    ',
 'Chinese room\n    ',
 'Computer programming\n    ',
 'Atanasoff–Berry computer\n    ']

In [30]:
# Define project base directory
# Change the index from 1 to 0 if you run this on Google Colab
project_dir = Path('notebooks').resolve().parents[0]
print(project_dir)

# Serialise index and store it as a pickle
with open("/content/drive/MyDrive/TTDS/faiss_index.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)

/content
