> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


# Week 3: Embedding-Based Retrieval

### What we are building
The goal of Embedding-Based Retrieval is to retrieve top-k candidates given a query based on embedding similarity/distance. A common application for this is given a query/sentence/document, find top-k similar candidates wrt query. While this is usually solved using TF-IDF/Information Retrieval (IR) based approaches, it is becoming more and more common in the industry to use an embedding based approach: encode the query and document as an embedding and use approximate nearest neighbor search to find top-k candidates in real-time.

We will build a system to find duplicate questions on Quora using a [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). A very common problem for forums/QA websites is trying to determine whether a question has already been asked before a user posts it.

We will continue to apply our learning philosophy of repetition as we build multiple models of increasing complexity in the following order:

1. Retrieval based on WordVectors
1. Using BERT
1. Using Sentence BERT
1. Using Cohere Sentence Embeddings

###  Evaluation
We will evaluate our models along the following metrics: 

1. Recall@k: the proportion of relevant items found in the top-k matches
2. Mean Reciprocal Rank: the rank of the first relevant item with respect to the top-k.

### Instructions

1. We have provide scaffolding for all the boiler plate Faiss code to get to our baseline model. This covers downloading and parsing the dataset, and training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point in our model, we will aim to use BERT embeddings. **Does this improve accuracy?**
1. In the third model, we will use Sentence BERT and then we'll see if they can boost up our model. **How do you think this model will perform?**
1. **Extension**: We have suggested a bunch of extensions to the project so go crazy! Tweak any parts of the pipeline, and see if you can beat all the current modes.

### Code Overview

- Dependencies: Install and import python dependencies
- Project
  - Dataset: Download the Quora dataset
  - Indexer: Function to manage and create a Faiss Index
  - Model 1: Word Vectors
  - Model 2: BERT
  - Model 3: Sentence BERT
  - Model 4: Cohere Sentence Embeddings
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [1]:
# Install all the required dependencies for the project
!pip install pytorch-lightning==1.6.5
!pip install spacy==2.2.4
!python -m spacy download en_core_web_md
!apt install libopenblas-base libomp-dev
!pip install faiss==1.5.3
!pip install faiss-cpu
!pip install -U sentence-transformers
!pip install transformers==4.17.0
#!pip install sentence-transformers==2.2.0
!pip install cohere

You should consider upgrading via the '/Users/vitalii.mishchenko/Documents/experiments/2302-nlp-course/venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/vitalii.mishchenko/Documents/experiments/2302-nlp-course/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
     |████████████████████████████████| 96.4 MB 1.7 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
You should consider upgrading via the '/Users/vitalii.mishchenko/Documents/experiments/2302-nlp-course/venv/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com 

Import all the necessary libraries we need throughout the project.

In [1]:
# Import all the relevant libraries
import csv
import en_core_web_md
import faiss
import numpy as np
import pytorch_lightning as pl
import random
import spacy
import torch
import cohere

from tqdm import tqdm
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
from transformers import BertTokenizer, BertModel, BertTokenizerFast, DistilBertTokenizer, DistilBertModel

Now let's load the Spacy data, which comes with pre-trainined embeddings. This process is expensive so only do it once.

In [3]:
# Really expensive operation to load the entire space word-vector index in memory
# We'll only run it once 
loaded_spacy_model = en_core_web_md.load()

# Embedding Based Retrieval

✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

Download the duplicate questions [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs).


In [4]:
!wget 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
!mkdir qqp
!mv quora_duplicate_questions.tsv qqp/
!ls qqp/

--2023-03-02 15:04:31--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.65.2, 151.101.129.2, 151.101.193.2, ...
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.65.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2023-03-02 15:04:34 (56.7 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]

quora_duplicate_questions.tsv


Perfect. Now we see all of our files. Let's poke at one of them before we start parsing our dataset.

In [3]:
DATA_FILE = "qqp/quora_duplicate_questions.tsv"

# The file is a 6-column tab separated file. 
# The first column is the row_id, second and third questions are ids of 
# specific questions, followed by the text of questions.
# The last column captures if the two questions are duplicates
with open(DATA_FILE, 'r', newline='\n') as file:
  reader = csv.reader(file, delimiter = '\t')
  # Read first 10 lines
  for i in range(10):
    print(next(reader))

['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']
['0', '1', '2', 'What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?', '0']
['1', '3', '4', 'What is the story of Kohinoor (Koh-i-Noor) Diamond?', 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?', '0']
['2', '5', '6', 'How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?', '0']
['3', '7', '8', 'Why am I mentally very lonely? How can I solve it?', 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?', '0']
['4', '9', '10', 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?', 'Which fish would survive in salt water?', '0']
['5', '11', '12', 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', "I'm a triple Capricorn (Sun, Moon and ascendant in Capri

The dataset has more than 500k questions! We are going to parse the full dataset and create a sample of 10k questions to experiment with in our models since BERT training & inference can be really slow.

In [4]:
"""
Util function to parse the file
"""
def parse_sample_dataset(file_path, sample_max_id):
  """
  Inputs:
    file_path: Path to the raw data file
    sample_max_id: Max question id to be considered in the sampled dataset

  Returns 4 objects:
    1. QuestionMap: list of all question ids
    2. DuplicatesMap: Map of questionID to it's duplicates
    3. SampleDataset: list of questionIds in the sample
    4. SampleEvalDataset: list of pair of duplicate questions in the sample
  """
  question_map = {}
  duplicates_map = defaultdict(set)
  sample_dataset = set([])
  sample_eval_dataset = []

  with open(file_path, 'r', newline='\n') as file:
    reader = csv.reader(file, delimiter='\t')
    next(reader)  # Skip the header line

    for row in reader:
      if len(row) != 6: # Skip incomplete rows
        continue

      # Limit the sample size of the dataset at max_id
      # Make sure all 4 objects start at index 0
      qid1, qid2, label = int(row[1]) - 1, int(row[2]) - 1, int(row[5])
      if qid1 < sample_max_id and qid2 < sample_max_id:
        
        if qid1 not in question_map:
          question_map[qid1] = str(row[3])
        if qid2 not in question_map:
          question_map[qid2] = str(row[4])

        if label == 1:
          duplicates_map[qid1].add(qid2)
          duplicates_map[qid2].add(qid1)

          sample_eval_dataset.append((qid1, qid2))

        sample_dataset.add(qid1)
        sample_dataset.add(qid2)

  # sample dataset duplicates removed via set(), so turn back into list
  return question_map, duplicates_map, list(sample_dataset), sample_eval_dataset

sample_max_id = 10000 # original
# sample_max_id = 2000 # for quick development
question_map, duplicates_map, sample_dataset, sample_eval_dataset, = parse_sample_dataset(DATA_FILE, sample_max_id)

# Complete file: 537k unique questions, 400k duplicate.
# To keep training time manageable limited to 10.000 (sample_max_id)
print("Number of unique questions:", len(question_map)) # 10.000
print("Number of question with duplicates:", len(duplicates_map)) # ~3.8k
print("Number of questions in sample:", len(sample_dataset)) # 10.000
print("Number of duplicate pairs in sample:", len(sample_eval_dataset)) # ~3.6k

Number of unique questions: 10000
Number of question with duplicates: 3810
Number of questions in sample: 10000
Number of duplicate pairs in sample: 3589


# Retrieval using Faiss -- COMPLETED

You are now going to create an Indexer class that implements multiple functions for indexing, searching, and evaluating our retrieval model. Faiss documentation can be found in the wiki here: https://github.com/facebookresearch/faiss/wiki/Getting-started

Some helpful Faiss guides are:
- https://www.pinecone.io/learn/faiss-tutorial/
- https://www.pinecone.io/learn/vector-indexes/

You need to implement the following functions:

1. **search**: Implement a function that takes a question and top_k variable and returns either the matched strings or the ids to the user as a 
    1. Call the search API on the faiss_index to look up similar sentences using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or [sentence, score] tuples based on the input parameter
    3. Sort the output by the score in descending order

1. **evaluate**: Sample num_docs pairs from the evaluation dataset and then check if the qid2 is present in the top-k results
    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches.


In [6]:
class FaissIndexer:
  def __init__(self, dataset,
               question_map, 
               eval_dataset, 
               batch_size, 
               sentence_vector_dim, 
               vectorizer):
    self.dataset = dataset
    self.question_map = question_map
    self.eval_dataset = eval_dataset
    self.batch_size = batch_size
    self.faiss_index = faiss.IndexFlatIP(sentence_vector_dim) # FlatIP uses L2 distance
    self.vectorizer = vectorizer


  def index(self):
    sentence_vectors = []

    print("Start indexing!")
    # tqdm - shows loop progress (https://tqdm.github.io)
    for sentence_ids in tqdm(self.split_list_(self.dataset, self.batch_size)):
      # Retrieve sentences based on qid
      sentences = [question_map[qid] for qid in sentence_ids]
      # Get embeddings of the sentences (Spacy, ..., Cohere)
      sentence_vectors_batch = self.vectorizer.vectorize(sentences)
      # Add batch to temporary list
      sentence_vectors.append(sentence_vectors_batch)

    # Add all batches from temporary list to index
    self.faiss_index.add(np.array(np.concatenate(sentence_vectors, axis=0)))
    print("\nDone indexing!")


  def split_list_(self, lst: list, sublist_size: int):
    sublists = []
    # Split list into even chunks/sublists/batches
    for i in range(0, len(lst), sublist_size):
      sublists.append(lst[i:i + sublist_size])
    return sublists


  def search(self, question: str, top_k: int, return_ids=False):
    """Given any sentence (typed by the user)
    We return a list of top_k(sentence, sim_score) or top_k(sentence_ids, sim_score)
    
    NOTE: The output type is controlled by the return_ids flag

    1. Call the search API on the faiss_index to look up similar sentences 
       using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or 
       [sentence, score] tuples based on return_ids being true/false
    3. Sort the output by the score in descending order
    """

    # NOTE: We converted the question to a list here to match the signature 
    # of the vectorize function
    question_vectors = self.vectorizer.vectorize([question])
    scores, indices = self.faiss_index.search(np.array(question_vectors), top_k)

    # Output is a List[(qid, score), (qid, score), (qid, score)] or
    # List[(q, score), (q, score), (q, score)] based on return_ids
    # Output is sorted in descending order of score
    if return_ids == True:
      output = list(zip(indices[0], scores[0]))
    else:
      output = [(self.question_map[qid], sentence) for qid, sentence in zip(indices[0], scores[0])]
    output.sort(reverse=True, key=lambda pair: pair[1])
    return output


  def evaluate(self, top_k: int, eval_sample_size: int):
    """Sample num_docs pairs from the evaluation dataset and then check 
    if the qid2 is present in the top-k results

    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches
      - Note: MRR is equivalent to mean([1/r or 0 for each sample])
    """
    # Sample from evaluation dataset as proxy for performance metrics
    eval_samples = random.sample(self.eval_dataset, eval_sample_size)

    # Retrieval metrics which only care about if searched for
    # item is present among the results.
    recall_at_k = [] # Relevant items vs total of relevant items
    mean_reciprocal_rank = [] # Rank of the first relevant item

    for eval_sample in eval_samples:
      first_qid = eval_sample[0]
      second_qid = eval_sample[1]
      first_question = self.question_map[first_qid]
      search_results = self.search(first_question, top_k, return_ids=True)

      result_qids = [qid for (qid, _) in search_results]
      if second_qid in result_qids:
        recall_at_k.append(1)

        second_q_position_in_results = result_qids.index(second_qid)
        reciprocal = 1 / (second_q_position_in_results + 1)
        mean_reciprocal_rank.append(reciprocal)
      else:
        recall_at_k.append(0)
        mean_reciprocal_rank.append(0)

    recall = np.mean(np.array(recall_at_k) * 100.0)
    reciprocal_rank = np.mean(np.array(mean_reciprocal_rank))
    print("\nRecall@{}:\t\t{:0.2f}%".format(top_k, recall))
    print("Mean Reciprocal Rank:\t{:0.2f}".format(reciprocal_rank))


  # Helper function to train, search and evaluate similar output from all the models created.
  def train_and_evaluate(self, 
                         question_example: str, 
                         top_k: int = 10, 
                         eval_sample_size: int = 1000
                         ):
    print("---- Indexing ----")
    self.index()
    print("\n---- Search ----")
    results = self.search(question_example, top_k, return_ids=False)
    print("Questions similar to:", question_example)
    for i, (q, s) in enumerate(results):
      print(f"{i} Question: {q} with score {s}")
    print("\n---- Evaluation ----")
    self.evaluate(top_k, eval_sample_size)

## Dummy Model Test

Really small sample of 4 sentences to make sure we can test our implementation of the FAISS search function correctly. We just project the 4 questions in a 2-d space where they are placed on the X-Axis if the word `invest` is present and on the Y-axis if `kohinoor` is present. 

In [11]:
dummy_ids = sample_dataset[:4]
print("Questions:")
for i in dummy_ids:
  print(i, ":", question_map[i])

Questions:
0 : What is the step by step guide to invest in share market in india?
1 : What is the step by step guide to invest in share market?
2 : What is the story of Kohinoor (Koh-i-Noor) Diamond?
3 : What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?


In [195]:
class DummyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      if "invest" in sentence:
        # If "invest" is present place it on the X-Axis
        vectors.append(np.array([random.random(), 0], dtype=np.float32))
      elif "Kohinoor" in sentence:
        # If "Kohinoor" is present place it on the Y-Axis
        vectors.append(np.array([0, random.random()], dtype=np.float32))
    return np.stack(vectors)


di = FaissIndexer(dummy_ids, 
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=2, 
                  vectorizer=DummyVectorizer(2)
                  )

di.index()

results = di.search("invest", 4)
print("Questions similar to:", "invest")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

results = di.search("Kohinoor", 4)
print("\nQuestions similar to:", "Kohinoor")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

Start indexing!



100%|██████████| 1/1 [00:00<00:00, 2702.52it/s]


Done indexing!
Questions similar to: invest
0 Question: What is the step by step guide to invest in share market? with score 0.03926607966423035
1 Question: What is the step by step guide to invest in share market in india? with score 0.01793918013572693
2 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.0
3 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.0

Questions similar to: Kohinoor
0 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.6463944315910339
1 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.08856147527694702
2 Question: What is the step by step guide to invest in share market? with score 0.0
3 Question: What is the step by step guide to invest in share market in india? with score 0.0





# Models

You may be wondering, "When are we going to start building models?" And, the answer is NOW! Finally the time has come to build our baseline model, and then we'll work towards improving it. 


**NOTE**: We will be using the sample dataset since BERT is really slow and processing the full dataset will take a lot of time.

### Model 1: Averaging Word Vectors --- COMPLETED
##### <font color='red'>Expected recall@10: ~20%, MRR: ~0.07</font>

Complete the `vectorize` function using Spacy provided word embeddings. This is something we've done twice already :) 

Implementation:

1. Tokenize each sentence and get wordVectors for each token in the sentence using Spacy 
2. Sentence vector is the mean of word vectors of each token
3. Stack the sentence vectors into a numpy array using np.stack

In [230]:
class SpacyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      tokens = loaded_spacy_model(sentence)
      token_vectors = []
      for token in tokens:
        token_vectors.append(token.vector)

      sentence_vector = np.mean(np.array(token_vectors), axis=0)
      vectors.append(sentence_vector)
    return np.stack(vectors)


spacyIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=300, 
                  vectorizer=SpacyVectorizer(300))

spacyIndex.index()
spacyIndex.search("how can i invest in stock market in india?", 10)

sample_size = 1000
top_k_for_each_sample = 10
spacyIndex.evaluate(top_k_for_each_sample, sample_size)

Start indexing!


100%|██████████| 10/10 [00:54<00:00,  5.48s/it]



Done indexing!

Recall@10:		20.70%
Mean Reciprocal Rank:	0.07


### Model 2: BERT Embeddings --- COMPLETED
##### <font color='red'>Expected recall@10: ~48%, MRR: ~0.19</font>

Compute the sentence embeddings using the BERT model and complete the `vectorize` function. Feel free to reference any documentation from https://huggingface.co/. 


Implementation:

1. Tokenize batch of sentences using `self.tokenizer`
2. Pipe the inputs through the BERT model to create the output logits
3. Normalize the batch output

**NOTE: This model is really slow and will take about 20 mins to run**

In [17]:
class BertVectorizer:
  def __init__(self):
    model_name = 'distilbert-base-uncased'
    self.tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    self.model = DistilBertModel.from_pretrained(model_name)

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences.

    1. Tokenize batch of sentences using `self.tokenizer`
    2. Pipe the inputs through the BERT model to create the output logits
    3. Normalize the batch output
    """

    # converts words to IDs
    tokens = self.tokenizer(
      sentences,
      padding=True,
      return_tensors='pt'
    )

    # BERT model expects input_ids
    # https://huggingface.co/docs/transformers/v4.17.0/en/model_doc/bert#transformers.BertModel.forward
    outputs = self.model(**tokens)
    model_output = outputs['last_hidden_state'].detach()

    return F.normalize(torch.mean(model_output, dim=1), dim=1).detach().numpy()


bertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32,
                  sentence_vector_dim=768,
                  vectorizer=BertVectorizer())

bertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---- Indexing ----
Start indexing!


100%|██████████| 313/313 [14:18<00:00,  2.74s/it]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 0.8770731091499329
1 Question: What is the step by step guide to invest in share market in india? with score 0.8744895458221436
2 Question: What are mutual funds and which is the best one in India in which to invest? with score 0.8723897933959961
3 Question: What will be the effect of banning 500 and 1000 notes on stock markets in India? with score 0.8636164665222168
4 Question: What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India? Can we expect sharp fall in prices in short/long term? with score 0.8614913821220398
5 Question: What are your views on Modi governments decision to demonetize 500 and 1000 rupee notes? How will this affect economy? with score 0.8532259464263916
6 Question: What

### Model 3: Sentence Transformer --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~93%, MRR: ~0.34</font>

Compute the sentence embeddings using the Sentence BERT model and complete the `vectorize` function. Feel free to look up documentation on https://www.sbert.net/. 

Implementation:

1. Pipe the input sentences through the Sentence BERT model to create the output logits
2. Normalize the batch output


In [18]:
class SentenceBertVectorizer:
  def __init__(self):
    self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Pipe the input sentences through the Sentence BERT model to create the output logits
    2. Normalize the batch output
    """
    sentence_vectors = self.model.encode(sentences)

    return sentence_vectors / np.expand_dims(np.linalg.norm(sentence_vectors, axis=1), axis=1)


SBertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=384, 
                  vectorizer=SentenceBertVectorizer())

SBertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

Downloading (…)001fa/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading (…)bb8001fa/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)001fa/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b8001fa/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

---- Indexing ----
Start indexing!


100%|██████████| 10/10 [02:10<00:00, 13.09s/it]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.733176589012146
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 0.6957336664199829
2 Question: What are the ways to learn about stock market? with score 0.6243616342544556
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 0.6239825487136841
4 Question: What is the best way to learn about stock market? with score 0.6222878694534302
5 Question: What is the step by step guide to invest in share market? with score 0.6042821407318115
6 Question: What is the best way to learn about investing in the stock market and what stocks to buy? with score 0.6032654643058777
7 Question: What is the best way to learn about stock markets? with score 0.5846710205078125
8 Question: How do I buy stocks? with score 0.57780

### Model 4: Cohere Sentence Embeddings --- COMPLETED
##### <font color='red'>Expected recall@10: ~89%, MRR: ~0.34</font>

Make sure create a Cohere account and make an API key.
Compute the sentence embeddings using the cohere API and complete the `vectorize` function. Feel free to look up documentation on https://docs.cohere.ai/semantic-search. 

Implementation:

1. Pipe the input sentences through the Cohere API. Make sure to select the small model.


In [19]:
# https://dashboard.cohere.ai/api-keys
COHERE_API_KEY = ""
co = cohere.Client(COHERE_API_KEY)

In [25]:
import functools as _functools
import threading as _threading

def limit(limit, every=1):
  """This decorator factory creates a decorator that can be applied to
     functions in order to limit the rate the function can be invoked.
     The rate is `limit` over `every`, where limit is the number of
     invocation allowed every `every` seconds.
     limit(4, 60) creates a decorator that limit the function calls
     to 4 per minute. If not specified, every defaults to 1 second."""

  def limitdecorator(fn):
    """This is the actual decorator that performs the rate-limiting."""
    semaphore = _threading.Semaphore(limit)

    @_functools.wraps(fn)
    def wrapper(*args, **kwargs):
      semaphore.acquire()

      try:
        return fn(*args, **kwargs)

      finally:                   # ensure semaphore release
        timer = _threading.Timer(every, semaphore.release)
        timer.setDaemon(True)  # allows the timer to be canceled on exit
        timer.start()

    return wrapper

  return limitdecorator

In [26]:
class CohereVectorizer:
  @limit(50, 60) # 50 calls in 60 seconds
  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """

    # Retrieve the embeddings of our sentences by calling
    # the API of Cohere.
    sentence_vectors = co.embed(texts = sentences,
                      model = "small",
                      truncate = "LEFT").embeddings


    # Convert from float64 to float32 to prevent bug:
    # https://github.com/facebookresearch/faiss/issues/461
    return np.float32(np.stack(sentence_vectors))


cohereIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32, 
                  sentence_vector_dim=1024, 
                  vectorizer=CohereVectorizer())

cohereIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!


100%|██████████| 313/313 [06:06<00:00,  1.17s/it] 



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 2562.99462890625
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 2064.28125
2 Question: What is the step by step guide to invest in share market? with score 2049.330810546875
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 1887.562744140625
4 Question: Which is the best Mutual Fund in India? with score 1856.3857421875
5 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 1831.635986328125
6 Question: How do I buy stocks? with score 1825.7650146484375
7 Question: What are mutual funds and which is the best one in India in which to invest? with score 1824.3251953125
8 Question: Which Best S

🎉 CONGRATULATIONS on finishing the assignment!!! We built a real model with an actual datasets for a problem that is used every time a new Quora question gets created!! 

As for why did SentenceBERT & Cohere perform so well, we'll cover that in Siamese networks in week4.

# Extensions

Now that you've worked through the project there is a lot more for us to try:

- See if you can use BERT to improve the model you shipped in Week 1.
  - Improved result on 2%. Take a look at "text-sentiment-bert.ipynb".
- Try out `SentenceBert` and `SpacyVectors` on the entire dataset rather the sample and see what you get?
- Try different transformer models from hugging face