# Info Retrieval Using Bert

In this example, we took inspiration from this [towardsdatascience.com](https://towardsdatascience.com/a-sub-50ms-neural-search-with-distilbert-and-weaviate-4857ae390154) article.

## Define Credentials
Head out to [Upstash Console](https://console.upstash.com) to create an Index under Vector tab. Dimension should be 768.

There, paste the `UPSTASH_VECTOR_REST_URL` and `UPSTASH_VECTOR_REST_TOKEN` values to the following block.

In [1]:
UPSTASH_VECTOR_REST_URL="https://humorous-toad-39859-us1-vector.upstash.io"
UPSTASH_VECTOR_REST_TOKEN="ABcFMGh1bW9yb3VzLXRvYWQtMzk4NTktdXMxYWRtaW5OVGhpTURNeU9HVXRZak01TWkwME0ySXhMV0kyWVdVdFkyTTRPR1kzTlRWaE1qTXk="

## Install the Dependencies

In [17]:
# !pip3 install transformers
# !pip3 install nltk
# !pip3 install torch
!pip3 install upstash_vector



## Download and Extract the Dataset

In [3]:
import shutil
import urllib.request as request
from contextlib import closing

with closing(request.urlopen('http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz')) as r:
    with open('20news-bydate.tar.gz', 'wb') as f:
        shutil.copyfileobj(r, f)

import tarfile

tar = tarfile.open('20news-bydate.tar.gz', "r:gz")
tar.extractall()

In [4]:
import torch
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import sent_tokenize
from upstash_vector import Index

torch.set_grad_enabled(False)

# udpated to use different model if desired
MODEL_NAME = "distilbert-base-uncased"
model = AutoModel.from_pretrained(MODEL_NAME)
# model.to('cuda') # remove if working without GPUs
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# initialize nltk (for tokenizing sentences)
import nltk
nltk.download('punkt')


# initialize upstash vector index for upserting and querying
index = Index(UPSTASH_VECTOR_REST_URL, UPSTASH_VECTOR_REST_TOKEN)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load Dataset from Disk
Here, we are defining some helper functions for easily converting posts to vectors.

In [5]:
import os
import random

def get_post_filenames(limit_objects=100):
    file_names = []
    i=0
    for root, dirs, files in os.walk("./20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]

    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)

    file_names = file_names[:limit_objects]

    return file_names

def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()

        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]

        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue

        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts

## Vectorize the Posts
Here are the functions that vectorize the given posts, using BERT transformer.
If ran on CPU only, this can take some time to vectorize. So, running on GPU is recommended, however not necessary.

In [6]:
import time

def text2vec(text):
    tokens_pt = tokenizer(text, padding=True, truncation=True, max_length=500, add_special_tokens = True, return_tensors="pt")
    outputs = model(**tokens_pt)
    # tokens_pt.to('cuda') # remove if working without GPUs
    return outputs[0].mean(0).mean(0).detach()

def vectorize_posts(posts=[]):
    post_vectors=[]
    before=time.time()
    for i, post in enumerate(posts):
        vec=text2vec(sent_tokenize(post))
        post_vectors += [vec]
        if i % 25 == 0 and i != 0:
            print("So far {} objects vectorized in {}s".format(i, time.time()-before))
    after=time.time()

    print("Vectorized {} items in {}s".format(len(posts), after-before))

    return post_vectors

## Reset the Index
Here, we reset index to make sure it is empty.

In [7]:
index.reset()

'Success'

## Upserter Helper Function

In [8]:
def import_posts_with_vectors(posts, vectors, index):
    if len(posts) != len(vectors):
        raise Exception("len of posts ({}) and vectors ({}) does not match".format(len(posts), len(vectors)))

    for i, post in enumerate(posts):
        try:
            index.upsert([(i, vectors[i].tolist(), {"content": post})])
        except Exception as e:
            print(f"Exception: {e}")

## Querying Helper Function

In [9]:
def search(query="", limit=3):
    before = time.time()
    vec = text2vec(query).tolist()
    vec_took = time.time() - before

    before = time.time()
    near_vec = {"vector": vec}
    res = index.query(vector=vec, top_k=limit, include_metadata=True)
    search_took = time.time() - before

    print("\nQuery \"{}\" with {} results took {:.3f}s ({:.3f}s to vectorize and {:.3f}s to search)" \
          .format(query, limit, vec_took+search_took, vec_took, search_took))
    for post in res:
        print("{:.4f}: {}".format(post.score, post.metadata["content"]))
        print('---')

## Run the main flow
Now, just use the already defined helper functions.

In [14]:
posts = read_posts(get_post_filenames(100)) # since running on cpu, only vectorize and upsert 25.
vectors = vectorize_posts(posts)

So far 25 objects vectorized in 28.810916662216187s
So far 50 objects vectorized in 52.01945471763611s
So far 75 objects vectorized in 80.60468935966492s
Vectorized 98 items in 103.79845881462097s


In [15]:
import_posts_with_vectors(posts, vectors, index)

In [18]:
search("the best camera lens", 1)
search("motorcycle trip", 1)
search("which software do i need to view jpeg files", 1)
search("windows vs mac", 1)


Query "the best camera lens" with 1 results took 0.178s (0.057s to vectorize and 0.122s to search)
0.8115:    I want All-Star Tickets does anyone know how I can get some?  Are they for public sale or are they sold out?  Or do you just have to work for a company with some  Anyway any answers would be appreciated.  Please E-mail me.  Thanks, Andrew 
---

Query "motorcycle trip" with 1 results took 0.073s (0.048s to vectorize and 0.025s to search)
0.7975:    I want All-Star Tickets does anyone know how I can get some?  Are they for public sale or are they sold out?  Or do you just have to work for a company with some  Anyway any answers would be appreciated.  Please E-mail me.  Thanks, Andrew 
---

Query "which software do i need to view jpeg files" with 1 results took 0.086s (0.061s to vectorize and 0.025s to search)
0.9266:    I need a graphics display program that can take as a parameter the name of the file to be displayed, then just display that image and then quit.  All of the othe