<a href="https://colab.research.google.com/github/xtchen64/virtual-doctor-chatbot/blob/main/notebooks/embedding_based_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Embedding Based Retrieval

Written By: Qingyang Xu  
Date Created: 11/24/2023  
Last Modified: 11/24/2023  

### Overview

- Section 1. Test embedding-based retrieval tool `DONE`

  - Faiss: https://github.com/facebookresearch/faiss

- Section 2. Test Word2Vec using BERT `DONE`

  - BERT: https://huggingface.co/transformers/v3.0.2/installation.html

- Section 3. Test embedding-based retrieval of symptoms `IN PROGRESS`

- ChatGPT: https://chat.openai.com/c/0d8e22b6-d86c-476f-8c02-2663f6eb0444

### Section 1. Test embedding-based retrieval tool `DONE`

- Reproduce sanity check results in https://github.com/facebookresearch/faiss/wiki/Getting-started

In [5]:
### Faiss
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [6]:
import numpy as np

d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

In [7]:
import faiss                   # make faiss available

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

True
100000


In [10]:
k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k)     # actual search
print("neighbors of the 5 first queries")
print(I[:5])                   # neighbors of the 5 first queries

print("neighbors of the 5 last queries")
print(I[-5:])                  # neighbors of the 5 last queries

[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[0.        7.175174  7.2076287 7.251163 ]
 [0.        6.323565  6.684582  6.799944 ]
 [0.        5.7964087 6.3917365 7.2815127]
 [0.        7.277905  7.5279875 7.6628447]
 [0.        6.763804  7.295122  7.368814 ]]
neighbors of the 5 first queries
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
neighbors of the 5 last queries
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]


### Section 2. Test Word2Vec using BERT `DONE`

In [11]:
!pip install transformers



In [12]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Input text for inference
input_text = "I have a high fever."

# Tokenize and encode the input text
tokens = tokenizer.encode(input_text, add_special_tokens=True)
input_ids = torch.tensor(tokens).unsqueeze(0)  # Add batch dimension

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [13]:
print(f"tokens: {tokens}")

tokens: [101, 1045, 2031, 1037, 2152, 9016, 1012, 102]


In [14]:
# Forward pass through the BERT model
with torch.no_grad():
    outputs = model(input_ids)

In [17]:
# Get the embeddings or logits from the BERT model
last_hidden_states = outputs.last_hidden_state

# For classification tasks, you might use the pooler output
pooler_output = outputs.pooler_output

# Convert PyTorch tensor to numpy array for further processing if needed
numpy_output = last_hidden_states.numpy()

# Print or use the results as needed
print("Embeddings shape:", last_hidden_states.shape)
print("Pooler output shape:", pooler_output.shape)

Embeddings shape: torch.Size([1, 8, 768])
Pooler output shape: torch.Size([1, 768])


### Section 3. Test embedding-based retrieval of symptoms `IN PROGRESS`