<a href="https://colab.research.google.com/github/qianyu-berkeley/NLP_study/blob/main/nlp_tasks/sementic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search

* NLP Concepts:
    * Embedding
    * Similarity metric

* Libraries
    * Huggingface
    * FAISS
    * Pytorch

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

Collecting datasets
  Using cached datasets-2.14.0-py3-none-any.whl (492 kB)
Collecting evaluate
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting transformers[sentencepiece]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Load github issue datasets

* The dataset contains a number of columns, we are interested in the comments, title, body.
* We will filter out the pull request since we aims to build a semantic search on issues
* We will filter empty comments

In [None]:
from datasets import load_dataset, Dataset

In [None]:
git_issue_dataset = load_dataset(path="lewtun/github-issues", split="train")
git_issue_dataset

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [None]:
git_issue_dataset = git_issue_dataset.filter(lambda x: x["is_pull_request"]==False and len(x["comments"])>0 )
git_issue_dataset

Filter:   0%|          | 0/808 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

In [None]:
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(git_issue_dataset.column_names)
print(columns_to_remove)
git_issue_dataset = git_issue_dataset.remove_columns(columns_to_remove)
git_issue_dataset

{'labels_url', 'user', 'node_id', 'author_association', 'assignee', 'closed_at', 'state', 'url', 'events_url', 'repository_url', 'active_lock_reason', 'created_at', 'pull_request', 'updated_at', 'assignees', 'labels', 'milestone', 'number', 'performed_via_github_app', 'comments_url', 'id', 'is_pull_request', 'locked', 'timeline_url'}


Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

## processing text fields before creating embeddings

* unpack comment fileds
* Filter out short comments
* Concat text fields to a single text field

In [None]:
git_issue_dataset.set_format("pandas")
df = git_issue_dataset[:]
df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...


In [None]:
df['comments'][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

In [None]:
from shutil import ignore_patterns

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)
print(comments_df.shape)

(2964, 4)


In [None]:
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [None]:
comments_dataset = comments_dataset.map(lambda x: {"comment_length": len(x["comments"].split())})
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

In [None]:
from pandas import concat
def concat_text(txt):
    return {"text": txt["title"]
            + "\n"
            + txt["body"]
            + "\n"
            + txt["comments"]
            }
comments_dataset = comments_dataset.map(concat_text)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

## Create Embeddings

* Load Auto Model and Tokenizer checkpoints for Q&A task
    * We are using a sentence transformer model `multi-qa-mpnet-base-dot-v1` design for semantic search task.
    * It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.)
* To generate sentence embeddings from a transformer model, we need to perform pooling becaue the raw embedding vectors are created for each tokens, the common pooling methods are:
    * `cls pooling`:  adding a special <CLS> token to the beginning of every sentences. The purpose of this special token is to capture information at the sentence level. As a result, the pooling layer aggregates by simply selecting the CLS token embedding as the sentence embedding.
    * `mean pooling`: simply averaging all of the contextualized word embeddings produced by the model (e.g. BERT)
    * `max pooling`: taking the maximum value of the token embeddings at each time step to produce a sentence embedding
    * `mean square pooling`:  taking square of averaging all of the contextualized word embeddings produced by the model (e.g. BERT)
* We use cls pooling for the notebook

In [None]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# We want to model output to be the hidden state thus use "AutoModel"
model = AutoModel.from_pretrained(model_ckpt)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
import torch

device = torch.device("cuda")
print(f"device: {device}")
model.to(device)

device: cuda


MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [None]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    # perform tokenization
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    # send data to GPU
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    # run pretrained model to produce the hidden state layer
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [None]:
embedding_example = get_embeddings(comments_dataset["text"][0])
embedding_example.shape

torch.Size([1, 768])

In [None]:
# detach tensor from GPU back to CPU and convert to numpy which works with FAISS
embedding_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

## Use FAISS for efficient similarity search

* FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding.

In [None]:
embedding_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [None]:
question_example = "how can I import a dataset offline?"
question_example_embedding = get_embeddings([question_example]).cpu().detach().numpy()
question_example_embedding.shape

(1, 768)

In [None]:
scores, samples = embedding_dataset.get_nearest_examples(
    "embeddings", question_example_embedding,  k=3
)

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [None]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 23.65715217590332
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine
> 
> 1. (online machine)
> 
> ```
> 
> import datasets
> 
> data = datasets.load_dataset(...)
> 
> data.save_to_disk(/YOUR/DATASET/DIR)
> 
> ```
> 
> 2. copy the dir from online to the offline machine
> 
> 3. (offline machine)
> 
> ```
> 
> import datasets
> 
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
> 
> ```
> 
> 
> 
> HTH.


SCORE: 22.730363845825195
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: here is my way to load a dataset offline, but i