<a href="https://colab.research.google.com/github/srvmishra/Language-Models/blob/main/GitHub_Issues_Semantic_Search_with_FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets
!pip install faiss-gpu
!pip install faiss-cpu

[31mERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu[0m[31m
[0mCollecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


### Imports

In [2]:
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModel

device = torch.device('cuda')

### Load Dataset

In [3]:
issues_dataset = load_dataset('lewtun/github-issues', split='train')
issues_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

### Filter Pull Requests

In [4]:
issues_dataset = issues_dataset.filter(lambda x: len(x['comments']) and not x['is_pull_request'])
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

### Remove unnecessary columns

In [5]:
columns_to_keep = ['html_url', 'title', 'comments', 'body']
all_columns = issues_dataset.column_names
columns_to_remove = set(all_columns).symmetric_difference(set(columns_to_keep))
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

### Create one row for each comment and filter out the smaller comments

1. Each issue has several comments. So we replicate all other column values for each comment associated with an issue.
2. We filter out comments that are less than 15 words in length.
3. Prepare text for tokenization by concatenating `title`, `body`, and `comments` fields in each row.

`explode` does not happen inplace

In [6]:
issues_dataset.set_format('pandas')
issues_df = issues_dataset[:]
issues_df = issues_df.explode('comments', ignore_index=True)
issues_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [7]:
issues_with_comments_dataset = Dataset.from_pandas(issues_df)
issues_with_comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [8]:
issues_with_comments_dataset = issues_with_comments_dataset.map(lambda x: {'length': len(x['comments'].split())})
issues_with_comments_dataset = issues_with_comments_dataset.filter(lambda x: x['length'] > 15)
issues_with_comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'length'],
    num_rows: 2175
})

In [9]:
issues_with_comments_dataset = issues_with_comments_dataset.map(lambda x: {'text': x['title'] + '\n' +
                                                                                   x['body'] + '\n' +
                                                                                   x['comments']})
issues_with_comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'length', 'text'],
    num_rows: 2175
})

### Load Model and Tokenizer

In [10]:
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt).to(device).eval()

### Create Embeddings

In [11]:
def get_embedding(example):
  model_inputs = tokenizer(example['text'], truncation=True, padding=True, return_tensors='pt')
  model_inputs = {k: v.to(device) for k, v in model_inputs.items()}
  with torch.no_grad():
    model_logits = model(**model_inputs).last_hidden_state[:, 0]
  return model_logits.cpu().numpy()[0]

issues_with_comments_dataset = issues_with_comments_dataset.map(lambda x: {'embedding': get_embedding(x)})
issues_with_comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'length', 'text', 'embedding'],
    num_rows: 2175
})

### Create and query with FAISS index

Nearest Neighbour Search with similarity between embedding vectors.

In [12]:
issues_with_comments_dataset.add_faiss_index(column='embedding')

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'length', 'text', 'embedding'],
    num_rows: 2175
})

In [13]:
question = "How can I load a dataset offline?"
example = {'text': [question]}
query_embedding = get_embedding(example)
print(query_embedding.shape)

(768,)


In [14]:
scores, samples = issues_with_comments_dataset.get_nearest_examples('embedding', query_embedding, k=5)
samples_df = pd.DataFrame.from_dict(samples)
samples_df['scores'] = scores
samples_df.sort_values('scores', ascending=False, inplace=True)
samples_df.head()

Unnamed: 0,html_url,title,comments,body,length,text,embedding,scores
4,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,Requiring online connection is a deal breaker ...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",57,Discussion using datasets in offline mode\n`da...,"[-0.4731806814670563, 0.24578382074832916, -0....",25.505016
3,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"The local dataset builders (csv, text , json a...","`datasets.load_dataset(""csv"", ...)` breaks if ...",38,Discussion using datasets in offline mode\n`da...,"[-0.4490852952003479, 0.20950652658939362, -0....",24.55554
2,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,I opened a PR that allows to reload modules th...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",179,Discussion using datasets in offline mode\n`da...,"[-0.4716479778289795, 0.2902272641658783, -0.0...",24.148987
1,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"> here is my way to load a dataset offline, bu...","`datasets.load_dataset(""csv"", ...)` breaks if ...",76,Discussion using datasets in offline mode\n`da...,"[-0.4992601275444031, 0.22699788212776184, -0....",22.894003
0,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"here is my way to load a dataset offline, but ...","`datasets.load_dataset(""csv"", ...)` breaks if ...",47,Discussion using datasets in offline mode\n`da...,"[-0.4902574121952057, 0.22889623045921326, -0....",22.406656


In [15]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505016326904297
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555540084838867
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's n