In [None]:
!pip install datasets transformers faiss-gpu

# Prepare Dataset

In [32]:
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd

In [33]:
issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [34]:
issues_dataset = issues_dataset.filter(lambda x: x['is_pull_request'] == False and len(x['comments'])>0)
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [35]:
# A each comment with the issue’s title and body, since these fields often include useful contextual information.
# Comments is a list and we want to explode it to have one comment per row
issues_dataset.set_format('pandas')
df = issues_dataset[:]
exploded_df = df.explode('comments', ignore_index=True)
exploded_df.head(2)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...


In [36]:
# Filter comments that are very short (not meaningful)
issues_dataset = Dataset.from_pandas(exploded_df)
issues_dataset = issues_dataset.map(lambda x: {"comment_len": len(x['comments'].split())})
issues_dataset = issues_dataset.filter(lambda x: x['comment_len']>15)
issues_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_len'],
    num_rows: 2175
})

In [37]:
def concatenate(sample):
  return {"text": sample['title'] + "\n" + sample['body'] + "\n" + sample['comments']}

issues_dataset = issues_dataset.map(concatenate)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

# Create Embeddings

* There’s a library called sentence-transformers that is dedicated to creating embeddings.
* Our use case is an example of asymmetric semantic search because we have a short query whose answer we’d like to find in a longer document, like a an issue comment.

1.  **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”.

2.  **Asymmetric semantic search**: you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query.

In [38]:
# Load model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt).to(device)

In [39]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)

    return model_output.last_hidden_state[:,0]

In [40]:
embeddings_dataset = issues_dataset.map(lambda x: {"embedding": get_embeddings(x["text"]).detach().cpu().numpy()[0]})

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

# FAISS Semantic Search

FAISS (Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding.

In [41]:
# Compute FAISS index
embeddings_dataset.add_faiss_index(column = "embedding")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_len', 'text', 'embedding'],
    num_rows: 2175
})

In [42]:
# get the nearest neighbors to the embedding of our question
question = "How can I load dataset with streaming?"
embedding = get_embeddings(question).detach().cpu().numpy()
scores, samples = embeddings_dataset.get_nearest_examples("embedding", embedding, k=5)

In [43]:
df = pd.DataFrame.from_dict(samples)
df["score"] = scores
df.sort_values("score", ascending=False, inplace=True)
for _, row in df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.score}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Hi @SBrandeis, thanks for reporting! ^^

I think this is an issue with `fsspec`: https://github.com/intake/filesystem_spec/issues/389

I will ask them if they are planning to fix it...
SCORE: 34.46894454956055
TITLE: `Can not decode content-encoding: gzip` when loading `scitldr` dataset with streaming
URL: https://github.com/huggingface/datasets/issues/2918

COMMENT: Code to reproduce the bug: `ClientPayloadError: 400, message='Can not decode content-encoding: gzip'`
```python
In [1]: import fsspec

In [2]: import json

In [3]: with fsspec.open('https://raw.githubusercontent.com/allenai/scitldr/master/SciTLDR-Data/SciTLDR-FullText/test.jsonl', encoding="utf-8") as f:
   ...:     for row in f:
   ...:         data = json.loads(row)
   ...:
---------------------------------------------------------------------------
ClientPayloadError                        Traceback (most recent call last)
```
SCORE: 34.46894454956055
TITLE: `Can not decode content-encoding: gzip

# The End 🤗