STEPS TO CREATE A NEW DATASET

We need to download the issues endpoint, which returns a list of JSON objects, with each obkect containing a large number of fields that include the title and description as wwell as metadata about the status of the issue and so on. A convenient way to download it is via the request library (standard way for making HTTP requests in Pyhton)

!pip install requests

In [None]:
#To retrieve the first issue on the first page

import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

response.status_code
#OUTPUT: 200 --> Means that the request was successful 
response.json()  #Accessing to the payload

Unauthenticated requests are limited to 60 requests per hour. So, you should follow Github's instructions on creating a personal access token so that you can boost the rate limit to 5000 requests per hour. Once you have the token, you can include it as part of the request header

GITHUB_TOKEN = xxx  # Copy your GitHub token here
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

In [None]:
# A function that can download all the issues from a github repository

import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm


def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

# Now, when we call fetch_issues() it will download all the issues in batches to avoid exceeding github's limit on the number of requests per hour
# The result will be stored in a repository_name-issues.jsonl file, qhere each line is a JSON object that represents an issue

fetch_issues()

# We'll load them locally
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
issues_dataset

# OUTPUT:
# Dataset({
#    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app'],
#    num_rows: 3019
#})

By checking the endpoint and what it returns, we saw that the comments are stored in the body fields. We'll write a function that returns all the comments associated with an issue by picking out the body contents for each element in response.json()

def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]


# Test our function works as expected
get_comments(2792)

Then, we'll use the dataset.map() to add a new "comments" column to each issue in our dataset

# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

USING EMBEDDINGS FOR SEMANTIC SEARCH

In [None]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

# OUTPUT:
# Dataset({
#    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
#    num_rows: 2855
#})

In [None]:
# We should filter out the pull requests, as these tent to be rarely used for answering user queries and will introduce noise in our
# search engine. We can use the Dataset.filter() function to exclude these rows in our dataset

issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

#Dataset({
#    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
#    num_rows: 771
#})

# We have a lot of columns that actually dont need, so we could delete them, except for the important ones: 
# "title", "body", "comments", "html_url"

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

# OUTPUT:
# Dataset({
#    features: ['html_url', 'title', 'comments', 'body'],
#    num_rows: 771
#})

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to “explode” the column so that each row consists of an (html_url, title, body, comment) tuple. In Pandas we can do this with the DataFrame.explode() function, which creates a new row for each element in a list-like column, while replicating all the other column values

In [None]:
# Switch to Pandas DataFrame format
issues_dataset.set_format("pandas")
df = issues_dataset[:]

# If we inspect the first row in this DataFrame, we can see there are four comments associated with this issue
df["comments"][0].tolist()

# OUTPUT:
#['the bug code locate in ：\r\n    if data_args.task_name is not None:\r\n        # Downloading and loading a dataset from the hub.\r\n        datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)',
# 'Hi @jinec,\r\n\r\nFrom time to time we get this kind of `ConnectionError` coming from the github.com website: https://raw.githubusercontent.com\r\n\r\nNormally, it should work if you wait a little and then retry.\r\n\r\nCould you please confirm if the problem persists?',
# 'cannot connect，even by Web browser，please check that  there is some  problems。',
# 'I can access https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datasets/glue/glue.py without problem...']

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

# OUTPUT:
# Lo muestra en una tabla con las columnas html_url, title, comments y body

In [None]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

# OUTPUT:
# Dataset({
#    features: ['html_url', 'title', 'comments', 'body'],
#    num_rows: 2842
#})

# Create a comments_length column that contains the number of words per comment

comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

#Filter out short comments not useful for our search engine

comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

# Concatenate the issue title, description and comments together in a new text column
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

NOW WE ARE READY TO CREATE SOME EMBEDDINGS

In [None]:
# we need to do is pick a suitable checkpoint to load the model from. Fortunately, there’s a library called sentence-transformers that is dedicated to creating embeddings
# the multi-qa-mpnet-base-dot-v1 checkpoint has the best performance for semantic search, so we’ll use that for our application. We’ll also load the tokenizer using the same checkpoint:

from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

# To speed up the embedding process:
import torch

device = torch.device("cuda")
model.to(device)

# We need to pool or average our token embedding in some way to represent each entry in our GitHub issues corpus as a single vector
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

# We'll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model and finally apply CLD pooling to the outputs
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

# We can use Dataset.map() to apply our get_embeddings() function to each row in our corpus, so let’s create a new embeddings column as follows
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

FAISS is a library that provides efficient algorithms to quickly search and cluster embedding vectors. The idea is to create an index that allows one to find which embeddings are similar to an input embedding.

embeddings_dataset.add_faiss_index(column="embeddings")

We also can perform queries on this index by doing a nearest neighbor lookup with the Dataset.get_nearest_examples() function

question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

OUTPUT: torch.Size([1,768])

The Dataset.get_nearest_examples() function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples.