<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/09-dealing-with-few-to-no-labels/01_working_with_no_labeled_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Working with No Labeled Data

Fortunately, there are several methods that are well suited for dealing with few to no
labels! You may already be familiar with some of them, such as zero-shot or few-shot
learning, as witnessed by GPT-3’s impressive ability to perform a diverse range of
tasks with just a few dozen examples.

In general, the best-performing method will depend on the task, the amount of available
data, and what fraction of that data is labeled.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/09-dealing-with-few-to-no-labels/images/decision-tree.png?raw=1' width='600'/>

##Setup

In [None]:
!pip -q install transformers[sentencepiece]
!pip -q install datasets

In [2]:
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline)
from transformers import TrainingArguments, Trainer
from transformers import AutoConfig
from datasets import load_dataset, load_metric

import torch
import torch.nn as nn
import torch.nn.functional as F

from pathlib import Path
from time import perf_counter

import sys
import time
import math
import requests
from pathlib import Path
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

If you want to download the issues yourself, you can query the Issues endpoint by using the Requests library, which is the standard way for making HTTP requests in Python.

In [3]:
def fetch_issues(owner="huggingface", repo="transformers", num_issues=10_000, rate_limit=5_000):    
  batch = []
  all_issues = []
  per_page = 100 # Number of issues to return per page
  num_pages = math.ceil(num_issues / per_page)
  base_url = "https://api.github.com/repos"
  
  for page in tqdm(range(num_pages)):
    # Query with state=all to get both open and closed issues
    query = f"issues?page={page}&per_page={per_page}&state=all"
    issues = requests.get(f"{base_url}/{owner}/{repo}/{query}")
    batch.extend(issues.json())

    if len(batch) > rate_limit and len(all_issues) < num_issues:
      all_issues.extend(batch)
      batch = [] # Flush batch for next time period
      print(f"Reached GitHub rate limit. Sleeping for one hour ...")
      time.sleep(60 * 60 + 1)
          
  all_issues.extend(batch)
  df = pd.DataFrame.from_records(all_issues)
  df.to_json(f"github-issues-{repo}.jsonl", orient="records", lines=True)

Now when you call `fetch_issues()`, it will download all the issues in batches to avoid exceeding GitHub's limit on the number of requests per hour. The results will be stored in an github-issues-transformers.jsonl file, where each line is a JSON object the represents the issue.

In [4]:
# fetch_issues()

Alternative way.

In [None]:
!wget https://github.com/nlp-with-transformers/notebooks/raw/main/data/github-issues-transformers.jsonl

##Dataset

###Preparing the Data

In [6]:
df_issues = pd.read_json("github-issues-transformers.jsonl", lines=True)
print(f"DataFrame shape: {df_issues.shape}")

DataFrame shape: (9930, 26)


In [7]:
df_issues.head(1)

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
0,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://github.com/huggingface/transformers/is...,849568459,MDU6SXNzdWU4NDk1Njg0NTk=,11046,Potential incorrect application of layer norm ...,...,,0,2021-04-03 03:37:32,2021-04-03 03:37:32,NaT,NONE,,"In BlenderbotSmallDecoder, layer norm is appl...",,


By looking at a single row we can
see that the information retrieved from the GitHub API contains many fields.

Let's truncate it for desired columns.

In [8]:
cols = ["url", "id", "title", "user", "labels", "state", "created_at", "body"]
df_issues.loc[2, cols].to_frame()

Unnamed: 0,2
url,https://api.github.com/repos/huggingface/trans...
id,849529761
title,[DeepSpeed] ZeRO stage 3 integration: getting ...
user,"{'login': 'stas00', 'id': 10676103, 'node_id':..."
labels,"[{'id': 2659267025, 'node_id': 'MDU6TGFiZWwyNj..."
state,open
created_at,2021-04-02 23:40:42
body,"**[This is not yet alive, preparing for the re..."


In [9]:
# The labels column is the thing that we’re interested in
df_issues.loc[2, cols]["labels"]

[{'color': '4D34F7',
  'default': False,
  'description': '',
  'id': 2659267025,
  'name': 'DeepSpeed',
  'node_id': 'MDU6TGFiZWwyNjU5MjY3MDI1',
  'url': 'https://api.github.com/repos/huggingface/transformers/labels/DeepSpeed'}]

In [10]:
# For our purposes, we’re only interested in the name field of each label object
df_issues["labels"] = (df_issues["labels"].apply(lambda x: [meta["name"] for meta in x]))
df_issues["labels"].head()

0             []
1             []
2    [DeepSpeed]
3             []
4             []
Name: labels, dtype: object

In [11]:
# let's compute the length of each row to find the number of labels per issue
df_issues["labels"].apply(lambda x: len(x)).value_counts().to_frame().T

Unnamed: 0,0,1,2,3,4,5
labels,6440,3057,305,100,25,3


Next let’s take a look at the top 10 most frequent labels in the dataset.

In [12]:
# we can do this by “exploding” the labels column so that each label in the list becomes a row
df_counts = df_issues["labels"].explode().value_counts()
print(f"Number of labels: {len(df_counts)}")

Number of labels: 65


In [13]:
# Display the top-8 label categories
df_counts.to_frame().head(8).T

Unnamed: 0,wontfix,model card,Core: Tokenization,New model,Core: Modeling,Help wanted,Good First Issue,Usage
labels,2284,649,106,98,64,52,50,46


Let's filters the dataset for the subset of labels that we’ll work with,
along with a standardization of the names to make them easier to read.

In [14]:
label_map = {
  "Core: Tokenization": "tokenization",
  "New model": "new model",
  "Core: Modeling": "model training",
  "Usage": "usage",
  "Core: Pipeline": "pipeline",
  "TensorFlow": "tensorflow or tf",
  "PyTorch": "pytorch",
  "Examples": "examples",
  "Documentation": "documentation"
}

def filter_labels(x):
  return [label_map[label] for label in x if label in label_map]

In [15]:
df_issues["labels"] = df_issues["labels"].apply(filter_labels)
all_labels = list(label_map.values())

Now let’s look at the distribution of the new labels.

In [16]:
df_counts = df_issues["labels"].explode().value_counts()
df_counts.to_frame().T

Unnamed: 0,tokenization,new model,model training,usage,pipeline,tensorflow or tf,pytorch,documentation,examples
labels,106,98,64,46,42,41,37,28,24


let’s create a new column that indicates whether the issue is unlabeled
or not.

In [17]:
df_issues["split"] = "unlabeled"
mask = df_issues["labels"].apply(lambda x: len(x)) > 0
df_issues.loc[mask, "split"] = "labeled"
df_issues["split"].value_counts().to_frame()

Unnamed: 0,split
unlabeled,9489
labeled,441


Let’s now take a look at an example.

In [18]:
for column in ["title", "body", "labels"]:
  print(f"{column }: {df_issues[column ].iloc[26][:500]}\n")

title: Add new CANINE model

body: # 🌟 New model addition

## Model description

Google recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:

> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually en

labels: ['new model']

