# Help BOBAI: More classification in an unknown language

<img src="https://drive.google.com/uc?id=19s4dROQxF9VBNyX9X77EdtChAV7GzYv-" width="750">

## Background
Last time you heard from Bob, he asked you to help him by building a classifier for a new unknown language. The client, Amoira, was happy with your solution so Bob instructed his team to deploy the new model and after some heavy optimization and careful unit testing, the service was deployed and has been running smoothly since.

## Task

This very morning, Amoira returned with a request to extend the number of classes which the classifier can handle from 5 to 7. And this has to be done *today*!

Amoira has provided labeled data for the new classes. With more time, Bob could just use your earlier solution to train a new model on the union of the old and new data, right? The trouble is that the deployment of a new model is a complex process and cannot be done in a day, so the solution has to be built entirely around the model already deployed. Bob has once more come to you for help, as you know the task best.

Whatsmore, Amoira's security concerns have grown even further with the addition of the new data, so they have requested that Bob does not release the text in any form - what if someone managed to decrypt it! So Bob has provided you with a precomputed and cached encoding of all available data: the train and dev set previously used for the 5-way classification, and the new data Amoira provided for the 2 additional classes. The encoding is the output of the pooling layer in mBERT, so is fits right into the classifier previously trained.

Your task is to build a solution for 7-way classification, while operating within the following constraints:

*   The solution can use the 5-way classifier, but cannot change the parameters of the classifier or add any new learned parameters.

*   You are allowed to compute averages and distances between the data encodings.

*   The solution should be reproducible in under 1 hour on an L4 GPU card.

*   The classifier has to perform inference on any random 500 data samples in under 2 minutes on an L4 GPU card.

## Deliverables

You need to submit:

*   Working code that can be used to reproduce and test your best model.
  * In this Colab notebook.
  * Reproducing your best model means that starting from the baseline classifier, we should be able to arrive at your final best model by executing the cells of the notebook.
*   The predictions on the test data (released two hours before the end of the competition).

**You absolutely need to ensure that:**

(1) your notebook is executable from top to bottom

(2) that the `team_email_address` variable is set correctly

(3) that the notebook contains the full code needed to reproduce your model

(3) that it can run on an L4 GPU



## Prerequisites


In [None]:
# enter your team's official IOAI email address here, e.g. animal@ioai-official.org
team_email_address = "redhead.vulture@ioai-official.org"

# Data

In [None]:
!wget  --header="Authorization: Bearer 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"  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/train-dev_dataset_with_labels.pt

--2024-08-11 15:36:55--  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/train-dev_dataset_with_labels.pt
Resolving huggingface.co (huggingface.co)... 13.35.7.38, 13.35.7.81, 13.35.7.57, ...
Connecting to huggingface.co (huggingface.co)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/68/c8/68c8cc89a6a5a1c62c455ca968a354a75d060e2def0451e29ec774b11e8fb237/ffb352fac3c6eeccbfe6aad7edf2765d1b77a954f1db8266b67ff8e19385ccdb?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-dev_dataset_with_labels.pt%3B+filename%3D%22train-dev_dataset_with_labels.pt%22%3B&Expires=1723649353&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzY0OTM1M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzY4L2M4LzY4YzhjYzg5YTZhNWExYzYyYzQ1NWNhOTY4YTM1NGE3NWQwNjBlMmRlZjA0NTFlMjllYzc3NGIxMWU4ZmIyMzcvZmZiMzUyZmFjM

In [None]:
import torch

dataset = torch.load('train-dev_dataset_with_labels.pt')

inputs = dataset[:,:,:-1]
labels = dataset[:, :, -1]


In [None]:
inputs.shape

torch.Size([2473, 1, 768])

In [None]:
# prompt: count number of different labels and count occurence of each of them

import torch

unique_labels, counts = torch.unique(labels, return_counts=True)
num_labels = unique_labels.numel()

print("Number of different labels:", num_labels)
print("Count of each label:")
for label, count in zip(unique_labels, counts):
    print(f"Label {label}: {count}")


Number of different labels: 7
Count of each label:
Label 0.0: 319
Label 1.0: 232
Label 2.0: 397
Label 3.0: 400
Label 4.0: 394
Label 5.0: 364
Label 6.0: 367


In [None]:
# prompt: split train test with inputs and labels

from sklearn.model_selection import train_test_split

train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    inputs, labels, test_size=0.2, random_state=42
)

# Solution

Below you will find a very naive baseline solution: given an input vector, we use either randomly assign one of the new labels (5 and 6) with uniform probability over a 7-way classification, or we use the base classifier to make a prediction.

You can replace the code below with your solution.

In [None]:
# download the base 5-way classifier
!wget --header="Authorization: Bearer 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" https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/base_classifier.pth

--2024-08-11 15:37:00--  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/base_classifier.pth
Resolving huggingface.co (huggingface.co)... 13.35.7.38, 13.35.7.81, 13.35.7.57, ...
Connecting to huggingface.co (huggingface.co)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/68/c8/68c8cc89a6a5a1c62c455ca968a354a75d060e2def0451e29ec774b11e8fb237/6144fa2a448b19fceb864cdf203e72eb785540a68541ed571d48dd00d7d161f7?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27base_classifier.pth%3B+filename%3D%22base_classifier.pth%22%3B&Expires=1723649502&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzY0OTUwMn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzY4L2M4LzY4YzhjYzg5YTZhNWExYzYyYzQ1NWNhOTY4YTM1NGE3NWQwNjBlMmRlZjA0NTFlMjllYzc3NGIxMWU4ZmIyMzcvNjE0NGZhMmE0NDhiMTlmY2ViODY0Y2RmMjAzZTcyZWI3ODU1NDBh

In [None]:
# prompt: get the ones with label bigger than 4 into another array and train a KNN on it

# Get the indices of samples with labels greater than 4
indices_greater_than_4 = (train_labels > 4).nonzero()

# Extract the corresponding inputs and labels
train_inputs_greater_than_4 = train_inputs[indices_greater_than_4[:, 0], indices_greater_than_4[:, 1]]
train_labels_greater_than_4 = train_labels[indices_greater_than_4[:, 0], indices_greater_than_4[:, 1]] - 5  # Adjust labels to start from 0

In [None]:
train_inputs_greater_than_4.shape, train_inputs.shape

(torch.Size([582, 768]), torch.Size([1978, 1, 768]))

In [None]:
# prompt: at first create new labels where labels lower than 5 are equal to 0 and labels equal to 5 are 1 and labels equal to 6 are 2
# then built a KNN from new labels
# and remember to use all these inputs and labels from train

# Create new labels
new_train_labels = torch.where(train_labels < 5, 0, torch.where(train_labels == 5, 1, 2))

In [None]:
# prompt: use standard scaler from sklearn to normalize train_input_reshape

import torch
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Reshape the input data for KNN
train_inputs_reshaped = train_inputs.reshape(train_inputs.shape[0] * train_inputs.shape[1], train_inputs.shape[2])
new_train_labels_reshaped = new_train_labels.reshape(-1)

# Normalize the reshaped input data
#scaler = StandardScaler()
#train_inputs_reshaped_normalized = scaler.fit_transform(train_inputs_reshaped)

# Train a KNN classifier on the normalized new labels
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')  # Adjust n_neighbors as needed
knn.fit(train_inputs_reshaped, new_train_labels_reshaped)

In [None]:
train_inputs_reshaped.shape, train_inputs_reshaped.shape

(torch.Size([1978, 768]), torch.Size([1978, 768]))

In [None]:
train_inputs_reshaped = torch.tensor(train_inputs_reshaped, dtype=torch.float)
train_inputs_reshaped = train_inputs_reshaped.unsqueeze(1)

  train_inputs_reshaped = torch.tensor(train_inputs_reshaped, dtype=torch.float)


In [None]:
train_inputs_reshaped.shape

torch.Size([1978, 1, 768])

In [None]:
import torch
import random

class SevenWayClassifier():
  def __init__(self, ):
    base_clf = torch.nn.Linear(in_features=768, out_features=5, bias=True)
    base_clf.load_state_dict(torch.load("base_classifier.pth"))
    self.base_clf = base_clf

  def base_classification(self, input_vector):

    with torch.no_grad():
      logits = self.base_clf(input_vector)
      preds = torch.softmax(logits, 1)
      predicted_class = preds.argmax(dim=1).numpy()[0]

    return predicted_class

  def get_preds(self, input_vector):
    with torch.no_grad():
      logits = self.base_clf(input_vector)
      preds = torch.softmax(logits, 1)
    return preds

  def __call__(self, input_vector):
    new_train_labels_reshaped.shape, train_inputs_reshaped.shape
    which = knn.predict(input_vector)[0]
    if (which):
      return which + 4
    return self.base_classification(input_vector)

clf = SevenWayClassifier()

# Inference and Evaluation

In [None]:
from sklearn.metrics import f1_score

def compute_f1(labels, predictions):
  return f1_score(labels, predictions, average='macro')

In [None]:
from tqdm import tqdm

def inference(clf, input_vectors):
  predictions = []
  input_vectors = input_vectors.reshape(input_vectors.shape[0] * input_vectors.shape[1], input_vectors.shape[2])
  input_vectors = torch.tensor(input_vectors, dtype=torch.float)
  input_vectors = input_vectors.unsqueeze(1)
  for sample in tqdm(input_vectors):
    predictions.append(clf(sample))
  return predictions

In [None]:
train_inputs.shape, train_inputs_reshaped.shape

(torch.Size([1978, 1, 768]), torch.Size([1978, 1, 768]))

In [None]:
predictions = inference(clf, test_inputs)

f1 = compute_f1(predictions, test_labels)
print('\nNaive solution F1', f1)

  input_vectors = torch.tensor(input_vectors, dtype=torch.float)
100%|██████████| 495/495 [00:02<00:00, 204.54it/s]


Naive solution F1 0.8547467107724406





# Leader board

In [None]:
# The leaderboard may or may not work... If it doesn't forgive us. We will try to get it running.

import pandas as pd
import numpy as np

# 30% of the test data
!wget  --header="Authorization: Bearer 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"  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/eval_dataset.pt

def submission_to_csv(pred: np.ndarray, output_fpath: str = "submission.csv"):
    pred = np.array(pred).flatten()
    data_size = pred.size
    df = pd.DataFrame({
        "ID": np.arange(data_size),
        "class": pred
    })

    df.to_csv(output_fpath, index=False)

eval_inputs = torch.load('eval_dataset.pt')

eval_predictions = inference(clf, eval_inputs)

submission_to_csv(eval_predictions)

--2024-08-11 15:37:04--  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site/resolve/main/eval_dataset.pt
Resolving huggingface.co (huggingface.co)... 13.35.7.38, 13.35.7.81, 13.35.7.57, ...
Connecting to huggingface.co (huggingface.co)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/68/c8/68c8cc89a6a5a1c62c455ca968a354a75d060e2def0451e29ec774b11e8fb237/36464fd7dad081a5aa6a08ea4ab26977d79c85d30eaea74722a618f7b1cf1917?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27eval_dataset.pt%3B+filename%3D%22eval_dataset.pt%22%3B&Expires=1723649052&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzY0OTA1Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzY4L2M4LzY4YzhjYzg5YTZhNWExYzYyYzQ1NWNhOTY4YTM1NGE3NWQwNjBlMmRlZjA0NTFlMjllYzc3NGIxMWU4ZmIyMzcvMzY0NjRmZDdkYWQwODFhNWFhNmEwOGVhNGFiMjY5NzdkNzljODVkMzBlYWVhNzQ3

  input_vectors = torch.tensor(input_vectors, dtype=torch.float)
100%|██████████| 200/200 [00:01<00:00, 103.97it/s]


# Testing

In [None]:
# DO NOT CHANGE THIS CELL

# this download link will not work until two hours before the end of the competition
!wget https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site_test/resolve/main/test_dataset.pt

test_inputs = torch.load('test_dataset.pt')

split='test'

test_predictions = inference(clf, test_inputs)

with open('{}_predictions.txt'.format(team_email_address), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in test_predictions]))

--2024-08-11 15:37:06--  https://huggingface.co/datasets/InternationalOlympiadAI/NLP_problem_on-site_test/resolve/main/test_dataset.pt
Resolving huggingface.co (huggingface.co)... 13.35.7.38, 13.35.7.81, 13.35.7.57, ...
Connecting to huggingface.co (huggingface.co)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/da/93/da93fdbe6f0874d8a4a7afec04acc74c14ed81f9d6a1f1c7340d4848e15fa3c6/8d766a4a5c3a570eec5ccdbe755d872c724c5fffc81a3d90323efc54ea2889eb?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27test_dataset.pt%3B+filename%3D%22test_dataset.pt%22%3B&Expires=1723647644&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMzY0NzY0NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2RhLzkzL2RhOTNmZGJlNmYwODc0ZDhhNGE3YWZlYzA0YWNjNzRjMTRlZDgxZjlkNmExZjFjNzM0MGQ0ODQ4ZTE1ZmEzYzYvOGQ3NjZhNGE1YzNhNTcwZWVjNWNjZGJlNzU1ZDg3MmM3MjRjNWZmZmM4MWE

  input_vectors = torch.tensor(input_vectors, dtype=torch.float)
100%|██████████| 700/700 [00:05<00:00, 119.90it/s]
