# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [1]:
from google.colab import userdata

read_access_token = "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"
write_access_token = "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"

### Dependencies

In [2]:
import importlib
import torch, transformers


!pip install datasets==2.18.0
!pip install evaluate==0.4.2 tokenizer
!pip install accelerate -U

Collecting datasets==2.18.0
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets==2.18.0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets==2.18.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets==2.18.0)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets==2.18.0)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [4]:
from transformers import set_seed

set_seed(42)
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch

    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(42)

In [5]:
# load the data
from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)
test= load_dataset("InternationalOlympiadAI/NLP_problem_test")['test']['text']

Downloading readme:   0%|          | 0.00/397 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 126k/126k [00:00<00:00, 458kB/s]
Downloading data: 100%|██████████| 19.4k/19.4k [00:00<00:00, 177kB/s]


Generating train split:   0%|          | 0/1524 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/218 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 90.6M/90.6M [00:00<00:00, 239MB/s]


Generating train split:   0%|          | 0/611245 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 36.0k/36.0k [00:00<00:00, 150kB/s]


Generating test split:   0%|          | 0/438 [00:00<?, ? examples/s]

# Unsupervised Training

In [6]:
from tokenizers import SentencePieceUnigramTokenizer

batch_size = 1000

tokenizer = SentencePieceUnigramTokenizer()


def batch_iterator():
    batch_length = 1000
    for i in range(0, len(raw_text["train"]), batch_length):
        yield raw_text["train"][i : i + batch_length]["text"]

tokenizer.train_from_iterator(batch_iterator(), vocab_size=32768, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"], unk_token="[UNK]")
# Save files to disk
tokenizer.save_model(".", "esperberto")

['./esperberto-unigram.json']

In [7]:
import json
from tokenizers import SentencePieceUnigramTokenizer
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast

with open('esperberto-unigram.json', encoding='utf-8-sig') as f:
    json_file = json.load(f)
    vocab = json_file['vocab']
    for idx, v in enumerate(vocab):
        vocab[idx] = tuple(v)

tokenizer = SentencePieceUnigramTokenizer(vocab)

# '[CLS] SENTENCE [SEP]' format
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ("[MASK]", tokenizer.token_to_id("[MASK]"))
    ],
)

tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

In [8]:
tokenizer.mask_token = "[MASK]"
tokenizer.pad_token = "[PAD]"
tokenizer.cls_token = "[CLS]"
tokenizer.sep_token = "[SEP]"
tokenizer.unk_token = "[UNK]"

In [9]:
tokenizer.push_to_hub("NLP_tokeniser2", token=write_access_token)

CommitInfo(commit_url='https://huggingface.co/ntuteama/NLP_tokeniser2/commit/7f29219442acec80c8e885fe104a4b56fafac26f', commit_message='Upload tokenizer', commit_description='', oid='7f29219442acec80c8e885fe104a4b56fafac26f', pr_url=None, pr_revision=None, pr_num=None)

In [10]:
from transformers import BertForMaskedLM, TrainingArguments, Trainer, AutoModelForSequenceClassification
from transformers import BertConfig

model = BertForMaskedLM.from_pretrained(
    "google-bert/bert-base-multilingual-uncased"
)

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding, DataCollatorForLanguageModeling

def preprocess_function(examples):
    result = tokenizer(examples["text"], truncation=True)
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [12]:
tokenized_raw = raw_text.map(
    preprocess_function, batched=True, remove_columns=["text"]
)

chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

data_collator2 = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.2)

Map:   0%|          | 0/611245 [00:00<?, ? examples/s]

In [13]:
grouped_raw = tokenized_raw.map(group_texts, batched=True)

Map:   0%|          | 0/611245 [00:00<?, ? examples/s]

In [None]:
epochs = 25
batch_size = 64

model_name = "masked_model"
training_args = TrainingArguments(
    output_dir=f"{model_name}",
    overwrite_output_dir=True,
    eval_strategy="steps",
    eval_steps=round(len(grouped_raw['train'])/batch_size),
    learning_rate=0.00025,
    warmup_steps=800,
    weight_decay=0.01,
    max_grad_norm=1.0,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    save_strategy="steps",
    save_steps=round(len(grouped_raw['train'])/batch_size) * epochs,
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_strategy="every_save",
    hub_token=write_access_token,
    hub_model_id='masked_model',
    report_to="none",
    num_train_epochs=epochs,
    lr_scheduler_type="linear",
    bf16=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=grouped_raw["train"],
    eval_dataset=grouped_raw["train"].select([0]),
    data_collator=data_collator2,
    tokenizer=tokenizer,
)
trainer.train()


# Supervised

In [15]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [17]:
from transformers import BertForMaskedLM, TrainingArguments, Trainer, AutoModelForSequenceClassification
from transformers import BertConfig, BertForSequenceClassification
import torch.nn as nn

def model_init():
  model = AutoModelForSequenceClassification.from_pretrained(
    "ntuteama/masked_model", num_labels=5, token=read_access_token
  )
  return model

In [None]:
training_args = TrainingArguments(
    output_dir="NLP_final",
    learning_rate=4.676339096688447e-05,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=40,
    weight_decay=0.01,
    eval_strategy="steps",
    eval_steps=96,
    save_strategy="steps",
    save_steps=2400,
    save_total_limit=2,
    metric_for_best_model='f1',
    load_best_model_at_end=False,
    push_to_hub=True,
    hub_strategy="checkpoint",
    hub_token=write_access_token,
    hub_private_repo=True,
    hub_model_id='NLP_final2',
    report_to="none",
    lr_scheduler_type='cosine',
    # warmup_steps=100
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

In [20]:
model = AutoModelForSequenceClassification.from_pretrained("./NLP_final/checkpoint-2400")

In [21]:
model.push_to_hub('NLP_final2', token=write_access_token)

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ntuteama/NLP_final2/commit/490a4f86e6d5a47c7d26612db6f0202baacbaf5f', commit_message='Upload BertForSequenceClassification', commit_description='', oid='490a4f86e6d5a47c7d26612db6f0202baacbaf5f', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [22]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_data[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')

In [None]:
dev_f1

# Testing

In [24]:
# UPDATE THIS CELL ACCORDINGLY

# define a funciton to load your tokenizer and model from a HF path
# the path variables can be strings or lists of strings (for ensemble solutions)
def load_model(path_to_tokenizer, path_to_model, token):
  # Example:
  tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer, token=token)
  model = AutoModelForSequenceClassification.from_pretrained(path_to_model, token=token)
  model.eval()

  return tokenizer, model

# define a "predict" function that takes the model and a list of input strings
# and returns the outputs as a list of integer classes
def predict(tokenizer, model, input_texts):
  #Example:
  predictions = []
  for input_text in input_texts:

    input_ids = tokenizer(input_text, return_tensors="pt")

    with torch.no_grad():
      logits = model(**input_ids).logits

    predictions.append(logits.argmax().item())

  return predictions

# set variables
path_to_model = "ntuteama/NLP_final" # can be a list instead
path_to_tokenizer = "ntuteama/NLP_tokeniser" # can be a list instead
model_access_token = "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" # a fine-grained token with read rights for your model repository


In [25]:
# DO NOT CHANGE THIS CELL!!!

tokenizer, model = load_model(path_to_tokenizer, path_to_model, token=model_access_token)

test_data = load_dataset("InternationalOlympiadAI/NLP_problem_test")['test']['text']

predictions = predict(tokenizer, model, test_data)

with open('{}_predictions.txt'.format('test'), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions]))

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.43M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]