## 1. Introduction

In this notebook we will work on the presidents dataset using embeddings from a pre-trained mode.

Our first approach will be to use the pre-trained model only for the embeddings, then use a classifier from sklearn for the actual task. We will be able to run this on CPU.

Our second approach will be to fine-tune the pre-trained model by integrating it into a larger architecture and training that.

**NOTE:** The first approach should be easily implemented. Need to think about how to treat class imbalance. We have 2 approaches as seen in the previous notebook: 1. ignore it; 2. under-sample by choosing representative entries (using k-means for example). We saw that with our previous encodings the under-sampling wasn't good. We would hope that the embeddings given by a pre-trained model will give good cluster centroids.

In [None]:
import codecs
import re

def load_pres(fname):
    alltxts = []
    alllabs = []
    s=codecs.open(fname, 'r','utf-8') # pour régler le codage
    while True:
        txt = s.readline()
        if(len(txt))<5:
            break
        #
        lab = re.sub(r"<[0-9]*:[0-9]*:(.)>.*","\\1",txt)
        txt = re.sub(r"<[0-9]*:[0-9]*:.>(.*)","\\1",txt)
        if lab.count('M') >0:
            alllabs.append(-1)
        else:
            alllabs.append(1)
        alltxts.append(txt)
    return alltxts,alllabs

fname = "./drive/MyDrive/Colab_Projects/RITAL/datasets/AFDpresidentutf8/corpus.tache1.learn.utf8.txt"
pres_alltxts, pres_alllabs = load_pres(fname)

## 2. First approach: embedding + simple classifier

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np

model_name = "camembert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
# create a train-test split before embedding
# otherwise we will have data leakage
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

x_train, x_test, y_train, y_test = train_test_split(pres_alltxts, pres_alllabs, test_size=0.2, stratify=pres_alllabs, random_state=42)

In [4]:
def get_batch_embeddings(batch_texts):
    inputs = tokenizer(batch_texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
    inputs = {key: value.to("cuda") for key, value in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].cpu().numpy()

batch_size = 32
dataloader_train = DataLoader(x_train, batch_size=batch_size, shuffle=False)
dataloader_test = DataLoader(x_test, batch_size=batch_size, shuffle=False)

embeddings_train = []
for batch in tqdm(dataloader_train, desc="Generating Train Embeddings with GPU"):
    batch_embeddings = get_batch_embeddings(batch)
    embeddings_train.append(batch_embeddings)

x_train_embedded = np.vstack(embeddings_train)
y_train = np.array(y_train)

embeddings_test = []
for batch in tqdm(dataloader_test, desc="Generating Test Embeddings with GPU"):
    batch_embeddings = get_batch_embeddings(batch)
    embeddings_test.append(batch_embeddings)

x_test_embedded = np.vstack(embeddings_test)
y_test = np.array(y_test)

logreg = LogisticRegression(max_iter=10000)
logreg.fit(x_train_embedded, y_train)
y_pred = logreg.predict(x_test_embedded)
print(classification_report(y_test, y_pred))

Generating Train Embeddings with GPU: 100%|██████████| 1436/1436 [00:52<00:00, 27.27it/s]
Generating Test Embeddings with GPU: 100%|██████████| 359/359 [00:13<00:00, 27.12it/s]


              precision    recall  f1-score   support

          -1       0.74      0.39      0.51      1505
           1       0.91      0.98      0.95      9978

    accuracy                           0.90     11483
   macro avg       0.83      0.68      0.73     11483
weighted avg       0.89      0.90      0.89     11483



## 3. Running on test data / creating submission

## 4. Under-sampling

We propose two approaches to under-sampling the majority class:
1. random
2. clustering and identifying representatives

In [5]:
# 1st approach
# use logistic regression as a classifier
m = np.where(y_train == -1)[0]
c = np.where(y_train == 1)[0]

c_undersampled = np.random.choice(c, size=len(m), replace=False)

x_train_undersampled = np.vstack((x_train_embedded[m], x_train_embedded[c_undersampled]))
y_train_undersampled = np.hstack((y_train[m], y_train[c_undersampled]))

logreg_under = LogisticRegression(max_iter=10000)
logreg_under.fit(x_train_undersampled, y_train_undersampled)
y_pred_under = logreg_under.predict(x_test_embedded)
print(classification_report(y_test, y_pred_under))

              precision    recall  f1-score   support

          -1       0.38      0.78      0.51      1505
           1       0.96      0.81      0.88      9978

    accuracy                           0.80     11483
   macro avg       0.67      0.79      0.69     11483
weighted avg       0.88      0.80      0.83     11483



In [6]:
# 2nd approach
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

n_cluster = 7523
pca = PCA(n_components=25)
embeddings_pca = pca.fit_transform(x_train_embedded[c])
kmeans = KMeans(n_clusters=n_cluster, random_state=42).fit(embeddings_pca)
cluster_centers = kmeans.cluster_centers_

# find dataset points closest to cluster centers
distances = np.linalg.norm(embeddings_pca[:, None, :] - cluster_centers[None, :, :], axis=-1)
closest_indices = np.argmin(distances, axis=1)

c_x_train_undersampled = x_train_embedded[c][closest_indices]
m_x_train_undersampled = x_train_embedded[m]

x_train_undersampled = np.vstack((m_x_train_undersampled, c_x_train_undersampled))
y_train_undersampled = np.hstack((y_train[m], y_train[c][closest_indices]))

logreg_cluster = LogisticRegression(max_iter=10000)
logreg_cluster.fit(x_train_undersampled, y_train_undersampled)
y_pred_cluster = logreg_cluster.predict(x_test_embedded)
print(classification_report(y_test, y_pred_cluster))

              precision    recall  f1-score   support

          -1       0.70      0.41      0.51      1505
           1       0.92      0.97      0.94      9978

    accuracy                           0.90     11483
   macro avg       0.81      0.69      0.73     11483
weighted avg       0.89      0.90      0.89     11483



## 5. Fine-tuning BERT

Fine-tuning a model is a fairly automated process (I hope) as there are library interfaces for all necessary functions.

The one thing we will pay close attention to is our train-test split. In a separate notebook we remarked on the structure present in the training dataset. We will try to keep this structure in how we feed our data.

More specifically we will:
1. split at the chunk level
2. feed chunks to the model
3. over-sample the minority class
4. alternate between speakers in training

In [7]:
! pip install datasets



In [8]:
# load the chunks
import json

with open("./drive/MyDrive/Colab_Projects/RITAL/chunks/presidents_M.json", "r") as f:
    chunks_M = json.load(f)

with open("./drive/MyDrive/Colab_Projects/RITAL/chunks/presidents_C.json", "r") as f:
    chunks_C = json.load(f)

# 1. do a train-test split on the chunks
# 2. over-sample M by tripling it
# 3. organize training dataset as: half M, half C, half M, half C
# 4. train with this version of the corpus
# IMPORTANT: check in the beginning how many chunks of each class we have, and how many sentences this corresponds to

m_train, m_test = train_test_split(list(chunks_M.keys()), test_size=0.2, random_state=42)
c_train, c_test = train_test_split(list(chunks_C.keys()), test_size=0.2, random_state=42)

m_train = m_train * 3
m_test = m_test * 3

print("Number of M chunks: ", len(m_train))
print("Number of C chunks: ", len(c_train))

sentences_m = 0
sentences_c = 0
for chunk in m_train:
  sentences_m += chunks_M[chunk]
for chunk in c_train:
  sentences_c += chunks_C[chunk]

print("Number of M sentences: ", sentences_m)
print("Number of C sentences: ", sentences_c)

m = len(m_train)
c = len(c_train)
chunks_train = m_train[:m // 2] + c_train[:c // 2] + m_train[m // 2:] + c_train[c // 2:]
chunks_test = m_test + c_test

x_train = []
y_train = []
x_test = []
y_test = []

for chunk in chunks_train:
  if chunk in chunks_M:
    length = chunks_M[chunk]
  else:
    length = chunks_C[chunk]
  chunk = int(chunk)
  for i in range(chunk, chunk + length):
    x_train.append(pres_alltxts[i])
    y_train.append((pres_alllabs[i] + 1) // 2)

for chunk in chunks_test:
  if chunk in chunks_M:
    length = chunks_M[chunk]
  else:
    length = chunks_C[chunk]
  chunk = int(chunk)
  for i in range(chunk, chunk + length):
    x_test.append(pres_alltxts[i])
    y_test.append((pres_alllabs[i] + 1) // 2)

x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

from datasets import Dataset

train_dataset = Dataset.from_dict({"text": x_train, "label": y_train})
test_dataset = Dataset.from_dict({"text": x_test, "label": y_test})

Number of M chunks:  960
Number of C chunks:  789
Number of M sentences:  18117
Number of C sentences:  39760
(57877,)
(57877,)
(14582,)
(14582,)


In [9]:
# now let's fine-tune a model
from transformers import CamembertTokenizer, CamembertForSequenceClassification
import torch

model_name = "camembert-base"
tokenizer = CamembertTokenizer.from_pretrained(model_name)
model = CamembertForSequenceClassification.from_pretrained(model_name, num_labels=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda


CamembertForSequenceClassification(
  (roberta): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0-11): 12 x CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias

In [10]:
! pip install evaluate



In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)
from transformers import TrainingArguments, Trainer
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return {"accuracy": accuracy.compute(predictions=predictions, references=labels),
          "f1": f1.compute(predictions=predictions, references=labels)}

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    no_cuda=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=None,
)

Map:   0%|          | 0/57877 [00:00<?, ? examples/s]

Map:   0%|          | 0/14582 [00:00<?, ? examples/s]

  trainer = Trainer(


In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mtudor-enache7[0m ([33mtudor-enache7-sorbonne-universit-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


In [None]:
# run model on testing dataset
from sklearn.metrics import classification_report

import torch
import torch.nn.functional as F
from torch import autocast

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
batch_size = 8

x_test = list(x_test)
y_pred = []
for i in range(0, len(x_test), batch_size):
    batch_texts = x_test[i:i+batch_size]
    inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probs = F.softmax(logits, dim=-1).cpu().numpy()
    y_pred.extend(probs)

y_pred = np.array(y_pred)
y_pred = np.argmax(y_pred, axis=1)
print(classification_report(y_test, y_pred))

In [None]:
# train model on whole dataset following the same training pattern
m = list(chunks_M.keys())
c = list(chunks_C.keys())
m = m * 3

len_m = len(m)
len_c = len(c)
chunks = m[:len_m // 2] + c[:len_c // 2] + m[len_m // 2:] + c[len_c // 2:]

x = []
y = []
for chunk in chunks:
  if chunk in chunks_M:
    length = chunks_M[chunk]
  else:
    length = chunks_C[chunk]
  chunk = int(chunk)
  for i in range(chunk, chunk + length):
    x.append(pres_alltxts[i])
    y.append((pres_alllabs[i] + 1) // 2)
x = np.array(x)
y = np.array(y)

dataset = Dataset.from_dict({"text": x, "label": y})
tokenized_dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=None,
)

trainer.train()

In [None]:
# use this model to predict classes for the testing dataset
fname = "./drive/MyDrive/Colab_Projects/RITAL/datasets/AFDpresidentutf8/corpus.tache1.test.utf8.txt"
test, _ = load_pres(fname)

tokenized_test = tokenizer(test, padding=True, truncation=True, return_tensors="pt")
tokenized_test = {k: v.to(device) for k, v in tokenized_test.items()}

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
batch_size = 8

y_pred = []
for i in range(0, len(test), batch_size):
    batch_texts = test[i:i+batch_size]
    inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probs = F.softmax(logits, dim=-1).cpu().numpy()
    y_pred.extend(probs)

y_pred = np.array(y_pred)

In [None]:
print(y_pred.shape)
print(y_pred[:, 0].shape)
np.save("./drive/MyDrive/Colab_Projects/RITAL/predictions/final.npy", y_pred[:, 0])

## 6. Comparison of the three models

Comparing the three models would have required creating the same train-test split for all of them. This would have required a bit more foresight on our part.

Part of the reason we did not structure it this way is that the first BERT models where built when we still had not spent time understanding the paragraph structure of the dataset, while the BERT fine-tune was built with this in mind.

Still, we keep the same split ration, so we can comment on our results.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
from matplotlib.pyplot import plt

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)

plt.figure()
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve of BERT')
plt.legend()
plt.savefig("drive/MyDrive/Colab_Projects/RITAL/plots/roc_curve_bert.png")
plt.show()