# AI Article Pre Training  

This notebook implements a pre-training of the BERT model by performing next-sentence classification on unlabelled articles about AI.

In [2]:
!ls .. | grep notebook

notebooks


## Imports and Setup

In [3]:
%pip install transformers -Uqq
%pip install sklearn -Uqq
%pip install datasets -Uqq
%pip install torch -Uqq
%pip install numpy -Uqq
%pip install evaluate -Uqq
!sudo apt install git-lfs

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible.
torchaudio 0.12.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated p

In [4]:
import evaluate
import numpy as np
import torch
from datasets import Dataset, load_dataset
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from transformers import (
    BertForNextSentencePrediction,
    BertTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
import gc
import os

In [5]:
MODEL_NAME = "aihype_article_bert_fine_tune"

## Loading Dataset

In [7]:
dataset = load_dataset("json", data_files="data/sanitized_pairs_unlabelled.json", field="data")
dataset

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2b50a228cfaf7b57/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-2b50a228cfaf7b57/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 127470
    })
})

In [8]:
dataset["train"][0:3]

{'sen1': ['The tech titans posted earnings as shares in Meta skyrocketed a day after it reported better results than expected and signaled spending and job cuts .',
  'The results follow weeks of unprecedented layoff rounds in the usually unassailable tech sector amid pessimism about the economic outlook .',
  'The souring mood followed a long spell of outsized growth during the peak Covid-19 period when consumers went online for work , shopping and entertainment .'],
 'sen2': ["By Afp Published : 16:33 , 2 February 2023 | Updated : 02:33 , 3 February 2023 The world 's biggest tech companies posted their latest earnings Google and Apple on Thursday reported downbeat results for the last quarter of 2022 as Amazon beat expectations , but warned that the coming months would be uncertain in a difficult moment for Big Tech .",
  'The tech titans posted earnings as shares in Meta skyrocketed a day after it reported better results than expected and signaled spending and job cuts .',
  'The re

In [9]:
num_epochs = 30

## Preprocess Data, Create Train/Test Split

In [10]:
dataset = dataset.class_encode_column('ans')
processed_dataset = dataset["train"].train_test_split(test_size=0.2, stratify_by_column='ans')
processed_dataset

Stringifying the column:   0%|          | 0/127470 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/127470 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 101976
    })
    test: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 25494
    })
})

In [11]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [12]:
def preprocess_data(examples):
    return tokenizer(examples["sen1"], examples['sen2'], padding='max_length', truncation=True)

In [13]:
tokenized_dataset = processed_dataset.map(
    preprocess_data,
    remove_columns=("sen1", "sen2"),
    batched=True,
).rename_column('ans', 'next_sentence_label')

tokenized_dataset

Map:   0%|          | 0/101976 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/25494 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


DatasetDict({
    train: Dataset({
        features: ['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 101976
    })
    test: Dataset({
        features: ['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25494
    })
})

### Verify dataset

In [14]:
example = tokenized_dataset['train'][3]
example.keys()

dict_keys(['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [15]:
tokenizer.decode(example["input_ids"])

"[CLS] OpenAI has never publicly explained those restrictions and did not respond to Reuters'request for comments. [SEP] OpenAI or ChatGPT itself is not blocked by Chinese authorities but OpenAI does not allow users in mainland China, Hong Kong, Iran, Russia and parts of Africa to sign up. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

## Load Pre-Trained Model

In [16]:
# use_fast uses fast tokenizers backed by rust. Remove it if it causes errors
model = BertForNextSentencePrediction.from_pretrained(
    "bert-base-cased",
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Verify data-model interaction

In [17]:
# forward pass
# outputs = model(
# input_ids=tokenized_dataset[labels[0]]["train"]["input_ids"][0],
# labels=tokenized_dataset[labels[0]]["train"][0]["labels"],
# )
# outputs

## Define Metrics

In [18]:
metrics = {
    "accuracy": evaluate.load("accuracy"),
    "presicion": evaluate.load("precision"),
    "recall": evaluate.load("recall"),
    "f1": evaluate.load("f1"),
}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [19]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    values = {}
    
    for name, metric in metrics.items():
        result = metric.compute(predictions=predictions, references=labels)
        for val in result.values() if isinstance(result, dict) else [result]:
            values[name] = val

    return values

## Train the Model

In [20]:
batch_size = 16 # TODO: increase if we have more data
num_epochs = 4

In [21]:
training_args = TrainingArguments(
    MODEL_NAME,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.00,
    report_to="none",
    label_names=['next_sentence_label'],
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    push_to_hub=True,
    hub_token='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',
)

In [22]:
small_train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(100000))
small_eval_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(20000))

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,  # compute_metrics,
)


Cloning https://huggingface.co/xt0r3/aihype_article_bert_fine_tune into local empty directory.


In [24]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Presicion,Recall,F1
1,0.2331,0.206217,0.91598,0.898055,0.851368,0.874089
2,0.1388,0.284947,0.91853,0.864273,0.904157,0.883765
3,0.0645,0.391923,0.924453,0.891432,0.887553,0.889488


: 

## Upload the Model

In [None]:
# Free the memory
gc.collect()

with torch.no_grad():
    torch.cuda.empty_cache()
    
model = None
trainer = None
training_args = None
gc.collect()

In [None]:
# agency-vs-rest/checkpoint-263: 0.75 precision, 0.85 recall
#

In [6]:
CHECKPOINT_NUMBER = 19122
model = BertForNextSentencePrediction.from_pretrained(f'{MODEL_NAME}/checkpoint-{CHECKPOINT_NUMBER}', local_files_only=True)
model.push_to_hub(MODEL_NAME, use_auth_token='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')

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/xt0r3/aihype_article_bert_fine_tune/commit/f77d86d05db534f13f5369d55df707e549baea8b', commit_message='Upload BertForNextSentencePrediction', commit_description='', oid='f77d86d05db534f13f5369d55df707e549baea8b', pr_url=None, pr_revision=None, pr_num=None)