# AI Article Pre Training  

This notebook implements a pre-training of the BERT model by performing next-sentence classification on unlabelled articles about AI.

In [1]:
!ls .. | grep notebook

notebooks


## Imports and Setup

In [2]:
%pip install transformers -U
%pip install sklearn -U
%pip install datasets -U
%pip install torch -U
%pip install numpy -U
%pip install evaluate -U
!sudo apt install git-lfs

Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.21.3
    Uninstalling transformers-4.21.3:
      Successfully uninstalled transformers-4.21.3
Successfully installed transformers-4.26.1
[0mNote: you may need to restart the kernel to use updated packages.
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2936 sha256=ce921abba37f59f0b020b11a9026fe702c7f5d443342f73f855e2d18e9130c84
  Stored in directory: /root/.cache/pip/wheels/03/8b/6f/9f1

In [3]:
import evaluate
import numpy as np
import torch
from datasets import Dataset, load_dataset
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from transformers import (
    BertForNextSentencePrediction,
    BertTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
)
import gc
import os

In [5]:
MODEL_NAME = "aihype_article_bert_fine_tune"

## Loading Dataset

In [6]:
dataset = load_dataset("json", data_files="data/sanitized_pairs_unlabelled.json", field="data")
dataset

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-d3256731e43b5dec/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-d3256731e43b5dec/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 251461
    })
})

In [7]:
dataset["train"][0:3]

{'sen1': ['Not one savings account is currently able to keep anywhere near the pace of rising costs .',
  'People can be forgiven for losing interest and whether there is a point of tucking money away for interest paying in some circumstances 5 per cent below the rate of inflation - and this gap could grow bigger in the coming months .',
  'However , firstly for those without a savings pot whatsoever , it is important to have a rainy day fund to fall back on .'],
 'sen2': ['By Ed Magnus For Thisismoney.co.uk Published : 07:50 , 13 January 2022 | Updated : 19:04 , 13 January 2022 8 View comments Surging inflation means the outlook for savers hunting returns is bleaker than bleak .',
  'Not one savings account is currently able to keep anywhere near the pace of rising costs .',
  'People can be forgiven for losing interest and whether there is a point of tucking money away for interest paying in some circumstances 5 per cent below the rate of inflation - and this gap could grow bigger in

In [8]:
num_epochs = 30

## Preprocess Data, Create Train/Test Split

In [9]:
dataset = dataset.class_encode_column('ans')
processed_dataset = dataset["train"].train_test_split(test_size=0.2, stratify_by_column='ans')
processed_dataset

Stringifying the column:   0%|          | 0/251461 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/251461 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 201168
    })
    test: Dataset({
        features: ['sen1', 'sen2', 'ans'],
        num_rows: 50293
    })
})

In [10]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [11]:
def preprocess_data(examples):
    return tokenizer(examples["sen1"], examples['sen2'], padding='max_length', truncation=True)

In [12]:
tokenized_dataset = processed_dataset.map(
    preprocess_data,
    remove_columns=("sen1", "sen2"),
    batched=True,
).rename_column('ans', 'next_sentence_label')

tokenized_dataset

Map:   0%|          | 0/201168 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/50293 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

### Verify dataset

In [None]:
example = tokenized_dataset['train'][3]
example.keys()

dict_keys(['next_sentence_label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
tokenizer.decode(example["input_ids"])

"[CLS] The gangster film cost around $ 159 million, but most of the funds went to making Robert De Niro look in his 20s - he was 76 years old when the movie was shot. [SEP] By Reuters Published : 04 : 00, 30 March 2020 | Updated : 10 : 20, 30 March 2020 SHANGHAI, March 30 ( Reuters ) - China's liquefied petroleum gas ( LPG ) futures fell on their debut on the Dalian Commodity Exchange on Monday, dropping as much as 10 % after oil prices hit an 18 - year low on fears lockdowns to curb the coronavirus will further hurt demand. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

## Load Pre-Trained Model

In [None]:
# use_fast uses fast tokenizers backed by rust. Remove it if it causes errors
model = BertForNextSentencePrediction.from_pretrained(
    "bert-base-cased",
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Verify data-model interaction

In [None]:
# forward pass
# outputs = model(
# input_ids=tokenized_dataset[labels[0]]["train"]["input_ids"][0],
# labels=tokenized_dataset[labels[0]]["train"][0]["labels"],
# )
# outputs

## Define Metrics

In [None]:
metrics = {
    "accuracy": evaluate.load("accuracy"),
    "presicion": evaluate.load("precision"),
    "recall": evaluate.load("recall"),
    "f1": evaluate.load("f1"),
}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    values = {}
    
    for name, metric in metrics.items():
        result = metric.compute(predictions=predictions, references=labels)
        for val in result.values() if isinstance(result, dict) else [result]:
            values[name] = val

    return values

## Train the Model

In [None]:
batch_size = 16 # TODO: increase if we have more data
num_epochs = 4

In [None]:
training_args = TrainingArguments(
    MODEL_NAME,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.00,
    report_to="none",
    label_names=['next_sentence_label'],
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    push_to_hub=True,
    hub_token='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',
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
small_train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(100000))
small_eval_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(20000))

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,  # compute_metrics,
)


Cloning https://huggingface.co/xt0r3/aihype_bert_fine_tune into local empty directory.


Download file pytorch_model.bin:   0%|          | 16.5k/413M [00:00<?, ?B/s]

Download file runs/Feb24_12-39-27_na85irh4fn/1677242489.3347664/events.out.tfevents.1677242489.na85irh4fn.59.1…

Download file training_args.bin: 100%|##########| 3.50k/3.50k [00:00<?, ?B/s]

Clean file runs/Feb24_12-39-27_na85irh4fn/1677242489.3347664/events.out.tfevents.1677242489.na85irh4fn.59.1:  …

Clean file training_args.bin:  29%|##8       | 1.00k/3.50k [00:00<?, ?B/s]

Download file runs/Feb24_12-39-27_na85irh4fn/events.out.tfevents.1677242489.na85irh4fn.59.0: 100%|##########| …

Clean file runs/Feb24_12-39-27_na85irh4fn/events.out.tfevents.1677242489.na85irh4fn.59.0:  14%|#4        | 1.0…

Clean file pytorch_model.bin:   0%|          | 1.00k/413M [00:00<?, ?B/s]

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## Upload the Model

In [None]:
# Free the memory
gc.collect()

with torch.no_grad():
    torch.cuda.empty_cache()
    
model = None
trainer = None
training_args = None
gc.collect()

In [None]:
# agency-vs-rest/checkpoint-263: 0.75 precision, 0.85 recall
#

In [None]:
model = BertForNextSentencePrediction.from_pretrained(f'{MODEL_NAME}/checkpoint-1875', local_files_only=True)
model.push_to_hub(MODEL_NAME, use_auth_token='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')

loading configuration file aihype_bert_fine_tune/checkpoint-1875/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForNextSentencePrediction"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file aihype_bert_fine_tune/checkpoint-1875/pytorch_model.bin
All model checkpoint weights were used when initializing BertForNextSentencePrediction.

All the weights of BertForNextSentencePrediction were in

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/xt0r3/aihype_bert_fine_tune/commit/85c6ccfc091d43b1428ebbde46a0854407f155f3', commit_message='Upload BertForNextSentencePrediction', commit_description='', oid='85c6ccfc091d43b1428ebbde46a0854407f155f3', pr_url=None, pr_revision=None, pr_num=None)