### WSBERTs - Fine-Tuned BERT - DistilBERT Transformers
Out-of-Sample Performance (Macro Avg F1-Score)
* Fine-Tuned BERT - 0.6400
* Fine-Tuned DistilBERT - 0.6300

In [None]:
! pip install transformers
! pip install datasets

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 14.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 75.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 52.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('tweets_comments_combined_df.csv')
df = df[df['comment'].map(len) <= 512] 
X = list(df['comment'])
y = list(df['sentiment'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) 

pd.DataFrame({'sentiment':y_train, 'text':X_train}).to_csv('train.csv', index=False)
pd.DataFrame({'sentiment':y_test, 'text':X_test}).to_csv('test.csv', index=False)

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', 
                       data_files={'train': 'train.csv', 
                                   'test': 'test.csv'}, 
                       )

Using custom data configuration default-93bfdf01f0a1cb8d


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-93bfdf01f0a1cb8d/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-93bfdf01f0a1cb8d/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentiment', 'text'],
        num_rows: 3209
    })
    test: Dataset({
        features: ['sentiment', 'text'],
        num_rows: 1070
    })
})

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
def transform_labels(label):

    label = label['sentiment']
    num = 0
    if label == 'negative':
        num = 0
    elif label == 'neutral':
        num = 1
    elif label == 'positive':
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['text'], padding='max_length')

dataset = dataset.map(tokenize_data, batched=True)

remove_columns = ['sentiment']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3209 [00:00<?, ?ex/s]

  0%|          | 0/1070 [00:00<?, ?ex/s]

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", num_train_epochs=3)

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
train_dataset = dataset['train'].shuffle(seed=10).select(range(3000))
eval_dataset = dataset['train'].shuffle(seed=10).select(range(3000, 3200))

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-93bfdf01f0a1cb8d/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-60237bbf0ed3a702.arrow


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1125


Step,Training Loss
500,0.9414
1000,0.4933


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1125, training_loss=0.6688673739963108, metrics={'train_runtime': 530.1856, 'train_samples_per_second': 16.975, 'train_steps_per_second': 2.122, 'total_flos': 2368020759552000.0, 'train_loss': 0.6688673739963108, 'epoch': 3.0})

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average='macro')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.evaluate()

Downloading builder script:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'eval_f1': 0.610551859632742,
 'eval_loss': 1.4478225708007812,
 'eval_runtime': 4.0064,
 'eval_samples_per_second': 49.921,
 'eval_steps_per_second': 6.24}

In [None]:
import torch

torch.save(model.state_dict(),'finetuned_BERT')

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained('finetuned_BERT_OUTPUT_DIR')

Configuration saved in finetuned_BERT_OUTPUT_DIR/config.json
Model weights saved in finetuned_BERT_OUTPUT_DIR/pytorch_model.bin


In [None]:
loaded_model = AutoModelForSequenceClassification.from_pretrained('finetuned_BERT_OUTPUT_DIR')

loading configuration file finetuned_BERT_OUTPUT_DIR/config.json
Model config BertConfig {
  "_name_or_path": "finetuned_BERT_OUTPUT_DIR",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file 

In [None]:
# arguments for Trainer
test_args = TrainingArguments(
    output_dir = 'finetuned_BERT_OUTPUT_DIR',
    do_train = False,
    do_predict = True,
    per_device_eval_batch_size = 16,   
    dataloader_drop_last = False    
)

# init trainer
trainer = Trainer(
              model = model, 
              args = test_args, 
              compute_metrics = compute_metrics)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
test_results = trainer.predict(dataset['test'])

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1070
  Batch size = 16


In [None]:
sentiment_labels = []
for label in test_results.predictions.argmax(axis=1):
  if label == 0:
    sentiment_labels.append('negative')
  elif label == 1:
    sentiment_labels.append('neutral')
  elif label == 2:
    sentiment_labels.append('positive')

from sklearn.metrics import classification_report
print(classification_report(y_test, sentiment_labels))

              precision    recall  f1-score   support

    negative       0.65      0.63      0.64       352
     neutral       0.59      0.64      0.61       318
    positive       0.70      0.67      0.68       400

    accuracy                           0.65      1070
   macro avg       0.64      0.64      0.64      1070
weighted avg       0.65      0.65      0.65      1070



In [None]:
inference = pd.read_csv('inference_1month_comments.csv')
inference[inference['comment'].map(len) <= 512].to_csv('inference512.csv', index=False)

inference_dataset = load_dataset('csv', 
                       data_files={'test': 'inference512.csv', 
                                   }, 
                       )

Using custom data configuration default-d3cdc1465b7e864b


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d3cdc1465b7e864b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d3cdc1465b7e864b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
def tokenize_data(example):
    return tokenizer(example['comment'], padding='max_length')

inference_dataset = inference_dataset.map(tokenize_data, batched=True)

remove_columns = ['date']
inference_dataset = inference_dataset.map(remove_columns=remove_columns)

  0%|          | 0/19 [00:00<?, ?ba/s]

  0%|          | 0/18628 [00:00<?, ?ex/s]

In [None]:
import time
t1 = time.perf_counter()


test_results = trainer.predict(inference_dataset['test'])


t2 = time.perf_counter()
print('time taken to run:',t2-t1)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: comment. If comment are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 18628
  Batch size = 16


time taken to run: 357.196422341


In [None]:
print(test_results[:5])
len(test_results)

(array([[-2.7021134 ,  2.7301712 ,  0.20170519],
       [ 3.0096207 , -1.8921918 , -1.2329382 ],
       [ 1.1454486 ,  0.33529317, -2.0748148 ],
       ...,
       [-2.7815766 ,  3.2730255 , -0.4104389 ],
       [-2.9399521 ,  3.6325374 , -1.2962633 ],
       [-1.8776834 ,  2.5289416 , -1.2009126 ]], dtype=float32), None, {'test_runtime': 357.1904, 'test_samples_per_second': 52.151, 'test_steps_per_second': 3.262})


3

### DistilBERT

In [None]:
from transformers import AutoModelForSequenceClassification
distilbert_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=3)

https://huggingface.co/distilbert-base-cased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpm4mhn02m


Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-cased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
creating metadata file for /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "ini

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9c9f39769dba4c5fe379b4bc82973eb01297bd607954621434eb9f1bc85a23a0.06b428c87335c1bb22eae46fdab31c8286efa0aa09e898a7ac42ddf5c3f5dc19
creating metadata file for /root/.cache/huggingface/transformers/9c9f39769dba4c5fe379b4bc82973eb01297bd607954621434eb9f1bc85a23a0.06b428c87335c1bb22eae46fdab31c8286efa0aa09e898a7ac42ddf5c3f5dc19
loading weights file https://huggingface.co/distilbert-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c9f39769dba4c5fe379b4bc82973eb01297bd607954621434eb9f1bc85a23a0.06b428c87335c1bb22eae46fdab31c8286efa0aa09e898a7ac42ddf5c3f5dc19
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'voc

In [None]:
from transformers import Trainer
import numpy as np
from datasets import load_metric

distilbert_trainer = Trainer(
    model=distilbert_model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

distilbert_trainer.train()


metric = load_metric("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average='macro')

distilbert_trainer = Trainer(
    model=distilbert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

distilbert_trainer.evaluate()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids. If text, token_type_ids are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1125


Step,Training Loss
500,0.9338
1000,0.5451


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids. If text, token_type_ids are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'eval_f1': 0.6131055666786455,
 'eval_loss': 1.1576679944992065,
 'eval_runtime': 2.0646,
 'eval_samples_per_second': 96.869,
 'eval_steps_per_second': 12.109}

In [None]:
distilbert_model_to_save = trainer.model.module if hasattr(distilbert_trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
distilbert_model_to_save.save_pretrained('finetuned_distilBERT_OUTPUT_DIR')
loaded_distilbert_model = AutoModelForSequenceClassification.from_pretrained('finetuned_distilBERT_OUTPUT_DIR')

Configuration saved in finetuned_distilBERT_OUTPUT_DIR/config.json
Model weights saved in finetuned_distilBERT_OUTPUT_DIR/pytorch_model.bin
loading configuration file finetuned_distilBERT_OUTPUT_DIR/config.json
Model config BertConfig {
  "_name_or_path": "finetuned_distilBERT_OUTPUT_DIR",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "

In [None]:
# arguments for Trainer
test_args = TrainingArguments(
    output_dir = 'finetuned_distilBERT_OUTPUT_DIR',
    do_train = False,
    do_predict = True,
    per_device_eval_batch_size = 16,   
    dataloader_drop_last = False    
)

# init trainer
loaded_distilbert_trainer = Trainer(
              model = loaded_distilbert_model, 
              args = test_args, 
              compute_metrics = compute_metrics)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
distilbert_test_results = loaded_distilbert_trainer.predict(dataset['test'])

distilbert_sentiment_labels = []
for label in distilbert_test_results.predictions.argmax(axis=1):
  if label == 0:
    distilbert_sentiment_labels.append('negative')
  elif label == 1:
    distilbert_sentiment_labels.append('neutral')
  elif label == 2:
    distilbert_sentiment_labels.append('positive')

from sklearn.metrics import classification_report
print(classification_report(y_test, distilbert_sentiment_labels))

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1070
  Batch size = 16


              precision    recall  f1-score   support

    negative       0.65      0.63      0.64       352
     neutral       0.59      0.64      0.61       318
    positive       0.70      0.67      0.68       400

    accuracy                           0.65      1070
   macro avg       0.64      0.64      0.64      1070
weighted avg       0.65      0.65      0.65      1070



In [None]:
import time
t1 = time.perf_counter()


test_results = loaded_distilbert_trainer.predict(inference_dataset['test'])


t2 = time.perf_counter()
print('time taken to run:',t2-t1)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: comment. If comment are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 18628
  Batch size = 16


time taken to run: 354.2760375150001


In [None]:
print(test_results[:5])
len(test_results.predictions)

(array([[-2.7021134 ,  2.7301712 ,  0.20170519],
       [ 3.0096207 , -1.8921918 , -1.2329382 ],
       [ 1.1454486 ,  0.33529317, -2.0748148 ],
       ...,
       [-2.7815766 ,  3.2730255 , -0.4104389 ],
       [-2.9399521 ,  3.6325374 , -1.2962633 ],
       [-1.8776834 ,  2.5289416 , -1.2009126 ]], dtype=float32), None, {'test_runtime': 354.2701, 'test_samples_per_second': 52.581, 'test_steps_per_second': 3.288})


18628