The following is taken from:


https://codistro.medium.com/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf

https://towardsdatascience.com/does-bert-need-clean-data-part-2-classification-d29adf9f745a

### NLP Classification of Corona Virus Tweets (Kaggle)

This notebook will take you through the entire BERT Fine-Tuning process.  The BERT model below is for a classification task - specifically on Tweets about Corona.  The notebook ends by saving the Fine-Tuned BERT model.

You will use the next notebook to call the saved model and make predictions.

#### Import Packages

In [1]:
import os
import numpy as np
import pandas as pd

from datasets import load_dataset
from datasets import load_metric

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


#### Load the data

In [1]:
dataset = load_dataset('csv', data_files={'train': 'C:\\Users\\hlmq\\OneDrive - Chevron\\Data\\DSDP\\CoronaTweets-Kaggle\\Corona_NLP_train.csv'
                                        , 'test': 'C:\\Users\\hlmq\\OneDrive - Chevron\\Data\\DSDP\\CoronaTweets-Kaggle\\Corona_NLP_test.csv'}
                                        , encoding = "ISO-8859-1")

Using custom data configuration default-b6f120a392a2bbd5
Reusing dataset csv (C:\Users\hlmq\.cache\huggingface\datasets\csv\default-b6f120a392a2bbd5\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/2 [00:00<?, ?it/s]

#### Access the Tokenizer

In [2]:
# Load tokenizer from transformers library
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [3]:
# Check to be sure the tokenizer is working
tokenizer("Attention is all you need")

{'input_ids': [101, 1335, 5208, 2116, 1110, 1155, 1128, 1444, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### Perform some data set cleanup

In [4]:
# Define utility functions

def transform_labels(label):

    label = label['Sentiment']
    num = 0
    if label == 'Extremely Positive':
        num = 0
    elif label == 'Positive':
        num = 1
    elif label == 'Neutral':
        num = 2
    elif label == 'Negative':
        num = 3
    elif label == 'Extremely Negative':
        num = 4

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['OriginalTweet'], padding='max_length')

In [5]:
dataset = dataset.map(tokenize_data, batched=True)

remove_columns = ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment']

dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Loading cached processed dataset at C:\Users\hlmq\.cache\huggingface\datasets\csv\default-b6f120a392a2bbd5\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-14a270d0c171a9d4.arrow


  0%|          | 0/4 [00:00<?, ?ba/s]

Loading cached processed dataset at C:\Users\hlmq\.cache\huggingface\datasets\csv\default-b6f120a392a2bbd5\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-50fcbfa3bc29c00f.arrow


  0%|          | 0/3798 [00:00<?, ?ex/s]

#### Access Pretrained BERT Base

##### Instantiate some training parameters

In [6]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [7]:
training_args = TrainingArguments("test_trainer", num_train_epochs=1)

In [8]:
train_dataset = dataset['train'].shuffle(seed=10).select(range(4000))
eval_dataset = dataset['train'].shuffle(seed=10).select(range(4000, 4100))

Loading cached shuffled indices for dataset at C:\Users\hlmq\.cache\huggingface\datasets\csv\default-b6f120a392a2bbd5\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-867a2464020b7f06.arrow
Loading cached shuffled indices for dataset at C:\Users\hlmq\.cache\huggingface\datasets\csv\default-b6f120a392a2bbd5\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-867a2464020b7f06.arrow


#### Retrain the model

In [9]:
# Instantiate the trainer
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)

In [10]:
# This step will take forever. Go grab a coffee.
trainer.train()

***** Running training *****
  Num examples = 4000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 500


Step,Training Loss
500,1.2488


Saving model checkpoint to test_trainer\checkpoint-500
Configuration saved in test_trainer\checkpoint-500\config.json
Model weights saved in test_trainer\checkpoint-500\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=500, training_loss=1.2487789306640624, metrics={'train_runtime': 18715.4515, 'train_samples_per_second': 0.214, 'train_steps_per_second': 0.027, 'total_flos': 1052472569856000.0, 'train_loss': 1.2487789306640624, 'epoch': 1.0})

#### Evaluate the accuracy of the model

In [11]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [12]:
# Evaluate model predictions
## This step will take forever, too
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


{'eval_loss': 1.1104215383529663,
 'eval_runtime': 190.1123,
 'eval_samples_per_second': 0.526,
 'eval_steps_per_second': 0.068,
 'epoch': 1.0}

#### Save the model

In [39]:
trainer.save_model('corona_tweet_sentiment.model')

Saving model checkpoint to corona_tweet_sentiment.model
Configuration saved in corona_tweet_sentiment.model\config.json
Model weights saved in corona_tweet_sentiment.model\pytorch_model.bin


# Future Additions

##### Taken from the following article: Compute Metrics

https://github.com/acoadmarmon/united-nations-ner/blob/master/train_un_ner.py

##### Graphs to plot accuracy/loss throughout the training process.  Will show the correct epoch to stop training.

In [None]:
# Path where model is located
checkpoint_path = # Fix this to correct path --> "../models/light_tf_bert.ckpt"

In [None]:
# Grab the BERT model you want from HuggingFace
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)

checkpoint_dir = os.path.dirname(checkpoint_path)
model_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

print('\nBert Model', bert_model.summary())

# Instantiate some goodness of fit metrics
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

In [None]:
# Fit the model
history=bert_model.fit(X_train,
                       y_train,
                       batch_size=32,
                       epochs=4, # 3 epochs is default
                       validation_data=(X_test, y_test),
                       callbacks=[model_callback])

In [None]:
### Accuracy Graph over Epochs
plt.figure(figsize=(16, 10))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
### Loss Graph over Epochs
plt.figure(figsize=(16, 10))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()