This notebook details my process for working towards a fine-tuned LLM model based on distilBERT for the purpose of sentiment classification in the [IMDB dataset](https://huggingface.co/datasets/imdb). First I download the dataset directly from Hugging Face and import various useful tools/metrics:

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
!pip install transformers[torch]
!pip install datasets
!apt-get install git-lfs

In [None]:
import torch

# Check if a GPU is available and select device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} for training.")

In [None]:
import transformers
from transformers import pipeline, DistilBertTokenizer, AutoTokenizer, \
 TrainingArguments, Trainer, AutoModelForSequenceClassification
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score, confusion_matrix, accuracy_score, \
 classification_report

In [None]:
ds = load_dataset('imdb')
stopwords = stopwords.words('english')

In [None]:
test_ds = ds['train']
train_ds = ds['test']

In [None]:
test_ds.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='imdb', config_name='plain_text', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=33435948, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32653810, num_examples=25000, shard_lengths=None, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67113044, num_examples=50000, shard_lengths=None, dataset_name='imdb')}, download_checksums={'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/train-00000-of-00001.parquet': {'num_bytes': 20979968, 'checksum': None}, 'hf://datasets/imdb@e6281661ce1c48d982bc483cf8a173c1bbeb5d31/plain_text/test-00000-of-00001.parquet': {'num_bytes': 20470363, 'checksum': None}, 'hf:

In [None]:
ds['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

#Preprocessing and Representation

I want to establish a baseline to compare the more computationally heavy LLMs against. While Hugging Face has tokenizers for use with most models, the baseline model will be a simple bag-of-words tokenization with a random forest applied to it, so I just wrote some quick code to deal with cleaning/tokenization.

In [None]:
# Convert train and test datasets to dataframes for SKLearn model
train_df = pd.DataFrame(train_ds)
test_df = pd.DataFrame(test_ds)

In [None]:
# Define preprocess_text function
def preprocess_text(text):
  # Removing all characters that are not alphanumeric or whitespace from text
  text = re.sub('[^0-9a-zA-Z\\s]+', '', text).lower()
  # Converting text to list of words
  text_list = text.split()
  # Removing stopwords
  text_list = [word for word in text_list if word not in stopwords]
  return text_list

In [None]:
# Apply preprocess_text and rejoin
for df in (train_df, test_df):
  df['text'] = df['text'].apply(preprocess_text).apply(lambda x: ' '.join(x))

The preprocess text function removes all non-alphanumeric/whitespace characters as well as stopwords from the original text. I will use it to clean the text in both the train_df and test_df, which are just casts of the dataset mentioned earlier. I then joined the lists returned by preprocess_text back into strings.

For tokenization/representation: TF-IDF strikes me as good for topic classification, but not ideal for sentiment analysis. No matter how many times the word "bad" is used in the dataset, it is likely a good predictor of whether a review is positive or negative. Hence, I am going to try applying simple BOW tokenization.

In [None]:
# Define count_vectorizer to convert text to BOW and RandomForestClassifier
# for baseline classification
count_vectorizer = CountVectorizer()
rfc = RandomForestClassifier()

In [None]:
# Define X, y and X_test, and y_test for training and evaluation
X = count_vectorizer.fit_transform(train_df['text'])
y = train_df['label']

# Testing on a limited subset in the interest of time
test_df = test_df.sample(1000)
X_test = count_vectorizer.transform(test_df['text'])
y_test = test_df['label']

In [None]:
# Fit the rfc and predict the test labels
rfc.fit(X, y)

In [None]:
y_pred = rfc.predict(X_test)

In [None]:
# Assess the accuracy of the model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83       489
           1       0.84      0.83      0.84       511

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000



The baseline model is fairly accurate, likely due to the simplicity of the task. However, there is still room for improvement.

# Pre-trained model

The next step was downloading a pre-trained model from Hugging Face and immediately using it for classification. To do this I simply create a pipeline using the popular 'distilbert-base-uncased-finetuned-sst-2-english' model, not modifying the tokenizer from the one specified for the model.

In [None]:
# Define a pipeline for pretrained model
pipe = pipeline(
                        task = 'text-classification',
                        model='distilbert-base-uncased-finetuned-sst-2-english',
                        max_length = 512,
                        truncation=True
              )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

A note: the pretrained models would not work unless max_length was set and would return a warning if truncation wasn't set to true. Almost all reviews did not exceed 512 *words* in length, but some did exceed 512 tokens after the tokenizer was applied.

In [None]:
# Sampling 1000 datapoints from the test dataset. Similar to how the baseline
# model was evaluated.
test_ds_sample = test_ds.shuffle()[:1000]

In [None]:
# Use pipeline to obtain y_pred for the test data
y_pred = pipe(test_ds_sample['text'])

In [None]:
# Convert the y_pred to a form that SKLearn metrics functions can evaluate
y_pred = [0 if data['label'] == 'NEGATIVE' else 1 for data in y_pred]

In [None]:
# Evaluate model
print(classification_report(test_ds_sample['label'], y_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.91       514
           1       0.94      0.86      0.90       486

    accuracy                           0.91      1000
   macro avg       0.91      0.90      0.91      1000
weighted avg       0.91      0.91      0.91      1000



The pretrained model outperforms the simple baseline random forest with BOW tokenization, but not by as much as I would hope. This is, again, likely due to the simplicity of the majority of the task and the difficulty of the remainder. In order to gain some traction in classifying the more difficult to parse reviews, I refined the "bert-base-uncased" model for this particular task.

# Optimization and Deployment
I will use the Trainer class from Hugging Face for both optimization and deployment.

For deployment, I add the push_to_hub=True argument to my training_args to "deploy" my model to my Hugging Face account.

For optimization, I can modify the hyperparameters in the training_args to try to achieve a better fit to the data. I modified learning rate to 3 different values (1e-5, 2e-5, and 1e-4) since it was the most likely to make a difference. I simply ran the following:


In [None]:
# training_args = TrainingArguments(
#     output_dir='my_model',
#     learning_rate=2e-5,
#     per_device_train_batch_size=16,
#     per_device_eval_batch_size=16,
#     num_train_epochs=1,
#     push_to_hub=True,
# )

# model = AutoModelForSequenceClassification.from_pretrained(
#           "bert-base-uncased", num_labels=2)

# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# def tokenize_function(examples):
#     return tokenizer(examples['text'], padding="max_length", truncation=True)

# tokenized_datasets = ds.map(tokenize_function, batched=True)

# train_valid_tokenized = tokenized_datasets["train"].train_test_split(
#     test_size=0.2)

# trainer = Trainer(
#     model,
#     training_args,
#     train_dataset=tokenized_valid_tokenized["train"],
#     eval_dataset=tokenized_valid_tokenized["test"]
# )
# trainer.train()

with the training_args modified appropriately, resetting the model to pretrained before each training run. In the end, I find that modifying the learning rate only seems to affect runtime, not accuracy. It is likely that with more time/resources I could see notable gains resulting from a proper hyperparameter search.

The final model was pushed to my Hugging Face hub, so I will import it from there.

In [None]:
# Import the pretrained model from my Hugging Face repo, tokenizer has to be
# respecified because it isn't in repo
pipe = pipeline(
                        task = 'text-classification',
                        model='toflyn/my_model',
                        tokenizer='bert-base-uncased',
                        max_length = 512,
                        truncation=True
              )

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Apply pipeline to the test_ds
y_pred = pipe(test_ds_sample['text'])

In [None]:
# Change create list changing y_pred labels to Booleans
y_pred = [0 if data['label'] == 'LABEL_0' else 1 for data in y_pred]

In [None]:
# Create classification report
report = classification_report(test_ds_sample['label'], y_pred)

In [None]:
# Print report
print(report)

              precision    recall  f1-score   support

           0       0.96      0.95      0.96       527
           1       0.95      0.95      0.95       473

    accuracy                           0.95      1000
   macro avg       0.95      0.95      0.95      1000
weighted avg       0.95      0.95      0.95      1000



Even without much for hyperparameter search, this model refined via transfer learning to the IMDB dataset gained a great deal of accuracy over the other two. It might be interesting to see an example of text that remains misclassified in the test data.

In [None]:
print(test_ds_sample['label'][41])
print(test_ds_sample['text'][41])

0
I'm not going to waste my time writing an essay and waste your time. I would like to say, however, that all those who are uptight about this movie even being made are being totally ridiculous. Johnnymacbest, you can't play that card, and I mean the nationalism patriotic card, to make people not watch this movie. this is a movie that expresses its free will in this country (oh yea i played that card)even though the director is German and deserves a slap on the wrist every time he reaches for a camera, but the world is full of controversy, and its the same thing you've heard before, like the GTA controversy. It's a comedy, laugh and be disgusted, not disgusted all together, its dark humor and obviously you don't have the heart to take it, the past is past, yes people remember those who were lost but its time to move on, that was !!!7!!! years ago. You can still watch this movie and have good moral values. <br /><br />Besides I'm insane, and who needs a soul?


We will have to forgive the model for misclassifying that one.