# Sami Abdelazim - JC Foster

Note this notebook is used to finetune the gpt-2 model provided by huggingface, on a labeled Twitter Sentiment Analysis dataset on Kaggle. It returns a score of 1 for positive tweets and a score of 0 for negative tweets.

## I used the following articles to help me code this:
https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

https://www.kaggle.com/code/paoloripamonti/twitter-sentiment-analysis

Kaggle Dataset Used: https://www.kaggle.com/datasets/kazanova/sentiment140

In [None]:
! pip install ftfy
! pip install transformers
! pip install kaggle
# download dataset
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download kazanova/sentiment140
! unzip sentiment140.zip

import io
import os
import re
import torch
import pandas as pd
import nltk
import sklearn
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm.notebook import tqdm
from ftfy import fix_text
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed,
                          TrainingArguments,
                          Trainer,
                          GPT2Config,
                          GPT2Tokenizer,
                          AdamW, 
                          get_linear_schedule_with_warmup,
                          GPT2ForSequenceClassification)

Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 19.8 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 11.2 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 8.3 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 7.4 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.2 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Collecting transformers
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 7.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.wh

In [None]:
# load data
columns = ["target", "ids", "date", "flag", "user", "text"]
data_encoding = "ISO-8859-1"
raw_df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding=data_encoding ,names=columns)
# sample data to save on training time
new_neg = raw_df[raw_df['target']==0].sample(100000)
new_pos = raw_df[raw_df['target']==4].sample(100000)
df = pd.concat([new_neg,new_pos])

# positive sentiment are labeled with 4, this allows you to convert back to 0,1
decode_map = {0 : 0 , 4 : 1}
def decode_sentiment(label):
    return decode_map[int(label)]

# rest is taken from https://www.kaggle.com/code/paoloripamonti/twitter-sentiment-analysis
df.target = df.target.apply(lambda x: decode_sentiment(x))
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

symbols_to_remove = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(symbols_to_remove, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

# remove stopwords and special characters
df.text = df.text.apply(lambda x: preprocess(x))

# split data in train and validation split
df_train, df_valid = train_test_split(df[['target','text']], test_size=0.2, random_state=42)
print("TRAIN size:", len(df_train))
print("VALID size:", len(df_valid))

TRAIN size: 160000
VALID size: 40000


In [None]:
## define Dataset class to pass to Dataloader

class TwitterData(Dataset):
  def __init__(self, df):
    self.texts = df['text'].values
    self.labels = df['target'].values
    self.n_examples = len(self.labels)

  def __len__(self):
    return self.n_examples

  def __getitem__(self, item):
    return {'text':self.texts[item],
            'label':self.labels[item]}

In [None]:
## taken from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

class Gpt2ClassificationCollator(object):
    r"""
    Data Collator used for GPT2 in a classificaiton task. 
    
    It uses a given tokenizer and label encoder to convert any text and labels to numbers that 
    can go straight into a GPT2 model.

    This class is built with reusability in mind: it can be used as is as long
    as the `dataloader` outputs a batch in dictionary format that can be passed 
    straight into the model - `model(**batch)`.

    Arguments:

      use_tokenizer (:obj:`transformers.tokenization_?`):
          Transformer type tokenizer used to process raw text into numbers.

      labels_ids (:obj:`dict`):
          Dictionary to encode any labels names into numbers. Keys map to 
          labels names and Values map to number associated to those labels.

      max_sequence_len (:obj:`int`, `optional`)
          Value to indicate the maximum desired sequence to truncate or pad text
          sequences. If no value is passed it will used maximum sequence size
          supported by the tokenizer and model.

    """

    def __init__(self, use_tokenizer, max_sequence_len=None):

        # Tokenizer to be used inside the class.
        self.use_tokenizer = use_tokenizer
        # Check max sequence length.
        self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len
        return

    def __call__(self, sequences):
        r"""
        This function allowes the class objesct to be used as a function call.
        Sine the PyTorch DataLoader needs a collator function, I can use this 
        class as a function.

        Arguments:

          item (:obj:`list`):
              List of texts and labels.

        Returns:
          :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model.
          It holddes the statement `model(**Returned Dictionary)`.
        """

        # Get all texts from sequences list.
        #print(sequences)
        texts = [sequence['text'] for sequence in sequences]
        # Get all labels from sequences list.
        labels = [sequence['label'] for sequence in sequences]
        # Call tokenizer on all texts to convert into tensors of numbers with 
        # appropriate padding.
        inputs = self.use_tokenizer(text=texts, return_tensors="pt", padding=True, truncation=True,  max_length=self.max_sequence_len)
        # Update the inputs with the associated encoded labels as tensor.
        inputs.update({'labels':torch.tensor(labels)})

        return inputs

In [None]:
# Set seed for reproducibility.
set_seed(123)
epochs = 6
batch_size = 32
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_name_or_path = 'gpt2'

In [None]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

# Get model configuration.
print('Loading configuraiton...')
model_config = GPT2Config.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=2)

# Get model's tokenizer.
print('Loading tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

# Get the actual model.
print('Loading model...')
model = GPT2ForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_name_or_path, config=model_config)

# resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer))

# fix model padding token id
model.config.pad_token_id = model.config.eos_token_id

# Load model to defined device.
model.to(device)
print('Model loaded to `%s`'%device)

Loading configuraiton...


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Loading tokenizer...


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Loading model...


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded to `cuda`


In [None]:
# ammended from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

gpt2_classificaiton_collator = Gpt2ClassificationCollator(use_tokenizer=tokenizer)

print('Dealing with Train...')
# Create pytorch dataset.
train_dataset = TwitterData(df_train)
print('Created `train_dataset` with %d examples!'%len(train_dataset))
train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              shuffle=True,
                              collate_fn=gpt2_classificaiton_collator)
print('Created `train_dataloader` with %d batches!'%len(train_dataloader))

print()


print('Dealing with Validation...')
# Create pytorch dataset.
valid_dataset = TwitterData(df_valid)
print('Created `train_dataset` with %d examples!'%len(valid_dataset))
valid_dataloader = DataLoader(valid_dataset,
                              batch_size=batch_size,
                              shuffle=False,
                              collate_fn=gpt2_classificaiton_collator)
print('Created `train_dataloader` with %d batches!'%len(valid_dataloader))
print()

Dealing with Train...
Created `train_dataset` with 160000 examples!
Created `train_dataloader` with 5000 batches!

Dealing with Validation...
Created `train_dataset` with 40000 examples!
Created `train_dataloader` with 1250 batches!



In [None]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

def train(dataloader, optimizer_, scheduler_, device_):
  # Use global variable for model.
  global model

  # Tracking variables.
  predictions_labels = []
  true_labels = []
  # Total loss for this epoch.
  total_loss = 0

  # Put the model into training mode.
  model.train()

  for batch in tqdm(dataloader, total=len(dataloader)):
    # Add original labels - use later for evaluation.
    true_labels += batch['labels'].numpy().flatten().tolist()

    # move batch to device
    batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()}

    # Always clear any previously calculated gradients before performing a
    # backward pass.
    model.zero_grad()

    # Perform a forward pass (evaluate the model on this training batch).
    # This will return the loss (rather than the model output) because we
    # have provided the `labels`.
    outputs = model(**batch)

    # The call to `model` always returns a tuple, so we need to pull the
    # loss value out of the tuple along with the logits. We will use logits
    # later to calculate training accuracy.
    loss, logits = outputs[:2]

    # Accumulate the training loss over all of the batches so that we can
    # calculate the average loss at the end. `loss` is a Tensor containing a
    # single value; the `.item()` function just returns the Python value 
    # from the tensor.
    total_loss += loss.item()

    # Perform a backward pass to calculate the gradients.
    loss.backward()

    # Clip the norm of the gradients to 1.0.
    # This is to help prevent the "exploding gradients" problem.
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # Update parameters and take a step using the computed gradient.
    # The optimizer dictates the "update rule"--how the parameters are
    # modified based on their gradients, the learning rate, etc.
    optimizer_.step()

    # Update the learning rate.
    scheduler_.step()

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()

    # Convert these logits to list of predicted labels values.
    predictions_labels += logits.argmax(axis=-1).flatten().tolist()

  # Calculate the average loss over the training data.
  avg_epoch_loss = total_loss / len(dataloader)

  # Return all true labels and prediction for future evaluations.
  return true_labels, predictions_labels, avg_epoch_loss


In [None]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

def validation(dataloader, device_):
  global model

  # Tracking variables
  predictions_labels = []
  true_labels = []
  #total loss for this epoch.
  total_loss = 0

  # Put the model in evaluation mode--the dropout layers behave differently
  # during evaluation.
  model.eval()

  # Evaluate data for one epoch
  for batch in tqdm(dataloader, total=len(dataloader)):
    # add original labels
    true_labels += batch['labels'].numpy().flatten().tolist()

    # move batch to device
    batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()}

    # Telling the model not to compute or store gradients, saving memory and
    # speeding up validation

    with torch.no_grad():
      # Forward pass, calculate logit predictions.
      # This will return the logits rather than the loss because we have
      # not provided labels.
      # token_type_ids is the same as the "segment ids", which
      # differentiates sentence 1 and 2 in 2-sentence tasks.
      outputs = model(**batch)
      loss, logits = outputs[:2]
      logits = logits.detach().cpu().numpy()
      total_loss += loss.item()
      predict_content = logits.argmax(axis=-1).flatten().tolist()
      predictions_labels += predict_content

  # Calculate the average loss over the training data.
  avg_epoch_loss = total_loss / len(dataloader)
  # Return all true labels and prediciton for future evaluations.
  return true_labels, predictions_labels, avg_epoch_loss

In [None]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # default is 1e-8.
                  )

# Total number of training steps is number of batches * number of epochs.
# `train_dataloader` contains batched data so `len(train_dataloader)` gives 
# us the number of batches.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_training_steps = total_steps)


# Loop through each epoch.
print('Epoch')
for epoch in tqdm(range(epochs)):
  print()
  print('Training on batches...')
  # Perform one full pass over the training set.
  train_labels, train_predict, train_loss = train(train_dataloader, optimizer, scheduler, device)
  train_acc = accuracy_score(train_labels, train_predict)
  # Get prediction form model on validation data.
  print('Validation on batches...')
  valid_labels, valid_predict, val_loss = validation(valid_dataloader, device)
  val_acc = accuracy_score(valid_labels, valid_predict)

  # Print loss and accuracy values to see how training evolves.
  print("  train_loss: %.5f - val_loss: %.5f - train_acc: %.5f - valid_acc: %.5f"%(train_loss, val_loss, train_acc, val_acc))
  print()

Epoch




  0%|          | 0/6 [00:00<?, ?it/s]


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.56503 - val_loss: 0.52433 - train_acc: 0.69828 - valid_acc: 0.73628


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.51147 - val_loss: 0.51447 - train_acc: 0.74271 - valid_acc: 0.74530


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.49046 - val_loss: 0.52039 - train_acc: 0.75686 - valid_acc: 0.74928


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.47341 - val_loss: 0.51010 - train_acc: 0.76708 - valid_acc: 0.75043


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.46046 - val_loss: 0.51222 - train_acc: 0.77400 - valid_acc: 0.75353


Training on batches...


  0%|          | 0/5000 [00:00<?, ?it/s]

Validation on batches...


  0%|          | 0/1250 [00:00<?, ?it/s]

  train_loss: 0.44971 - val_loss: 0.51416 - train_acc: 0.78126 - valid_acc: 0.75270



In [None]:
# save model to google drive
output_model = 'drive/MyDrive/DS-301_PROJECT/twitter_SA_lw.pth'

# save
def save(model, optimizer, scheduler):
    # save
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict()
    }, output_model)

save(model, optimizer, scheduler)