## Sentiment Classification using HuggingFace Transformers

## Introduction
This notebook demonstrates the process of training a binary sentiment classifier using pre-trained transformer models from HuggingFace. The task involves merging two datasets and training a general sentiment classification model.



#Necessary downloads of packages/libraries

In [None]:
!pip install datasets evaluate accelerate

In [None]:
!pip install transformers[torch] accelerate -U


In [None]:
!pip install datasets transformers accelerate adafactor

In [None]:
pip install datasets scikit-learn transformers torch pandas numpy accelerate


#Google Drive Setup

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Libraries Import, Data Loading and Preprocessing

In [6]:
# Import necessary libraries
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, AdamW, get_linear_schedule_with_warmup, Trainer, TrainingArguments
from torch.utils.data import DataLoader
import torch
import pandas as pd
import numpy as np
from accelerate import Accelerator

# Define function for preprocessing
def preprocess_function(examples):
    augmented_examples = []
    for text in examples["text"]:
        # Apply some text augmentation techniques here (e.g., synonym replacement)
        augmented_examples.append(text)
    tokenized_inputs = tokenizer(
        augmented_examples,
        truncation=True,
        max_length=128,  # Experiment with different max lengths
        padding="max_length",
        return_tensors="pt"
    )
    print("Tokenized sequence lengths:", tokenized_inputs['input_ids'].size(1))
    return tokenized_inputs

# Load datasets
imdb_dataset = load_dataset("stanfordnlp/imdb")
tweet_dataset = load_dataset("mteb/tweet_sentiment_extraction")

# Filter tweet dataset to remove neutral sentiment
tweet_dataset = tweet_dataset.filter(lambda example: example['label_text'] != 'neutral')

# Map sentiment labels to numerical values
label_map_tweet = {'negative': 0, 'positive': 1}
tweet_dataset = tweet_dataset.map(lambda example: {'label': label_map_tweet[example['label_text']], 'text': example['text']})

# Convert datasets to pandas dataframe and merge them
tweet_df = pd.DataFrame(tweet_dataset['train'])
imdb_df = pd.DataFrame(imdb_dataset['train'])
merged_df = pd.concat([imdb_df, tweet_df], ignore_index=True)

# Split merged dataset into train, validation, and test sets
train_data, test_data = train_test_split(merged_df, test_size=0.1, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)





The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/465k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27481 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3534 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27481 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3534 [00:00<?, ? examples/s]

Map:   0%|          | 0/16363 [00:00<?, ? examples/s]

Map:   0%|          | 0/2104 [00:00<?, ? examples/s]

* Loads datasets from HuggingFace's library

  1.   imdb_dataset
  2.   tweet_dataset

   

* Filters the tweet dataset to remove neutral sentiments.
* Merges the datasets and splits them into train, validation, and test sets.
* Defines a preprocessing function to tokenize the input text using a pre-trained tokenizer (RobertaTokenizerFast).




##DataLoader Definition, Tokenization and Model Setup.

In [7]:
# Load tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

# Apply preprocessing function to datasets
train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)
test_dataset = Dataset.from_pandas(test_data)

train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Define DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Define Accelerator
accelerator = Accelerator()

# Define model configuration
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    id2label=id2label,
    label2id=label2id,
)
model.resize_token_embeddings(len(tokenizer))


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/33503 [00:00<?, ? examples/s]

Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenize

Map:   0%|          | 0/3723 [00:00<?, ? examples/s]

Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128


Map:   0%|          | 0/4137 [00:00<?, ? examples/s]

Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128
Tokenized sequence lengths: 128


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(50265, 768, padding_idx=1)

* Defines DataLoaders to load batches of tokenized data for training, validation, and testing.

* Initializes a tokenizer (RobertaTokenizerFast) for tokenizing text data.

* Applies the preprocessing function to convert text data into tokenized sequences.

* Initializes an Accelerator object for distributed training if available.

* Initializes a pre-trained model for sequence classification (RobertaForSequenceClassification).

* Uses the roberta-base variant of the RoBERTa model, which is a widely used pre-trained model for NLP tasks.
* The RoBERTa base model (roberta-base) was preferred over BERT base or DistilBERT due to its enhanced performance and efficiency, attributed to its optimized pre-training objectives and diverse pre-training data.


* Configures the model to map class indices to sentiment labels (0: "negative", 1: "positive").

# Taining Configuration and Trainer

In [9]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/NLP',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=400,  # almost 10% of total steps
    weight_decay=0.01, # to penalize when it starts to overfit
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="epoch",  #evaluation strategy after every epoch
    save_strategy="epoch",  #save after every epoch
    load_best_model_at_end=True,
)

# function to evaluate the model
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


# Train the model and save the model after training
trainer.train()
trainer.save_model('Best_model_3')



Epoch,Training Loss,Validation Loss,Accuracy
1,0.1741,0.339022,0.897932
2,0.1676,0.329351,0.911093


The model trained with the accuracy of 91% with no problems of overfitting. The validation loss decreases as well as training loss which shows the model has no overfitting problems.

* Sets up training arguments (e.g., number of epochs, batch size, logging directory) using TrainingArguments.
* Defines an instance of the Trainer class to manage the training process.

* Trains the model using the defined Trainer object.
* Saves the best-performing model based on validation performance.

#Evaluation for the model

The model predicted all the sentences correctly.

In [11]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# Load the trained model and tokenizer
model_path = "/content/Best_model_3"  # Update the path to the directory where your model is saved
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Example strings for inference
example_strings = [
    "I loved the movie but hated the ending",
    "The film was not good",
    "This film was normal not too bad or too good!",
    "I'm disappointed",
    "The book was good",
    "The restaurant had terrible food.",
    "i hate you.",
    "i love the final product!",
    "The customer support was very good",
    "I hate to waste money"
]

# Tokenize example strings
tokenized_examples = tokenizer(example_strings, truncation=True, padding=True, return_tensors="pt")

# Make predictions
with torch.no_grad():
    outputs = model(**tokenized_examples)
    predictions = torch.argmax(outputs.logits, dim=1).cpu().numpy()

# Define label mapping if needed
id2label = {0: "negative", 1: "positive"}  # Update with your label mapping if different

# Map predictions to labels
predicted_labels = [id2label[pred] for pred in predictions]

# Print predicted sentiments for each example
for example, sentiment in zip(example_strings, predicted_labels):
    print(f"Example: {example} - Predicted Sentiment: {sentiment}")


Example: I loved the movie but hated the ending - Predicted Sentiment: positive
Example: The film was not good - Predicted Sentiment: negative
Example: This film was normal not too bad or too good! - Predicted Sentiment: positive
Example: I'm disappointed - Predicted Sentiment: negative
Example: The book was good - Predicted Sentiment: positive
Example: The restaurant had terrible food. - Predicted Sentiment: negative
Example: i hate you. - Predicted Sentiment: negative
Example: i love the final product! - Predicted Sentiment: positive
Example: The customer support was very good - Predicted Sentiment: positive
Example: I hate to waste money - Predicted Sentiment: negative
