<a href="https://colab.research.google.com/github/shaaranii12/emotion-analyzer/blob/main/Emotion_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preparation & Preprocessing

In [None]:
#Installing the Transformers library
!pip install transformers

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#from google.colab import files
#uploaded = files.upload()

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

#Loading fine tuned model from Google Drive
Roberta = "/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/finetuned_model"

#Roberta = "j-hartmann/emotion-english-distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(Roberta)
model = AutoModelForSequenceClassification.from_pretrained(Roberta)

Initially, the pre-trained checkpoint j-hartmann/emotion-english-distilroberta-base
 was used as a baseline. After fine-tuning on the dataset, the model and tokenizer were saved to Google Drive and are now loaded directly to save training time and ensure consistent results.

In [None]:
model.config.id2label

**So, now we know the model supports 7 emotions:**

0: anger 🤬

1: disgust 🤢

2: fear 😨

3: joy 😀

4: neutral 😐

5: sadness 😭

6: surprise 😲

### Dataset 1: Emotions by Nidula Elgiriyewithana (2023)

In [None]:
import pandas as pd
#Loading of dataset
dataset1 = pd.read_csv('/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/Emotions.csv', encoding='latin-1', on_bad_lines='skip', quoting=3)
dataset1.shape

In [None]:
#Drop unnecessary columns
dataset1 = dataset1.drop(columns=["Unnamed: 0"])

In [None]:
print(dataset1.head())

Drop all data labeled love [2] because the model doesn’t support & it roughly corresponds to joy.

In [None]:
dataset1 = dataset1[dataset1["label"] != 2].reset_index(drop=True)

In [None]:
#Map the labels to their corresponding emotions.
label_mapping = {0: 'sadness', 1: 'joy', 3: 'anger', 4: 'fear', 5: 'surprise'}
dataset1['emotion'] = dataset1['label'].map(label_mapping)

In [None]:
print(dataset1.head())

The dataset’s label IDs don’t match the model’s expected IDs, and each emotion maps differently. To avoid confusion we remap the labels to align with the model’s configuration.

In [None]:
#Dataset to model label mapping
remap = {
    0: 5, #Sadness
    1: 3, #Joy
    3: 0, #Anger
    4: 2, #Fear
    5: 6  #Surprise
}

#Remapping the labels
dataset1["label"] = dataset1["label"].map(remap)
dataset1 = dataset1.reset_index(drop=True)

In [None]:
print(dataset1.head())

### Dataset 2: Go Emotions by Shivam Bansal (2021)

In [None]:
#Loading of dataset
dataset2 = pd.read_csv('/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/GoEmotions.csv', encoding='latin-1', on_bad_lines='skip', quoting=3, low_memory=False)
dataset2.shape

In [None]:
print(dataset2.head())

The original dataset has multiple columns indicating different emotions with 0/1 values. A single label column is created to summarize the dominant emotion for each text.

In [None]:
#The emotions supported by the model, in it's label's order
main_emotions = ['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise']

#Assigning label
def assign_label(row):
    for i, col in enumerate(main_emotions):
        if row[col] == 1:
            return i
    #Assign 7 for emotions other than the main 6
    return 7

dataset2['label'] = dataset2.apply(assign_label, axis=1)

In [None]:
#mapping the labels (easier referencing)
label_map = {0: 'anger', 1: 'disgust', 2: 'fear', 3: 'joy', 4: 'neutral', 5: 'sadness', 6: 'surprise'}
dataset2['emotion'] = dataset2['label'].map(label_map)

In [None]:
#Drop all rows that has the label 7 (other emotions)
dataset2 = dataset2[dataset2['label'] != 7].reset_index(drop=True)

#Drop all other unnecessary columns
dataset2 = dataset2[['text', 'label', 'emotion']]

In [None]:
print(dataset2.head())

In [None]:
import re

def clean_text(text):
    #Lowercase
    text = text.lower()

    #Remove placeholders like [NAME], [RELIGION], etc.
    text = re.sub(r'\[.*?\]', '', text)

    #Remove common social media tokens like /s, /jk, <3
    text = re.sub(r'/s|/jk|<3', '', text)

    #Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    #Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

#Apply to the dataframe
dataset2['text'] = dataset2['text'].apply(clean_text)

###Train-Test Split


Combining both dataset 1 and dataset 2

In [None]:
data = pd.concat([dataset1, dataset2], ignore_index=True)
data.shape

In [None]:
#Deduplicating data before splitting to avoid overlapping texts
data = data.drop_duplicates(subset=['text']).reset_index(drop=True)

The dataset has 450K+ rows, which is too large for quick fine-tuning. To speed things up, we limit it to 20,000 rows with an equal number of samples per label.

In [None]:
#Perform stratified sampling to take an equal number of rows from each label.
n_per_class = 10000 // data["label"].nunique()
data = data.groupby("label", group_keys = False).apply(lambda x: x.sample(n = n_per_class, random_state = 42))

print(data["label"].value_counts())

In [None]:
#Get the number of unique emotions in the dataset
num_labels = data["label"].nunique()
print(f"Number of unique emotions: {num_labels}")

We split the dataset into training and test sets (80/20 - according to the
Pareto theory) so the model can learn from one portion and be fairly evaluated on unseen data.


In [None]:
from sklearn.model_selection import train_test_split

#Stratified splitting of dataset
train_data, test_data = train_test_split(data, test_size = 0.2, stratify = data["label"], random_state = 42)

In [None]:
print("Train data distribution:\n", train_data["label"].value_counts())
print("\nTest data distribution:\n", test_data["label"].value_counts())

**Data leakage** is when information from the test set “leaks” into the training set, meaning the model accidentally sees data it shouldn’t during training. This makes the test accuracy look higher than reality because the model didn’t have to generalize, it just memorized.

In [None]:
#Check to see if there's any overlapping texts (data leaks)
overlapping_texts = set(train_data["text"]).intersection(set(test_data["text"]))
print("Number of overlapping texts:", len(overlapping_texts))

#Drop overlaps directly from train_data and test_data to avoid data leakage
#train_data = train_data[~train_data["text"].isin(overlapping_texts)].reset_index(drop=True)
#test_data = test_data[~test_data["text"].isin(overlapping_texts)].reset_index(drop=True)

## Model Fine-Tuning

In [None]:
#Coverting pandas DataFrame into a Hugging Face Dataset
from datasets import Dataset, DatasetDict

train_hf = Dataset.from_pandas(train_data)
test_hf = Dataset.from_pandas(test_data)

data = DatasetDict({
    "train": train_hf,
    "test": test_hf
})

**Tokenization** turns text into numbers the model understands. This is important because models like RoBERTa cannot read raw text and only work with numerical tokens to learn patterns and make predictions.

In [None]:
def tokenize(dataset):
    return tokenizer(dataset ["text"], padding=True, truncation=True, max_length=128)

data = data.map(tokenize, batched=True)
train_hf = train_hf.map(tokenize, batched=True)
test_hf = test_hf.map(tokenize, batched=True)

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics_function(prediction):
    labels = prediction.label_ids
    predictions = prediction.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average = 'weighted')
    accuracy = accuracy_score(labels, predictions)
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
    }

In [None]:
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(Roberta)

In [None]:
from transformers import Trainer, TrainingArguments

#Set the training arguments
training_arguments = TrainingArguments(
    output_dir = "./results",
    eval_strategy = "epoch",
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 3,
    learning_rate = 2e-5,
    weight_decay = 0.01
)

#Construct the trainer
trainer = Trainer (
    model_init = model_init,
    args = training_arguments,
    train_dataset = data["train"],
    eval_dataset = data["test"],
    tokenizer = tokenizer,
    compute_metrics = compute_metrics_function
)

trainer.train()

In [None]:
trainer.evaluate()

In [None]:
# Save into Google Drive
model.save_pretrained("/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/finetuned_model")
tokenizer.save_pretrained("/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/finetuned_model")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Get the model predictions on the test set
results = trainer.predict(data["test"])

#Extract predicted and true labels
predicted = results.predictions.argmax(-1) #argmax pick class with highest probability
labels = results.label_ids

In [None]:
# Define label names for better clarity in the confusion matrix
label_names = ["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]

# Generate the confusion matrix
cmx = confusion_matrix(labels, predicted)

# Plot out the confusion matrix
plt.figure(figsize = (10, 10))
sns.heatmap(cmx, annot = True, fmt = 'd', xticklabels = label_names, yticklabels = label_names, cmap = "pink")

#Add labels
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Initial Training Confusion Matrix')
plt.show()

print(cmx)

## Deployment to Gradio

The fine-tuned emotion analysis model was deployed on Hugging Face by uploading it to the Model Hub and linking it with a Space.

In [None]:
!pip install gradio

In [None]:
!pip install -U huggingface_hub

In [None]:
#Import Hugging Face Hub functions for authentication and uploading
from huggingface_hub import login, create_repo, upload_folder
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [None]:
#Log in to Hugging Face Hub and authenticate with tokens
login()

In [None]:
#Define the repository ID
repo_id = "Shaaranii12/emotion-analysis-model"

#Create a new repository on Hugging Face Hub
create_repo(repo_id=repo_id, repo_type="model", private=False, exist_ok=True)
print("Repo created at:", "https://huggingface.co/" + repo_id)

1.   repo_type="model" means this is a model repo, not a dataset or Space
2.   private=False makes it public so your Space can access it
3. exist_ok=True means create repo if it doesn't exist, otherwise do nothing to throw off error warning

In [None]:
#uploading the trained fine tuned model
local_model_path = "/content/drive/MyDrive/Colab/Project1_Emotion_Analysis/finetuned_model"

upload_folder(
    repo_id="Shaaranii12/emotion-analysis-model",  # your repo
    folder_path=local_model_path,
    path_in_repo=".",   # upload into root of repo
    commit_message="Upload fine-tuned RoBERTa emotion model"
)

#https://huggingface.co/spaces/Shaaranii12/emotion-analyzer

The model was successfully uploaded and deployed on Hugging Face Spaces, creating a simple web app where users can input text and see the predicted emotion.

Click the link to explore the app: [Emotions Analyzer](https://huggingface.co/spaces/Shaaranii12/emotion-analyzer)