# Emotion Detection Data Preparation

This notebook demonstrates the steps to prepare the GoEmotions dataset for a multi-label classification task.

 We will perform the following steps:
1. Group and sum emotion labels for each unique text.
2. Clean the dataset by removing conflicting neutral labels.
3. Retain only the top 3 emotion labels per text.
4. Stratify the dataset based on the labels.
5. Visualize the data before and after each step.


## Step 1: Load the Data
First, we load the GoEmotions dataset and inspect its structure.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from datasets import load_dataset

dataFrame = load_dataset('go_emotions', 'raw')

print("Total data size: ", len(dataFrame["train"]))
df = dataFrame["train"].to_pandas()
df.to_parquet("go_emotions.parquet")
df.head(2)

## 2. Inspecting the text column

* We want to look at the text column and determine the max_sequence_length our model will have to handle.
* And to avoid data-leakage between test and validation we also want to deduplicate the text column

In [None]:
import matplotlib
df["length_text"] = df["text"].apply(len)
df["length_text"].describe()


In [None]:

plt.figure(figsize=(12, 6))
sns.histplot(df["length_text"], bins=150, kde=True)
plt.title("Text Length Distribution")
plt.show()

In [None]:
# display the rows with text length lower than 10
df[df["length_text"] < 3]["text"].unique()[:100]

In [None]:
# display the rows with text length larget than 200
df[df["length_text"] > 200]["text"].unique()

In [None]:

# suppress < 3 and > 256
df = df[(df["length_text"] >= 3) & (df["length_text"] <= 200)]
# plot
plt.figure(figsize=(12, 6))
sns.histplot(df["length_text"], bins=50, kde=True)
plt.title("Text Length Distribution")
plt.show()


## Step 2: Group and Sum Emotion Labels
Next, we group the dataset by text and sum the emotion labels for each unique text.

In [None]:
df["text"].describe()

In [None]:
# Columns containing emotion labels
emotions = df.columns[9:-1]

# Group by text and sum the emotion columns
df_grouped = df.groupby('text', as_index=False)[emotions].sum()

# Display the first few rows of the grouped dataset
print("Grouped Dataset:")
print(df_grouped.head())
df_grouped

In [None]:
# plot a confusion matrix displaying the counts of each emotion
plt.figure(figsize=(20, 6))
sns.heatmap(df_grouped[emotions].corr(), annot=True, fmt=".1f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

In [None]:

# Create a DataFrame from the labels
labels_df = df_grouped[emotions]

# Calculate the co-occurrence matrix
co_occurrence_matrix = labels_df.T.dot(labels_df)

# Convert to a DataFrame for better readability
co_occurrence_df = pd.DataFrame(co_occurrence_matrix, columns=labels_df.columns, index=labels_df.columns)

# Plot the heatmap
plt.figure(figsize=(20, 10))
sns.heatmap(co_occurrence_df, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Label Co-Occurrence Matrix')
plt.xlabel('Labels')
plt.ylabel('Labels')
plt.show()

## Step 3: Remove Conflicting Neutral Labels
We remove the neutral label if it is set amongst other labels.

In [None]:
all_except_neutral = emotions[:-1]
all_except_neutral

In [None]:
# count the number of times neutral is present along with other emotions
df_grouped["neutral_only"] = df_grouped[all_except_neutral].sum(axis=1) == 0
df_grouped["neutral_and_other"] = df_grouped["neutral_only"] == False
print("Neutral Only:", df_grouped["neutral_only"].sum())
print("Neutral and Other:", df_grouped["neutral_and_other"].sum())
# number of items with neutral above 1 
df_grouped["neutral_above_1"] = df_grouped["neutral"] > 1
print("Neutral above 1:", df_grouped["neutral_above_1"].sum())

In [None]:
# Clean up the dataset by removing the neutral label if it conflicts with other labels
def remove_conflicting_neutral(row):
    if row['neutral'] > 0 and row[all_except_neutral].sum() > 0:
        row['neutral'] = 0
    return row

df_grouped = df_grouped.apply(remove_conflicting_neutral, axis=1)

# Display the first few rows after removing conflicting neutral labels
print("Dataset after Removing Conflicting Neutral Labels:")
# count the number of times neutral is present along with other emotions
df_grouped["neutral_only"] = df_grouped[all_except_neutral].sum(axis=1) == 0
df_grouped["neutral_and_other"] = df_grouped["neutral_only"] == False
print("Neutral Only:", df_grouped["neutral_only"].sum())
print("Neutral and Other:", df_grouped["neutral_and_other"].sum())
# number of items with neutral above 1 
df_grouped["neutral_above_1"] = df_grouped["neutral"] > 1
print("Neutral above 1:", df_grouped["neutral_above_1"].sum())

## Step 4: Retain Only Top 3 Labels
For each row, we keep only the top 3 emotion labels with the highest scores and set those to 1, while setting all other labels to 0.


In [None]:
import numpy as np

# Function to keep only the top 3 labels
def keep_top_labels(row, top_labels=3):
    top_indices = row.nlargest(top_labels).index
    # only keep those labels where the value is greater than 0
    top_indices = top_indices[row[top_indices] > 0]
    row[:] = 0
    row[top_indices] = 1
    return row

# Apply the function to each row
df_grouped[emotions] = df_grouped[emotions].apply(lambda l: keep_top_labels(l, 3), axis=1)

# Display the first few rows after keeping only the top 3 labels
print("Dataset after Keeping Only Top Labels:")
print(df_grouped.head())

## Step 5: Stratify the Dataset
We stratify the dataset based on the labels to ensure the distribution of labels is maintained.

In [None]:
from sklearn.model_selection import train_test_split

# shuffle the dataset
df_grouped = df_grouped.sample(frac=1, random_state=42)

# Create a multi-label indicator for stratification
df_grouped['labels'] = df_grouped[emotions].apply(lambda x: tuple(x), axis=1)

# Count the occurrences of each label combination
label_counts = df_grouped['labels'].value_counts()
label_counts

In [None]:
# count of label combinations occuring only once
single_label_counts = label_counts[label_counts == 1].count()
print(f"Label combinations occuring only once: {single_label_counts}")

In [None]:

# Identify rare combinations (those that appear less than twice)
rare_combinations = label_counts[label_counts < 2].index
print("Rare Combinations:", len(rare_combinations))

# Filter out rows with rare combinations
df_filtered = df_grouped[~df_grouped['labels'].isin(rare_combinations)]

# Stratify the dataset
train, test = train_test_split(df_filtered, test_size=0.1, random_state=42, shuffle=True,  stratify=df_filtered['labels'])

# Drop the 'labels' column as it was only needed for stratification
train = train.drop(columns=['labels'])
test = test.drop(columns=['labels'])

# Display the resulting datasets
print("Training set size:", len(train))
print("Test set size:", len(test))


In [None]:
%pip install -q imbalanced-learn

In [None]:
# balance the dataset by oversampling the minority classes
from imblearn.under_sampling import RandomUnderSampler
# Initialize the RandomOverSampler
oversampler = RandomUnderSampler(random_state=42)

# Fit and apply the sampler
X_resampled, y_resampled = oversampler.fit_resample(df_grouped[emotions], df_grouped['text'])


In [None]:

print("Resampled dataset size:", len(X_resampled), len(y_resampled))


## Visualizations
Let's visualize the distribution of labels before and after each step.

### Original Dataset


In [None]:
# Plot the original distribution of emotion labels
import matplotlib.pyplot as plt
import seaborn as sns

original_emotion_counts = df[emotions].sum().sort_values(ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x=original_emotion_counts.index, y=original_emotion_counts.values, palette="viridis")
plt.title('Original Counts per Emotion in Go Emotions Dataset')
plt.xlabel('Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


### Cleaned Dataset

In [None]:
# Plot the distribution of emotion labels after removing conflicting neutral labels
cleaned_emotion_counts = df_grouped[emotions].sum().sort_values(ascending=False)

# plot histogram
plt.figure(figsize=(12, 8))
cleaned_emotion_counts.plot(kind='bar', color='skyblue')


### Dataset with Top 3 Labels


### Test Set


In [None]:
# Calculate emotion counts for the training set
train_emotion_counts = train[emotions].sum()

# Calculate emotion counts for the test set
test_emotion_counts = test[emotions].sum()

# Combine the counts into a DataFrame
emotion_counts_df = pd.DataFrame({
    'Train': train_emotion_counts,
    'Test': test_emotion_counts
})

# Display the combined DataFrame
print("Emotion Counts in Train and Test Sets:")
print(emotion_counts_df)


In [None]:
import matplotlib.pyplot as plt

# Plot the stacked bar chart
emotion_counts_df.plot(kind='bar', stacked=True, figsize=(12, 8), color=['skyblue', 'lightgreen'])

# Add titles and labels
plt.title('Stacked Distribution of Emotions in Train and Test Sets')
plt.xlabel('Emotions')
plt.ylabel('Counts')
plt.xticks(rotation=45)
plt.legend(title='Dataset')
plt.show()


## Step 6: Stacked Distribution of Emotions in Train and Test Sets
Finally, let's visualize the distribution of emotions in the training and test sets using a stacked bar chart.

In [None]:
# Calculate emotion counts for the training set
train_emotion_counts = train[emotions].sum()

# Calculate emotion counts for the test set
test_emotion_counts = test[emotions].sum()

# Combine the counts into a DataFrame
emotion_counts_df = pd.DataFrame({
    'Train': train_emotion_counts,
    'Test': test_emotion_counts
})

# Calculate total counts for sorting
emotion_counts_df['Total'] = emotion_counts_df['Train'] + emotion_counts_df['Test']

# Sort the DataFrame by total counts
emotion_counts_df = emotion_counts_df.sort_values(by='Total', ascending=False)

# Drop the total column for plotting
emotion_counts_df = emotion_counts_df.drop(columns=['Total'])

# Display the sorted combined DataFrame
print("Sorted Emotion Counts in Train and Test Sets:")
print(emotion_counts_df)



In [None]:
import matplotlib.pyplot as plt

# Plot the stacked bar chart
emotion_counts_df.plot(kind='bar', stacked=True, figsize=(12, 8), color=['skyblue', 'lightgreen'])

# Add titles and labels
plt.title('Stacked Distribution of Emotions in Train and Test Sets (Sorted)')
plt.xlabel('Emotions')
plt.ylabel('Counts')
plt.xticks(rotation=45)
plt.legend(title='Dataset')
plt.show()


## Save the datasets to Parquet file

In [None]:
train.to_parquet("go_emotions_train.parquet")
test.to_parquet("go_emotions_test.parquet")