# Data Preparation

This notebook prepares the train and validation datasets for the emotion classification experiments.

**Dataset source**: [Kaggle Emotion Dataset](https://www.kaggle.com/datasets/parulpandey/emotion-dataset/data)

The dataset contains text samples labeled with emotions:
- 0: sadness
- 1: joy
- 2: love
- 3: anger
- 4: fear

We subsample 100 samples from the validation set and split them into train (70) and val (30) sets.

In [26]:
import pandas as pd
import os

In [27]:
LABEL_MAPPING = {
    "0": "sadness",
    "1": "joy",
    "2": "love",
    "3": "anger",
    "4": "fear"
}

EMOTION_TYPES = list(LABEL_MAPPING.values())
EMOTION_TYPES

['sadness', 'joy', 'love', 'anger', 'fear']

In [28]:
# Kaggle dataset reference:
# https://www.kaggle.com/datasets/parulpandey/emotion-dataset/data

# Load the full validation dataset
df_full = pd.read_csv("validation.csv")

print(f"Total samples in validation.csv: {len(df_full)}")
df_full.head()

Total samples in validation.csv: 2000


Unnamed: 0,text,label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


In [29]:
# Set random seed for reproducibility
RANDOM_SEED = 42
N_SAMPLES = 150

# Subsample 100 samples reproducibly
df_sampled = (
    df_full
    .assign(
        emotion = lambda df_: df_["label"].astype(str).map(LABEL_MAPPING)
    )
    .query("label.isin([0,1,2,3,4])")
    .sample(n=N_SAMPLES, random_state=RANDOM_SEED)
    .rename(columns={"text": "user_message"})
    [["user_message", "emotion"]]
    .reset_index(drop=True)
)

print(f"Sampled {len(df_sampled)} samples")
pd.DataFrame(df_sampled['emotion'].value_counts()).reset_index()

Sampled 150 samples


Unnamed: 0,emotion,count
0,joy,55
1,sadness,42
2,fear,18
3,love,18
4,anger,17


In [30]:
# Simple train/val
TRAIN_SIZE = 100

df_train = df_sampled[:TRAIN_SIZE]
df_val = df_sampled[TRAIN_SIZE:]

print(f"Train size: {len(df_train)}")
print(f"Val size: {len(df_val)}")

Train size: 100
Val size: 50


In [31]:
# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Save to CSV files
df_train.to_csv("data/train.csv", index=False)
df_val.to_csv("data/val.csv", index=False)

In [32]:
# Verify the saved files
print("\nTrain dataset:")
print(pd.read_csv("data/train.csv").head())

print("\nValidation dataset:")
print(pd.read_csv("data/val.csv").head())


Train dataset:
                                        user_message emotion
0  i must say that i do feel better in myself and...     joy
1  i feel so privileged to have spent so much tim...     joy
2  i went to see the entrance examination results...     joy
3  i wish there was something i could do sitting ...    fear
4  i still can t shake the feeling of him loving ...    love

Validation dataset:
                                        user_message  emotion
0  i only find out that they are looking and feel...      joy
1  i can feel that my hopes have not been in vain...  sadness
2  i know how much work goes into the creation an...      joy
3  i feel shamed that i hoped for one last christ...  sadness
4  i was feeling very bah humbugish coming out of...      joy
