<h3>Emotion Classification of Natural Language</h3>

Team Members:

<h3>Project Overview:</h3>

<p>This project involves building and evaluating machine learning models to classify text by the emotion it conveys. The task is to predict one of 28 possible emotion classes from a given text. The notebook demonstrates various preprocessing steps, model training using both neural networks and gradient boosted decision trees, and generating predictions for submission.</p>


<h2>Part 0: Setup and Imports</h2>

<h3>0.1 Import Libraries:</h3>
<p>Import all the necessary packages for text processing and machine learning.</p>
<p>Useful resources:</p>
<ul>
  <li><a href="https://scikit-learn.org/stable/tutorial/basic/tutorial.html">scikit-learn tutorial</a></li>
  <li><a href="https://pytorch.org/tutorials/">PyTorch tutorials</a></li>
  <li><a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html">pandas quickstart</a></li>
</ul>

In [9]:
# !pip install spacy



In [10]:
# !python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [66]:
import os
import pandas as pd
import numpy as np
import torch
import spacy
import xgboost as xgb
from sklearn.metrics import accuracy_score
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.metrics import accuracy_score
from transformers import Trainer, TrainingArguments
# Additional code will follow.

<h3>0.2 Accuracy Example:</h3>
<p>Below is an example of using accuracy_score to measure performance.</p>

In [67]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic Modeling</h2>
<p>This section loads the data, preprocesses the text, and builds two machine learning models.</p>

<h3>1.1 Load and Preprocess the Data:</h3>
<p>Load the training and test datasets and extract the text and labels.</p>

In [68]:
# Load training data
train = pd.read_csv("train.csv")
train_text = train["text"]
train_label = train["label"]

# Load test data
test = pd.read_csv("test.csv")
test_id = test["id"]
test_text = test["text"]

In [69]:
# Preview the training data
train.head()

<h3>1.2 Implement Two Training Algorithms:</h3>
<p>This section demonstrates two approaches: a neural network using bag-of-words and a gradient boosted decision tree model.</p>

<strong>Bag of Words Vectorization</strong>

In [70]:
# Load spaCy model for text preprocessing
nlp = spacy.load("en_core_web_sm")

# Preprocess text: lemmatize and remove stop words
cleaned = []
for text in train['text']:
    doc = nlp(text)
    filtered = [word.lemma_ for word in doc if not word.is_stop]
    cleaned.append(filtered)

# Build vocabulary of unique words
vocab = {}
for sentence in cleaned:
    for word in sentence:
        if word not in vocab:
            vocab[word] = len(vocab)



In [71]:
# Function to vectorize text using bag-of-words representation
def vectorize(text, vocab):
  vector = np.zeros(len(vocab))
  for word in text.lower().split():
    if word in vocab:
      vector[vocab[word]] += 1
  return vector

# Convert sentences to vectors and store corresponding labels
vectors = []
labels = []
for text, label in zip(train['text'], train['label']):
    vectors.append(vectorize(text, vocab))
    labels.append(label)

vectors = torch.tensor(np.array(vectors), dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.float32)

# Define a neural network model using bag-of-words input
class BoWModel(torch.nn.Module):
    def __init__(self, vocab_size, hidden_size, output_size):
        super(BoWModel, self).__init__()
        self.fc1 = torch.nn.Linear(vocab_size, hidden_size)
        self.fc2 = torch.nn.Linear(hidden_size, hidden_size)
        self.fc3 = torch.nn.Linear(hidden_size, output_size)
        self.bn1 = torch.nn.BatchNorm1d(hidden_size)
        self.dropout = torch.nn.Dropout(0.5)

    def forward(self, x):
        leaky = torch.nn.LeakyReLU(negative_slope=0.1)
        x = leaky(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = leaky(self.fc2(x))
        x = self.fc3(x)
        return x

model = BoWModel(len(vocab), 100, 28)
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=1e-4)

<h3>1.3 Train, Validate, and Select Model:</h3>
<p>Split the data into training and validation sets and train the neural network.</p>

In [72]:
# Split the data into training and validation sets (80% train, 20% validation)
split = int(len(vectors) * 0.8)
train_vectors = vectors[:split]
train_labels = labels[:split]
valid_vectors = vectors[split:]
valid_labels = labels[split:]

# Train the neural network model
num_epochs = 200
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    train_preds = model(train_vectors)
    train_loss = loss(train_preds, train_labels.long())
    train_loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        valid_preds = model(valid_vectors)
        valid_loss = loss(valid_preds, valid_labels.long())
        valid_preds_arg = torch.argmax(valid_preds, dim=1)
        valid_accuracy = (valid_preds_arg == valid_labels.long()).sum().item() / valid_labels.size(0)

    if epoch % 25 == 0 or epoch == num_epochs - 1:
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"  Train Loss: {train_loss.item():.4f}")
        print(f"  Validation Loss: {valid_loss.item():.4f}")
        print(f"  Validation Accuracy: {valid_accuracy * 100:.2f}%")

Epoch 1/200
  Train Loss: 3.3998
  Validation Loss: 3.3438
  Validation Accuracy: 0.70%
Epoch 26/200
  Train Loss: 3.0209
  Validation Loss: 3.3145
  Validation Accuracy: 2.35%
Epoch 51/200
  Train Loss: 2.6353
  Validation Loss: 3.1966
  Validation Accuracy: 41.90%
Epoch 76/200
  Train Loss: 2.2344
  Validation Loss: 2.7724
  Validation Accuracy: 58.80%
Epoch 101/200
  Train Loss: 1.8446
  Validation Loss: 2.2464
  Validation Accuracy: 63.05%
Epoch 126/200
  Train Loss: 1.5118
  Validation Loss: 1.9260
  Validation Accuracy: 65.45%
Epoch 151/200
  Train Loss: 1.2496
  Validation Loss: 1.6988
  Validation Accuracy: 66.65%
Epoch 176/200
  Train Loss: 1.0507
  Validation Loss: 1.5349
  Validation Accuracy: 67.95%
Epoch 200/200
  Train Loss: 0.9174
  Validation Loss: 1.4326
  Validation Accuracy: 68.35%


In [73]:
# Vectorize test set and generate predictions using the neural network model
test_vectors = []
for text in test['text']:
    test_vectors.append(vectorize(text, vocab))
test_vectors = torch.tensor(np.array(test_vectors), dtype=torch.float32)

model.eval()
with torch.no_grad():
    test_preds = model(test_vectors)
    test_preds_arg = torch.argmax(test_preds, dim=1).numpy()

predictions_df = pd.DataFrame({
    'id': test['id'],
    'label': test_preds_arg
})

predictions_df.to_csv('submission.csv', index=False)

In [20]:
print(train_vectors.shape)

torch.Size([8000, 10035])


Now we use a gradient boosted decision tree model as our second approach.

In [74]:
# Train and evaluate the gradient boosted decision tree model
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

pca = PCA(n_components=1000)

train_vectors_reduced = train_vectors.numpy()
valid_vectors_reduced = valid_vectors.numpy()
test_vectors_reduced = test_vectors.numpy()

train_labels = train_labels.long()
valid_labels = valid_labels.long()

eval_set = [(train_vectors_reduced, train_labels), (valid_vectors_reduced, valid_labels)]

model = xgb.XGBClassifier(
    max_depth=4,
    learning_rate=0.1,
    n_estimators=100,
    objective="multi:softmax",
    num_class=28
)
model.fit(train_vectors_reduced, train_labels, eval_set=eval_set, verbose=True)

val_predictions = model.predict(valid_vectors_reduced)
val_accuracy = accuracy_score(valid_labels.numpy(), val_predictions)
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")

test_predictions = model.predict(test_vectors_reduced)

predictions_df = pd.DataFrame({'id': test['id'], 'label': test_predictions})
predictions_df.to_csv('submission2.csv', index=False)

[0]	validation_0-mlogloss:3.05115	validation_1-mlogloss:3.06361
[1]	validation_0-mlogloss:2.86536	validation_1-mlogloss:2.88056
[2]	validation_0-mlogloss:2.72320	validation_1-mlogloss:2.74551
[3]	validation_0-mlogloss:2.60764	validation_1-mlogloss:2.63338
[4]	validation_0-mlogloss:2.51228	validation_1-mlogloss:2.54337
[5]	validation_0-mlogloss:2.42938	validation_1-mlogloss:2.46322
[6]	validation_0-mlogloss:2.35822	validation_1-mlogloss:2.39629
[7]	validation_0-mlogloss:2.29524	validation_1-mlogloss:2.33612
[8]	validation_0-mlogloss:2.23948	validation_1-mlogloss:2.28323
[9]	validation_0-mlogloss:2.18893	validation_1-mlogloss:2.23567
[10]	validation_0-mlogloss:2.14308	validation_1-mlogloss:2.19147
[11]	validation_0-mlogloss:2.10106	validation_1-mlogloss:2.14954
[12]	validation_0-mlogloss:2.06260	validation_1-mlogloss:2.11459
[13]	validation_0-mlogloss:2.02728	validation_1-mlogloss:2.08037
[14]	validation_0-mlogloss:1.99463	validation_1-mlogloss:2.05003
[15]	validation_0-mlogloss:1.96439	

<h3>1.4 Summary:</h3>
<p>Summarize your approach, model design, and any performance insights here.</p>

<h2>Part 2: Advanced Modeling and Experimentation</h2>
<p>Explore additional techniques and innovative changes. Use new training algorithms or preprocessing methods as desired.</p>

<h3>2.1 Additional Experimentation:</h3>
<p>Load the data and explore alternative modeling approaches.</p>

In [75]:
data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [76]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=28).to('cpu')

inputs = tokenizer(data["text"].to_list(), padding=True, truncation=True, return_tensors='pt')
labels = torch.tensor(data["label"])

test_inputs = tokenizer(test_data["text"].to_list(), padding=True, truncation=True, return_tensors='pt')

class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

dataset = TextDataset(inputs, labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a downstream task to be able to use it for predictions and inference.


In [29]:
from sklearn.metrics import f1_score, accuracy_score
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  score = f1_score(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)
  return {
      'f1': score,
      'accuracy': acc
  }

In [30]:
training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=20,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy='steps',
    output_dir='./results',
    run_name='my_experiment',
    report_to='none',
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

trainer.evaluate()

Step,Training Loss,Validation Loss,F1,Accuracy
100,2.7098,1.861923,0.365795,0.495
200,1.6237,1.317019,0.565912,0.633
300,1.1473,1.099909,0.652605,0.6965
400,0.9922,1.020955,0.693295,0.723
500,1.0065,0.938475,0.719269,0.7435
600,0.8264,0.906927,0.724391,0.744
700,0.7678,0.92069,0.72853,0.7525
800,0.8343,0.872855,0.727389,0.7535
900,0.7299,0.844276,0.73957,0.7565
1000,0.7177,0.831046,0.739738,0.7605


{'eval_loss': 0.8156575560569763,
 'eval_f1': 0.7462436776875906,
 'eval_accuracy': 0.7665,
 'eval_runtime': 1.1076,
 'eval_samples_per_second': 1805.708,
 'eval_steps_per_second': 90.285,
 'epoch': 3.0}

In [44]:
from datasets import Dataset
test_dataset = Dataset.from_dict({
    "input_ids": test_inputs["input_ids"],
    "attention_mask": test_inputs["attention_mask"]
})
test_results = trainer.predict(test_dataset)
print(test_results)

PredictionOutput(predictions=array([[-3.2994623e+00, -4.1913283e-01, -3.1433434e+00, ...,
        -3.4308878e-01, -2.6346073e+00,  6.8985143e+00],
       [-3.4019110e+00, -1.0683243e+00, -2.2352924e+00, ...,
        -5.0666088e-01, -3.2107604e+00, -4.8267762e-03],
       [-3.1755946e+00,  4.3650618e-01, -2.8848526e+00, ...,
        -9.5433182e-01, -2.5741169e+00,  7.8771424e-01],
       ...,
       [-3.3728209e+00, -1.9593272e+00, -1.9619123e+00, ...,
         2.9653367e-01, -2.8483663e+00, -8.4335697e-01],
       [-2.2238469e+00,  8.2302198e+00, -2.1392069e+00, ...,
        -1.7103633e+00, -3.0423563e+00,  6.4389908e-01],
       [-3.4659858e+00,  1.3115994e+00, -2.6390691e+00, ...,
        -1.9886749e+00, -3.1556966e+00,  6.4458926e-03]], dtype=float32), label_ids=None, metrics={'test_runtime': 34.8425, 'test_samples_per_second': 430.509, 'test_steps_per_second': 21.525})


In [52]:
predictions = test_results.predictions
predicted_classes = predictions.argmax(axis=1)

df = pd.DataFrame({'id': list(range(len(test_data))), 'predictions': predicted_classes})

df.to_csv('submission3.csv', index=False)

print(df)

          id  predictions
0          0           27
1          1           16
2          2           21
3          3           21
4          4           21
...      ...          ...
14995  14995            9
14996  14996            9
14997  14997           12
14998  14998            1
14999  14999            4

[15000 rows x 2 columns]


<h3>2.2 Project Insights:</h3>
<p>Discuss performance improvements, challenges, and the results of your experiments.</p>

<h2>Part 3: Generating Final Predictions for Deployment</h2>
<p>Generate a CSV file with two columns: 'id' and 'label' for deployment.</p>

In [None]:
id = range(15000)
prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('/kaggle/working/submission.csv', index=False)

In [None]:
# Additional code to generate a CSV file from your predictions using pandas

<h2>Part 4: References and Resources</h2>
<p>Cite any papers or online resources you used.</p>

Please list your references here.