<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members:

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Basics</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [9]:
# !pip install spacy



In [10]:
# !python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [66]:
import os
import pandas as pd
import numpy as np
import torch
import spacy
import xgboost as xgb
from sklearn.metrics import accuracy_score
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.metrics import accuracy_score
from transformers import Trainer, TrainingArguments
# TODO

<h3>0.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below.
<p>

In [67]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [68]:
#train = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/train.csv")
train = pd.read_csv("train.csv")
train_text = train["text"]
train_label = train["label"]

#test = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/test.csv")
test = pd.read_csv("test.csv")
test_id = test["id"]
test_text = test["text"]

In [69]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO
train.head()

Unnamed: 0,text,label
0,i interact with on a daily basis either in rea...,1
1,Stranger than fiction. Can't even begin to com...,1
2,i sit here with the aftermath feeling so damn ...,1
3,Great job! Hats off to you.,25
4,i hate you threads posted by people just whini...,9


<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

**Bag of Words Vectorization**

In [70]:
#import spacy, a library that will aid in getting rid of unuseful features
nlp = spacy.load("en_core_web_sm")

#extract all words from the text and add to cleaned list
cleaned = []
for text in train['text']:
    doc = nlp(text)
    #filter out unimportant features, such as "and".
    #word.lemma_ converts verbs like running into run.
    filtered = [word.lemma_ for word in doc if not word.is_stop]
    cleaned.append(filtered)

#create a set to hold all unique words
vocab = {}

#add unique words to vocab and add index to value for indexing later.
for sentence in cleaned:
    for word in sentence:
        if word not in vocab:
            vocab[word] = len(vocab)



In [71]:
#transform text into vector by creating tensor representing each word in the vocab
#then counting the number of occurences of the vocab words in each sentence.
def vectorize(text, vocab):
  vector = np.zeros(len(vocab))
  for word in text.lower().split():
    if word in vocab:
      vector[vocab[word]] += 1
  return vector

#vectors is the list of all sentences transformed into vectors
#labels is the list of all labels corresponding to the vectors
vectors = []
labels = []

#simultaneously iterate over the sentences and labels and apply vectorization
#to each sentence and populate vectors and labels list
for text, label in zip(train['text'], train['label']):
    vectors.append(vectorize(text, vocab))
    labels.append(label)

#convert both vectors and labels into tensors and convert to float32 for pytorch
vectors = torch.tensor(np.array(vectors), dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.float32)

#create bag of words model using a deep neural net in pytorch
class BoWModel(torch.nn.Module):
    def __init__(self, vocab_size, hidden_size, output_size):
        super(BoWModel, self).__init__()
        self.fc1 = torch.nn.Linear(vocab_size, hidden_size)
        self.fc2 = torch.nn.Linear(hidden_size, hidden_size)
        self.fc3 = torch.nn.Linear(hidden_size, output_size)

        #added batchnorm layer to make training faster and more stable
        self.bn1 = torch.nn.BatchNorm1d(hidden_size)

        #added dropout since there is large amounts of overfitting
        self.dropout = torch.nn.Dropout(0.5)

    def forward(self, x):
        #use leaky relu so there are no vanishing gradients
        leaky = torch.nn.LeakyReLU(negative_slope=0.1)
        x = leaky(self.bn1(self.fc1(x)))
        #apply dropout to output from previous step
        x = self.dropout(x)
        x = leaky(self.fc2(x))
        x = self.fc3(x)
        return x

#create model with correct input/output dimensions. hidden size of 100 is arbitrary
model = BoWModel(len(vocab), 100, 28)

#set loss to cross entropy since we are working with discrete outputs
loss = torch.nn.CrossEntropyLoss()

#set optimizer to adam since it is the most current efficient and effectice optimizer
#add regulariztion (weight_decay) to penzalize large weights and reduce overfitting
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=1e-4)

<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [72]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO

#create integer on where to split train/validation set at 80%
split = int(len(vectors) * 0.8)

#create train and validation lists for vector embeddings and labels
train_vectors = vectors[:split]
train_labels = labels[:split]
valid_vectors = (vectors[split:])
valid_labels = (labels[split:])

#training loop for BoW and Nerual Net model
num_epochs = 200
for epoch in range(num_epochs):

    model.train()
    optimizer.zero_grad()
    train_preds = model(train_vectors)
    train_loss = loss(train_preds, train_labels.long())
    train_loss.backward()
    optimizer.step()

    #get validation set predicitions (logits and non logits)
    model.eval()
    with torch.no_grad():
        valid_preds = model(valid_vectors)
        valid_loss = loss(valid_preds, valid_labels.long())
        valid_preds_arg = torch.argmax(valid_preds, dim=1)
        valid_accuracy = (valid_preds_arg == valid_labels.long()).sum().item() / valid_labels.size(0)

    #print train loss, validation loss, validation accuracy every 25 epochs
    if epoch % 25 == 0 or epoch == num_epochs - 1:
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"  Train Loss: {train_loss.item():.4f}")
        print(f"  Validation Loss: {valid_loss.item():.4f}")
        print(f"  Validation Accuracy: {valid_accuracy * 100:.2f}%")

Epoch 1/200
  Train Loss: 3.3998
  Validation Loss: 3.3438
  Validation Accuracy: 0.70%
Epoch 26/200
  Train Loss: 3.0209
  Validation Loss: 3.3145
  Validation Accuracy: 2.35%
Epoch 51/200
  Train Loss: 2.6353
  Validation Loss: 3.1966
  Validation Accuracy: 41.90%
Epoch 76/200
  Train Loss: 2.2344
  Validation Loss: 2.7724
  Validation Accuracy: 58.80%
Epoch 101/200
  Train Loss: 1.8446
  Validation Loss: 2.2464
  Validation Accuracy: 63.05%
Epoch 126/200
  Train Loss: 1.5118
  Validation Loss: 1.9260
  Validation Accuracy: 65.45%
Epoch 151/200
  Train Loss: 1.2496
  Validation Loss: 1.6988
  Validation Accuracy: 66.65%
Epoch 176/200
  Train Loss: 1.0507
  Validation Loss: 1.5349
  Validation Accuracy: 67.95%
Epoch 200/200
  Train Loss: 0.9174
  Validation Loss: 1.4326
  Validation Accuracy: 68.35%


In [73]:
#create list to hold vectors in test set, populate list, convert list into tensor of float32
test_vectors = []
for text in test['text']:
    test_vectors.append(vectorize(text, vocab))
test_vectors = torch.tensor(np.array(test_vectors), dtype=torch.float32)

#get predictions for test set and get non_logit output
model.eval()
with torch.no_grad():
    test_preds = model(test_vectors)
    test_preds_arg = torch.argmax(test_preds, dim=1).numpy()

#create pandas df in correct format
predictions_df = pd.DataFrame({
    'id': test['id'],
    'label': test_preds_arg
})

#write predictions to csv
predictions_df.to_csv('submission.csv', index=False)

In [20]:
print(train_vectors.shape)

torch.Size([8000, 10035])


Now We will use a gradient boosted decision tree for our second model

In [74]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.2
# TODO

import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.decomposition import PCA


pca = PCA(n_components=1000)  # Keep 500 components

train_vectors_reduced = train_vectors.numpy()
valid_vectors_reduced = valid_vectors.numpy()
test_vectors_reduced = test_vectors.numpy()



train_labels = train_labels.long()
valid_labels = valid_labels.long()


eval_set = [(train_vectors_reduced,train_labels), (valid_vectors_reduced, valid_labels)]

# Train XGBoost model
model = xgb.XGBClassifier(
    max_depth=4,
    learning_rate=0.1,
    n_estimators=100,
    objective="multi:softmax",
    num_class=28  # Number of classes in the dataset
)
model.fit(train_vectors_reduced, train_labels,eval_set = eval_set, verbose =True)

# Evaluate on validation set
val_predictions = model.predict(valid_vectors_reduced)
val_accuracy = accuracy_score(valid_labels.numpy(), val_predictions)
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")

# Predict on test set
test_predictions = model.predict(test_vectors_reduced)

# Save predictions
predictions_df = pd.DataFrame({'id': test['id'], 'label': test_predictions})
predictions_df.to_csv('submission2.csv', index=False)

[0]	validation_0-mlogloss:3.05115	validation_1-mlogloss:3.06361
[1]	validation_0-mlogloss:2.86536	validation_1-mlogloss:2.88056
[2]	validation_0-mlogloss:2.72320	validation_1-mlogloss:2.74551
[3]	validation_0-mlogloss:2.60764	validation_1-mlogloss:2.63338
[4]	validation_0-mlogloss:2.51228	validation_1-mlogloss:2.54337
[5]	validation_0-mlogloss:2.42938	validation_1-mlogloss:2.46322
[6]	validation_0-mlogloss:2.35822	validation_1-mlogloss:2.39629
[7]	validation_0-mlogloss:2.29524	validation_1-mlogloss:2.33612
[8]	validation_0-mlogloss:2.23948	validation_1-mlogloss:2.28323
[9]	validation_0-mlogloss:2.18893	validation_1-mlogloss:2.23567
[10]	validation_0-mlogloss:2.14308	validation_1-mlogloss:2.19147
[11]	validation_0-mlogloss:2.10106	validation_1-mlogloss:2.14954
[12]	validation_0-mlogloss:2.06260	validation_1-mlogloss:2.11459
[13]	validation_0-mlogloss:2.02728	validation_1-mlogloss:2.08037
[14]	validation_0-mlogloss:1.99463	validation_1-mlogloss:2.05003
[15]	validation_0-mlogloss:1.96439	

<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

To begin with, we knew we needed to convert the text from the sentences into something that a machine learning algorithm can use. We also knew we needed to come up with a way for our model to select the best category of the 28 to choose from. We decided the best way to go about this would be to have our model give a predicted value between 1 and 28 that corresponds to the 28 possible classes. This predicted class can be determined by a probability distribution across all of the categories and select the most likely label for one, or through other rules for comparing feature vector similarities. We decided the best way to evaluate our models performance, and improve upon it is by looking at the models accuracy to get an idea for how well the model is making predictions.


After converting the text into a vector, we needed to learn a model on the vectors to correctly predict patterns in the vectors and their emotion classification.



1.4.2 Which two learning methods from class did you choose and why did you made the choices?

We chose to use a feed forward neural network with a leaky relu activation function that outputs a vector of dim = 28 for each categorization. The value of this represents the likelihood of the respective category according to the model with the highest probability selected as the model’s output. For our second model we used decision trees with gradient boosting to classify text. We chose to use this learning method because we felt that decision trees work well generally for categorizing text into multiple “buckets”, and we felt boosting would allow us to avoid some of the pitfalls of trees like overfitting, especially since our feature vectors were large.


1.4.3 How did you do the model selection?


In order to learn a model on natural language, we created a bag of word vectorization functions to convert text into numbers. Bag of words provided a good balance of easy computation and easy implementation while being fairly effective. We created a set called vocab to store all of the unique words. In order to cut down on features and unimportant words, I used the spacy library to filter out “stop words” such as “and” or “to” that do not contribute to sentiment as much as other words. Additionally, we converted verbs such as “running” into “run” to further cut down on unnecessary features.


We went about model selection for the feed forward network by hyperparameter tuning. We tried adjusting the amount of layers, and we found by including additional hidden layers the model was overfitting which was evident by a near zero training loss, but a validation loss larger than 2.0. However, we also found that removing the third layer decreased perform	ance. Hence, we decreased the amount of epochs as well as included regularization methods in the optimizer and added dropout, keeping the improved performance while reducing overfitting.


For the gradient boosted decision trees we tried using PCA adt first to truncate the amount of features but found that the model performed best when trained on the full feature set. Additionally we experimented with the tree height finding that a smaller height actually provided slightly better performance.We went from a tree with a height of 6 to a height of 4.



1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission) [link text](https://)

Yes, our basic solution exceeded Tiny Piney






<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [75]:
data = pd.read_csv('train.csv')

test_data = pd.read_csv('test.csv')



In [76]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels = 28).to('cpu')

inputs = tokenizer(data["text"].to_list(), padding=True, truncation=True, return_tensors='pt')
labels = torch.tensor(data["label"])

test_inputs = tokenizer(test_data["text"].to_list(), padding=True, truncation=True, return_tensors='pt')


class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item
dataset = TextDataset(inputs, labels)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
from sklearn.metrics import f1_score, accuracy_score
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])


def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  #score = accuracy_score(labels, preds)
  score = f1_score(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)
  return {
      'f1': score,
      'accuracy': acc,
  }

In [30]:
training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=20,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy='steps',
    output_dir='./results',
    run_name='my_experiment',
    report_to='none',
    seed = 42,

)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics = compute_metrics
)


trainer.train()


trainer.evaluate()

Step,Training Loss,Validation Loss,F1,Accuracy
100,2.7098,1.861923,0.365795,0.495
200,1.6237,1.317019,0.565912,0.633
300,1.1473,1.099909,0.652605,0.6965
400,0.9922,1.020955,0.693295,0.723
500,1.0065,0.938475,0.719269,0.7435
600,0.8264,0.906927,0.724391,0.744
700,0.7678,0.92069,0.72853,0.7525
800,0.8343,0.872855,0.727389,0.7535
900,0.7299,0.844276,0.73957,0.7565
1000,0.7177,0.831046,0.739738,0.7605


{'eval_loss': 0.8156575560569763,
 'eval_f1': 0.7462436776875906,
 'eval_accuracy': 0.7665,
 'eval_runtime': 1.1076,
 'eval_samples_per_second': 1805.708,
 'eval_steps_per_second': 90.285,
 'epoch': 3.0}

In [44]:

from datasets import Dataset
test_dataset = Dataset.from_dict({
    "input_ids": test_inputs["input_ids"],
    "attention_mask": test_inputs["attention_mask"],

})
test_results = trainer.predict(test_dataset)
print(test_results)


PredictionOutput(predictions=array([[-3.2994623e+00, -4.1913283e-01, -3.1433434e+00, ...,
        -3.4308878e-01, -2.6346073e+00,  6.8985143e+00],
       [-3.4019110e+00, -1.0683243e+00, -2.2352924e+00, ...,
        -5.0666088e-01, -3.2107604e+00, -4.8267762e-03],
       [-3.1755946e+00,  4.3650618e-01, -2.8848526e+00, ...,
        -9.5433182e-01, -2.5741169e+00,  7.8771424e-01],
       ...,
       [-3.3728209e+00, -1.9593272e+00, -1.9619123e+00, ...,
         2.9653367e-01, -2.8483663e+00, -8.4335697e-01],
       [-2.2238469e+00,  8.2302198e+00, -2.1392069e+00, ...,
        -1.7103633e+00, -3.0423563e+00,  6.4389908e-01],
       [-3.4659858e+00,  1.3115994e+00, -2.6390691e+00, ...,
        -1.9886749e+00, -3.1556966e+00,  6.4458926e-03]], dtype=float32), label_ids=None, metrics={'test_runtime': 34.8425, 'test_samples_per_second': 430.509, 'test_steps_per_second': 21.525})


In [52]:
predictions = test_results.predictions
predicted_classes = predictions.argmax(axis=1)




df = pd.DataFrame({'id': [id for id in range(len(test_data))], 'predictions': predicted_classes})


df.to_csv('submission3.csv', index=False)

print(df)

          id  predictions
0          0           27
1          1           16
2          2           21
3          3           21
4          4           21
...      ...          ...
14995  14995            9
14996  14996            9
14997  14997           12
14998  14998            1
14999  14999            4

[15000 rows x 2 columns]


<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem.
We provide how to save a csv file if you are running Notebook on Kaggle.

In [None]:
id = range(15000)
prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('/kaggle/working/submission.csv', index=False)

In [None]:
# TODO

# You may use pandas to generate a dataframe with country, date and your predictions first
# and then use to_csv to generate a CSV file.

<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.