<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members:

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Preliminaries</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [6]:
import os
import pandas as pd
import numpy as np
import torch
# TODO

# student-imported libraries are listed below

from sklearn.feature_extraction.text import CountVectorizer # for preprocessing
import re # imports the regular expression module, which provides support for working w/ text
from sklearn.model_selection import train_test_split # allows us to split our bag of words and labels for training

# algorithm 1: SVM
from sklearn.svm import SVC
import random

# algorithm 2: MLP
# new torch imports
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# new vision-dataset-related torch imports
import torchvision
import torchvision.datasets as dset
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

import time
import math
from sklearn.model_selection import GridSearchCV




<h3>0.2 Accuracy:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below. 
<p>

In [7]:
from sklearn.metrics import accuracy_score, classification_report
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basics</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [8]:
#train = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/train.csv")
#train_text = train["text"]
#train_label = train["label"]

#test = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/test.csv")
#test_id = test["id"]
#test_text = test["text"]

In [9]:
### loading the data for testing on local machine. delete this code segment or comment out when submitting
train = pd.read_csv("train.csv")
train_text = train["text"]
train_label = train["label"]


test = pd.read_csv("test.csv")
test_id = test["id"]
test_text = test["text"]


In [33]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# train 

def preprocess_text(text):
    """
    takes a given string, and normalizes it by doing the following
    1. lowercase all text to keep consistency
    2. remove any punctuation and numbers
    
    returns the cleaned text
    """
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

# preprocess the text 
train_text_cleaned = train_text.apply(preprocess_text)
# print(df)

# converting the processed text into a bag-of-words vector

# Initialize CountVectorizer
vectorizer = CountVectorizer(max_features=1000)

# Fit and transform the text data
X = vectorizer.fit_transform(train_text_cleaned)
y = train_label
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_tensor = torch.FloatTensor(X_train.toarray())
X_test_tensor = torch.FloatTensor(X_test.toarray())

X_train_tensor = X_train_tensor.float()
X_test_tensor = X_test_tensor.float()

# Convert y_train and y_test from Pandas Series to PyTorch tensors
y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32).to(X_train_tensor.device)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32).to(X_test_tensor.device)

# Ensure the shapes are compatible
y_train_tensor = y_train_tensor.view(-1, 1)  # Make it (n, 1) for regression
y_test_tensor = y_test_tensor.view(-1, 1)


print(f"X_train_tensor shape: {X_train_tensor.shape}")
print(f"y_train_tensor shape: {y_train_tensor.shape}")



#print(X_train_tensor.shape[1])
# TODO

X_train_tensor shape: torch.Size([8000, 1000])
y_train_tensor shape: torch.Size([8000, 1])


In [35]:
### code to preprocess the test data set...
test_text_cleaned = test_text.apply(preprocess_text)
# print(df)

# converting the processed text into a bag-of-words vector

# Initialize CountVectorizer
#vectorizer = CountVectorizer(max_features=1000)

# Fit and transform the text data
test_x = vectorizer.transform(test_text_cleaned)
print(test_x.shape)


(15000, 1000)


<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

In [12]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
### SVM
svm = SVC(kernel = "linear", C = 2)
svm.fit(X_train, y_train)
    
print("SVM Ready")
# TODO

SVM Ready


In [None]:
def gen_nonlinear_data(num_samples=10000):
    # generate random x samples for training and test sets
    xTr = torch.rand(num_samples, 1) * 2 * np.pi
    xTe = torch.rand(int(num_samples * 0.1), 1) * 2 * np.pi
    
    # gaussian noise for non-linear regression
    noise = torch.rand(num_samples, 1) * 0.2
    test_noise = torch.rand(int(num_samples * 0.1), 1) * 0.2
    
    # add noise on the labels for the training set
    yTr = torch.sin(xTr) + noise
    yTe = torch.sin(xTe) + test_noise
    
    #print(xTr.shape)
    #print("PING")
    return xTr, xTe, yTr, yTe

In [None]:
### MLP

def mse_loss(y_pred, y_true):
    square_diff = torch.pow((y_pred-y_true), 2)
    mean_error = 0.5 * torch.mean(square_diff)
    return mean_error

class CustomSGD(optim.Optimizer):
    def __init__(self, params, lr=0.01, momentum=0.9):
        defaults = dict(lr=lr, momentum=momentum)
        super(CustomSGD, self).__init__(params, defaults)
        self.velocities = [torch.zeros_like(param.data) for param in self.param_groups[0]['params']]
    
    def step(self):
        """Update the parameters with velocity and gradient.
        There is nothing needed to return from the function.
        Please update the param.data directly.
        """
        for group in self.param_groups:
            for param, velocity in zip(group['params'], self.velocities):
                if param.grad is None:
                    continue
                
                lr = group['lr'] # learning rate
                momentum = group['momentum'] # momentum coefficient
                gradient = param.grad.data # gradient
                
                # update the velocity; [:] enables inplace update
                # velocity[:] = None
                
                # update the parameters
                # param.data = None
                velocity[:] = momentum * velocity[:] + (1-momentum) * gradient
                param.data = param.data - lr * velocity[:]
                
                
class MLPNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim=1):
        super(MLPNet, self).__init__()
        """ pytorch optimizer checks for the properties of the model, and if
            the torch.nn.Parameter requires gradient, then the model will update
            the parameters automatically.
        """
        self.input_dim = input_dim
        
        # Initialize the fully connected layers
        # raise NotImplementedError("Your code goes here!")
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
    def forward(self, x):
        # Implement the forward pass, with ReLU non-linearities
        # raise NotImplementedError("Your code goes here!")
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

def train_regression_model(xTr, yTr, model, num_epochs, lr=1e-2, momentum=0.9, print_freq=100, display_loss=True):
    """Train loop for a neural network model.
    
    Input:
        xTr:     (n, d) matrix of regression input data
        yTr:     n-dimensional vector of regression labels
        model:   nn.Model to be trained
        num_epochs: number of epochs to train the model for
        lr:      learning rate for the optimizer
        print_freq: frequency to display the loss
        display_loss: boolean, if we print the loss
    
    Output:
        model:   nn.Module trained model
    """
    optimizer = CustomSGD(model.parameters(), lr=lr, momentum=momentum)  # create an Adam optimizer for the model parameters
    
    for epoch in range(num_epochs):
        # need to zero the gradients in the optimizer so we don't
        # use the gradients from previous iterations
        optimizer.zero_grad()  
        pred = model(xTr)  # run the forward pass through the model to compute predictions
        loss = mse_loss(pred, yTr)
        loss.backward()  # compute the gradient wrt loss
        optimizer.step()  # performs a step of gradient descent
        if display_loss and (epoch + 1) % print_freq == 0:
            print('epoch {} loss {}'.format(epoch+1, loss.item()))
    
    return model  # return trained model


hdims = 69
num_epochs = 5000
lr = 1e-1
momentum = 0.9

start_time = time.time()


#X_train, X_test, y_train, y_test = gen_nonlinear_data(num_samples=500)

size = X_train_tensor.shape[1]
#mlp_model = MLPNet(input_dim=size, hidden_dim=hdims, output_dim=1)
#mlp_model = train_regression_model(X_train_tensor, y_train_tensor, mlp_model, num_epochs=num_epochs, lr=lr, momentum=momentum)
#mlp_model = train_regression_model(X_train, y_train, mlp_model, num_epochs=num_epochs, lr=lr, momentum=momentum)


#fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,3))

#X_train_tensor_1D = X_train_tensor[:, 0]  # Use the first feature
#X_test_tensor_1D = X_test_tensor[:, 0]

# Plot the visualizations from our MLP Model
#ax1.scatter(X_train_tensor, y_train_tensor, label="Train Points")
#ax1.scatter(X_test_tensor, y_test_tensor, label="Test Points")
#ax1.scatter(X_train_tensor, mlp_model(X_train_tensor).detach(), color="red", marker='o', label="Prediction")

#ax1.scatter(X_train, y_train, label="Train Points")
#ax1.scatter(X_test, y_test, label="Test Points")
#ax1.scatter(X_train, mlp_model(X_train).detach(), color="red", marker='o', label="Prediction")

#ax1.scatter(X_train_tensor_1D.cpu().numpy(), y_train_tensor.squeeze().cpu().numpy(), label="Train Points")
#ax1.scatter(X_test_tensor_1D.cpu().numpy(), y_test_tensor.squeeze().cpu().numpy(), label="Test Points")
#ax1.scatter(X_train_tensor_1D.cpu().numpy(), mlp_model(X_train_tensor).detach().squeeze().cpu().numpy(), color="red", marker='o', label="Prediction")

#ax1.legend()
#ax1.set_title('MLP Net')

#end_time = time.time()
#elapsed_time = end_time - start_time

#print(f"Program started at: {time.ctime(start_time)}")
#print(f"Program ended at: {time.ctime(end_time)}")
#print(f"Total elapsed time: {elapsed_time:.5f} seconds")
    



<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [38]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
### SVM


svm = SVC(kernel = "rbf", C = 5)
svm.fit(X_train, y_train)
y_preds = svm.predict(test_x)
#y_preds = svm.predict(X_test)
#print("SVM Accuracy: ", accuracy_score(y_test, y_preds))
#print("Classification Report:\n", classification_report(y_test, y_pred))
# print(y_preds)
print("Done")
print()
value_frequency = pd.value_counts(y_preds, sort=True, normalize = True)
print(value_frequency)

# TODO

Done

21    0.342400
1     0.244200
12    0.214600
9     0.058933
4     0.045333
27    0.035733
16    0.010400
25    0.010133
22    0.008200
17    0.008133
10    0.007067
18    0.005533
8     0.002733
20    0.001467
23    0.001133
11    0.000800
24    0.000733
3     0.000533
15    0.000400
6     0.000400
2     0.000400
14    0.000267
19    0.000200
5     0.000133
13    0.000067
7     0.000067
Name: proportion, dtype: float64


<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

1.4.2 Which two learning methods from class did you choose and why did you made the choices?

1.4.3 How did you do the model selection?

1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission)

<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.2
# TODO

<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem. 
We provide how to save a csv file if you are running Notebook on Kaggle.

In [None]:
id = range(15000)
prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('/kaggle/working/submission.csv', index=False)

In [39]:
# TODO

# You may use pandas to generate a dataframe with country, date and your predictions first 
# and then use to_csv to generate a CSV file.


id = test_id
prediction = y_preds
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('submission.csv', index=False)

<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.