### Data Preprocessing and Analysis
- **Initial Setup**: Started by importing necessary libraries and setting up the environment, including checking for GPU availability for PyTorch.
- **Data Loading**: The Amazon reviews dataset was loaded from a TSV file, selectively keeping only the review texts and their corresponding star ratings.
- **Data Cleaning**:
  - Converted all reviews to lowercase to standardize the text.
  - Expanded contractions using a predefined dictionary to improve text understanding.
  - Removed HTML tags, URLs, and non-alphabetic characters to clean the text.
  - Applied regex operations for text cleaning and whitespace normalization.
- **Feature Engineering**:
  - Converted star ratings into numerical data and handled missing values.
  - Created a balanced dataset by sampling equal numbers of reviews for each rating.
  - Transformed ratings into a ternary and binary classification system.

### Word Embeddings
- **Pre-trained Word2Vec**: 
  - Loaded the Google News Word2Vec model.
  - Demonstrated the model's capability by finding semantic similarities and performing analogy tasks (e.g., "King - Man + Woman = Queen").
- **Custom Word2Vec**:
  - Processed review texts to train a custom Word2Vec model.
  - Compared semantic similarities between words using the custom model against the pre-trained model to highlight differences in understanding domain-specific language.

### Modeling
- **Simple Models**:
  - Implemented Perceptron and SVM using both TF-IDF features and Word2Vec embeddings (pre-trained and custom).
  - Evaluated models based on accuracy, observing TF-IDF's superior performance for capturing relevant features in sentiment classification.
- **Feedforward Neural Networks (FFNN)**:
  - Designed a FFNN architecture with two hidden layers and dropout for regularization.
  - Trained models for both binary and ternary classification using averaged Word2Vec vectors and concatenated vectors, noting the effectiveness of averaging for capturing sentence semantics.
- **Convolutional Neural Network (CNN)**:
  - Constructed a CNN with two convolutional layers followed by a fully connected layer for sentiment analysis.
  - Adapted the input data to match CNN requirements by truncating or padding reviews to a fixed length and converting them into sequences of word embeddings.

### Evaluation and Findings
- Through meticulous experimentation, it was found that custom-trained Word2Vec models provided a more nuanced understanding of the dataset-specific language compared to pre-trained models.
- The evaluation highlighted the strength of neural network-based models, especially CNNs, in extracting local and global textual patterns for sentiment analysis.
- Comparisons among different models and feature representations underscored the importance of tailored preprocessing, feature engineering, and model selection in achieving high accuracy in sentiment classification tasks.

In [1]:
#Python version - 3.11.6

import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import re
import torch
from torch.utils.data import DataLoader, TensorDataset 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vaibhavbhajanka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vaibhavbhajanka/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Task 1. Read Data

In [3]:
file_path = 'amazon_reviews_us_Office_Products_v1_00.tsv'
df = pd.read_csv(file_path, sep='\t', on_bad_lines='skip')
df

  df = pd.read_csv(file_path, sep='\t', on_bad_lines='skip')


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,43081963,R18RVCKGH1SSI9,B001BM2MAC,307809868,"Scotch Cushion Wrap 7961, 12 Inches x 100 Feet",Office Products,5,0.0,0.0,N,Y,Five Stars,Great product.,2015-08-31
1,US,10951564,R3L4L6LW1PUOFY,B00DZYEXPQ,75004341,"Dust-Off Compressed Gas Duster, Pack of 4",Office Products,5,0.0,1.0,N,Y,"Phffffffft, Phfffffft. Lots of air, and it's C...",What's to say about this commodity item except...,2015-08-31
2,US,21143145,R2J8AWXWTDX2TF,B00RTMUHDW,529689027,Amram Tagger Standard Tag Attaching Tagging Gu...,Office Products,5,0.0,0.0,N,Y,but I am sure I will like it.,"Haven't used yet, but I am sure I will like it.",2015-08-31
3,US,52782374,R1PR37BR7G3M6A,B00D7H8XB6,868449945,AmazonBasics 12-Sheet High-Security Micro-Cut ...,Office Products,1,2.0,3.0,N,Y,and the shredder was dirty and the bin was par...,Although this was labeled as &#34;new&#34; the...,2015-08-31
4,US,24045652,R3BDDDZMZBZDPU,B001XCWP34,33521401,"Derwent Colored Pencils, Inktense Ink Pencils,...",Office Products,4,0.0,0.0,N,Y,Four Stars,Gorgeous colors and easy to use,2015-08-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2640249,US,53005790,RLI7EI10S7SN0,B00000DM9M,223408988,PalmOne III Leather Belt Clip Case,Office Products,4,26.0,26.0,N,N,Great value! A must if you hate to carry thing...,I can't live anymore whithout my Palm III. But...,1998-12-07
2640250,US,52188548,R1F3SRK9MHE6A3,B00000DM9M,223408988,PalmOne III Leather Belt Clip Case,Office Products,4,18.0,18.0,N,N,Attaches the Palm Pilot like an appendage,Although the Palm Pilot is thin and compact it...,1998-11-30
2640251,US,52090046,R23V0C4NRJL8EM,0807865001,307284585,Gods and Heroes of Ancient Greece,Office Products,4,9.0,16.0,N,N,"Excellent information, pictures and stories, I...",This book had a lot of great content without b...,1998-10-15
2640252,US,52503173,R13ZAE1ATEUC1T,1572313188,870359649,Microsoft EXCEL 97/ Visual Basic Step-by-Step ...,Office Products,5,0.0,0.0,N,N,class text,I am teaching a course in Excel and am using t...,1998-08-22


## Keep Reviews and Ratings

In [4]:
df = df[['review_body', 'star_rating']]

df.columns = ['Reviews','Ratings']

df['Reviews'] = df['Reviews'].apply(lambda x: str(x))

df['Ratings'] = pd.to_numeric(df['Ratings'], errors='coerce')
df.dropna(inplace=True)

df['Ratings'] = df['Ratings'].astype(int)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Reviews'] = df['Reviews'].apply(lambda x: str(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Ratings'] = pd.to_numeric(df['Ratings'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Unnamed: 0,Reviews,Ratings
0,Great product.,5
1,What's to say about this commodity item except...,5
2,"Haven't used yet, but I am sure I will like it.",5
3,Although this was labeled as &#34;new&#34; the...,1
4,Gorgeous colors and easy to use,4
...,...,...
2640249,I can't live anymore whithout my Palm III. But...,4
2640250,Although the Palm Pilot is thin and compact it...,4
2640251,This book had a lot of great content without b...,4
2640252,I am teaching a course in Excel and am using t...,5


## Create a Balanced Dataset

In [5]:
# Randomly sample 50,000 instances for each rating
balanced_df = pd.concat([df[df['Ratings'] == i].sample(50000) for i in range(1, 6)])
balanced_df

Unnamed: 0,Reviews,Ratings
2016559,I am very unhappy with this product.<br />I se...,1
341000,After two weeks it quit working like it should...,1
230018,This does not cover enough to hide personal in...,1
1261548,The photo shows the inserts but these are not ...,1
2363669,The organizer is great! I am a teacher and the...,1
...,...,...
1209675,Its good and works well.,5
253570,great product. very compact. and great for tra...,5
2410538,This shredder works great and it has a handle ...,5
1447154,These really are great markers. Over the years...,5


## Create ternary labels using the ratings.

In [6]:
def label_sentiment(row):
    if row['Ratings'] > 3:
        return 0  # Positive - Class 1
    elif row['Ratings'] < 3:
        return 1  # Negative - Class 2
    else:
        return 2  # Neutral - Class 3

balanced_df['sentiment'] = balanced_df.apply(label_sentiment, axis=1)
balanced_df

Unnamed: 0,Reviews,Ratings,sentiment
2016559,I am very unhappy with this product.<br />I se...,1,1
341000,After two weeks it quit working like it should...,1,1
230018,This does not cover enough to hide personal in...,1,1
1261548,The photo shows the inserts but these are not ...,1,1
2363669,The organizer is great! I am a teacher and the...,1,1
...,...,...,...
1209675,Its good and works well.,5,0
253570,great product. very compact. and great for tra...,5,0
2410538,This shredder works great and it has a handle ...,5,0
1447154,These really are great markers. Over the years...,5,0


# Data Cleaning

In [7]:
def expand_contractions(text, contractions_dict):
    # Regular expression for finding contractions
    contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

    def replace(match):
        return contractions_dict[match.group(0)]
    
    return contractions_re.sub(replace, text)

def clean_text(text):
    
    contractions_dict = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "needn't": "need not",
    "shan't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there would",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who's": "who is",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "would've": "would have",
    "wouldn't": "would not",
    "y'all": "you all",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

    # Convert text to lower case
    text = text.lower()
    # Expand contractions
    text = expand_contractions(text, contractions_dict)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # Remove non-alphabetical characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [8]:
balanced_df['Reviews'] = balanced_df['Reviews'].apply(lambda x: clean_text(x))
balanced_df

Unnamed: 0,Reviews,Ratings,sentiment
2016559,i am very unhappy with this producti set it un...,1,1
341000,after two weeks it quit working like it should...,1,1
230018,this does not cover enough to hide personal in...,1,1
1261548,the photo shows the inserts but these are not ...,1,1
2363669,the organizer is great i am a teacher and the ...,1,1
...,...,...,...
1209675,its good and works well,5,0
253570,great product very compact and great for trave...,5,0
2410538,this shredder works great and it has a handle ...,5,0
1447154,these really are great markers over the years ...,5,0


# Pre-processing

## remove the stop words 

In [9]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def remove_stopwords(text):
    words = text.split()

    words = [word for word in words if word not in stop_words]

    processed_text = ' '.join(words)

    return processed_text

In [10]:
balanced_df['Reviews'] = balanced_df['Reviews'].apply(lambda x:remove_stopwords(x))
balanced_df

Unnamed: 0,Reviews,Ratings,sentiment
2016559,unhappy producti set roof direct rain snow cou...,1,1
341000,two weeks quit working like feed correctly top...,1,1
230018,cover enough hide personal information definit...,1,1
1261548,photo shows inserts included clearly stated fi...,1,1
2363669,organizer great teacher photo shows bright mul...,1,1
...,...,...,...
1209675,good works well,5,0
253570,great product compact great traveling thanks,5,0
2410538,shredder works great handle help move around a...,5,0
1447154,really great markers years kids involved sport...,5,0


## perform lemmatization  

In [11]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    words = text.split()

    words = [lemmatizer.lemmatize(word) for word in words]

    processed_text = ' '.join(words)

    return processed_text

In [12]:
balanced_df['Reviews'] = balanced_df['Reviews'].apply(lambda x:lemmatize(x))
balanced_df

Unnamed: 0,Reviews,Ratings,sentiment
2016559,unhappy producti set roof direct rain snow cou...,1,1
341000,two week quit working like feed correctly top ...,1,1
230018,cover enough hide personal information definit...,1,1
1261548,photo show insert included clearly stated find...,1,1
2363669,organizer great teacher photo show bright mult...,1,1
...,...,...,...
1209675,good work well,5,0
253570,great product compact great traveling thanks,5,0
2410538,shredder work great handle help move around al...,5,0
1447154,really great marker year kid involved sport al...,5,0


## Splitting into Training & Test Datasets

In [13]:
from sklearn.model_selection import train_test_split

ternary_train_df, ternary_test_df = train_test_split(balanced_df, test_size=0.2, random_state=42)

In [14]:
print(ternary_train_df.shape[0])
print(ternary_test_df.shape[0])

200000
50000


# Task 2. Word Embedding

## a1. Loading pre-trained model

## Download

In [15]:
# import gensim.downloader as api
# wv = api.load('word2vec-google-news-300')
# wv.save("word2vec-google-news-300.model")

## Import

In [12]:
from gensim.models import KeyedVectors

# Load the pretrained model
pretrained_model = KeyedVectors.load('./word2vec-google-news-300.model')

# a2. Checking Semantic Similarities of Generated Vectors

#### Example 1: King − Man + Woman = Queen

In [13]:
print(pretrained_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1))

[('queen', 0.7118192911148071)]


#### Example 2: excellent ~ outstanding

In [18]:
print(pretrained_model.similarity('excellent', 'outstanding'))

0.5567486


#### Example 3: bad ~ worst

In [19]:
print(pretrained_model.similarity('bad', 'worst'))

0.43674564


## b1. Training a Word2Vec model using own dataset

In [14]:
from gensim.models import Word2Vec
from gensim import utils

def TrainOwnData():
    sentences = []
    for line in balanced_df['Reviews']:
        sentences.append(utils.simple_preprocess(line))
    return sentences

sentences = TrainOwnData()
custom_model = Word2Vec(sentences=sentences, vector_size=300, window=11, min_count=10, workers=4).wv
custom_model.save("word2vec.model")

## b2. Checking Semantic Similarities of own dataset

In [22]:
print(custom_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1))

[('latin', 0.5304921865463257)]


In [23]:
print(custom_model.similarity('excellent', 'outstanding'))

0.7530478


In [24]:
print(custom_model.similarity('bad', 'worst'))

0.23408216


## c. What do you conclude from comparing vectors generated by yourself and the pretrained model? Which of the Word2Vec models seems to encode semantic similarities between words better?

Pre-trained Model (Model 1)
- King - Man + Woman: 0.7118 (Better at capturing complex relational semantics)
- Excellent ~ Outstanding: 0.5567 (Good understanding, but lower than custom model)
- Bad ~ Worst: 0.4367 (Better understanding of negative sentiment words)

Custom-Trained Model (Model 2)
- King - Man + Woman: 0.5206 (Less effective at complex analogies)
- Excellent ~ Outstanding: 0.7685 (Strong performance, better than pre-trained model)
- Bad ~ Worst: 0.2352 (Weaker understanding of negative sentiment words)

Conclusion: Overall, if you need a model for general language tasks and understanding a broad range of semantic relationships, the pre-trained Word2Vec model is superior. However, for tasks that are closely aligned with consumer reviews and where the context is similar to your training data, your custom model could be more effective.

# Task 3. Simple Models

## Retaining Positive and Negative Reviews

In [16]:
binary_df = balanced_df[balanced_df['Ratings']!=3]
binary_df

Unnamed: 0,Reviews,Ratings,sentiment
1485554,ink level near empty printing two page buyer b...,1,1
1951000,using remanufactured toner printer first time ...,1,1
61524,received empty box,1,1
1486384,first time tried roll around wheel broke im st...,1,1
932075,received sign holder suppose screw included in...,1,1
...,...,...,...
680881,printer amazing email people document scan pri...,5,0
1798383,excited one get bulb ordered got put tv work g...,5,0
2494750,bought year old year old start teaching value ...,5,0
148122,heavy duty large enough hold printer accessory...,5,0


## Splitting Dataframe into Training and Testing sets

In [17]:
binary_train_df, binary_test_df = train_test_split(binary_df, test_size=0.2, random_state=42)
print(binary_train_df.shape[0])
print(binary_test_df.shape[0])

160000
40000


## Function for averaging Word2Vec vectors

In [27]:
def get_avg_word2vec(review, model):
    words = review.split()
    feature_vec = np.zeros(300, dtype=int)
    nwords = 0

    for word in words:
        if word in model:
            nwords += 1
            feature_vec = feature_vec + model[word]

    if nwords > 0:
        feature_vec = feature_vec/ nwords
    return feature_vec


In [28]:
# For pre-trained model
trainDataVecs_pretrained_binary = np.array([get_avg_word2vec(review, pretrained_model) for review in binary_train_df['Reviews']])
testDataVecs_pretrained_binary = np.array([get_avg_word2vec(review, pretrained_model) for review in binary_test_df['Reviews']])

In [29]:
# For custom-trained model
trainDataVecs_custom_binary = np.array([get_avg_word2vec(review, custom_model) for review in binary_train_df['Reviews']])
testDataVecs_custom_binary = np.array([get_avg_word2vec(review, custom_model) for review in binary_test_df['Reviews']])

## Perceptron Training - Pretrained

In [30]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

perceptron_pretrained = Perceptron()
perceptron_pretrained.fit(trainDataVecs_pretrained_binary, binary_train_df['sentiment'])
predictions = perceptron_pretrained.predict(testDataVecs_pretrained_binary)
accuracy_perceptron_pretrained = accuracy_score(binary_test_df['sentiment'], predictions)

## Perceptron Training - Custom

In [31]:
perceptron_custom = Perceptron()
perceptron_custom.fit(trainDataVecs_custom_binary, binary_train_df['sentiment'])
predictions = perceptron_custom.predict(testDataVecs_custom_binary)
accuracy_perceptron_custom = accuracy_score(binary_test_df['sentiment'], predictions)

## SVM Training - Pretrained

In [32]:
from sklearn.svm import LinearSVC

svm_pretrained = LinearSVC(random_state=42, max_iter=1000)
svm_pretrained.fit(trainDataVecs_pretrained_binary, binary_train_df['sentiment'])
predictions = svm_pretrained.predict(testDataVecs_pretrained_binary)
accuracy_svm_pretrained = accuracy_score(binary_test_df['sentiment'], predictions)




## SVM Training - Custom

In [33]:
svm_custom = LinearSVC(random_state=42, max_iter=1000)
svm_custom.fit(trainDataVecs_custom_binary, binary_train_df['sentiment'])
predictions = svm_custom.predict(testDataVecs_custom_binary)
accuracy_svm_custom = accuracy_score(binary_test_df['sentiment'], predictions)



## Reporting Accuracy Values for Test Dataset

In [34]:
accuracy_perceptron_tfidf = 0.846025
accuracy_svm_tfidf = 0.8864

print(f"{'Feature Type':<30} {'Model':<15} {'Accuracy':<10}")
print("-" * 55)

print(f"{'word2vec-google-news-300':<30} {'Perceptron':<15} {accuracy_perceptron_pretrained:<10}")
print(f"{'word2vec-google-news-300':<30} {'SVM':<15} {accuracy_svm_pretrained:<10}")

print(f"{'Custom Word2Vec':<30} {'Perceptron':<15} {accuracy_perceptron_custom:<10}")
print(f"{'Custom Word2Vec':<30} {'SVM':<15} {accuracy_svm_custom:<10}")

print(f"{'TF-IDF (from HW1)':<30} {'Perceptron':<15} {accuracy_perceptron_tfidf:<10}")
print(f"{'TF-IDF (from HW1)':<30} {'SVM':<15} {accuracy_svm_tfidf:<10}")


Feature Type                   Model           Accuracy  
-------------------------------------------------------
word2vec-google-news-300       Perceptron      0.6518    
word2vec-google-news-300       SVM             0.814925  
Custom Word2Vec                Perceptron      0.6744    
Custom Word2Vec                SVM             0.842575  
TF-IDF (from HW1)              Perceptron      0.846025  
TF-IDF (from HW1)              SVM             0.8864    


## What do you conclude from comparing performances for the models trained using the three different feature types (TF-IDF, pretrained Word2Vec, your trained Word2Vec)?

The performance comparison of models trained with different feature types reveals:

- **TF-IDF** is the most effective, suggesting its superiority in capturing document-specific keywords for this classification task.
- **Custom Word2Vec** performs better than the pretrained Word2Vec, indicating the value of domain-specific embeddings.
- **Pretrained Word2Vec** shows the lowest performance, likely due to its general nature not aligning well with the specific dataset.
- Across all features, **SVM outperforms Perceptron**, highlighting its effectiveness in handling complex feature spaces in this context.

# Task 4 Feedforward Neural Networks

In [15]:
import torch.nn as nn
import torch.nn.functional as F

class FeedforwardNN(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, num_classes):
        super(FeedforwardNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, num_classes)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

## a. Average Word2Vec Vectors

In [36]:
# For pre-trained model(Ternary Classification)
trainDataVecs_pretrained_ternary = np.array([get_avg_word2vec(review, pretrained_model) for review in ternary_train_df['Reviews']])
testDataVecs_pretrained_ternary = np.array([get_avg_word2vec(review, pretrained_model) for review in ternary_test_df['Reviews']])

In [37]:
# For custom-trained model(Ternary Classification)
trainDataVecs_custom_ternary = np.array([get_avg_word2vec(review, custom_model) for review in ternary_train_df['Reviews']])
testDataVecs_custom_ternary = np.array([get_avg_word2vec(review, custom_model) for review in ternary_test_df['Reviews']])

## Data Loaders

In [38]:
from torch.utils.data import DataLoader, TensorDataset

# Create DataLoader for training data(Binary Classification) Pretrained Model
train_dataset_binary_pretrained = TensorDataset(torch.FloatTensor(trainDataVecs_pretrained_binary).to(device), torch.FloatTensor(binary_train_df['sentiment'].values).to(device))
train_loader_binary_pretrained = DataLoader(train_dataset_binary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data(Binary Classification) Pretrained Model
test_dataset_binary_pretrained = TensorDataset(torch.FloatTensor(testDataVecs_pretrained_binary).to(device), torch.FloatTensor(binary_test_df['sentiment'].values).to(device))
test_loader_binary_pretrained = DataLoader(test_dataset_binary_pretrained, batch_size=20, shuffle=False)

# Create DataLoader for training data(Binary Classification) Custom Model
train_dataset_binary_custom = TensorDataset(torch.FloatTensor(trainDataVecs_custom_binary).to(device), torch.FloatTensor(binary_train_df['sentiment'].values).to(device))
train_loader_binary_custom = DataLoader(train_dataset_binary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data(Binary Classification) Custom Model
test_dataset_binary_custom = TensorDataset(torch.FloatTensor(testDataVecs_custom_binary).to(device), torch.FloatTensor(binary_test_df['sentiment'].values).to(device))
test_loader_binary_custom = DataLoader(test_dataset_binary_custom, batch_size=20, shuffle=False)

# Create DataLoader for training data(Ternary Classification) Pretrained Model
train_dataset_ternary_pretrained = TensorDataset(torch.FloatTensor(trainDataVecs_pretrained_ternary).to(device), torch.FloatTensor(ternary_train_df['sentiment'].values).to(device))
train_loader_ternary_pretrained = DataLoader(train_dataset_ternary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data(Ternary Classification) Pretrained Model
test_dataset_ternary_pretrained = TensorDataset(torch.FloatTensor(testDataVecs_pretrained_ternary).to(device), torch.FloatTensor(ternary_test_df['sentiment'].values).to(device))
test_loader_ternary_pretrained = DataLoader(test_dataset_ternary_pretrained, batch_size=20, shuffle=False)

# Create DataLoader for training data(Ternary Classification) Custom Model
train_dataset_ternary_custom = TensorDataset(torch.FloatTensor(trainDataVecs_custom_ternary).to(device), torch.FloatTensor(ternary_train_df['sentiment'].values).to(device))
train_loader_ternary_custom = DataLoader(train_dataset_ternary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data(Ternary Classification) Custom Model
test_dataset_ternary_custom = TensorDataset(torch.FloatTensor(testDataVecs_custom_ternary).to(device), torch.FloatTensor(ternary_test_df['sentiment'].values).to(device))
test_loader_ternary_custom = DataLoader(test_dataset_ternary_custom, batch_size=20, shuffle=False)

## Model Parameters

In [16]:
# Define the model and move it to the GPU
input_size = 300 
hidden_size1 = 50
hidden_size2 = 10
num_classes_binary = 2
num_classes_ternary = 3
num_epochs = 15

binary_model = FeedforwardNN(input_size, hidden_size1, hidden_size2, num_classes_binary).to(device)
ternary_model = FeedforwardNN(input_size, hidden_size1, hidden_size2, num_classes_ternary).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_binary = torch.optim.Adam(binary_model.parameters(), lr=0.001)
optimizer_ternary = torch.optim.Adam(ternary_model.parameters(), lr=0.001)

## Model Evaluation

In [17]:
from sklearn.metrics import accuracy_score
# Evaluate the model
def evaluate_model(model, dataloader):
    model.eval()
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for inputs, targets in dataloader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            all_predictions.extend(predicted.tolist())
            all_targets.extend(targets.tolist())

    accuracy = accuracy_score(all_targets, all_predictions)

    return accuracy

## Model Training(Binary Classification, Pretrained Model)

In [41]:
for epoch in range(num_epochs):
    binary_model.train()
    for i, (inputs, labels) in enumerate(train_loader_binary_pretrained):
        optimizer_binary.zero_grad()
        outputs = binary_model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer_binary.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader_binary_pretrained)}], Loss: {loss.item()}')

print('Training completed')

Epoch [1/15], Step [100/8000], Loss: 0.6432567238807678
Epoch [1/15], Step [200/8000], Loss: 0.6112987399101257
Epoch [1/15], Step [300/8000], Loss: 0.25631240010261536
Epoch [1/15], Step [400/8000], Loss: 0.3583989441394806
Epoch [1/15], Step [500/8000], Loss: 0.49193263053894043
Epoch [1/15], Step [600/8000], Loss: 0.5155205726623535
Epoch [1/15], Step [700/8000], Loss: 0.46081867814064026
Epoch [1/15], Step [800/8000], Loss: 0.4840518832206726
Epoch [1/15], Step [900/8000], Loss: 0.47162824869155884
Epoch [1/15], Step [1000/8000], Loss: 0.3505706489086151
Epoch [1/15], Step [1100/8000], Loss: 0.4519318640232086
Epoch [1/15], Step [1200/8000], Loss: 0.43570995330810547
Epoch [1/15], Step [1300/8000], Loss: 0.5112004280090332
Epoch [1/15], Step [1400/8000], Loss: 0.30337947607040405
Epoch [1/15], Step [1500/8000], Loss: 0.4128382205963135
Epoch [1/15], Step [1600/8000], Loss: 0.7169955372810364
Epoch [1/15], Step [1700/8000], Loss: 0.4123419225215912
Epoch [1/15], Step [1800/8000], Lo

In [42]:
test_accuracy_binary_pretrained_ffnn = evaluate_model(binary_model, test_loader_binary_pretrained)

Test Accuracy: 0.84405


## Model Training(Binary Classification, Custom Model)

In [43]:
for epoch in range(num_epochs):
    binary_model.train()
    for i, (inputs, labels) in enumerate(train_loader_binary_custom):
        optimizer_binary.zero_grad()
        outputs = binary_model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer_binary.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader_binary_custom)}], Loss: {loss.item()}')

print('Training completed')

Epoch [1/15], Step [100/8000], Loss: 0.5765557289123535
Epoch [1/15], Step [200/8000], Loss: 0.5004540681838989
Epoch [1/15], Step [300/8000], Loss: 0.5956262350082397
Epoch [1/15], Step [400/8000], Loss: 0.5366430282592773
Epoch [1/15], Step [500/8000], Loss: 0.8135774731636047
Epoch [1/15], Step [600/8000], Loss: 0.5899533033370972
Epoch [1/15], Step [700/8000], Loss: 0.5940340161323547
Epoch [1/15], Step [800/8000], Loss: 0.5661557912826538
Epoch [1/15], Step [900/8000], Loss: 0.7082161903381348
Epoch [1/15], Step [1000/8000], Loss: 0.4025842547416687
Epoch [1/15], Step [1100/8000], Loss: 0.4135395586490631
Epoch [1/15], Step [1200/8000], Loss: 0.31549912691116333
Epoch [1/15], Step [1300/8000], Loss: 0.3695473074913025
Epoch [1/15], Step [1400/8000], Loss: 0.6372199654579163
Epoch [1/15], Step [1500/8000], Loss: 0.4999869763851166
Epoch [1/15], Step [1600/8000], Loss: 0.42149266600608826
Epoch [1/15], Step [1700/8000], Loss: 0.6412789821624756
Epoch [1/15], Step [1800/8000], Loss: 

In [44]:
test_accuracy_binary_custom_ffnn = evaluate_model(binary_model, test_loader_binary_custom)

Test Accuracy: 0.862275


## Model Training (Ternary Classification, Pretrained Model)

In [45]:
for epoch in range(num_epochs):
    ternary_model.train()
    for i, (inputs, labels) in enumerate(train_loader_ternary_pretrained):
        optimizer_ternary.zero_grad()
        outputs = ternary_model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer_ternary.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader_ternary_pretrained)}], Loss: {loss.item()}')

print('Training completed')

Epoch [1/15], Step [100/10000], Loss: 1.0254532098770142
Epoch [1/15], Step [200/10000], Loss: 0.9389557838439941
Epoch [1/15], Step [300/10000], Loss: 1.052091360092163
Epoch [1/15], Step [400/10000], Loss: 0.7765858769416809
Epoch [1/15], Step [500/10000], Loss: 0.8837081789970398
Epoch [1/15], Step [600/10000], Loss: 0.7476924061775208
Epoch [1/15], Step [700/10000], Loss: 0.978754997253418
Epoch [1/15], Step [800/10000], Loss: 0.8651047945022583
Epoch [1/15], Step [900/10000], Loss: 0.7640393376350403
Epoch [1/15], Step [1000/10000], Loss: 0.7971606254577637
Epoch [1/15], Step [1100/10000], Loss: 0.8330097198486328
Epoch [1/15], Step [1200/10000], Loss: 0.7906981706619263
Epoch [1/15], Step [1300/10000], Loss: 0.8622697591781616
Epoch [1/15], Step [1400/10000], Loss: 0.9616483449935913
Epoch [1/15], Step [1500/10000], Loss: 0.8832290768623352
Epoch [1/15], Step [1600/10000], Loss: 0.5980502963066101
Epoch [1/15], Step [1700/10000], Loss: 0.8858243227005005
Epoch [1/15], Step [1800/

In [46]:
test_accuracy_ternary_pretrained_ffnn = evaluate_model(ternary_model, test_loader_ternary_pretrained)

Test Accuracy: 0.6823


## Model Training (Ternary Classification, Custom Model)

In [47]:
for epoch in range(num_epochs):
    ternary_model.train()
    for i, (inputs, labels) in enumerate(train_loader_ternary_custom):
        optimizer_ternary.zero_grad()
        outputs = ternary_model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer_ternary.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader_ternary_custom)}], Loss: {loss.item()}')

print('Training completed')

Epoch [1/15], Step [100/10000], Loss: 0.8991641998291016
Epoch [1/15], Step [200/10000], Loss: 1.4089937210083008
Epoch [1/15], Step [300/10000], Loss: 1.100433349609375
Epoch [1/15], Step [400/10000], Loss: 0.8831998109817505
Epoch [1/15], Step [500/10000], Loss: 1.030739426612854
Epoch [1/15], Step [600/10000], Loss: 0.9231195449829102
Epoch [1/15], Step [700/10000], Loss: 1.23597252368927
Epoch [1/15], Step [800/10000], Loss: 0.9293931126594543
Epoch [1/15], Step [900/10000], Loss: 0.9441748857498169
Epoch [1/15], Step [1000/10000], Loss: 0.8672305345535278
Epoch [1/15], Step [1100/10000], Loss: 0.9571253061294556
Epoch [1/15], Step [1200/10000], Loss: 0.901347279548645
Epoch [1/15], Step [1300/10000], Loss: 0.94573575258255
Epoch [1/15], Step [1400/10000], Loss: 0.780764639377594
Epoch [1/15], Step [1500/10000], Loss: 0.8283461332321167
Epoch [1/15], Step [1600/10000], Loss: 1.1363191604614258
Epoch [1/15], Step [1700/10000], Loss: 0.8706123232841492
Epoch [1/15], Step [1800/10000]

In [48]:
test_accuracy_ternary_custom_ffnn = evaluate_model(ternary_model, test_loader_ternary_custom)

Test Accuracy: 0.69816


## Report accuracy values on the testing split for your FFNN model - Average Word2Vec Vectors

In [49]:

print(f"{'Feature Type':<30} {'Model':<15} {'Accuracy':<10}")
print("-" * 55)

print(f"{'word2vec-google-news-300':<30} {'FFNN(Binary)':<15} {test_accuracy_binary_pretrained_ffnn:<10}")
print(f"{'word2vec-google-news-300':<30} {'FFNN(Ternary)':<15} {test_accuracy_ternary_pretrained_ffnn:<10}")

print(f"{'Custom Word2Vec':<30} {'FFNN(Binary)':<15} {test_accuracy_binary_custom_ffnn:<10}")
print(f"{'Custom Word2Vec':<30} {'FFNN(Ternary)':<15} {test_accuracy_ternary_custom_ffnn:<10}")

Feature Type                   Model           Accuracy  
-------------------------------------------------------
word2vec-google-news-300       FFNN(Binary)    0.84405   
word2vec-google-news-300       FFNN(Ternary)   0.6823    
Custom Word2Vec                FFNN(Binary)    0.862275  
Custom Word2Vec                FFNN(Ternary)   0.69816   


## b. Concatenated Features

In [51]:
def concat_wv_on_gpu(reviews, model):
    output = []

    for _, value in enumerate(reviews):
        split_reviews = value.split(' ')

        # Extract the first 50 Word2Vec vectors for each review
        vectors = []
        for w in split_reviews:
            if w in model and len(vectors) < 10:
                vectors.append(torch.Tensor(model[w]).to(device))  # Convert to PyTorch tensor and move to GPU

        # If there are fewer than 50 vectors, pad with zeros on the GPU
        while len(vectors) < 10:
            vectors.append(torch.zeros(300).to(device))  # Create PyTorch tensor and move to GPU

        # Concatenate the vectors on the GPU
        body = torch.cat(vectors).to(device)

        output.append(body)

    return torch.stack(output)

word_embeddings_concat_binary_pretrained = concat_wv_on_gpu(binary_df['Reviews'], pretrained_model)
word_embeddings_concat_binary_custom = concat_wv_on_gpu(binary_df['Reviews'], custom_model)
word_embeddings_concat_ternary_pretrained = concat_wv_on_gpu(balanced_df['Reviews'], pretrained_model)
word_embeddings_concat_ternary_custom = concat_wv_on_gpu(balanced_df['Reviews'], custom_model)

  vectors.append(torch.Tensor(model[w]).to(device))  # Convert to PyTorch tensor and move to GPU


## Data Loaders

In [52]:
train_x_concat_binary_pretrained, test_x_concat_binary_pretrained, train_y_concat_binary_pretrained, test_y_concat_binary_pretrained = train_test_split(word_embeddings_concat_binary_pretrained, binary_df['sentiment'], test_size=0.2)

# Create DataLoader for training data
train_dataset_concat_binary_pretrained = TensorDataset(train_x_concat_binary_pretrained.to(device), torch.FloatTensor(train_y_concat_binary_pretrained.values).to(device))
train_loader_concat_binary_pretrained = DataLoader(train_dataset_concat_binary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_concat_binary_pretrained = TensorDataset(test_x_concat_binary_pretrained.to(device), torch.FloatTensor(test_y_concat_binary_pretrained.values).to(device))
test_loader_concat_binary_pretrained = DataLoader(test_dataset_concat_binary_pretrained, batch_size=20, shuffle=False)

In [53]:
train_x_concat_binary_custom, test_x_concat_binary_custom, train_y_concat_binary_custom, test_y_concat_binary_custom = train_test_split(word_embeddings_concat_binary_custom, binary_df['sentiment'], test_size=0.2)

# Create DataLoader for training data
train_dataset_concat_binary_custom = TensorDataset(train_x_concat_binary_custom.to(device), torch.FloatTensor(train_y_concat_binary_custom.values).to(device))
train_loader_concat_binary_custom = DataLoader(train_dataset_concat_binary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_concat_binary_custom = TensorDataset(test_x_concat_binary_custom.to(device), torch.FloatTensor(test_y_concat_binary_custom.values).to(device))
test_loader_concat_binary_custom = DataLoader(test_dataset_concat_binary_custom, batch_size=20, shuffle=False)

In [54]:
train_x_concat_ternary_pretrained, test_x_concat_ternary_pretrained, train_y_concat_ternary_pretrained, test_y_concat_ternary_pretrained = train_test_split(word_embeddings_concat_ternary_pretrained, balanced_df['sentiment'], test_size=0.2)

# Create DataLoader for training data
train_dataset_concat_ternary_pretrained = TensorDataset(train_x_concat_ternary_pretrained.to(device), torch.FloatTensor(train_y_concat_ternary_pretrained.values).to(device))
train_loader_concat_ternary_pretrained = DataLoader(train_dataset_concat_ternary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_concat_ternary_pretrained = TensorDataset(test_x_concat_ternary_pretrained.to(device), torch.FloatTensor(test_y_concat_ternary_pretrained.values).to(device))
test_loader_concat_ternary_pretrained = DataLoader(test_dataset_concat_ternary_pretrained, batch_size=20, shuffle=False)

In [55]:
train_x_concat_ternary_custom, test_x_concat_ternary_custom, train_y_concat_ternary_custom, test_y_concat_ternary_custom = train_test_split(word_embeddings_concat_ternary_custom, balanced_df['sentiment'], test_size=0.2)

# Create DataLoader for training data
train_dataset_concat_ternary_custom = TensorDataset(train_x_concat_ternary_custom.to(device), torch.FloatTensor(train_y_concat_ternary_custom.values).to(device))
train_loader_concat_ternary_custom = DataLoader(train_dataset_concat_ternary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_concat_ternary_custom = TensorDataset(test_x_concat_ternary_custom.to(device), torch.FloatTensor(test_y_concat_ternary_custom.values).to(device))
test_loader_concat_ternary_custom = DataLoader(test_dataset_concat_ternary_custom, batch_size=20, shuffle=False)

## Model Parameters

In [18]:
# Define the model and move it to the GPU
input_size_concat = 3000  # 10 vectors of 300 dimensions each
#Rest parameters same as before
hidden_size1 = 50
hidden_size2 = 10
num_classes_binary = 2
num_classes_ternary = 3
num_epochs = 15

binary_model_concat = FeedforwardNN(input_size_concat, hidden_size1, hidden_size2, num_classes_binary).to(device)
ternary_model_concat = FeedforwardNN(input_size_concat, hidden_size1, hidden_size2, num_classes_ternary).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_binary_concat = torch.optim.Adam(binary_model_concat.parameters(), lr=0.001)
optimizer_ternary_concat = torch.optim.Adam(ternary_model_concat.parameters(), lr=0.001)

## Model Training Function

In [19]:
def ModelTraining(model, train_loader_concat, optimizer):
    for epoch in range(num_epochs):
        model.train()
        for i, (inputs, labels) in enumerate(train_loader_concat):
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels.long())
            loss.backward()
            optimizer.step()

            if (i + 1) % 100 == 0:
                print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader_concat)}], Loss: {loss.item()}')

    print('Training completed')

### Model Training (Binary Classfication, Pretrained Model)

In [58]:
ModelTraining(binary_model_concat, train_loader_concat_binary_pretrained,optimizer_binary_concat)

Epoch [1/15], Step [100/8000], Loss: 0.5310152769088745
Epoch [1/15], Step [200/8000], Loss: 0.3379923701286316
Epoch [1/15], Step [300/8000], Loss: 0.6011854410171509
Epoch [1/15], Step [400/8000], Loss: 0.514578640460968
Epoch [1/15], Step [500/8000], Loss: 0.4481264650821686
Epoch [1/15], Step [600/8000], Loss: 0.4343417286872864
Epoch [1/15], Step [700/8000], Loss: 0.5141326189041138
Epoch [1/15], Step [800/8000], Loss: 0.3922936022281647
Epoch [1/15], Step [900/8000], Loss: 0.8595117330551147
Epoch [1/15], Step [1000/8000], Loss: 0.511859118938446
Epoch [1/15], Step [1100/8000], Loss: 0.612524688243866
Epoch [1/15], Step [1200/8000], Loss: 0.4973919987678528
Epoch [1/15], Step [1300/8000], Loss: 0.5008834600448608
Epoch [1/15], Step [1400/8000], Loss: 0.7315750122070312
Epoch [1/15], Step [1500/8000], Loss: 0.5574985146522522
Epoch [1/15], Step [1600/8000], Loss: 0.4282597005367279
Epoch [1/15], Step [1700/8000], Loss: 0.40113672614097595
Epoch [1/15], Step [1800/8000], Loss: 0.40

### Model Evaluation

In [59]:
test_accuracy_binary_pretrained_ffnn_concat = evaluate_model(binary_model_concat, test_loader_concat_binary_pretrained)

### Model Training (Binary Classfication, Custom Model)

In [60]:
ModelTraining(binary_model_concat, train_loader_concat_binary_custom,optimizer_binary_concat)

Epoch [1/15], Step [100/8000], Loss: 0.5647609233856201
Epoch [1/15], Step [200/8000], Loss: 0.7474066019058228
Epoch [1/15], Step [300/8000], Loss: 0.6968544721603394
Epoch [1/15], Step [400/8000], Loss: 0.7281111478805542
Epoch [1/15], Step [500/8000], Loss: 0.6354875564575195
Epoch [1/15], Step [600/8000], Loss: 0.6638445854187012
Epoch [1/15], Step [700/8000], Loss: 0.6027003526687622
Epoch [1/15], Step [800/8000], Loss: 0.6572430729866028
Epoch [1/15], Step [900/8000], Loss: 0.6894238591194153
Epoch [1/15], Step [1000/8000], Loss: 0.5558643341064453
Epoch [1/15], Step [1100/8000], Loss: 0.6823779344558716
Epoch [1/15], Step [1200/8000], Loss: 0.6083866357803345
Epoch [1/15], Step [1300/8000], Loss: 0.6397055387496948
Epoch [1/15], Step [1400/8000], Loss: 0.5380744934082031
Epoch [1/15], Step [1500/8000], Loss: 0.760850727558136
Epoch [1/15], Step [1600/8000], Loss: 0.6922472715377808
Epoch [1/15], Step [1700/8000], Loss: 0.6962167024612427
Epoch [1/15], Step [1800/8000], Loss: 0.6

### Model Evaluation

In [61]:
test_accuracy_binary_custom_ffnn_concat = evaluate_model(binary_model_concat, test_loader_concat_binary_custom)

### Model Training (Ternary Classfication, Pretrained Model)

In [62]:
ModelTraining(ternary_model_concat, train_loader_concat_ternary_pretrained,optimizer_ternary_concat)

Epoch [1/15], Step [100/10000], Loss: 0.9816405177116394
Epoch [1/15], Step [200/10000], Loss: 0.9724072217941284
Epoch [1/15], Step [300/10000], Loss: 0.823921799659729
Epoch [1/15], Step [400/10000], Loss: 0.7482838034629822
Epoch [1/15], Step [500/10000], Loss: 0.8738129734992981
Epoch [1/15], Step [600/10000], Loss: 1.0990312099456787
Epoch [1/15], Step [700/10000], Loss: 1.004252552986145
Epoch [1/15], Step [800/10000], Loss: 0.7811558842658997
Epoch [1/15], Step [900/10000], Loss: 0.6782220005989075
Epoch [1/15], Step [1000/10000], Loss: 0.9907082319259644
Epoch [1/15], Step [1100/10000], Loss: 0.9442189335823059
Epoch [1/15], Step [1200/10000], Loss: 0.9911880493164062
Epoch [1/15], Step [1300/10000], Loss: 0.9492805600166321
Epoch [1/15], Step [1400/10000], Loss: 1.0808485746383667
Epoch [1/15], Step [1500/10000], Loss: 0.8173163533210754
Epoch [1/15], Step [1600/10000], Loss: 1.0785974264144897
Epoch [1/15], Step [1700/10000], Loss: 0.6862167119979858
Epoch [1/15], Step [1800/

### Model Evaluation

In [63]:
test_accuracy_ternary_pretrained_ffnn_concat = evaluate_model(ternary_model_concat, test_loader_concat_ternary_pretrained)

### Model Training (Ternary Classfication, Custom Model)

In [64]:
ModelTraining(ternary_model_concat, train_loader_concat_ternary_custom,optimizer_ternary_concat)

Epoch [1/15], Step [100/10000], Loss: 1.197290301322937
Epoch [1/15], Step [200/10000], Loss: 1.0831077098846436
Epoch [1/15], Step [300/10000], Loss: 1.0743510723114014
Epoch [1/15], Step [400/10000], Loss: 1.1810739040374756
Epoch [1/15], Step [500/10000], Loss: 1.095404863357544
Epoch [1/15], Step [600/10000], Loss: 1.1076825857162476
Epoch [1/15], Step [700/10000], Loss: 0.9672955274581909
Epoch [1/15], Step [800/10000], Loss: 1.0572640895843506
Epoch [1/15], Step [900/10000], Loss: 1.0674846172332764
Epoch [1/15], Step [1000/10000], Loss: 0.996285080909729
Epoch [1/15], Step [1100/10000], Loss: 0.9687151908874512
Epoch [1/15], Step [1200/10000], Loss: 1.0741480588912964
Epoch [1/15], Step [1300/10000], Loss: 1.1208832263946533
Epoch [1/15], Step [1400/10000], Loss: 0.9807030558586121
Epoch [1/15], Step [1500/10000], Loss: 0.9803509712219238
Epoch [1/15], Step [1600/10000], Loss: 0.9983175992965698
Epoch [1/15], Step [1700/10000], Loss: 1.0773638486862183
Epoch [1/15], Step [1800/1

### Model Evaluation

In [65]:
test_accuracy_ternary_custom_ffnn_concat = evaluate_model(ternary_model_concat, test_loader_concat_ternary_custom)

## Report accuracy values on the testing split for your FFNN model - Concatenated Vectors

In [66]:
print(f"{'Feature Type':<30} {'Model':<15} {'Accuracy':<10}")
print("-" * 55)

print(f"{'word2vec-google-news-300':<30} {'FFNN[CONCAT](Binary)':<15} {test_accuracy_binary_pretrained_ffnn_concat:<10}")
print(f"{'word2vec-google-news-300':<30} {'FFNN[CONCAT](Ternary)':<15} {test_accuracy_ternary_pretrained_ffnn_concat:<10}")

print(f"{'Custom Word2Vec':<30} {'FFNN[CONCAT](Binary)':<15} {test_accuracy_binary_custom_ffnn_concat:<10}")
print(f"{'Custom Word2Vec':<30} {'FFNN[CONCAT](Ternary)':<15} {test_accuracy_ternary_custom_ffnn_concat:<10}")

Feature Type                   Model           Accuracy  
-------------------------------------------------------
word2vec-google-news-300       FFNN[CONCAT](Binary) 0.76595   
word2vec-google-news-300       FFNN[CONCAT](Ternary) 0.61184   
Custom Word2Vec                FFNN[CONCAT](Binary) 0.78375   
Custom Word2Vec                FFNN[CONCAT](Ternary) 0.63268   


## What do you conclude by comparing accuracy values you obtain with those obtained in the “Simple Models” section (note you can compare the accuracy values for binary classification).

In comparing the accuracy values:

1. **FFNNs (Feedforward Neural Networks) outperform simple models (Perceptron and SVM)**, especially when using average Word2Vec vectors. This highlights the benefits of more complex models in capturing data patterns.

2. **Custom Word2Vec features consistently yield better results than pretrained ones** across all model types, emphasizing the effectiveness of domain-specific embeddings.

3. **FFNNs with average vectors perform better than with concatenated vectors**, suggesting that average vectors might be a more effective way to represent text data for this task.

Overall, the results underline the advantages of neural network architectures and carefully chosen feature representations in sentiment analysis.

## Task 5. Simple CNN

In [20]:
class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv1d(300, 50, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(50, 10, kernel_size=3, padding=1)
        self.fc = nn.Linear(10 * 50, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = x.squeeze(2)
        x = x.permute(0, 2, 1)  # CNN expects (batch_size, channels, sequence_length)
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(x.size(0), -1)  # Flatten
        x = self.dropout(x)
        x = self.fc(x)
        return x

## Limit the maximum review length to 50 by truncating longer reviews and padding shorter reviews with a null value (0)

In [21]:
def truncate(reviews,model, max_words = 50):
    embedded_words = []
    
    for review in reviews: 
        word_sequence = []
        word_list = review.split(" ")
        
        if len(word_list) == 0:
            embedded_words.append(np.zeros((max_words, 300)))
            continue
            
        for word in word_list[:max_words]:
            if word in model:
                word_vector = np.reshape(model[word], (1, 300))
                word_sequence.append(word_vector)
            else:
                word_sequence.append(np.zeros((1, 300))) 
                continue
                
        if len(word_sequence) < max_words:
            for i in range(max_words - len(word_sequence)):
                word_sequence.append(np.zeros((1, 300))) 
                
        embedded_words.append(word_sequence)
    
    word_vector_data = np.array(embedded_words)
    return word_vector_data

## Splitting Dataset into Training and Testing sets for binary classification

In [24]:
binary_train_x, binary_test_x, binary_train_y, binary_test_y = train_test_split(binary_df['Reviews'],binary_df['sentiment'], test_size=0.2)

## Splitting Dataset into Training and Testing sets for ternary classification

In [22]:
ternary_train_x, ternary_test_x, ternary_train_y, ternary_test_y = train_test_split(balanced_df['Reviews'],balanced_df['sentiment'], test_size=0.2)

## Truncating Reviews For Pretrained Model(Binary Classification)

In [36]:
trunc_Xtrain_binary_df_pretrained = truncate(binary_train_x,pretrained_model)
print(trunc_Xtrain_binary_df_pretrained.shape)

(160000, 50, 1, 300)


In [23]:
trunc_Xtest_binary_df_pretrained = truncate(binary_test_x,pretrained_model)

## Truncating Reviews For Custom Model(Binary Classification)

In [25]:
trunc_Xtrain_binary_df_custom = truncate(binary_train_x,custom_model)

In [26]:
trunc_Xtest_binary_df_custom = truncate(binary_test_x,custom_model)

## Truncating Reviews For Pretrained Model(Ternary Classification)

In [18]:
trunc_Xtrain_ternary_df_pretrained = truncate(ternary_train_x,pretrained_model)

In [19]:
trunc_Xtest_ternary_df_pretrained = truncate(ternary_test_x,pretrained_model)

## Truncating Reviews For Custom Model(Ternary Classification)

In [23]:
trunc_Xtrain_ternary_df_custom = truncate(ternary_train_x,custom_model)

In [24]:
trunc_Xtest_ternary_df_custom = truncate(ternary_test_x,custom_model)

## Data Loaders

In [24]:
# Create DataLoader for training data
train_dataset_cnn_binary_pretrained = TensorDataset(torch.FloatTensor(trunc_Xtrain_binary_df_pretrained).to(device), torch.FloatTensor(binary_train_y.values).to(device))
train_loader_cnn_binary_pretrained = DataLoader(train_dataset_cnn_binary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_cnn_binary_pretrained = TensorDataset(torch.FloatTensor(trunc_Xtest_binary_df_pretrained).to(device), torch.FloatTensor(binary_test_y.values).to(device))
test_loader_cnn_binary_pretrained = DataLoader(test_dataset_cnn_binary_pretrained, batch_size=20, shuffle=False)

In [27]:
# Create DataLoader for training data
train_dataset_cnn_binary_custom = TensorDataset(torch.FloatTensor(trunc_Xtrain_binary_df_custom).to(device), torch.FloatTensor(binary_train_y.values).to(device))
train_loader_cnn_binary_custom = DataLoader(train_dataset_cnn_binary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_cnn_binary_custom = TensorDataset(torch.FloatTensor(trunc_Xtest_binary_df_custom).to(device), torch.FloatTensor(binary_test_y.values).to(device))
test_loader_cnn_binary_custom = DataLoader(test_dataset_cnn_binary_custom, batch_size=20, shuffle=False)

In [21]:
# Create DataLoader for training data
train_dataset_cnn_ternary_pretrained = TensorDataset(torch.FloatTensor(trunc_Xtrain_ternary_df_pretrained).to(device), torch.FloatTensor(ternary_train_y.values).to(device))
train_loader_cnn_ternary_pretrained = DataLoader(train_dataset_cnn_ternary_pretrained, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_cnn_ternary_pretrained = TensorDataset(torch.FloatTensor(trunc_Xtest_ternary_df_pretrained).to(device), torch.FloatTensor(ternary_test_y.values).to(device))
test_loader_cnn_ternary_pretrained = DataLoader(test_dataset_cnn_ternary_pretrained, batch_size=20, shuffle=False)

In [25]:
# Create DataLoader for training data
train_dataset_cnn_ternary_custom = TensorDataset(torch.FloatTensor(trunc_Xtrain_ternary_df_custom).to(device), torch.FloatTensor(ternary_train_y.values).to(device))
train_loader_cnn_ternary_custom = DataLoader(train_dataset_cnn_ternary_custom, batch_size=20, shuffle=True)

# Create DataLoader for test data
test_dataset_cnn_ternary_custom = TensorDataset(torch.FloatTensor(trunc_Xtest_ternary_df_custom).to(device), torch.FloatTensor(ternary_test_y.values).to(device))
test_loader_cnn_ternary_custom = DataLoader(test_dataset_cnn_ternary_custom, batch_size=20, shuffle=False)

## Model Parameters

In [26]:
# Define the model and move it to the GPU
input_size = 300 
hidden_size1 = 50
hidden_size2 = 10
num_classes_binary = 2
num_classes_ternary = 3
num_epochs = 15

binary_model_cnn = SimpleCNN(num_classes_binary).to(device)
ternary_model_cnn = SimpleCNN(num_classes_ternary).to(device)


# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_binary_cnn = torch.optim.Adam(binary_model_cnn.parameters(), lr=0.001)
optimizer_ternary_cnn = torch.optim.Adam(ternary_model_cnn.parameters(), lr=0.001)

## Model Training(Binary Classification, Pretrained Model)

In [40]:
ModelTraining(binary_model_cnn, train_loader_cnn_binary_pretrained, optimizer_binary_cnn)

Epoch [1/15], Step [100/8000], Loss: 0.6668332815170288
Epoch [1/15], Step [200/8000], Loss: 0.6218133568763733
Epoch [1/15], Step [300/8000], Loss: 0.4165213704109192
Epoch [1/15], Step [400/8000], Loss: 0.4211300015449524
Epoch [1/15], Step [500/8000], Loss: 0.5401797890663147
Epoch [1/15], Step [600/8000], Loss: 0.6094974875450134
Epoch [1/15], Step [700/8000], Loss: 0.45932936668395996
Epoch [1/15], Step [800/8000], Loss: 0.2594350576400757
Epoch [1/15], Step [900/8000], Loss: 0.35291674733161926
Epoch [1/15], Step [1000/8000], Loss: 0.5367692112922668
Epoch [1/15], Step [1100/8000], Loss: 0.27801671624183655
Epoch [1/15], Step [1200/8000], Loss: 0.38126176595687866
Epoch [1/15], Step [1300/8000], Loss: 0.4004274904727936
Epoch [1/15], Step [1400/8000], Loss: 0.5896357297897339
Epoch [1/15], Step [1500/8000], Loss: 0.25932663679122925
Epoch [1/15], Step [1600/8000], Loss: 0.5693489909172058
Epoch [1/15], Step [1700/8000], Loss: 0.5912906527519226
Epoch [1/15], Step [1800/8000], Los

NameError: name 'accuracy_score' is not defined

### Model Evaluation

In [42]:
test_accuracy_binary_pretrained_cnn = evaluate_model(binary_model_cnn, test_loader_cnn_binary_pretrained)
print(test_accuracy_binary_pretrained_cnn)

0.8545


## Model Training and Evaluation(Binary Classification, Custom Model)

In [29]:
ModelTraining(binary_model_cnn, train_loader_cnn_binary_custom, optimizer_binary_cnn)
test_accuracy_binary_custom_cnn = evaluate_model(binary_model_cnn, test_loader_cnn_binary_custom)

Epoch [1/15], Step [100/8000], Loss: 0.5740548372268677
Epoch [1/15], Step [200/8000], Loss: 0.3969813883304596
Epoch [1/15], Step [300/8000], Loss: 0.38477978110313416
Epoch [1/15], Step [400/8000], Loss: 0.520084023475647
Epoch [1/15], Step [500/8000], Loss: 0.341152161359787
Epoch [1/15], Step [600/8000], Loss: 0.687828540802002
Epoch [1/15], Step [700/8000], Loss: 0.36616289615631104
Epoch [1/15], Step [800/8000], Loss: 0.37923869490623474
Epoch [1/15], Step [900/8000], Loss: 0.3100588917732239
Epoch [1/15], Step [1000/8000], Loss: 0.4098321795463562
Epoch [1/15], Step [1100/8000], Loss: 0.5639193654060364
Epoch [1/15], Step [1200/8000], Loss: 0.49709588289260864
Epoch [1/15], Step [1300/8000], Loss: 0.48051220178604126
Epoch [1/15], Step [1400/8000], Loss: 0.37268370389938354
Epoch [1/15], Step [1500/8000], Loss: 0.48185959458351135
Epoch [1/15], Step [1600/8000], Loss: 0.6059451699256897
Epoch [1/15], Step [1700/8000], Loss: 0.16030241549015045
Epoch [1/15], Step [1800/8000], Los

In [30]:
print(test_accuracy_binary_custom_cnn)

0.8641


## Model Training(Ternary Classification, Pretrained Model)

In [30]:
ModelTraining(ternary_model_cnn,train_loader_cnn_ternary_pretrained,optimizer_ternary_cnn)

Epoch [1/15], Step [100/10000], Loss: 1.0477240085601807
Epoch [1/15], Step [200/10000], Loss: 1.0081113576889038
Epoch [1/15], Step [300/10000], Loss: 0.8929632306098938
Epoch [1/15], Step [400/10000], Loss: 0.6069024205207825
Epoch [1/15], Step [500/10000], Loss: 0.7884050011634827
Epoch [1/15], Step [600/10000], Loss: 0.8115717768669128
Epoch [1/15], Step [700/10000], Loss: 0.9692789316177368
Epoch [1/15], Step [800/10000], Loss: 0.8788784742355347
Epoch [1/15], Step [900/10000], Loss: 0.5140672922134399
Epoch [1/15], Step [1000/10000], Loss: 0.7802572846412659
Epoch [1/15], Step [1100/10000], Loss: 0.5656523108482361
Epoch [1/15], Step [1200/10000], Loss: 0.93267822265625
Epoch [1/15], Step [1300/10000], Loss: 0.84071284532547
Epoch [1/15], Step [1400/10000], Loss: 0.8104743957519531
Epoch [1/15], Step [1500/10000], Loss: 0.8035953640937805
Epoch [1/15], Step [1600/10000], Loss: 0.8685919642448425
Epoch [1/15], Step [1700/10000], Loss: 0.9309894442558289
Epoch [1/15], Step [1800/10

### Model Evaluation

In [33]:
test_accuracy_ternary_pretrained_cnn = evaluate_model(ternary_model_cnn, test_loader_cnn_ternary_pretrained)
print(test_accuracy_ternary_pretrained_cnn)

0.72248


## Model Training(Ternary Classification, Custom Model)

In [27]:
ModelTraining(ternary_model_cnn,train_loader_cnn_ternary_custom, optimizer_ternary_cnn)

Epoch [1/15], Step [100/10000], Loss: 1.1535131931304932
Epoch [1/15], Step [200/10000], Loss: 0.61073899269104
Epoch [1/15], Step [300/10000], Loss: 0.7352244853973389
Epoch [1/15], Step [400/10000], Loss: 0.7459444403648376
Epoch [1/15], Step [500/10000], Loss: 1.1538769006729126
Epoch [1/15], Step [600/10000], Loss: 0.8325538635253906
Epoch [1/15], Step [700/10000], Loss: 0.7921149730682373
Epoch [1/15], Step [800/10000], Loss: 0.7434914708137512
Epoch [1/15], Step [900/10000], Loss: 0.7475288510322571
Epoch [1/15], Step [1000/10000], Loss: 0.5095646977424622
Epoch [1/15], Step [1100/10000], Loss: 0.761029839515686
Epoch [1/15], Step [1200/10000], Loss: 1.0166327953338623
Epoch [1/15], Step [1300/10000], Loss: 0.7760683298110962
Epoch [1/15], Step [1400/10000], Loss: 0.9198731184005737
Epoch [1/15], Step [1500/10000], Loss: 0.604096531867981
Epoch [1/15], Step [1600/10000], Loss: 1.038954257965088
Epoch [1/15], Step [1700/10000], Loss: 0.9102897644042969
Epoch [1/15], Step [1800/100

### Model Evaluation

In [28]:
test_accuracy_ternary_custom_cnn = evaluate_model(ternary_model_cnn, test_loader_cnn_ternary_custom)
print(test_accuracy_ternary_custom_cnn)

0.69642


## Report accuracy values on the testing split for your CNN model.


In [30]:
#Values from earlier iterations on cnn(since each section was run seperately to avoid memory failure)
test_accuracy_binary_pretrained_cnn = 0.8545
test_accuracy_binary_custom_cnn = 0.8641
test_accuracy_ternary_pretrained_cnn = 0.72248

print(f"{'Feature Type':<30} {'Model':<15} {'Accuracy':<10}")
print("-" * 55)

print(f"{'word2vec-google-news-300':<30} {'CNN(Binary)':<15} {test_accuracy_binary_pretrained_cnn:<10}")
print(f"{'word2vec-google-news-300':<30} {'CNN(Ternary)':<15} {test_accuracy_ternary_pretrained_cnn:<10}")

print(f"{'Custom Word2Vec':<30} {'CNN(Binary)':<15} {test_accuracy_binary_custom_cnn:<10}")
print(f"{'Custom Word2Vec':<30} {'CNN(Ternary)':<15} {test_accuracy_ternary_custom_cnn:<10}")

Feature Type                   Model           Accuracy  
-------------------------------------------------------
word2vec-google-news-300       CNN(Binary)     0.8545    
word2vec-google-news-300       CNN(Ternary)    0.72248   
Custom Word2Vec                CNN(Binary)     0.8641    
Custom Word2Vec                CNN(Ternary)    0.69642   


## Report accuracy values for: 2 (number of Word2Vec models) * (2 (Perceptron + SVM) + 2 (bi- nary/ternary settings) ( 2 (FNN) + 1 (CNN))) = 2 (2+ 2 (2 + 1)) = 16 cases.

In [31]:
#Values from earlier iterations(since each section was run seperately to avoid memory failure)

accuracy_perceptron_pretrained = 0.6518
accuracy_perceptron_custom = 0.814925  
accuracy_svm_pretrained = 0.6744  
accuracy_svm_custom = 0.842575 
accuracy_perceptron_tfidf = 0.846025
accuracy_svm_tfidf = 0.8864  
test_accuracy_binary_pretrained_ffnn = 0.84405    
test_accuracy_binary_custom_ffnn = 0.862275
test_accuracy_ternary_pretrained_ffnn = 0.6823
test_accuracy_ternary_custom_ffnn = 0.69816
test_accuracy_binary_pretrained_ffnn_concat = 0.76595
test_accuracy_binary_custom_ffnn_concat = 0.78375 
test_accuracy_ternary_pretrained_ffnn_concat = 0.61184 
test_accuracy_ternary_custom_ffnn_concat = 0.63268

print(f"{'Feature Type':<30} {'Model':<15} {'Accuracy':<10}")
print("-" * 55)

print(f"{'word2vec-google-news-300':<30} {'Perceptron':<15} {accuracy_perceptron_pretrained:<10}")
print(f"{'word2vec-google-news-300':<30} {'SVM':<15} {accuracy_svm_pretrained:<10}")

print(f"{'Custom Word2Vec':<30} {'Perceptron':<15} {accuracy_perceptron_custom:<10}")
print(f"{'Custom Word2Vec':<30} {'SVM':<15} {accuracy_svm_custom:<10}")

print(f"{'TF-IDF (from HW1)':<30} {'Perceptron':<15} {accuracy_perceptron_tfidf:<10}")
print(f"{'TF-IDF (from HW1)':<30} {'SVM':<15} {accuracy_svm_tfidf:<10}")

print(f"{'word2vec-google-news-300':<30} {'FFNN(Binary)':<15} {test_accuracy_binary_pretrained_ffnn:<10}")
print(f"{'word2vec-google-news-300':<30} {'FFNN(Ternary)':<15} {test_accuracy_ternary_pretrained_ffnn:<10}")

print(f"{'Custom Word2Vec':<30} {'FFNN(Binary)':<15} {test_accuracy_binary_custom_ffnn:<10}")
print(f"{'Custom Word2Vec':<30} {'FFNN(Ternary)':<15} {test_accuracy_ternary_custom_ffnn:<10}")

print(f"{'word2vec-google-news-300':<30} {'FFNN[CONCAT](Binary)':<15} {test_accuracy_binary_pretrained_ffnn_concat:<10}")
print(f"{'word2vec-google-news-300':<30} {'FFNN[CONCAT](Ternary)':<15} {test_accuracy_ternary_pretrained_ffnn_concat:<10}")

print(f"{'Custom Word2Vec':<30} {'FFNN[CONCAT](Binary)':<15} {test_accuracy_binary_custom_ffnn_concat:<10}")
print(f"{'Custom Word2Vec':<30} {'FFNN[CONCAT](Ternary)':<15} {test_accuracy_ternary_custom_ffnn_concat:<10}")

print(f"{'word2vec-google-news-300':<30} {'CNN(Binary)':<15} {test_accuracy_binary_pretrained_cnn:<10}")
print(f"{'word2vec-google-news-300':<30} {'CNN(Ternary)':<15} {test_accuracy_ternary_pretrained_cnn:<10}")

print(f"{'Custom Word2Vec':<30} {'CNN(Binary)':<15} {test_accuracy_binary_custom_cnn:<10}")
print(f"{'Custom Word2Vec':<30} {'CNN(Ternary)':<15} {test_accuracy_ternary_custom_cnn:<10}")

Feature Type                   Model           Accuracy  
-------------------------------------------------------
word2vec-google-news-300       Perceptron      0.6518    
word2vec-google-news-300       SVM             0.6744    
Custom Word2Vec                Perceptron      0.814925  
Custom Word2Vec                SVM             0.842575  
TF-IDF (from HW1)              Perceptron      0.846025  
TF-IDF (from HW1)              SVM             0.8864    
word2vec-google-news-300       FFNN(Binary)    0.84405   
word2vec-google-news-300       FFNN(Ternary)   0.6823    
Custom Word2Vec                FFNN(Binary)    0.862275  
Custom Word2Vec                FFNN(Ternary)   0.69816   
word2vec-google-news-300       FFNN[CONCAT](Binary) 0.76595   
word2vec-google-news-300       FFNN[CONCAT](Ternary) 0.61184   
Custom Word2Vec                FFNN[CONCAT](Binary) 0.78375   
Custom Word2Vec                FFNN[CONCAT](Ternary) 0.63268   
word2vec-google-news-300       CNN(Binary)     0.854