# EECE 571T Project - NLP with Emotion Dataset

Last updated:
* Date: February 17, 2022
* Time: 5:50pm

## References:

* Making our own word2vec model: https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/
* https://medium.com/@adriensieg/text-similarities-da019229c894
* Text Classification tutorial: https://github.com/adsieg/Multi_Text_Classification
* From same author:
  * [**Feb.17**] https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794 (Try following these instructions next)
  * [**Feb.17**] https://towardsdatascience.com/text-analysis-feature-engineering-with-nlp-502d6ea9225d


## Get data from GitHub repo

In [12]:
!wget https://github.com/tkjsung/EECE571T_Dataset/archive/refs/heads/master.zip
!unzip /content/master.zip

--2022-02-18 00:45:12--  https://github.com/tkjsung/EECE571T_Dataset/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/tkjsung/EECE571T_Dataset/zip/refs/heads/master [following]
--2022-02-18 00:45:12--  https://codeload.github.com/tkjsung/EECE571T_Dataset/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip.1’

master.zip.1            [  <=>               ] 798.87K  3.23MB/s    in 0.2s    

2022-02-18 00:45:13 (3.23 MB/s) - ‘master.zip.1’ saved [818042]

Archive:  /content/master.zip
f84fef58c648047c03c671498e0375bf224f000e
replace EECE571T_Dataset-master/.gitignore? [y]

## Import Data

In [48]:
# Import libraries for data import
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [49]:
# Read CSV
data_train = pd.read_csv('/content/EECE571T_Dataset-master/Project/train.txt',sep=';', header=None)
data_test = pd.read_csv('/content/EECE571T_Dataset-master/Project/test.txt',sep=';', header=None)
data_val = pd.read_csv('/content/EECE571T_Dataset-master/Project/val.txt',sep=';', header=None)

col_names = ["sentence","emotion"]
data_train.columns = col_names
data_test.columns = col_names
data_val.columns = col_names

In [50]:
# See the data head to make sure data is imported correctly.
data_train.head()
# data_test.head()
# data_val.head()

Unnamed: 0,sentence,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


## Encode the emotion labels with unique identifiers

In [51]:
from sklearn.preprocessing import LabelEncoder
# Encode the emotion labels with unique identifiers
data_train['emotion'].unique()
labelencoder = LabelEncoder()
data_train['label_enc'] = labelencoder.fit_transform(data_train['emotion'])
data_test['label_enc'] = labelencoder.fit_transform(data_test['emotion'])
data_val['label_enc'] = labelencoder.fit_transform(data_val['emotion'])
# For data_test and data_val, use the same labelencoder. Make sure it's the same by using the display code below.

Display the encoded emotion labels

In [52]:
data_train[['emotion','label_enc']].drop_duplicates(keep='first')
# data_test[['emotion','label_enc']].drop_duplicates(keep='first')
# data_val[['emotion','label_enc']].drop_duplicates(keep='first')

Unnamed: 0,emotion,label_enc
0,sadness,4
2,anger,0
3,love,3
6,surprise,5
7,fear,1
8,joy,2


Add sentence length to each sentence. It should calculate number of characters, including spaces and punctuation.

In [53]:
data_train['length'] = [len(x) for x in data_train['sentence']]
data_test['length'] = [len(x) for x in data_test['sentence']]
data_val['length'] = [len(x) for x in data_val['sentence']]
# data_train.head()
# data_test.head()
# data_val.head()

In [54]:
data_train.head()
# data_test.head()
# data_val.head()

Unnamed: 0,sentence,emotion,label_enc,length
0,i didnt feel humiliated,sadness,4,23
1,i can go from feeling so hopeless to so damned...,sadness,4,108
2,im grabbing a minute to post i feel greedy wrong,anger,0,48
3,i am ever feeling nostalgic about the fireplac...,love,3,92
4,i am feeling grouchy,anger,0,20


Finding the maximum sentence length. It seems to be 300. From the testing and validation set, they are 296 and 295, respectively.

In [55]:
max_len = data_train['length'].max()
print(max_len)

300


## Data Cleaning & Pre-Processing

We need to do some data cleaning first, otherwise it would be a nightmare to do pre-processing with at least 15212 vocabulary words...

<!-- Tokenize the words. This uses `keras.preprocessing` library. We get a tokenizer that fits onto our training set's sentences. Then a dictionary of words is created from the tokenizer. -->

*Feb.16: Instructions in this code block is commented out*

First, data cleaning.<br>
**Feb.17:** For stemming, I think we should replace it with lemmization, which looks to be better and would probably work better for word2vec.
Source: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

In [None]:
import re
from nltk.corpus import stopwords
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Attempting data cleaning here
def preprocess(raw_text):
    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]
    
    #stemmed words (looks like this is causing some words to be weird)
    ps = PorterStemmer()
    stemmed_words = [ps.stem(word) for word in meaningful_words]

    #lemmed words (trying this because this gets the root word?)
    lem = WordNetLemmatizer()
    lemmed_words = [lem.lemmatize(word) for word in meaningful_words]
    
    #join the cleaned words in a list
    # cleaned_word_list = " ".join(stemmed_words)
    cleaned_word_list = " ".join(lemmed_words)
    # cleaned_word_list = " ".join(meaningful_words)

    return cleaned_word_list

In [57]:
data_train['sentence'] = data_train['sentence'].apply(lambda line : preprocess(line))
data_test['sentence'] = data_test['sentence'].apply(lambda line : preprocess(line))
data_val['sentence'] = data_val['sentence'].apply(lambda line : preprocess(line))

Tokenize text and vectorize. (This is literally TF-IDF, as per Tensorflow's documentation)

In [25]:
from keras.preprocessing import text
token = text.Tokenizer() # uses keras.preprocessing I believe

In [26]:
token.fit_on_texts(data_train['sentence'])
word_index = token.word_index

In [27]:
# Text to sequence
x_train_token = token.texts_to_sequences(data_train['sentence'])
x_test_token = token.texts_to_sequences(data_test['sentence'])
x_val_token = token.texts_to_sequences(data_val['sentence'])

Pad the data sets to be of the same length

In [28]:
def checkLength(listArr):
  max = 0
  for i in range(0,len(listArr)):
    if(max < len(listArr[i])):
      max = len(listArr[i])
  return max
print(checkLength(x_train_token))
print(checkLength(x_test_token))
print(checkLength(x_val_token))

35
30
29


Max length is 35. Pad all arrays to be of size 35.

In [29]:
# Need to add padding code here

### Using word2vec

I did pre-processing, word stemming, and stuff like that above. The simplest way avoid words not being found in a database is if word stemming is not performed on the dataset (or as I just found out, use lemmization instead. More computationally complex but better for actually working with word embedding techniques (I think)).

~~**February 16:**~~ Find words in the Word2VecKeyedVector (using 2.3 in source https://github.com/adsieg/Multi_Text_Classification/blob/master/%5BIntroduction%5D%20-%20Big%20tutorial%20-%20Text%20Classification.ipynb) by using `Word2VecKeyedVector.index2word`. This returns a list of the word2vec array.

In [30]:
# DO NOT RUN THIS BLOCK MORE THAN ONCE IN ONE SESSION
# Import gensim data
import gensim.downloader as api
import gensim
# Load word2vec model
# Gensim data obtained from https://github.com/RaRe-Technologies/gensim-data (official source)
model = api.load('glove-twitter-25')
# model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Check dimension of word vectors
# model.vector_size



In [31]:
# Testing: Gets the index of where the embedded model
# model.vocab["whatever"].index
# Now use the source above, section 2.3 and follow instructions there.
# (And write it in the section below)

In [32]:
# embeddings_index = {}
# for line in model:
#   # values = line.split()
#   embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')

## IGNORE THINGS IN THIS SECTION.

**Ignore code blocks below this one please.**

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Filter out stopwords
stop_words = set(stopwords.words('english'))

# words = [word for word in words if not word in stop_words]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# from keras_preprocessing.text import Tokenizer
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(pd.concat(data_train, axis=0))

In [None]:
vocabSize = 15000

Padding will require the text to be already in numbers... so I can't run this yet.

In [None]:
# from nltk.stem.porter import PorterStemmer
# from tensorflow.keras.preprocessing.sequence import pad_sequences
# import re

# def text_cleaning(df, column):
#   stemmer = PorterStemmer()
#   corpus = []

#   for text in df[column]:
#     text = re.sub("[^a-zA-Z]", " ", text)
#     text = text.lower()
#     text = text.split()
#     text = [stemmer.stem(word) for word in text if word not in stop_words]
#     text = " ".join(text)
#     corpus.append(text)
  
#   # pad = pad_sequences(sequences=corpus, maxlen=max_len, padding='pre')
#   # return pad
#   return corpus

In [None]:
# data_train_clean = text_cleaning(data_train, 'sentence')
# data_test_clean = text_cleaning(data_test, 'sentence')
# data_val_clean = text_cleaning(data_val, 'sentence')

### Pre-Processing: Method 1

Source: https://towardsdatascience.com/using-word2vec-to-analyze-news-headlines-and-predict-article-success-cdeda5f14751

In [None]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

# train_seq_x = sequence.pad_sequences(token.texts_to_sequences(data_train['sentence']), maxlen=300)
# test_seq_x = sequence.pad_sequences(token.texts_to_sequences(data_test['sentence']), maxlen=300)
# valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(data_val['sentence']), maxlen=300)

In [None]:
# Create list of strings into a single long string for processing
# title_list = [title for title in data_train['sentence']]

# We definitely are not doing this.
# Collapse the list of strings into a single long string for processing
# big_title_string = ' '.join(title_list)

In [None]:
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')

In [None]:
# from nltk.tokenize import word_tokenize
# Tokenize the string into words
# tokens = word_tokenize(big_title_string)

# Filter out stopwords
# from nltk.corpus import stopwords
# stop_words = set(stopwords.words('english'))

# words = [word for word in words if not word in stop_words]

### Pre-Processing: Method 2

Sources:

* https://github.com/adsieg/Multi_Text_Classification/blob/master/%5BIntroduction%5D%20-%20Big%20tutorial%20-%20Text%20Classification.ipynb
* https://www.tensorflow.org/text/guide/word_embeddings
* Only BOW and TF-IDF: https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/

In [None]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

In [None]:
# data_train['length'] = [len(x) for x in token]
# data_train.head()
# max_len = data_train['length'].max()
# print(max_len)

### Word Vectorization

We can use the `gensim` library to train our own word2vec model on a custom corpus either with CBOW or Skip Gram.

word2vec cannot create a vector from a word that is not in its vocabulary. So we need to specify "if word in model.vocab" when creating the full list of word vectors (source: https://towardsdatascience.com/using-word2vec-to-analyze-news-headlines-and-predict-article-success-cdeda5f14751)

In [None]:
# Relevant Libraries for Word Vectorization
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
from sklearn.preprocessing import LabelEncoder

# !pip install nltk
# !pip install gensim
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import Word2Vec

In [None]:
# CBOW model
# model1 = gensim.models.Word2Vec(data_train, min_count=1, size=100, window=5)
# Skip Gram Model
# model2 = gensim.models.Word2Vec(data_train, min_count=1, size=100, window=5,sg=1)
# model1.build_vocab()