## 95-891 Introduction to AI  
## Homework 4: Natural Language Processing - Sentiment Classification

In [411]:
import collections
from itertools import compress
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
import os
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gensim
nltk.download('stopwords')
nltk.download('punkt')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from adjustText import adjust_text
from sklearn.metrics import accuracy_score
import sklearn.metrics as metrics

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ruome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ruome\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Data ETL  

In [188]:
# Importing the datasets
fields = ['utterance', 'context']
train = pd.read_csv('train.csv', usecols=fields)
valid = pd.read_csv('valid.csv', usecols=fields)
test = pd.read_csv('test.csv', usecols=fields)

# Concat the training data
# Filter out the four target sentiments for both the training and testing data
sentiments = ['sad', 'jealous', 'joyful', 'terrified']
train_data = pd.concat([train, valid])
train_data = train_data.loc[train_data['context'].apply(lambda x: x in sentiments)]
test_data = test.loc[test['context'].apply(lambda x: x in sentiments)]

# Clean the index
train_data.index = pd.RangeIndex(len(train_data.index))
test_data.index = pd.RangeIndex(len(test_data.index))
print(train_data)
print(test_data)

         context                                          utterance
0      terrified  Job interviews always make me sweat bullets_co...
1      terrified                Don't be nervous. Just be prepared.
2      terrified  I feel like getting prepared and then having a...
3      terrified            Yes but if you stay calm it will be ok.
4      terrified         It's hard to stay clam. How do you do it? 
...          ...                                                ...
11346    jealous  Yeah_comma_ I can understand that. Sometimes w...
11347  terrified  During this past spring during a bad storm_com...
11348  terrified           Oh my gosh!  That had to be super scary!
11349  terrified  It sounded like a train going by the house fro...
11350  terrified  Yeah_comma_ thank god you are safe.  I don't k...

[11351 rows x 2 columns]
        context                                          utterance
0           sad  I'm so sad because i've read an article about ...
1           sad  Ugh_com

In [299]:
# Synthesize training attributes and labels
# Getting the train labels
train_labels_unique = list(train_data['context'].unique())
label_mapper = {}
num = 0
for label in train_labels_unique:
    label_mapper[label] = num
    num += 1

train_labels = list(train_data['context'])
train_labels_encoded = []
for label in train_labels:
    train_labels_encoded.append(label_mapper[label])


# Getting test labels
labels_test = list(test_data['context'])
test_labels_encoded = []
for label in labels_test:
    test_labels_encoded.append(label_mapper[label])
test_labels_encoded = np.array(test_labels_encoded)

# Data preprocessing
# Remove punctuations and replace "_comma_" from the sentence
train_data['utterance'] = train_data['utterance'].apply(lambda row: re.sub(r'\W', ' ', str(row)))

test_data['utterance'] = test_data['utterance'].apply(lambda row: re.sub(r'\W', ' ', str(row)))
# test_data['utterance'] = test_data['utterance'].apply(lambda row: row.replace('_comma_', ''))

train_data_list_cleaned = train_data['utterance']
test_data_list_cleaned = test_data['utterance']

print("Train Data cleaned: ")
print(train_data_list_cleaned)

print("------------------------------------------------------")
print("Test Data cleaned: ")
print(test_data_list_cleaned)

Train Data cleaned: 
0        Job interviews always make me sweat bullets_co...
1                      Don t be nervous  Just be prepared 
2        I feel like getting prepared and then having a...
3                  Yes but if you stay calm it will be ok 
4               It s hard to stay clam  How do you do it  
                               ...                        
11346    Yeah_comma_ I can understand that  Sometimes w...
11347    During this past spring during a bad storm_com...
11348             Oh my gosh   That had to be super scary 
11349    It sounded like a train going by the house fro...
11350    Yeah_comma_ thank god you are safe   I don t k...
Name: utterance, Length: 11351, dtype: object
------------------------------------------------------
Test Data cleaned: 
0       I m so sad because i ve read an article about ...
1       Ugh_comma_ those articles always get me too   ...
2       she was born premature at home_comma_ she had ...
3           Jeez  Its so unfortunat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['utterance'] = test_data['utterance'].apply(lambda row: re.sub(r'\W', ' ', str(row)))


## 2. Converting the utterances into a sparse bag-of-words (BOW) representation

In [300]:
# Using sklearn
train_count_vectorizer = CountVectorizer()
X = train_count_vectorizer.fit_transform(train_data_list_cleaned)
encoding = X.toarray()

# Converting counts to binary result
present_count = 0
for arr in encoding:
    arr_present_count = (arr > 0).sum()
    present_count += arr_present_count

print("The count of non-zero cells is: {0}".format(present_count))
print("The size of encoding array is: {0}".format(encoding.size))
print("The percentage of words present is: {0}%".format(present_count / encoding.size * 100))

The count of non-zero cells is: 136631
The size of encoding array is: 108299891
The percentage of words present is: 0.12615986843421662%


## 3. Shortcoming of the previous representation of utterance features

In my view, the shortcomings of the BOW representation of utterace features are:

1. There many meaningless words counted in the BOW matrix. For example, stopwords like "the", "is", "and", and words like "_comma_", which does not help much in the process of training the classifier.

2. The percentage of non-zero cells in the entire encoding array is only 0.12615986843421662%. This indicates the matrix is too sparse and we should avoid this.

3. No grammar of the utterances, order of words, or context of the sentences are being studies or extracted from the process.

In [301]:
# Remove stop words
# Getting the list of stopwords and appending additional words to it
stopwords_list = list(set(stopwords.words('english')))

# Remove train data stopwords and covert to lower case
train_data_stop_removed = train_data_list_cleaned.apply(lambda row: row.replace('_comma_', ''))
train_data_stop_removed = train_data_stop_removed.apply(lambda row: [word for word in word_tokenize(row) if not word.lower() in stopwords_list])
train_data_stop_removed = train_data_stop_removed.str.join(' ')
print("Train Data processed: ")
print(train_data_stop_removed)

print("------------------------------------------------------")

# Remove test data stopwords and covert to lower case
test_data_stop_removed = test_data_list_cleaned.apply(lambda row: row.replace('_comma_', ''))
test_data_stop_removed = test_data_stop_removed.apply(lambda row: [word for word in word_tokenize(row) if not word.lower() in stopwords_list])
test_data_stop_removed = test_data_stop_removed.str.join(' ')
print("Test Data processed: ")
print(test_data_stop_removed)

Train Data processed: 
0        Job interviews always make sweat bullets makes...
1                                         nervous prepared
2        feel like getting prepared curve ball thrown t...
3                                         Yes stay calm ok
4                                           hard stay clam
                               ...                        
11346    Yeah understand Sometimes position enjoy busy ...
11347    past spring bad storm tornado touch neighbors ...
11348                                  Oh gosh super scary
11349    sounded like train going house basement Thankf...
11350         Yeah thank god safe know would done happened
Name: utterance, Length: 11351, dtype: object
------------------------------------------------------
Test Data processed: 
0       sad read article newborn girl died parents bel...
1                           Ugh articles always get wrong
2       born premature home hard time breathing instea...
3                             J

In [302]:
# Creating the bag of words encoding again  
train_count_vectorizer = CountVectorizer()
X_stop_removed = train_count_vectorizer.fit_transform(train_data_stop_removed)

train_one_hot_encoding = X_stop_removed.toarray()

train_one_hot_encoding_present_count = 0
for arr in train_one_hot_encoding:
    train_one_hot_encoding_arr_present_count = (arr > 0).sum()
    train_one_hot_encoding_present_count += train_one_hot_encoding_arr_present_count

print("The count of non-zero cells is: {0}".format(train_one_hot_encoding_present_count))
print("The size of encoding array is: {0}".format(train_one_hot_encoding.size))
print("The percentage of words present is: {0}%".format(train_one_hot_encoding_present_count / train_one_hot_encoding.size * 100))

The count of non-zero cells is: 72450
The size of encoding array is: 94939764
The percentage of words present is: 0.07631154423345733%


## 4. Normalization using TF-IDF

In [303]:
# Normalizing the training data using TD-IDF transformer

train_tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
train_embedding_tfidf_transformer = train_tfidf_transformer.fit_transform(X_stop_removed)
train_embedding_tfidf = train_embedding_tfidf_transformer.toarray()

print(train_embedding_tfidf)
print(train_embedding_tfidf.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(11351, 8364)


## 5. Build a SGD classifier and perfrom error analysis

In [307]:
X_train = train_embedding_tfidf_transformer
y_train = np.array(train_labels_encoded)

clf = SGDClassifier()
clf.fit(X_train, y_train)

# Using training data vocabulary on test data so that the features are consistent    
test_count_vectorizer = CountVectorizer(vocabulary = train_count_vectorizer.get_feature_names())
X_test = test_count_vectorizer.fit_transform(test_data_stop_removed)

test_one_hot_encoding = X_test.toarray()

for arr in test_one_hot_encoding:
    arr[arr > 0] = 1

# Normalizing the test data  
test_tfidf_transformer = TfidfTransformer(smooth_idf=False, use_idf=True)
test_embedding_tfidf_transformer = test_tfidf_transformer.fit_transform(test_one_hot_encoding)

# Getting predictions on test data
test_predicted_labels = clf.predict(test_embedding_tfidf_transformer)


# do some evaluation on the test set
print('Test accuracy:', np.mean(test_labels_encoded == test_predicted_labels))

f1_score_vector = f1_score(test_labels_encoded, test_predicted_labels, average=None)

print('F1 score:', np.mean(test_labels_encoded == test_predicted_labels))

print('Confusion matrix: \n', confusion_matrix(test_labels_encoded, test_predicted_labels))

print('f1 score using SGD classifier is:', np.mean(f1_score_vector))

  idf = np.log(n_samples / df) + 1


Test accuracy: 0.6201888162672476
F1 score: 0.6201888162672476
Confusion matrix: 
 [[215  27  30  26]
 [ 31 228  54  42]
 [ 53  54 229  38]
 [ 36  69  63 182]]
f1 score using SGD classifier is: 0.6206513585802701


In [331]:
# Misclassified examples
misclassified_exhibits = list(np.where(test_labels_encoded != test_predicted_labels)[0])[:4]
print("Misclassfied exhibits: {0}\n".format(misclassified_exhibits))

print("--------------------------------------")
print("Misclassified Example 1")
index_1 = misclassified_exhibits[0]
print("Index:", index_1)
print("Predicted Sentiment:", train_labels_unique[test_predicted_labels[index_1]])
print("Actual Sentiment:", test_data['context'].iloc[index_1])
print("Utterance:", test_data['utterance'].iloc[index_1])

print("--------------------------------------")
print("Misclassified Example 2")
index_2 = misclassified_exhibits[1]
print("Index:", index_2)
print("Predicted Sentiment:", train_labels_unique[test_predicted_labels[index_2]])
print("Actual Sentiment:", test_data['context'].iloc[index_2])
print("Utterance:", test_data['utterance'].iloc[index_2])

print("--------------------------------------")
print("Misclassified Example 3")
index_3 = misclassified_exhibits[2]
print("Index:", index_3)
print("Predicted Sentiment:", train_labels_unique[test_predicted_labels[index_3]])
print("Actual Sentiment:", test_data['context'].iloc[index_3])
print("Utterance:", test_data['utterance'].iloc[index_3])

print("--------------------------------------")
print("Misclassified Example 4")
index_4 = misclassified_exhibits[3]
print("Index:", index_4)
print("Predicted Sentiment:", train_labels_unique[test_predicted_labels[index_4]])
print("Actual Sentiment:", test_data['context'].iloc[index_4])
print("Utterance:", test_data['utterance'].iloc[index_4])

Misclassfied exhibits: [1, 2, 6, 8]

--------------------------------------
Misclassified Example 1
Index: 1
Predicted Sentiment: jealous
Actual Sentiment: sad
Utterance: Ugh_comma_ those articles always get me too       What was wrong with her  
--------------------------------------
Misclassified Example 2
Index: 2
Predicted Sentiment: terrified
Actual Sentiment: sad
Utterance: she was born premature at home_comma_ she had hard time breathing on her own but instead of taking her to the doctor parents were just praying
--------------------------------------
Misclassified Example 3
Index: 6
Predicted Sentiment: sad
Actual Sentiment: joyful
Utterance: 3 years is a long time  How come 
--------------------------------------
Misclassified Example 4
Index: 8
Predicted Sentiment: sad
Actual Sentiment: joyful
Utterance: Oh I see  They must miss you_comma_ too 


##### My thoughts
Among these misclassification examples, example 1 and 2 are the replies to the same prompt. This prompt is about a sad incident that a newborn babygirl died due to being premature and her parents refused to take her to the hospital. The two utterances are both expressing sadness for the babygirl's misfortune. 

However, the first example was predicted to be "jealous" and the second was "terrified". In my view, the cause might be that the first example's "ugh" actually have a mixed feeling of "anger" towards the irresponsible parents, and "what was wrong with her" in everyday conversations do have some negative sentiments. The second example's "premature", "hard time breathing", "praying" could be misleading to the classifier because they do convey a sentiment of a terrified state. It is also a mixed sentiment.

Misclassification example 3 and 4 are also replies to the same prompt. This prompt is about someone announcing that they are going to visit their parents soon. The two examples are both expressing joy towards this great news.
However, example 3 and 4 are both predicted to be "sad". My guess is that the words/phrases "A long time", "How come", "miss" do tend to convey negative sentiments in everyday situations, thus, the classifier grouped these as "sad".

## 6. Build a classifier using pre-trained word embeddings 

In [336]:
# Tokenizing the data
train_tokens = [nltk.word_tokenize(sentences) for sentences in train_data_stop_removed]
train_y = np.array(train_labels_encoded)

test_tokens = [nltk.word_tokenize(sentences) for sentences in test_data_stop_removed]
test_y = np.array(test_labels_encoded)

# Loading the pretrained word2vec model from Google
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(model.vector_size)

300


In [347]:
# Filter the list of vectors to include only those that Word2Vec has a vector for
vector_list = [model[word] for word in words if word in model]

# Create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in model]

# Zip the words together with their vector representations
word_vec_zip = zip(words_filtered, vector_list)

# Cast to a dict so we can turn it into a DataFrame
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
print(df)

                 0         1         2         3         4         5    \
0           0.152344 -0.121094  0.102051 -0.083984 -0.184570  0.015320   
Job         0.206055 -0.053467 -0.318359 -0.265625  0.287109  0.065918   
interviews  0.005646  0.114746 -0.177734 -0.162109  0.298828  0.097168   
always      0.055908  0.057617  0.015198  0.251953 -0.041260  0.074219   
make       -0.113281 -0.036865  0.094238  0.007996  0.024902 -0.166992   
...              ...       ...       ...       ...       ...       ...   
k           0.040771 -0.178711  0.226562  0.326172  0.441406 -0.175781   
Name        0.039307 -0.155273  0.052490 -0.171875 -0.112305 -0.062988   
utterance   0.109863 -0.228516  0.095215  0.201172 -0.542969  0.023315   
Length     -0.010925 -0.051758 -0.003403  0.287109  0.171875 -0.097168   
object      0.292969 -0.062988  0.083496  0.043457 -0.562500  0.154297   

                 6         7         8         9    ...       290       291  \
0           0.238281 -0.478516  

In [369]:
from sklearn.manifold import TSNE

# Initialize t-SNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)

tsne_df = tsne.fit_transform(df)



## 7. Build a classifier based on BERT and MLP

In [400]:
from transformers import DistilBertTokenizer, DistilBertModel
import torch

# load the tokenizer and the model of distilbert-base-uncased
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

def sentenceVec(doc):
    sentenceVec.i = sentenceVec.i + 1
    
    if (sentenceVec.i % 1000 == 0):
        print("Classifier Trainining Iterations: {0}".format(sentenceVec.i))
        
    inputs = tokenizer(doc, return_tensors="pt", truncation = True)
    outputs = model(**inputs)
    
    last_hidden_states = outputs.last_hidden_state
    sentence_embedding = torch.mean(last_hidden_states[0], dim=0).detach().numpy()
    
    return sentence_embedding

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [382]:
# Transform raw train data
sentenceVec.i = 0;
train_data_raw = pd.DataFrame(train_data)
train_sentence_vectors =  train_data_raw.applymap(lambda doc: sentenceVec(doc))

BERT Classifier Trainining Iterations: 1000
BERT Classifier Trainining Iterations: 2000
BERT Classifier Trainining Iterations: 3000
BERT Classifier Trainining Iterations: 4000
BERT Classifier Trainining Iterations: 5000
BERT Classifier Trainining Iterations: 6000
BERT Classifier Trainining Iterations: 7000
BERT Classifier Trainining Iterations: 8000
BERT Classifier Trainining Iterations: 9000
BERT Classifier Trainining Iterations: 10000
BERT Classifier Trainining Iterations: 11000
BERT Classifier Trainining Iterations: 12000
BERT Classifier Trainining Iterations: 13000
BERT Classifier Trainining Iterations: 14000
BERT Classifier Trainining Iterations: 15000
BERT Classifier Trainining Iterations: 16000
BERT Classifier Trainining Iterations: 17000
BERT Classifier Trainining Iterations: 18000
BERT Classifier Trainining Iterations: 19000
BERT Classifier Trainining Iterations: 20000
BERT Classifier Trainining Iterations: 21000
BERT Classifier Trainining Iterations: 22000


In [383]:
# Trainig MLP Classifier
mlp_X_train = pd.DataFrame(train_sentence_vectors['utterance'].tolist()).to_numpy()
mlp = MLPClassifier(hidden_layer_sizes=150, random_state = 42, max_iter=300)
mlp.fit(mlp_X_train, y_train)

MLPClassifier(hidden_layer_sizes=150, max_iter=300, random_state=42)

In [401]:
# Transform raw test data
sentenceVec.i = 0;
test_data_raw = pd.DataFrame(test_data)
test_sentence_vectors =  test_data_raw.applymap(lambda doc: sentenceVec(doc))

Classifier Trainining Iterations: 1000
Classifier Trainining Iterations: 2000


In [416]:
# Predict
X_test = pd.DataFrame(test_sentence_vectors['utterance'].tolist()).to_numpy()
y_pred = mlp.predict(X_test)

accuracy = accuracy_score(test_labels_encoded, y_pred)
f1 = metrics.f1_score(test_labels_encoded, y_pred, average="weighted")
cf_matrix = confusion_matrix(test_labels_encoded, y_pred)

print("Accuracy:", accuracy)
print("f1-score:", f1)
print("Confusion Matrix:\n", cf_matrix)

Accuracy: 0.6122004357298475
f1-score: 0.6109740351403672
Confusion Matrix:
 [[213  32  38  15]
 [ 40 219  46  50]
 [ 54  54 218  48]
 [ 29  67  61 193]]


## 8. Read the paper at https://arxiv.org/pdf/1811.00207.pdf  and answer the following questions: 
###### 1)  What does this paper mean by "fine-tuning" results? How might you use such fine-tuning in building an empathetic chatbot? 
The “fine-tuning” results in this paper means the results produced from the classifier based on ED utterance data. Using such fine-tuning techniques, we can improve human metrics on the ED task, in both retrieval and generative set-ups, and then better analyze how to generate more empathetic dialogues when developing the chatbot. 
###### 2)  What properties of the transformer architecture make it well suited for this application? 
The transformer includes five layers instead of four, indicating that it has a greater potential for learning and can recognize more complex representations of the input data. Additionally, this transformer features a BERT-based architecture that can classify the next sentence, greatly increasing its accuracy. Additionally, this transformer has both retrieval-based and generative setups. The generative set-up creates a prediction of a likely sentence, and the retrieval-based set-up helps in locating the utterance with the highest likelihood from the batch. The combination enhances and provides the best possible conversations.
###### 3)  Explain the metrics used to evaluate performance in Table 1 (P@1,100, AVG-BLEU, and PPL).   
P@1,100 is the precision of retrieving the correct test candidate out of 100 test candidates. It reflects how well the model chose the correct response from 100 randomly chosen samples in the test set. Contrary to inference from the retrieval systems for all other metrics, which only uses training utterances as candidates, when we compute p@1,100, the actual response is included in the candidates.

AVG-BLEU is the average of BLEU-1,-2,-3,-4. The BLEU score is a metric used to assess how similarly the text that was machine translated compared to a group of high quality reference translations. It compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

PPL is the metric of perplexity. It is one of the most common metrics for evaluating language models. Perplexity is used to measure sensibility of responses and how well the responses correlate with human evaluations. The lower this metric is, the more easily people are able to understand the sentence.
###### 4)  Which of the metrics do you think provides the best measure of performance of empathic systems and why?   
When evaluating the performance of empathic systems, AVG-BLEU provides the best measure. P@1,100 measures the precision of retrieving the correct test candidate. PPL measures how well people can understand the sentence. Only AVG-BLEU measures how close the machine-produced response resembles the human response, which ensures the sentiment of the response.
###### 5)  Based on Tables 1 and 2, and your reading of the paper, what do you think would help the system get to human-level performance? 
Based on the paper, providing retrieval candidates of the dataset and fine-tuning conversation models lead to responses that are closer to human-level performance. 