This notebook is for the major project submission, on the **language** dataset and task. It contains the following sections:

*   Loading the Data

*   Exploratory Analysis of the data
*   Feature Engineering on the data
*   Conventional Machine Learning Model
*   A description of the selected conventional ML Model
*   Deep Learning Model
*   A description of the selected Deep Learning Model
*   Discussion between the performances of the two models













# ***1) Loading the Data***

In [1]:
import numpy as np
import pandas as pd
from os.path import join
from google.colab import drive
import pickle

drive.mount('/content/drive/')

def load_pickle(path):
    with open(path, 'rb') as f:
        file = pickle.load(f)
        print ('Loaded %s..' %path)
        return file

dataset_directory = '/content/drive/My Drive/ML Project/tweet-emotion-detection' 

emotions = ['anger', 'fear', 'joy', 'sadness']

tweets_train = np.load(join(dataset_directory, 'text_train_tweets.npy'))
labels_train = np.load(join(dataset_directory, 'text_train_labels.npy'))
vocabulary = load_pickle(join(dataset_directory, 'text_word_to_idx.pkl'))

tweets_val = np.load(join(dataset_directory, 'text_val_tweets.npy'))
labels_val = np.load(join(dataset_directory, 'text_val_labels.npy'))

tweets_test_private = np.load(join(dataset_directory, 'text_test_private_tweets.npy'))

print(len(vocabulary))
idx_to_word = {i: w for w, i in vocabulary.items()}
for i in range(7):
  print(i, idx_to_word[i])

sample = 1  

print('sample tweet, stored form:')
print(tweets_train[sample])
print(labels_train[sample])

print('sample tweet, readable form:')

decode = []
for i in range(50):
  decode.append(idx_to_word[tweets_train[sample][i]])
print(decode)
print(emotions[labels_train[sample]])


print(tweets_train.shape)
print(labels_train.shape)
print(tweets_val.shape)
print(labels_val.shape)
print(tweets_test_private.shape)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
Loaded /content/drive/My Drive/ML Project/tweet-emotion-detection/text_word_to_idx.pkl..
13978
0 <NULL>
1 <START>
2 <END>
3 it
4 makes
5 me
6 so
sample tweet, stored form:
[ 1 23 24 20 25 19 26 27 28  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0]
0
sample tweet, readable form:
['<START>', 'lol', 'adam', 'the', 'bull', 'with', 'his', 'fake', 'outrage', '<END>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>', '<NULL>']
anger
(7098, 52)
(709

# ***2) Exploratory Analysis of the data***

In [2]:
arr = np.array(tweets_train)
X_train = pd.DataFrame(data=arr.flatten())
print(X_train)

        0
0       1
1       3
2       4
3       5
4       6
...    ..
369091  0
369092  0
369093  0
369094  0
369095  0

[369096 rows x 1 columns]


In [3]:
arr1 = np.array(labels_train)
y_train = pd.DataFrame(data=arr1.flatten())
print(y_train)

      0
0     0
1     0
2     0
3     0
4     0
...  ..
7093  3
7094  3
7095  3
7096  3
7097  3

[7098 rows x 1 columns]


# ***3) Feature Engineering on the data***

***Converting index to words***

In [4]:
# preprocessing the X_train (training data)

df=pd.DataFrame(tweets_train)
for i in range(13978):
  df.replace(i, idx_to_word[i], inplace=True)
X_train=df.values
X_train

array([['<START>', 'it', 'makes', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'lol', 'adam', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', '<user>', 'passed', ..., '<NULL>', '<NULL>', '<NULL>'],
       ...,
       ['<START>', '#vinb', 'i', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'overwhelming', 'sadness', ..., '<NULL>', '<NULL>',
        '<NULL>'],
       ['<START>', 'idk', 'why', ..., '<NULL>', '<NULL>', '<NULL>']],
      dtype=object)

In [5]:
print(X_train.shape)

(7098, 52)


In [6]:
# preprocessing the x_val (validation data)

df2=pd.DataFrame(tweets_val)
for i in range(13978):
  df2.replace(i, idx_to_word[i], inplace=True)
x_val=df2.values
x_val

array([['<START>', '<user>', '<user>', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'had', 'a', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', '#cnn', 'really', ..., '<NULL>', '<NULL>', '<NULL>'],
       ...,
       ['<START>', 'i', 'wont', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'and', 'after', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'hit', 'by', ..., '<NULL>', '<NULL>', '<NULL>']],
      dtype=object)

In [7]:
print(x_val.shape)

(1460, 52)


In [8]:
# preprocessing the x_test (test data)

df3=pd.DataFrame(tweets_test_private)
for i in range(13978):
  df3.replace(i, idx_to_word[i], inplace=True)
x_test=df3.values
x_test

array([['<START>', 'whatever', 'you', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'accept', 'the', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'my', 'roommate', ..., '<NULL>', '<NULL>', '<NULL>'],
       ...,
       ['<START>', '<user>', ':', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', 'you', 'have', ..., '<NULL>', '<NULL>', '<NULL>'],
       ['<START>', '<user>', '<user>', ..., '<NULL>', '<NULL>', '<NULL>']],
      dtype=object)

In [9]:
print(x_test.shape)

(4257, 52)


**TF-IDF Vectors as Features**

In [10]:
# Feature extraction code

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(preprocessor= lambda x: x,
                             tokenizer = lambda x: x,
                             stop_words = stopwords.words('english'),
                             max_features=5000,
                             min_df=3, max_df=0.9)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Transformation code

X_train_idf = vectorizer.fit_transform(X_train)

X_val_idf = vectorizer.transform(x_val)

x_test = vectorizer.transform(x_test)

y_train = labels_train
y_val = labels_val

  'stop_words.' % sorted(inconsistent))


In [12]:
print('x_train shape:', X_train_idf.shape)
print('x_val shape:', X_val_idf.shape)
print('y_train shape:', y_train.shape)
print('y_val shape:', y_val.shape)
print('x_test shape:', x_test.shape)

x_train shape: (7098, 3903)
x_val shape: (1460, 3903)
y_train shape: (7098,)
y_val shape: (1460,)
x_test shape: (4257, 3903)


# ***4) Conventional Machine Learning Model***

***The final model that produced the best-performing predictions for the Kaggle submission (accuracy of 52% on public test set and 63% on private test set) was an SVM with a linear kernel.***

In [0]:
# Conventional ML model definition code

from sklearn import svm

clf = svm.SVC(kernel='linear') # Linear Kernel

#Trained the model using the training sets after feature extraction

clf.fit(X_train_idf, y_train)

#Predicted the response for the validation dataset

y_pred = clf.predict(X_val_idf)

In [14]:
#Checked the accuracy for SVM Linear Kernel on the validation set

from sklearn.metrics import accuracy_score
print("Conventional ML Model Accuracy Score:", accuracy_score(y_val, y_pred))

Conventional ML Model Accuracy Score: 0.4321917808219178


In [0]:
#Predicted the response for the test dataset

predsvm = clf.predict(x_test)

In [0]:
#Saving file as CSV

import numpy as np
import pandas as pd
prediction_svm = pd.DataFrame(predsvm, columns=['Prediction'])
prediction_svm.to_csv('/content/drive/My Drive/ML Project/tweet-emotion-detection/prediction_svm.csv')

# ***5) A Description of the Selected Conventional ML Model***

The final conventional ML Model of **SVM Linear Classifier** resulted in an accuracy of 0.432 (43.2%) on the validation set. Whereas, it gave an accuracy of 0.52 (52%) on the public test set on Kaggle. This may have been because SVM is known to work better for text classification compared to other models and the data was linearly separable as well which is also why our SVM Model worked well. (implemented hyperparameter tuning of kernel type in SVM, where polynomial kernel did not work well and linear kernel worked better)

In addition to the final model, I also tried a ***KNN Model*** and ***Random Forest Model***. These models performed fairly poorly comparatively. While, KNN Classifier resulted in an accuracy of 0.36 on the validation set and gave an accuracy of 0.30 on the public test set when using number of neighbours as 3 as other hyperparameters (n= 1, 2) performed poorly (implemented hyperparameter tuning of number of neighbours in KNN), the Random Forest Classifier resulted in an accuracy of 0.429 on the validation set. Whereas, it gave an accuracy of 0.51 on the public test set (implemented hyperparameter tuning by changing values of n_estimators using values as 100, 200 etc. in Random Forest classifer). This may have been because in the case of KNN, for large datasets like ours, the performance of the algorithm gets degraded due to higher cost of calculation and also, KNN is sensitive to outliers & noisy data. For Random Forest algorithm, the feature is required to have some predictive power or it may not work well with the data which seems to be our case and training a large amount of trees has high computational costs leading to degradation of our performance and accuracy metrics.

Hence, I decided to choose SVM Linear Classifier as the best conventional ML Model. 

# ***6) Deep Learning Model***

The final model that produced the best-performing predictions for the Kaggle submission (Accuracy of 54.65% on public test set and 64% on private test set) was a fully connected dense model with NN layers.  The input was the raw data that had been preprocessed by feature extraction using TF-IDF.

In [17]:
pip install keras



In [18]:
# pre-processing training labels and validation labels to categorical data

import tensorflow as tf
import keras
from keras.utils import to_categorical
y_train = keras.utils.to_categorical(y_train)
y_val = keras.utils.to_categorical(y_val)

Using TensorFlow backend.


In [0]:
# Deep Model definition code

from keras.models import Sequential
from keras import layers

input_dim = X_train_idf.shape[1]  # Number of features

model = Sequential()

model.add(layers.Dense(12, input_dim=(3903), activation='relu')) #Activation dense input layer using relu activation

model.add(layers.Dropout(0.5)) #Dropout layer for regularization

model.add(layers.Dropout(0.25)) #Dropout layer for regularization

model.add(layers.Dense(4, activation='softmax')) #Output dense layer using softmax activation

#Referred to Lecture Notes (Week 9) for the idea of the code

In [20]:
#Compilation using the loss feature, optimizer and metrics of the Deep Learning Model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                46848     
_________________________________________________________________
dropout_1 (Dropout)          (None, 12)                0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 12)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 52        
Total params: 46,900
Trainable params: 46,900
Non-trainable params: 0
_________________________________________________________________


In [21]:
#Fitted the deep learning model on validation set

history = model.fit(X_train_idf, y_train, epochs=10, verbose=1, 
                    validation_data=(X_val_idf, y_val),batch_size=10)

Train on 7098 samples, validate on 1460 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
#Accuracy on the validation set

loss, accuracy = model.evaluate(X_train_idf, y_train, verbose=False)
print("Deep Learning Model Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val_idf, y_val, verbose=False)
print("Deep Learning Model Testing Accuracy:  {:.4f}".format(accuracy))

Deep Learning Model Training Accuracy: 0.9649
Deep Learning Model Testing Accuracy:  0.4479


In [0]:
#Predicting on the test set

y_pred_deep = model.predict_classes(x_test)

In [0]:
#Saving file as CSV

import numpy as np
import pandas as pd
deepmodel = pd.DataFrame(y_pred_deep, columns=['Prediction'])
deepmodel.to_csv('/content/drive/My Drive/ML Project/tweet-emotion-detection/deepmodel.csv')

# ***7) A Description of the Selected Deep Learning Model***

For the final model, we have used 2 dense layers and 2 dropout layers, as this gave me the best accuracy I could achieve (average of 44%) on the validation set, 54.65% on the public test set and 64.059% on the private test set. The hyperparameters were chosen by altering the epoch and batch size to get the best accuracy. Since, we have a big dataset, we set the epochs=10 and batch_size=10. Relu activation function is used since it does not activate all the neurons at the same time and softmax layer is used for multi-class classification. We have used the loss function as categorical crossentropy for single label categorization. The Adam optimizer is used as it is a stochastic gradient descent method based on adaptive estimation of first-order and second-order moments. 

In addition to the final model, I also tried a ANN with Global average pooling layer. This performed almost as well as the final model (accuracy 53% on public test set). This gap in performance may have been because average pooling assumes a single mode with a single centroid, while our distribution has more than one mode as well as outliers which leads to the average pooling not being accurate.

# ***8) Discussion Between the Performances of the Two Models***

Comparing my final conventional ML and deep learning models, the deep learning one performed better by 2.5% on the public test set.  The deep learning model ranked #36 out of 57 submissions on the public test set, with the top-performing system having 78.86% accuracy, and the majority of the accuracies lied between 53% to 59% for the public test set.

On the private test set, my best model had an accuracy of 64.059% which ranked #29 out of 49 submissions, with the top-performing system having 100% accuracy, and a majority class baseline having 64% accuracy.

******
***Validation Set vs Public Test Set***

The performance on validation set versus public test set for the best Conventional ML Model and Deep Learning Model resulted in a better accuracy for the public test set in both the cases. This might be a result of overfitting on the validation data for both the cases. Also, for Deep Model, this might be due to usage of Neural Networks and usage of dropout layer which is a regularization technique. While the Dropout layer, during training, removes some random collection of these classifiers. Thus, the training accuracy suffers. While, during testing, it shuts off and lets all of the ‘weak classifiers’ in the neural network to be utilized. Thus, testing accuracy improves. It also indicates that our dataset might be slightly small.


******
***Public Test Set vs Private Test Set***

While applying the deep model on public set gave an accuracy of 54.65%, it gave an accuracy of 64% on the private set. This shows that my deep model worked better on the private set and the difference between scores for the public set and the private set can be considered as a result of overfitting on the public set compared to the private set. This can be improved by increasing the amount of training data and using normalization techniques. Also, the difference in accuracies can be related to the bias of the predictor when applied to the training data versus the testing data. 

******

I believe that in this competition, the models having very high accuracies when trained on the public test set did not perform that well on the private test set because there was too much overfitting done by the competitors on the validation set and the public test set to make the data fit well particularly to the public test set. Due to this, it lead to a lesser accuracy on the unseen private test set which was different from the public test set. This can be resolved by adding more training data. Thus, it is necessary to have a neutral bias-variance tradeoff where there is a balance between the bias and the variance to have a more generalized model.