<a href="https://colab.research.google.com/github/vikramrajan28/MachineLearning_TweetEmotionDetection/blob/main/COMP8220_ML_MajorProject_45763054.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>COMP7220/8220 Machine Learning</center>
## <center> Major Project</center>
##### <center> Vikram Rajan : 45763054</center>

### INTRODUCTION

In this project, I have used the dataset related to language domain to perform a emotion type classification task on tweets data provided. The data is explored and preprocessed using various preprocessing techniques like removing stop words, lemmatisation, pos-tagging, etc. This project deals with the statistical approaches covering both conventional machine learning and deep learning methodologies/models.
 The dataset used here is the data provided from the Kaggle competition and also some external data from <a href="https://competitions.codalab.org/competitions/17751#learn_the_details">the Affect in Tweets task at SemEval in 2018</a>.  
 
 **Conventional Model**  
 SVC model performed much better than other models with an accuracy of 68.8% during the Kaggle submission. The objective of a  *SVC (Support Vector Classifier)* is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes the data. After getting the hyperplane, we can then feed some features to our classifier to identify the "predicted" class. Other conventional models used for evaluation are *Logistic Regression, Naives Bayes and SGD  Clasiifier*.
 - TF-IDF vectorizer is used for feature extraction. It transforms the words  into its feature vectors which is then provided as input to the model. Similiar to CountVectorizer but performs well in this case.
  
**Deep Learning Model**  
 The Keras neural network model supports two main types of model such as *Sequential and Functional models*. Here we are using Sequential model which is a linear stack of layers. The Deep dense Neural Network  with a *Keras embedding layer and GlobalMaxPooling1D(a way to downsample the incoming feature vectors)* layer delivers maximum accuracy of 72.2% in Kaggle submissions for public test data. Keras embedding layer takes the previously claculated integers and maps them to a dense vector of the word embedding.
 - Loss functions used to compile the sequential model is *Categorical_crossentropy*, because the number of classes are four(anger, sad, fear and sadness).
 - Metrics used is *Categorical Accuracy*.
 - CNN model was also implemented but resulted in low accuracy over validation data compared to above model.

#### Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
import pickle
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from collections import defaultdict
from textblob import TextBlob, Word
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('wordnet')
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from nltk import PorterStemmer
from nltk.corpus import wordnet as wn
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report
from keras.models import Sequential
from keras import layers
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import LSTM
from keras.layers import Dropout
from keras.utils import to_categorical


nltk.download('averaged_perceptron_tagger')
pd.set_option('max_colwidth', 400)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Using TensorFlow backend.


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


#### Importing Datasets

In [None]:
# Preprocessed Dataset provided in the Kaggle Competition
train_tweets = np.load('text_train_tweets.npy') # Train data
train_labels = np.load('text_train_labels.npy') # Train Labels
val_tweets = np.load('text_val_tweets.npy') # Validation data 
val_labels = np.load('text_val_labels.npy') # Validation Labels
public_data = np.load('text_test_public_tweets_rand.npy') # Test public data
print("Train Labels Y : ",len(train_labels),"\nTrain Tweets X : ",len(train_tweets))

# Loading text word pickle file
infile = open("text_word_to_idx.pkl",'rb')
old_dict = pickle.load(infile)
infile.close()
new_dict = dict([(value, key) for key, value in old_dict.items()]) 

# Function to append three dataframes
def emMerge(x,y,z):
    f1 =  x.append(y,ignore_index=True,sort=True)
    f2 =  f1.append(z,ignore_index=True,sort=True)
    return f2

# Label Classificatrion:

# Loading/Importing External Datasets
anger = pd.read_excel('anger_train.xls')
joy = pd.read_excel('joy.xls')
fear = pd.read_excel('fear.xls')
sad = pd.read_excel('sadness.xls')

angert = pd.read_excel('anger-test-gold.xls')
joyt = pd.read_excel('joy-test-gold.xls')
feart = pd.read_excel('fear-test-gold.xls')
sadt = pd.read_excel('sadness-test-gold.xls')

angerd = pd.read_excel('anger-dev.xls')
joyd = pd.read_excel('joy-dev.xls')
feard = pd.read_excel('fear-dev.xls')
sadd = pd.read_excel('sadness-dev.xls')


angerSet = emMerge(anger,angert,angerd)
joySet = emMerge(joy,joyt,joyd)
fearSet = emMerge(fear,feart,feard)
sadSet = emMerge(sad,sadt,sadd)


f1 =  angerSet.append(fearSet,ignore_index=True,sort=True)
f2 =  f1.append(joySet,ignore_index=True,sort=True)
extdata =  f2.append(sadSet,ignore_index=True,sort=True)
extdata['Content']=extdata['Tweet']
extdata = extdata.drop('ID',1)
extdata = extdata.drop('Intensity Class',1)
extdata = extdata.drop('Tweet',1)
print("External data shape: ",extdata.shape)

Train Labels Y :  7098 
Train Tweets X :  7098
External data shape:  (12634, 2)


**Label Encoding** for external datasets as the labels are anger, sadness, joy and fear.
- anger = 0 , fear = 1, joy = 2, sadness =3

In [None]:
lbl_enc = preprocessing.LabelEncoder()
extdata_y = lbl_enc.fit_transform(extdata['Affect Dimension'].values)
extdata_y = pd.DataFrame(extdata_y)
extdata['Sentiment']=extdata_y[0]
extdata = extdata.drop('Affect Dimension',1)
extdata['Content']=extdata['Content'].astype(str)
extdata.head()

Unnamed: 0,Content,Sentiment
0,@xandraaa5 @amayaallyn6 shut up hashtags are cool #offended,0
1,it makes me so fucking irate jesus. nobody is calling ppl who like hajime abusive stop with the strawmen lmao,0
2,Lol Adam the Bull with his fake outrage...,0
3,@THATSSHAWTYLO passed away early this morning in a fast and furious styled car crash as he was leaving an ATL strip club. That's rough stuff,0
4,@Kristiann1125 lol wow i was gonna say really?! haha have you seen chris or nah? you dont even snap me anymore dude!,0


### Pre Processing 

Preprocessing function for external datasets.

In [None]:
def preproccessingfnExt(dataset):
    train_tweets_data=dataset
    # making all the tweets lower case
    train_tweets_data['Content'] = train_tweets_data['Content'].apply(lambda x: x.lower())
    # Tokenizing the tweets
    train_tweets_data['Content'] = train_tweets_data['Content'].apply(lambda x: word_tokenize(x))
    
    # Initialisisng the tag map for pos_tagging
    tag_map = defaultdict(lambda : wn.NOUN) 
    tag_map['N'] = wn.NOUN
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB 
    tag_map['R'] = wn.ADV

    for index, entry in enumerate(train_tweets_data['Content']):
        final_words = [] 
        word_lemmatized = WordNetLemmatizer() # Initialising Lemmatizer
        for word, tag in pos_tag(entry):
            if word not in stopwords.words('english') and word.isalpha():# Removal of stop words and special characters
                word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]])
                final_words.append(word_final) 
        train_tweets_data.loc[index, 'Content'] = str(final_words)
    
    # Reforming the sentence/tweet
    train_tweets_data['Content'] = train_tweets_data['Content'].str.replace(',',' ')
    train_tweets_data['Content'] = train_tweets_data['Content'].str.replace('[^\w\s]','')
    
    # Stemming
    tokenized_tweet = train_tweets_data['Content'].apply(lambda x: x.split())
    ps = PorterStemmer()
    tokenized_tweet = tokenized_tweet.apply(lambda x: [ps.stem(i) for i in x])
    # Reforming the final tweets for training
    for i in range(len(tokenized_tweet)):
        tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
    train_tweets_data['Content'] = tokenized_tweet
    
    
    return train_tweets_data

Pre processing function for given training dataset.

In [None]:
def preproccessingfn(dataset):
    content =[]
    # Regeneration of the tweets from the text_word_to_idx pickle file.
    for j in dataset:
      tw = contenttext(j)
      content.append(tw)
    
    # Convereting the data to a dataframe.
    train_tweets_df = pd.DataFrame(content)
    # removal of null and other unnecessary words/values from the tweets.
    train_tweets_df = train_tweets_df.replace( "<NULL>",np.nan)
    train_tweets_df = train_tweets_df.replace( "<START>",np.nan)
    train_tweets_df = train_tweets_df.replace( "<END>",np.nan)
    train_tweets_df = train_tweets_df.replace( "<user>",np.nan)

    train_tweets_df[0] = train_tweets_df[train_tweets_df.columns[:]].apply( lambda x: ','.join(x.dropna().astype(str)),axis=1)
    train_tweets_data = pd.DataFrame()
    train_tweets_data['Content1']=train_tweets_df[0]
    # making all the tweets lower case
    train_tweets_data['Content1'] = train_tweets_data['Content1'].apply(lambda x: x.lower())
    # Tokenizing the tweets
    train_tweets_data['Content1'] = train_tweets_data['Content1'].apply(lambda x: word_tokenize(x))
    
    # Initialisisng the tag map for pos_tagging
    tag_map = defaultdict(lambda : wn.NOUN) 
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB 
    tag_map['R'] = wn.ADV

    for index, entry in enumerate(train_tweets_data['Content1']):
        final_words = [] 
        word_lemmatized = WordNetLemmatizer() # Initialising Lemmatizer
        for word, tag in pos_tag(entry):
            if word not in stopwords.words('english') and word.isalpha():# Removal of stop words and special characters
                word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]])
                final_words.append(word_final) 
        train_tweets_data.loc[index, 'Content'] = str(final_words)
    
    # Reforming the sentence/tweet
    train_tweets_data['Content'] = train_tweets_data['Content'].str.replace(',',' ')
    train_tweets_data['Content'] = train_tweets_data['Content'].str.replace('[^\w\s]','')
    train_tweets_data = train_tweets_data.drop('Content1', 1)
    
    
    # Stemming
    tokenized_tweet = train_tweets_data['Content'].apply(lambda x: x.split())
    ps = PorterStemmer()
    tokenized_tweet = tokenized_tweet.apply(lambda x: [ps.stem(i) for i in x])
    # Reforming the final tweets for training
    for i in range(len(tokenized_tweet)):
        tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
    train_tweets_data['Content'] = tokenized_tweet
    
    return train_tweets_data

def contenttext(x):
  xyz=[]
  for i in x:
    xyz.append(new_dict[i])
  return xyz

Processing external data:

In [None]:
external_data = preproccessingfnExt(extdata)
print("Shape of External data: ",external_data.shape)
external_data.head()

Shape of External data:  (12634, 2)


Unnamed: 0,Content,Sentiment
0,shut hashtag cool offend,0
1,make fuck irat jesu nobodi call ppl like hajim abus stop strawman lmao,0
2,lol adam bull fake outrag,0
3,thatsshawtylo pass away earli morn fast furiou style car crash leav atl strip club rough stuff,0
4,lol wow gon na say realli haha see chri nah dont even snap anymor dude,0


Processing the given train data

In [None]:
train_labels_Y = pd.DataFrame(train_labels, columns=['Sentiment'])
train_data = preproccessingfn(train_tweets)

# Joining the processed train data  and its labels into a dataframe.
train_set = pd.concat([train_data, train_labels_Y], axis=1)
print("Shape of given Train data: ",train_set.shape)
train_set.head()

Shape of given Train data:  (7098, 2)


Unnamed: 0,Content,Sentiment
0,make fuck irat jesu nobodi call ppl like hajim abus stop strawman lmao,0
1,lol adam bull fake outrag,0
2,pass away earli morn fast furiou style car crash leav atl strip club rough stuff,0
3,lol wow gon na say realli haha see chri nah dont even snap anymor dude,0
4,need sushi date oliv guard date rocki date,0


Processing the given validation data(same as train data)

In [None]:
val_lab = pd.DataFrame(val_labels, columns=['Sentiment'])
val_data = preproccessingfn(val_tweets)

# Joining the processed validation data  and its labels into a dataframe.
val_set = pd.concat([val_data, val_lab], axis=1)
print("Shape of given Validation data: ",val_set.shape)
val_set.head()

Shape of given Validation data:  (1460, 2)


Unnamed: 0,Content,Sentiment
0,fume hijack money move full back,0
1,nightmar dream freedom,0
2,cnn realli need get busi number second fall soldier tragedi right back hatr potu,0
3,kikm horni kik nude girl horni snap,0
4,fuck tag pictur famili first cut number year ago one,0


Joining the provided train dataframe and the external dataframe to form a final large training dataframe

In [None]:
exttrain_set=train_set.append(external_data,ignore_index=True,sort=True)
print("Shape of  Final Training data: ",exttrain_set.shape)
exttrain_set.head()

Shape of  Final Training data:  (19732, 2)


Unnamed: 0,Content,Sentiment
0,make fuck irat jesu nobodi call ppl like hajim abus stop strawman lmao,0
1,lol adam bull fake outrag,0
2,pass away earli morn fast furiou style car crash leav atl strip club rough stuff,0
3,lol wow gon na say realli haha see chri nah dont even snap anymor dude,0
4,need sushi date oliv guard date rocki date,0


### Conventional Machine Learning Model

The SVC model produced the best accuracy of __ during the Kaggle submission.

Feature extraction is performed with the help of tf-idf feature extraction technique to convert the input tweets to its vector.

In [None]:
# TF IDF Feature Extraction Function
def featureExtract(para, host):
    tf_idf = TfidfVectorizer(max_features=20000)
    tf_idf.fit(host)
    para_transformed = tf_idf.transform(para)
    return para_transformed
    

Different Conventional training models were considered for training and the best one is selected with maximum validation accuracy.
ML models are: Naives Bayes, SVC, SGDClassifier and Logistic regression model

**Naives Bayes Model**

In [None]:
def naivesBayesModel(comb,X):    
    t_tf = featureExtract(X['Content'],comb['Content']) # Feature extraction for the train tweets
    nb = MultinomialNB()
    # Fit the training dataset.
    nb.fit(t_tf, X['Sentiment'])
    return nb


**SVC Model**

In [None]:
def SVM_Model(comb,X):    
    t_tf = featureExtract(X['Content'],comb['Content']) # Feature extraction for the train tweets
    nb = SVC(C=10.0, kernel='linear', degree=3, gamma='scale')
    # Fit the training dataset.
    nb.fit(t_tf, X['Sentiment'])
    return nb

**SGD Classifier Model**

In [None]:
def SGDClassifierModel(comb,X):    
    t_tf = featureExtract(X['Content'],comb['Content']) # Feature extraction for the train tweets
    nb = SGDClassifier()
    # Fit the training dataset.
    nb.fit(t_tf, X['Sentiment'])
    return nb

**Logistic Regression Model**

In [None]:
def LogisticRegressionModel(comb,X):    
    t_tf = featureExtract(X['Content'],comb['Content']) # Feature extraction for the train tweets
    nb = LogisticRegression(max_iter=400)
    # Fit the training dataset.
    nb.fit(t_tf, X['Sentiment'])
    return nb

Function for fitting the model, evaluating accuracy over validation dataset and predicting the labels for test or private data.

In [None]:
def predictionSets( X,dataset,ml,acScore ):
    comb = X.append(dataset,ignore_index=True,sort=True)
    model=None
    # To identify which model is invoked
    if ml=='NaivesBayes':
        model=naivesBayesModel(comb,X)
    elif ml=='SVM':
        model=SVM_Model(comb,X)
    elif ml=='SGD':
        model=SGDClassifierModel(comb,X)
    elif ml=='LogR':
        model=LogisticRegressionModel(comb,X)
    print("Processing Model: ",ml)
    # Predict the labels on the validation dataset.
    test = featureExtract(dataset['Content'],comb['Content'])
    y_predict = model.predict(test)
    # Use the accuracy_score function to get the accuracy,
    if(acScore):
        print(ml," Accuracy Score for ",ml," --> ", accuracy_score( dataset['Sentiment'],y_predict)*100)
        target_names = ['anger', 'fear', 'joy', 'sadness']
        print(ml," Classification_report ",ml,"-->\n",classification_report(dataset['Sentiment'], y_predict, target_names=target_names))
    return y_predict

Evaluating Models with respect to validation accuracy

In [None]:
# Evaluating Naives Bayes Model:
nbResult = predictionSets(exttrain_set,val_set,'NaivesBayes',True)

# Evaluating SVC Model:
svcResult = predictionSets(exttrain_set,val_set,'SVM',True)

# Evaluating SGD CLASSIFIER Model:
sgdResult = predictionSets(exttrain_set,val_set,'SGD',True)

# Evaluating Logistic regression Model:
logRResult = predictionSets(exttrain_set,val_set,'LogR',True)

Processing Model:  NaivesBayes
NaivesBayes  Accuracy Score for  NaivesBayes  -->  47.26027397260274
NaivesBayes  Classification_report  NaivesBayes -->
               precision    recall  f1-score   support

       anger       0.48      0.42      0.45       387
        fear       0.37      0.64      0.47       388
         joy       0.76      0.56      0.65       289
     sadness       0.48      0.30      0.37       396

    accuracy                           0.47      1460
   macro avg       0.53      0.48      0.48      1460
weighted avg       0.51      0.47      0.47      1460

Processing Model:  SVM
SVM  Accuracy Score for  SVM  -->  54.31506849315069
SVM  Classification_report  SVM -->
               precision    recall  f1-score   support

       anger       0.51      0.48      0.49       387
        fear       0.45      0.57      0.50       388
         joy       0.81      0.78      0.79       289
     sadness       0.50      0.41      0.45       396

    accuracy               

Determining the best model: Comparing the accuracy of the above four models, its evident that SVC model performs much better with an accuracy of ~54%. Thus, the SVC model is considered as the best model.

**Best Parameters estimation** : Grid SearchCV is used for best parameter estimation for the best performing model, which is SVM model with max accuracy on validation data selected above.

In [None]:
# Estimation of best parameters for Logistic regression model
def gridSearchCV(X,dataset):   
    comb = X.append(dataset,ignore_index=True,sort=True)
    t_tf = featureExtract(X['Content'],comb['Content'])
    lr_params = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'linear']}
    # Define the gridsearchCV
    lr_grid = GridSearchCV(SVC(), param_grid=lr_params, cv=3, refit=True,verbose=1)
    lr_grid.fit(t_tf, X['Sentiment'])
    print ('Best Score:', lr_grid.best_score_)
    print (lr_grid.best_estimator_)
    print ('Best Params:', lr_grid.best_params_)

In [None]:
gridSearchCV(exttrain_set,val_set)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed: 55.2min finished


Best Score: 0.8385805117134982
SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
Best Params: {'C': 10, 'gamma': 1, 'kernel': 'rbf'}


**Final model with improved parameters**

In [None]:
# Overwritting the SVM model function
def SVM_Model(comb,X):    
    t_tf = featureExtract(X['Content'],comb['Content']) # Feature extraction for the train tweets
    nb = SVC(C=10.0, kernel='rbf', degree=3, gamma='scale')
    # Fit the training dataset.
    nb.fit(t_tf, X['Sentiment'])
    return nb

In [None]:
# Evaluating Naives Bayes Model:
logRResult = predictionSets(exttrain_set,val_set,'SVM',True)

Processing Model:  SVM
SVM  Accuracy Score for  SVM  -->  56.91780821917808
SVM  Classification_report  SVM -->
               precision    recall  f1-score   support

       anger       0.54      0.48      0.51       387
        fear       0.46      0.58      0.51       388
         joy       0.88      0.83      0.86       289
     sadness       0.51      0.45      0.48       396

    accuracy                           0.57      1460
   macro avg       0.60      0.59      0.59      1460
weighted avg       0.58      0.57      0.57      1460



After the use of best parameters, we can notice that the accuracy has increased from 54.32% to 56.91%.

**Predicting the labels for public test dataset.**

In [None]:
# Preprocessing the test data similarly as the train data was performed
pub_test_data = preproccessingfn(public_data)
print("Shape of  Final public data: ",pub_test_data.shape)
pub_test_data.head()

Shape of  Final public data:  (4064, 1)


Unnamed: 0,Content
0,omg mother daughter dull ni move dad worri
1,happi birthday repeat miss excit back florida hear turkey date repeat
2,ever cri middl bomb rest someon wake emerg sleep cri
3,mental suffer worthless pain
4,courag driver shot bu show courag natur scare death threat make flee fast


In [None]:
# Function call to predict the labels for public dataset
finTestPredit = predictionSets(exttrain_set,pub_test_data,'SVM',False)


Processing Model:  SVM


In [None]:
# Function to write the prediction to csv file inorder to upload the output into the Kaggle competition.
def writeToCSV(result):
    output = pd.DataFrame(result,columns=['Prediction'])
    output.index += 1 
    output.index.name='ID'
    output.to_csv('public_45763054-conv.csv')
    print(output.head())
def writeToCSVpriv(result):
    output = pd.DataFrame(result,columns=['Prediction'])
    #output.index += 1 
    output.index.name='ID'
    output.to_csv('private_45763054-conv.csv')
    print(output.head())

In [None]:
writeToCSV(finTestPredit)

    Prediction
ID            
1            3
2            2
3            3
4            3
5            1


**Predicting Private Data**

In [None]:
private_data = np.load('text_test_private_tweets.npy')
priv_data = preproccessingfn(private_data)
print("Shape of  Final Private data: ",priv_data.shape)
priv_data.head()

Shape of  Final Private data:  (4257, 1)


Unnamed: 0,Content
0,whatev decid make sure make happi
1,accept challeng liter even feel exhilar victori georg patton
2,roommat okay spell autocorrect terribl firstworldprob
3,cute atsu probabl shi photo cherri help uwu
4,rooney fuck untouch fuck dread depay look decentishtonight


In [None]:
finPrivatePredit = predictionSets(exttrain_set,priv_data,'SVM',False)

Processing Model:  SVM


In [None]:
writeToCSVpriv(finPrivatePredit)

    Prediction
ID            
0            2
1            2
2            1
3            1
4            1


### Deep Learning Model

The final model that produced the best-performing predictions for the Kaggle submission (accuracy (x+5)%) was a fully connected dense model with __ NN layers. The input data is same preprocessed data used for conventional model comprising of both external and given train data together.
For further processing unlike the above conventional model, here we have used keras tokeniser to convert the word in the tweets to index values according to the tokeniser. We have also converted the input labels into array/matrix as per the number of classes(anger,joy,fear and sadness) using *to_categorical* function.

In [None]:
# Copying the tweet contents and sentiment(labels) into  arrays.
contents = exttrain_set["Content"].values
labels = exttrain_set["Sentiment"].values
labels = to_categorical(labels) # Conversion of training labels for input to the NN model

# Train Validation split of the final dataset
xtrain, xVal, y_train, y_val = train_test_split(contents, labels, test_size=0.20, random_state=1000)

- **Here, for initial accuracy calculation to find the best model, I have used validation data as the test data and splitted final train data  for train-validation split.**  

In [None]:
# Performing the same tasks in above step for validation data
y_test = val_set["Sentiment"].values
Xtest = val_set["Content"].values
y_test = to_categorical(y_test)

In [None]:
# Use of Tokenizer for feature  extraction
tokeniz = Tokenizer(num_words=5000)
tokeniz.fit_on_texts(xtrain)

X_train = tokeniz.texts_to_sequences(xtrain)
X_test = tokeniz.texts_to_sequences(Xtest)
X_val = tokeniz.texts_to_sequences(xVal)

vocab_size = len(tokeniz.word_index) + 1  # Adding 1 because of reserved 0 index

The transformed data using tokenizer in above step has rows with different lenths. Therefore applying padding of maxlenth of 50 to the matrix.

In [None]:
maxlen = 50

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_val = pad_sequences(X_val, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

#### Creating the NN Model
For Deep learning implementation, I have considered evaluating the deep dense NN network and CNN network. The below simple deeply dense neural network with 4 NN layers comprising an embedding layer, GlobalMaxpool layer, dense activation layer and finally a output softmax layer.
This model is compiled with *Categorical crossentropy* loss function because there are four number of classes at the output dense layer.(anger, sadness, joy and fear). Softmax output layer is used as it can perform much better for more than two classifications. 

In [None]:
embedding_dim = 100

mod = Sequential()

mod.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen,trainable=True))

mod.add(layers.GlobalMaxPool1D())
#mm.add(layers.Flatten())
mod.add(layers.Dense(10, activation='relu'))
#m1.add(layers.Dense(10, input_dim=X_train.shape[1], activation='relu'))
mod.add(layers.Dense(4, activation='softmax'))
mod.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])
mod.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 100)           1528700   
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1010      
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 44        
Total params: 1,529,754
Trainable params: 1,529,754
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Fitting the model with training data
history = mod.fit(X_train, y_train,
                    epochs=10,
                    verbose=1,
                    validation_data=(X_val, y_val),
                    batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15785 samples, validate on 3947 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Evaluation of the model for training and validation accuracy**

In [None]:
loss, accuracy = mod.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy:  ",accuracy)
loss, accuracy = mod.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: ",accuracy)

Training Accuracy:   0.9095343947410583
Testing Accuracy:  0.5554794669151306


- Testing Accuracy shown above is the accuracy of the model over the given validation data.

#### Convolutional Neural Networks (CNN)  
Same loss functions and metrics as the previous model, and also the output layer remains the same with Softmax activation.  
For CNN model an extra Convolution 1-D  layer is introduced.(Only 1-D because for text processing only one dimensional layer  is required)


In [None]:
embedding_dim = 50

modelCNN = Sequential()
modelCNN.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
modelCNN.add(layers.Conv1D(128, 5, activation='relu'))
modelCNN.add(layers.GlobalMaxPooling1D())
modelCNN.add(layers.Dense(10, activation='relu'))
modelCNN.add(layers.Dense(4, activation='softmax'))
modelCNN.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])
modelCNN.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 50)            764350    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 46, 128)           32128     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1290      
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 44        
Total params: 797,812
Trainable params: 797,812
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = modelCNN.fit(X_train, y_train,
                    epochs=20,
                    verbose=1,
                    validation_data=(X_val, y_val),
                    batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15785 samples, validate on 3947 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


**Evaluation of the model for training and validation accuracy**

In [None]:
loss, accuracy = modelCNN.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy:  ",accuracy)
loss, accuracy = modelCNN.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  ",accuracy)


Training Accuracy:   0.9109280705451965
Testing Accuracy:   0.5472602844238281


- Testing Accuracy shown above is the accuracy of the model over the given validation data.


**Best Deep Learning Model**  
Comparing Accuracy on validation data for the above two DNN and CNN models, its clear that the first deep NN model performs better than the CNN model by approximately 0.821%. Therefore, the first deep dense neural model is selected as the best model.

**Predicting sentiments for the best NN model**

In [None]:
# Public Data set
test = pub_test_data["Content"].values
XTest = tokeniz.texts_to_sequences(test) # feature extraction for public test data set
XTest = pad_sequences(XTest, padding='post', maxlen=maxlen) # padding of maxlen =50

# Private Data set
private = priv_data["Content"].values
Xprivate = tokeniz.texts_to_sequences(private)# feature extraction for private test data set
Xprivate = pad_sequences(Xprivate, padding='post', maxlen=maxlen) # padding of maxlen =5

In [None]:
# Predicting for public data
ypred_public = mod.predict_classes(XTest,verbose=0)
writeToCSV(ypred_public)

# Predicting for private data
ypred_private = mod.predict_classes(Xprivate,verbose=0)
writeToCSVpriv(ypred_private)

    Prediction
ID            
1            3
2            2
3            0
4            3
5            1
    Prediction
ID            
0            2
1            2
2            1
3            1
4            1


## Discussion of Model Performance and Implementation


Comparing the results of the Conventional model and Deep Learning model, the deep learning one exhibited better accuracy of _ more than the other model on the public data set. the deep learning model ranked 2nd in Kaggle submissions on public data set with an accuracy of 72.2%.

Both conventional and deep learning models performed very well on public data set than the validation data set with a difference of ~(12-18)% and among them deep learning model exhibited better predictions/classifications.
Similarly, both conventional and deep learning models performed very well with private data set compared to validation set, with a gain in accuracy of ~(8-10)% but less than the accuracy acheived in public data set.
Thus, both models performed much better with public data set.

In conclusion, both conventional and deep learning models can deliver better  performance with large training data sets, careful preprocessing and efficient optimization of the models(like selecting hyperparameters).