<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#EmoInt-dataset-for-emotion-detection-in-lyrics" data-toc-modified-id="EmoInt-dataset-for-emotion-detection-in-lyrics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>EmoInt dataset for emotion detection in lyrics</a></span><ul class="toc-item"><li><span><a href="#EmoInt-statistics" data-toc-modified-id="EmoInt-statistics-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>EmoInt statistics</a></span></li><li><span><a href="#Merge-with-MoodyLyrics" data-toc-modified-id="Merge-with-MoodyLyrics-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Merge with MoodyLyrics</a></span></li></ul></li><li><span><a href="#Features-Selection" data-toc-modified-id="Features-Selection-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Features Selection</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#k-Nearest-Neighbour" data-toc-modified-id="k-Nearest-Neighbour-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>k-Nearest Neighbour</a></span></li><li><span><a href="#SVM" data-toc-modified-id="SVM-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>SVM</a></span></li><li><span><a href="#Gradient-Boost" data-toc-modified-id="Gradient-Boost-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Gradient Boost</a></span></li><li><span><a href="#Artificial-Neural-Network" data-toc-modified-id="Artificial-Neural-Network-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Artificial Neural Network</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>References</a></span></li></ul></div>

In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV, cross_val_score

# EmoInt dataset for emotion detection in lyrics

Existing emotion datasets are mainly annotated categorically without an indication of degree of emotion. EmoInt, instead, provides several tweets annotated according to an emotion (anger, fear, joy, sadness) and to the degree at which the emotion is expressed in text.

It is important to mention that EmoInt was manually annotated, using [Best-Worst Scaling](https://nparc.nrc-cnrc.gc.ca/eng/view/fulltext/?id=b132b0af-2ae0-4964-ac3a-493e7292a37a) (BWS), an annotation scheme shown to obtain very reliable scores.

For our purpose, we will consider each tweet to be like a lyric and, on top of that, we will perform our feature engineering using spaCy and the other tools we used so far.

Our original dataset, MoodyLyrics, contains "happy", "sad", "angry" and "relaxed" as labels. Therefore, in order to perform a sort of interjection with EmoInt, we will just use the tweets corresponding to the anger, joy and sadness emotions.

The remaining part of this notebook assumes that we have already parsed EmoInt dataset in a .csv file which we can use to train some machine learning models as we did when we performed our feature engineering on lyrics. For more information about how this .csv was generated, please refer to the `src/emoint_parser.py` script.

## EmoInt statistics

As EmoInt provide intensity levels together with emotion labels, we decided to take into account only those tweets for which the intensity was greater that 0.50 (50%). Also, we dropped hashtags a remove the tag characters (e.g. "Hey @MrTwitter how are you? #cool" became "Hey MrTwitter how are you?") because we will have to compare those tweets with songs and songs do not have those kind of things. Also, this sort of preprocessing should maximize the chances that everything is recognized properly by spaCy's POS tagger.

In [6]:
emoint = pd.read_csv('datasets/emoint_featurized.csv')

In [7]:
useless_columns = [ 'ID','ARTIST', 'SONG_TITLE', 'X_FREQUENCIES', 'SPACE_FREQUENCIES']
emoint.drop(useless_columns, axis=1, inplace=True)

In [8]:
emoint.head(5)

Unnamed: 0,LYRICS_VECTOR,TITLE_VECTOR,LINE_COUNT,WORD_COUNT,ECHOISMS,SELFISH_DEGREE,DUPLICATE_LINES,IS_TITLE_IN_LYRICS,RHYMES,VERB_PRESENT,...,NOUN_FREQUENCIES,NUM_FREQUENCIES,PART_FREQUENCIES,PRON_FREQUENCIES,PROPN_FREQUENCIES,PUNCT_FREQUENCIES,SCONJ_FREQUENCIES,SYM_FREQUENCIES,VERB_FREQUENCIES,EMOTION
0,[-1.26710683e-01 1.60194725e-01 -1.36762261e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,16,0.0,0.0,0.0,False,0.0,0.75,...,0.0625,0.0,0.0,0.0,0.125,0.1875,0.0,0.0,0.25,happy
1,[-6.28133714e-02 1.90393195e-01 -1.95530921e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,19,0.0,1.0,0.0,False,0.0,0.666667,...,0.157895,0.0,0.0,0.105263,0.0,0.052632,0.0,0.0,0.263158,happy
2,[ 9.66307223e-02 2.91245524e-02 -1.42218113e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,11,0.0,0.0,0.0,False,0.0,0.5,...,0.545455,0.0,0.0,0.0,0.363636,0.0,0.0,0.0,0.181818,happy
3,[-1.13483094e-01 3.13860744e-01 -2.05740720e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,23,0.0,0.2,0.0,False,0.0,0.75,...,0.086957,0.0,0.0,0.217391,0.043478,0.304348,0.0,0.0,0.173913,happy
4,[ 3.85632203e-03 2.41273686e-01 -1.58885673e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,22,0.0,0.666667,0.0,False,0.0,1.0,...,0.272727,0.0,0.045455,0.136364,0.045455,0.136364,0.0,0.0,0.227273,happy


We used the same columns naming convention we used in the past notebooks with MoodyLyrics just for compatibility reasons (we will have to put them together). Since tweets do not have title, the `TITLE_VECTOR` was just left there as a vector of 0s, with the same shape of the `LYRICS_VECTOR`.

## Merge with MoodyLyrics
Let's now merge EmoInt and MoodyLyrics featurized datasets in order to be able to proceed with further analysis.

In [9]:
path = 'datasets/moodylyrics_featurized.csv'

In [10]:
moodylyrics = pd.read_csv(path)
moodylyrics.columns = ['ID', 'ARTIST', 'SONG_TITLE', 'LYRICS_VECTOR', 'TITLE_VECTOR', 
                   'LINE_COUNT', 'WORD_COUNT', 'ECHOISMS', 'SELFISH_DEGREE', 
                   'DUPLICATE_LINES', 'IS_TITLE_IN_LYRICS', 'RHYMES', 'VERB_PRESENT', 
                   'VERB_PAST', 'VERB_FUTURE', 'ADJ_FREQUENCIES', 'CONJUCTION_FREQUENCIES', 
                   'ADV_FREQUENCIES', 'AUX_FREQUENCIES', 'CONJ_FREQUENCIES', 'CCONJ_FREQUENCIES', 
                   'DETERMINER_FREQUENCIES', 'INTERJECTION_FREQUENCIES', 'NOUN_FREQUENCIES', 
                   'NUM_FREQUENCIES', 'PART_FREQUENCIES', 'PRON_FREQUENCIES', 'PROPN_FREQUENCIES', 
                   'PUNCT_FREQUENCIES', 'SCONJ_FREQUENCIES', 'SYM_FREQUENCIES', 'VERB_FREQUENCIES', 
                   'X_FREQUENCIES', 'SPACE_FREQUENCIES', 'EMOTION']
moodylyrics.drop(useless_columns, axis=1, inplace=True)

In [11]:
dataset = emoint.append(moodylyrics)

In [12]:
dataset.describe()

Unnamed: 0,LINE_COUNT,WORD_COUNT,ECHOISMS,SELFISH_DEGREE,DUPLICATE_LINES,RHYMES,VERB_PRESENT,VERB_PAST,VERB_FUTURE,ADJ_FREQUENCIES,...,INTERJECTION_FREQUENCIES,NOUN_FREQUENCIES,NUM_FREQUENCIES,PART_FREQUENCIES,PRON_FREQUENCIES,PROPN_FREQUENCIES,PUNCT_FREQUENCIES,SCONJ_FREQUENCIES,SYM_FREQUENCIES,VERB_FREQUENCIES
count,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,...,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0
mean,18.771143,115.487675,0.001933,0.25716,0.046589,0.034445,0.681493,0.223661,0.032797,0.103892,...,0.014189,0.190628,0.007121,0.025846,0.118723,0.046162,0.097045,0.0,0.001023,0.235702
std,21.636621,123.785535,0.014999,0.30792,0.062358,0.076869,0.327669,0.276698,0.102555,0.080279,...,0.031877,0.105343,0.023027,0.035716,0.085248,0.092595,0.128123,0.0,0.013569,0.104105
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,16.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.051323,...,0.0,0.127972,0.0,0.0,0.055556,0.0,0.016854,0.0,0.0,0.178571
50%,11.0,64.0,0.0,0.142857,0.0,0.0,0.769231,0.125,0.0,0.090909,...,0.0,0.178983,0.0,0.011205,0.114544,0.00266,0.068182,0.0,0.0,0.241379
75%,33.0,198.0,0.0,0.451455,0.086786,0.03125,1.0,0.333333,0.0,0.142857,...,0.013502,0.24,0.0,0.041301,0.173913,0.060647,0.135135,0.0,0.0,0.295567
max,188.0,1149.0,0.341463,1.0,1.0,0.735294,1.0,1.0,1.0,0.666667,...,0.4,1.0,0.444444,0.333333,0.5,1.5,2.777778,0.0,0.714286,1.0


# Features Selection

Based on our experience with previous models and feature engineering strategies, we believe that building models using all our available features is a waste of time. We already noticed that the models which achieved better results were those using just the content of the lyrics. Therefore we will work on just the content of our input texts (either lyrics or tweets) plus some additional features. Among all the available features we decided to pick the followings:
- SELFISH_DEGREE
- VERB_PRESENT
- VERB_PAST
- VERB_FUTURE

In fact we believe that we can not use other features which may seem to be useful for our purpose ,e.g. "RHYMES", because, as we are considering a broader dataset, those kind of features are not general enough (not suitable for tweets).

In [13]:
dataset = dataset[['LYRICS_VECTOR', 'SELFISH_DEGREE', 'VERB_PRESENT', 'VERB_PAST', 'VERB_FUTURE', 'EMOTION']]

In [14]:
dataset.head(5)

Unnamed: 0,LYRICS_VECTOR,SELFISH_DEGREE,VERB_PRESENT,VERB_PAST,VERB_FUTURE,EMOTION
0,[-1.26710683e-01 1.60194725e-01 -1.36762261e-...,0.0,0.75,0.25,0.0,happy
1,[-6.28133714e-02 1.90393195e-01 -1.95530921e-...,1.0,0.666667,0.333333,0.0,happy
2,[ 9.66307223e-02 2.91245524e-02 -1.42218113e-...,0.0,0.5,0.5,0.0,happy
3,[-1.13483094e-01 3.13860744e-01 -2.05740720e-...,0.2,0.75,0.25,0.0,happy
4,[ 3.85632203e-03 2.41273686e-01 -1.58885673e-...,0.666667,1.0,0.0,0.0,happy


# Modeling

Before starting we should flatten the dataset's features which are vectors at the moment (title vector and content vector). Let's do that

In [15]:
X_vect = list()
for (i, row) in dataset.drop('EMOTION', axis=1).iterrows():
    sub_list = list()
    for field in row:
        if type(field) == str:
            field = field[1:-1].split()
            sub_list += [float(x.replace('\n','')) for x in field]
        else:
            sub_list.append(field)
    X_vect.append(np.array(sub_list))
X_vect = np.array(X_vect)

In [16]:
y = dataset.EMOTION.astype("category").cat.codes

In [17]:
print(X_vect.shape)
print(y.shape)

(4706, 304)
(4706,)


As we can see, we will have 4706 entries in our dataset, each of them having 304 different features.

## k-Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier

ks = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]

for k in ks:
    # Build model
    clf = KNeighborsClassifier(n_neighbors=k, algorithm='auto', 
                           metric='euclidean', n_jobs=-1)
    # Evaluate accuracy
    scores = cross_val_score(clf, X_vect, y, cv=10)
    print('Accuracy for k=%d: %0.2f (+/- %0.2f)' % (k, scores.mean(), scores.std() * 1.96))

## SVM

In [None]:
def parameters_grid_search(classifier, params, x, y, cv=10, verbose=False):
    """
    Grid Search to find best parameters for a certain classifier whose
    performances are evaluated using cross-validation
    """
    gs = GridSearchCV(classifier(), params, cv=cv, n_jobs=-1, verbose=verbose)
    gs.fit(x, y)    
    return (gs.best_estimator_, gs.best_params_, gs.best_score_)

In [None]:
from sklearn.svm import SVC

# Build model
clf = SVC()
# Define the set of parameters we want to test on
params = [
    { 'kernel': ['rbf'], 'C': [ 1 ] }
]

# Perform grid search
svm_best, best_params, best_score = parameters_grid_search(SVC, params, X_vect, y, verbose=1)
print('Parameters:', best_params)

In [None]:
scores = cross_val_score(svm_best, X_vect, y, cv=10)
print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 1.96))

## Gradient Boost

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Build model
clf = GradientBoostingClassifier(learning_rate=0.7, n_estimators=200)
# Evaluate accuracy
scores = cross_val_score(clf, X_vect, y, cv=10)
print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 1.96))

## Artificial Neural Network

# References
[EmoInt](http://saifmohammad.com/WebDocs/TweetEmotionIntensities-starsem2017.pdf)