<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#EmoInt-dataset-for-emotion-detection-in-lyrics" data-toc-modified-id="EmoInt-dataset-for-emotion-detection-in-lyrics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>EmoInt dataset for emotion detection in lyrics</a></span><ul class="toc-item"><li><span><a href="#EmoInt-statistics" data-toc-modified-id="EmoInt-statistics-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>EmoInt statistics</a></span></li><li><span><a href="#Merge-with-MoodyLyrics" data-toc-modified-id="Merge-with-MoodyLyrics-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Merge with MoodyLyrics</a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#SVM" data-toc-modified-id="SVM-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SVM</a></span></li><li><span><a href="#Gradient-Boost" data-toc-modified-id="Gradient-Boost-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Gradient Boost</a></span></li><li><span><a href="#Artificial-Neural-Network" data-toc-modified-id="Artificial-Neural-Network-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Artificial Neural Network</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>References</a></span></li></ul></div>

In [46]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV, cross_val_score

# EmoInt dataset for emotion detection in lyrics

Existing emotion datasets are mainly annotated categorically without an indication of degree of emotion. EmoInt, instead, provides several tweets annotated according to an emotion (anger, fear, joy, sadness) and to the degree at which the emotion is expressed in text.

It is important to mention that EmoInt was manually annotated, using [Best-Worst Scaling](https://nparc.nrc-cnrc.gc.ca/eng/view/fulltext/?id=b132b0af-2ae0-4964-ac3a-493e7292a37a) (BWS), an annotation scheme shown to obtain very reliable scores.

For our purpose, we will consider each tweet to be like a lyric and, on top of that, we will perform our feature engineering using spaCy and the other tools we used so far.

Our original dataset, MoodyLyrics, contains "happy", "sad", "angry" and "relaxed" as labels. Therefore, in order to perform a sort of interjection with EmoInt, we will just use the tweets corresponding to the anger, joy and sadness emotions.

The remaining part of this notebook assumes that we have already parsed EmoInt dataset in a .csv file which we can use to train some machine learning models as we did when we performed our feature engineering on lyrics. For more information about how this .csv was generated, please refer to the `src/emoint_parser.py` script.

## EmoInt statistics

As EmoInt provide intensity levels together with emotion labels, we decided to take into account only those tweets for which the intensity was greater that 0.50 (50%). Also, we dropped tags and hashtags (e.g. "Hey @MrTwitter how are you? #cool" became "Hey how are you?") because we will have to compare those tweets with songs and songs do not have those kind of things.

In [29]:
emoint = pd.read_csv('datasets/emoint_featurized.csv')

In [30]:
useless_columns = [ 'ID','ARTIST', 'SONG_TITLE', 'X_FREQUENCIES', 'SPACE_FREQUENCIES']
emoint.drop(useless_columns, axis=1, inplace=True)

In [31]:
emoint.head(5)

Unnamed: 0,LYRICS_VECTOR,TITLE_VECTOR,LINE_COUNT,WORD_COUNT,ECHOISMS,SELFISH_DEGREE,DUPLICATE_LINES,IS_TITLE_IN_LYRICS,RHYMES,VERB_PRESENT,...,NOUN_FREQUENCIES,NUM_FREQUENCIES,PART_FREQUENCIES,PRON_FREQUENCIES,PROPN_FREQUENCIES,PUNCT_FREQUENCIES,SCONJ_FREQUENCIES,SYM_FREQUENCIES,VERB_FREQUENCIES,EMOTION
0,[-1.58756495e-01 1.27405643e-01 -1.87897816e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,16,0.0,0.0,0.0,False,0,0.75,...,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0625,happy
1,[-5.92244938e-02 1.76795915e-01 -1.86805829e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,20,0.0,1.0,0.0,False,0,0.6,...,0.03,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.05,happy
2,[ 5.27095608e-03 1.50894225e-01 -9.51976478e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,9,0.0,0.0,0.0,False,0,0.5,...,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,happy
3,[-8.42130557e-02 2.85120517e-01 -2.86448717e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,23,0.0,0.2,0.0,False,0,0.75,...,0.021739,0.0,0.0,0.054348,0.01087,0.0,0.0,0.0,0.043478,happy
4,[-2.13992037e-03 2.37986907e-01 -1.79613903e-...,[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,1,21,0.0,0.666667,0.0,False,0,0.5,...,0.071429,0.0,0.011905,0.035714,0.0,0.0,0.0,0.0,0.047619,happy


We used the same columns naming convention we used in the past notebooks with MoodyLyrics just for compatibility reasons (we will have to put them together). Since tweets do not have title, the `TITLE_VECTOR` was just left there as a vector of 0s, with the same shape of the `LYRICS_VECTOR`.

## Merge with MoodyLyrics
Let's now merge EmoInt and MoodyLyrics featurized datasets in order to be able to proceed with further analysis.

In [32]:
path = 'datasets/emotion_detection_dataset.csv'

In [33]:
moodylyrics = pd.read_csv(path)
moodylyrics.columns = ['ID', 'ARTIST', 'SONG_TITLE', 'LYRICS_VECTOR', 'TITLE_VECTOR', 
                   'LINE_COUNT', 'WORD_COUNT', 'ECHOISMS', 'SELFISH_DEGREE', 
                   'DUPLICATE_LINES', 'IS_TITLE_IN_LYRICS', 'RHYMES', 'VERB_PRESENT', 
                   'VERB_PAST', 'VERB_FUTURE', 'ADJ_FREQUENCIES', 'CONJUCTION_FREQUENCIES', 
                   'ADV_FREQUENCIES', 'AUX_FREQUENCIES', 'CONJ_FREQUENCIES', 'CCONJ_FREQUENCIES', 
                   'DETERMINER_FREQUENCIES', 'INTERJECTION_FREQUENCIES', 'NOUN_FREQUENCIES', 
                   'NUM_FREQUENCIES', 'PART_FREQUENCIES', 'PRON_FREQUENCIES', 'PROPN_FREQUENCIES', 
                   'PUNCT_FREQUENCIES', 'SCONJ_FREQUENCIES', 'SYM_FREQUENCIES', 'VERB_FREQUENCIES', 
                   'X_FREQUENCIES', 'SPACE_FREQUENCIES', 'EMOTION']
moodylyrics.drop(useless_columns, axis=1, inplace=True)

In [34]:
dataset = emoint.append(moodylyrics)

In [35]:
dataset.describe()

Unnamed: 0,LINE_COUNT,WORD_COUNT,ECHOISMS,SELFISH_DEGREE,DUPLICATE_LINES,RHYMES,VERB_PRESENT,VERB_PAST,VERB_FUTURE,ADJ_FREQUENCIES,...,INTERJECTION_FREQUENCIES,NOUN_FREQUENCIES,NUM_FREQUENCIES,PART_FREQUENCIES,PRON_FREQUENCIES,PROPN_FREQUENCIES,PUNCT_FREQUENCIES,SCONJ_FREQUENCIES,SYM_FREQUENCIES,VERB_FREQUENCIES
count,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,...,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0,4706.0
mean,19.634722,118.501062,0.00304,0.257467,0.046397,0.0,12.353133,3.47867,0.0,0.034587,...,0.003442,0.056307,0.002523,0.00381,0.017707,0.016783,0.002929,0.0,3.1e-05,0.043912
std,22.537418,127.585458,0.018688,0.308315,0.062092,0.0,17.179584,6.630785,0.0,0.075121,...,0.022703,0.103561,0.015115,0.012375,0.039051,0.061004,0.021154,0.0,0.001528,0.066435
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,17.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.001356,...,0.0,0.003262,0.0,0.0,0.001751,0.0,0.0,0.0,0.0,0.005114
50%,13.0,64.0,0.0,0.142857,0.0,0.0,1.0,0.5,0.0,0.004329,...,0.0,0.0134,0.0,0.000217,0.004305,0.0,0.0,0.0,0.0,0.012512
75%,34.0,202.0,0.0,0.451613,0.086462,0.0,21.0,4.0,0.0,0.03125,...,0.000296,0.062307,0.0,0.000991,0.018519,0.005952,0.001526,0.0,0.0,0.059881
max,189.0,1162.0,0.5,1.0,1.0,0.0,142.0,97.0,0.0,1.0,...,0.5,1.0,0.444444,0.2,0.666667,1.0,0.875,0.0,0.1,1.0


# Modeling

Before starting we should flatten the dataset's features which are vectors at the moment (title vector and content vector). Let's do that

In [37]:
X_vect = list()
for (i, row) in dataset.drop('EMOTION', axis=1).iterrows():
    sub_list = list()
    for field in row:
        if type(field) == str:
            field = field[1:-1].split()
            sub_list += [float(x.replace('\n','')) for x in field]
        else:
            sub_list.append(field)
    X_vect.append(np.array(sub_list))
X_vect = np.array(X_vect)

In [38]:
y = dataset.EMOTION.astype("category").cat.codes

In [39]:
print(X_vect.shape)
print(y.shape)

(4706, 627)
(4706,)


## SVM

In [43]:
def parameters_grid_search(classifier, params, x, y, cv=10, verbose=False):
    """
    Grid Search to find best parameters for a certain classifier whose
    performances are evaluated using cross-validation
    """
    gs = GridSearchCV(classifier(), params, cv=cv, n_jobs=-1, verbose=verbose)
    gs.fit(x, y)    
    return (gs.best_estimator_, gs.best_params_)

In [None]:
from sklearn.svm import SVC

# Build model
clf = SVC()
# Define the set of parameters we want to test on
params = [
    { 'kernel': ['rbf'], 'C': [ 0.1, 1 ] }
]

# Perform grid search
svm_best, best_params = parameters_grid_search(SVC, params, X_vect, y, verbose=1)
print('Parameters:', best_params)

In [47]:
scores = cross_val_score(svm_best, X_vect, y, cv=10)
print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 1.96))

Accuracy: 0.42 (+/- 0.10)


## Gradient Boost

## Artificial Neural Network

# References
[EmoInt paper](http://saifmohammad.com/WebDocs/TweetEmotionIntensities-starsem2017.pdf)