# Opistocks - Model selection
In this notebook we analyze multiple classifiers on multiple datasets to have the best model to perform sentiment analysis on tweets.

The different classifiers tested are the following:
- SVM linear kernel
- SVM RBF kernel
- Naive Bayes

## Data Analysis
There are 4 datasets of tweets, classified, by hand, in sentiment scores. The score range and nature derives from a dataset to the other. Here is a first glance of the datasets.
- **self-drive** : 7156 tweets (213 not relevant), scores [1, 5] where 3 is neutral
- **text-emotion** : 40k tweets, scores are 13 different emotions {empty, sadness, enthusiasm, neutral, worry, surprise, love, fun, hate, happiness, boredom, relief, anger}, we can associate empty with not relevant (827 tweets)
- **apple** : 3886 tweets (82 not relevant), scores {1, 3 5} where 3 is neutral
- **airline** : 14641 tweets, scores are 3 words {negative, neutral, positive}

In total, we have (7156-213) + (40000-827) + (3886-82) + 14641 = 64561

Since we have to use the less precise scale we will have 3 different classes : **positive**, **neutral**, **negative**. We will normalize the classes across all the datasets to perform the model selection. For example, for "text-emotion", from the 13 words, we will map them to the 3 classes.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import svm
import pickle

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

from sklearn.utils import shuffle

``SEED``: the seed used in every random generator.

``TRAIN_TEST_RATIO``: a number between 0 and 1 describing the percentage allowed for the testing set (e.g. 0.2 = 20% for testing set and 80% for training set)

In [2]:
SEED = 42
TRAIN_TEST_RATIO = 0.2

In [3]:
# ============================== self-drive ===================================
df_sd = pd.read_csv('../data/Twitter-sentiment-self-drive-DFE.csv', encoding='latin1')

# remove not relevant tweets
df_sd = df_sd[~df_sd['sentiment'].isin(['not_relevant'])]

# create the main dataframe by extracting the relevant rows
df = pd.concat([df_sd['text'], df_sd['sentiment']], axis=1, keys=['text', 'sentiment'])

# normalize the sentiment values ({1; 2; 3; 4; 5} => {-1; 0; 1})
df['sentiment'] = df['sentiment'].map({'1':-1, '2':-1, '3':0, '4':1, '5':1})

# shuffle the data
df = shuffle(df, random_state=SEED)

# append the rows to the main dataframe
df_sd = df

In [4]:
# ============================== text-emotion =================================
df_te = pd.read_csv('../data/text_emotion.csv')
df_te = pd.concat([df_te['content'], df_te['sentiment']], axis=1, keys=['text', 'sentiment'])

# remove not relevant tweets
df_te = df_te[~df_te['sentiment'].isin(['empty'])]

# map words into integer values ({...} => {-1; 0; 1})
df_te['sentiment'] = df_te['sentiment'].map({
    'sadness':-1,
    'enthusiasm':1,
    'neutral':0,
    'worry':-1,
    'surprise':1,
    'love':1,
    'fun':1,
    'hate':-1,
    'happiness':1,
    'boredom':-1,
    'relief':1,
    'anger':-1})

# shuffle the data
df_te = shuffle(df_te, random_state=SEED)

# append the rows to the main dataframe
df = df.append(df_te)

In [5]:
# ================================= apple =====================================
df_ap = pd.read_csv('../data/Apple-Twitter-Sentiment-DFE.csv', encoding='latin1')
df_ap = pd.concat([df_ap['text'], df_ap['sentiment']], axis=1, keys=['text', 'sentiment'])

# remove not relevant tweets
df_ap = df_ap[~df_ap['sentiment'].isin(['not_relevant'])]

# normalize the sentiment values ({1; 3; 5} => {-1; 0; 1})
df_ap['sentiment'] = df_ap['sentiment'].map({'1':-1, '3':0, '5':1})

# shuffle the data
df_ap = shuffle(df_ap, random_state=SEED)

# append the rows to the main dataframe
df = df.append(df_ap)

In [6]:
# =============================== airline =====================================
df_ai = pd.read_csv('../data/Airline-Sentiment-2-w-AA.csv', encoding='latin1')
df_ai = pd.concat([df_ai['text'], df_ai['airline_sentiment']], axis=1, keys=['text', 'sentiment'])

# normalize the sentiment values ({'negative'; 'neutral'; 'positive'} => {-1; 0; 1})
df_ai['sentiment'] = df_ai['sentiment'].map({'negative':-1, 'neutral':0, 'positive':1})

# shuffle the data
df_ai = shuffle(df_ai, random_state=SEED)

# append the rows to the main dataframe
df = df.append(df_ai)

There are some classifiers that can not be trained on negative classes (due to the implementation and/or the mathematical background). Therefore, we map the initial classes to new ones to be able to train correctly the classifiers.

In [7]:
# map the 3 classes to another representation
df['sentiment'] = df['sentiment'].map({-1:1, 0:3, 1:5})

Having in the training set unbalanced classes might lead to performance drop. In order to avoid this risk, we extract the number of samples from the classes based on the number of elements of the smallest class.

In [8]:
SAMPLE_SIZE = 18144 # Number of neutral occurences

df = df.loc[df['sentiment'] == 5].sample(SAMPLE_SIZE).append(df.loc[df['sentiment'] == 3].sample(SAMPLE_SIZE)).append(df.loc[df['sentiment'] == 1].sample(SAMPLE_SIZE))
df = shuffle(df, random_state=SEED)

## Features extraction
Now that we have a dataframe containing all the data normalized, we can perform our model selection. In summary, we have 64560 tweets (26428 negatives, 18144 neutrals, 19988 positives) and we want to train a model based on this dataset.

However, in order to train the models, we need to choose on which features we will base the classification. Here are the differents feature extraction strategies that are going to be tested:
- Count (Bag-of-Words) VS TF-IDF
- Stopwords from predefined list VS stopwords from dataset occurences
- Unigram VS bigram tokenization
- Lowercase VS not lowercase conversion

There are 16 different strategies.

In [9]:
# Array of vectorizers for the feature extraction step
vecs = [
    CountVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=True),
    CountVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=False),
#     CountVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=True),
#     CountVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=False),
    CountVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=True),
    CountVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=False),
#     CountVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=True),
#     CountVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=False),
    TfidfVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=True),
    TfidfVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=False),
#     TfidfVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=True),
#     TfidfVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=False),
    TfidfVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=True),
    TfidfVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=False),
#     TfidfVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=True),
#     TfidfVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=False)
]

## Models training and evaluation
Apart from the 16 preprocessing strategies, we test also different classifiers. However, since the course is more about IR and data management than machine learning techniques, we decided to train and evaluate 3 different classifiers but without changing their default parameters (i.e. parameters set by the developpers of the implementation in scikit-learn). These paramaters are the following :

- SVM with Linear Kernel: ``C=1.0, loss='squared_hinge', max_iter=1000, multi_class=ovr (One-vs-Rest)``
- SVM with RBF Kernel: ``C=1.0, degree=3, gamma='auto', multi_class=ovo (One-vs-One)``
- NB for multinomial models: ``alpha=1.0``

Special attention for the choice of the Naive Bayes model, this specific implementation is for multinomially distributed data.

In [10]:
# dfs = [df_sd, df_te, df_ap, df_ai, df]
dfs = [df_sd, df_ap, df_ai]
# dfs = [df_sd[:200]]

In [11]:
results = {'linear': [], 'rbf': [], 'nb': []}
target_names = ['negative', 'neutral', 'positive']

for i in range(len(dfs)):
    results_df_svm = []
    results_df_rbf = []
    results_df_nb = []

    threshold = int(TRAIN_TEST_RATIO*len(dfs[i]))

    for idx, vec in enumerate(vecs):
        print('======== LOOP NUMBER {} ========'.format(i*len(vecs) + idx))
        current_vec = vec
        X, y = current_vec.fit_transform(dfs[i]['text'].as_matrix()).toarray(), dfs[i]['sentiment'].as_matrix()

        # SVM Linear
        svm_linear = svm.LinearSVC()
        y_pred = svm_linear.fit(X[:threshold], y[:threshold]).predict(X[threshold:])
        results_df_svm.append(classification_report(y[threshold:], y_pred, target_names=target_names))

        # SVM RBF
        svm_rbf = svm.SVC()
        y_pred = svm_rbf.fit(X[:threshold], y[:threshold]).predict(X[threshold:])
        results_df_svm.append(classification_report(y[threshold:], y_pred, target_names=target_names))

        # Naive Bayes
        gnb = MultinomialNB()
        y_pred = gnb.fit(X[:threshold], y[:threshold]).predict(X[threshold:])
        results_df_nb.append(classification_report(y[threshold:], y_pred, target_names=target_names))

    results['linear'].append(results_df_svm)
    results['rbf'].append(results_df_rbf)
    results['nb'].append(results_df_nb)



  'precision', 'predicted', average, warn_for)




In [12]:
# pickle the results in order to visualize them in another notebook
pickle.dump(results, open("results.p", "wb"))

In [13]:
# Information about software version
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas, sklearn, pickle

Software,Version
Python,3.6.0 64bit [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]
IPython,6.0.0
OS,Darwin 16.5.0 x86_64 i386 64bit
numpy,1.12.0
pandas,0.19.2
sklearn,0.18.1
pickle,The 'pickle' distribution was not found and is required by the application
Fri May 19 17:40:46 2017 CEST,Fri May 19 17:40:46 2017 CEST
