# Opistocks - Model selection
In this notebook we analyze multiple classifiers on multiple datasets to have the best model to perform sentiment analysis on tweets.

The different classifiers tested are the following:
- SVM linear kernel
- SVM RBF kernel
- Naive Bayes

## Data Analysis
There are 4 datasets of tweets, classified, by hand, in sentiment scores. The score range and nature derives from a dataset to the other. Here is a first glance of the datasets.
- **self-drive** : 7156 tweets (213 not relevant), scores [1, 5] where 3 is neutral
- **text-emotion** : 40k tweets, scores are 13 different emotions {empty, sadness, enthusiasm, neutral, worry, surprise, love, fun, hate, happiness, boredom, relief, anger}, we can associate empty with not relevant (827 tweets)
- **apple** : 3886 tweets (82 not relevant), scores {1, 3 5} where 3 is neutral
- **airline** : 14641 tweets, scores are 3 words {negative, neutral, positive}
In total, we have (7156-213) + (40000-827) + (3886-82) + 14641 = 64561

Since we have to use the less precise scale we will have 3 different classes : **positive**, **neutral**, **negative**. We will normalize the classes across all the datasets to perform the model selection. For example, for "text-emotion", from the 13 words, we will map them to the 3 classes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# self-drive
df_sd = pd.read_csv('../data/Twitter-sentiment-self-drive-DFE.csv', encoding='latin1')

# remove not relevant tweets
df_sd = df_sd[~df_sd['sentiment'].isin(['not_relevant'])]

# create the main dataframe by extracting the relevant rows
df = pd.concat([df_sd['text'], df_sd['sentiment']], axis=1, keys=['text', 'sentiment'])

# normalize the sentiment values ({1; 2; 3; 4; 5} => {-1; 0; 1})
df['sentiment'] = df['sentiment'].map({'1':-1, '2':-1, '3':0, '4':1, '5':1})
del df_sd

In [3]:
# text-emotion
df_te = pd.read_csv('../data/text_emotion.csv')
df_te = pd.concat([df_te['content'], df_te['sentiment']], axis=1, keys=['text', 'sentiment'])

# remove not relevant tweets
df_te = df_te[~df_te['sentiment'].isin(['empty'])]

# map words into integer values ({...} => {-1; 0; 1})
df_te['sentiment'] = df_te['sentiment'].map({
    'sadness':-1, 
    'enthusiasm':1, 
    'neutral':0, 
    'worry':-1, 
    'surprise':1, 
    'love':1, 
    'fun':1, 
    'hate':-1, 
    'happiness':1, 
    'boredom':-1, 
    'relief':1, 
    'anger':-1})

# append the rows to the main dataframe
# df = df.append(df_te)
del df_te

In [4]:
# apple
df_ap = pd.read_csv('../data/Apple-Twitter-Sentiment-DFE.csv', encoding='latin1')
df_ap = pd.concat([df_ap['text'], df_ap['sentiment']], axis=1, keys=['text', 'sentiment'])

# remove not relevant tweets
df_ap = df_ap[~df_ap['sentiment'].isin(['not_relevant'])]

# normalize the sentiment values ({1; 3; 5} => {-1; 0; 1})
df_ap['sentiment'] = df_ap['sentiment'].map({'1':-1, '3':0, '5':1})

# append the rows to the main dataframe
# df = df.append(df_ap)
del df_ap

In [5]:
# airline
df_ai = pd.read_csv('../data/Airline-Sentiment-2-w-AA.csv', encoding='latin1')
df_ai = pd.concat([df_ai['text'], df_ai['airline_sentiment']], axis=1, keys=['text', 'sentiment'])

# normalize the sentiment values ({'negative'; 'neutral'; 'positive'} => {-1; 0; 1})
df_ai['sentiment'] = df_ai['sentiment'].map({'negative':-1, 'neutral':0, 'positive':1})

# append the rows to the main dataframe
# df = df.append(df_ai)
del df_ai

In [6]:
# Statistical metrics about the classes
df.describe()

Unnamed: 0,sentiment
count,6943.0
mean,0.159585
std,0.602642
min,-1.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [7]:
# Occurences of the different classes
df['sentiment'].value_counts()

 0    4245
 1    1903
-1     795
Name: sentiment, dtype: int64

## Features extraction
Now that we have a dataframe containing all the data normalized, we can perform our model selection. In summary, we have 64560 tweets (26428 negatives, 18144 neutrals, 19988 positives) and we want to train a model based on this dataset.

However, in order to train the models, we need to choose on which features we will base the classification. Here are the differents feature extraction strategies that are going to be tested:
- Count (Bag-of-Words) VS TF-IDF
- Stopwords from predefined list VS stopwords from dataset occurences
- Unigram VS bigram tokenization
- Lowercase VS not lowercase conversion

There are 16 different strategies.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Array of vectorizers for the feature extraction step
vecs = [
    CountVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=True),
    CountVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=False),
    CountVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=True),
    CountVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=False),
    CountVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=True),
    CountVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=False),
    CountVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=True),
    CountVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=False),
    TfidfVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=True),
    TfidfVectorizer(stop_words='english', ngram_range=(1, 1), lowercase=False),
    TfidfVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=True),
    TfidfVectorizer(stop_words='english', ngram_range=(2, 2), lowercase=False),
    TfidfVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=True),
    TfidfVectorizer(stop_words=None, ngram_range=(1, 1), lowercase=False),
    TfidfVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=True),
    TfidfVectorizer(stop_words=None, ngram_range=(2, 2), lowercase=False)
]

def feat(df, vec):
    return vec.fit_transform(df['text'].as_matrix()).toarray(), df['sentiment'].as_matrix()

## Grid search model selection
The GridSearchCV class from scikit-learn does an eshaustive search over specified parameter values for an estimator (e.g. C parameter in SVM). For every batch of features extracted by a specific vectorizer, we will split the set into distinct sets respectively the training set and the testing set (with a 75%-25% size ratio). With these two sets, the grid search find the best model and parameters.

In [None]:
SAMPLE_SIZE = -1

# df = df.sample(SAMPLE_SIZE)
df['sentiment'] = df['sentiment'].map({-1:1, 0:3, 1:5})
if SAMPLE_SIZE == -1:
    df_sample = df
else:
    df_sample = df.loc[df['sentiment'] == 5].sample(SAMPLE_SIZE).append(df.loc[df['sentiment'] == 3].sample(SAMPLE_SIZE)).append(df.loc[df['sentiment'] == 1].sample(SAMPLE_SIZE))

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import svm

SEED = 17
results = []

for idx, vec in enumerate(vecs):
    curvec = vec
    X, y = curvec.fit_transform(df_sample['text'].as_matrix()).toarray(), df_sample['sentiment'].as_matrix()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)
#     parameters = {'kernel':('linear', 'rbf'), 'C':[0.01, 0.02, 0.05, 0.07, 0.1, 0.2, 0.5, 0.7, 1, 2, 5, 7, 10]}
    parameters = {'kernel':('linear', 'rbf'), 'C':[0.01, 0.05, 0.1, 0.5, 1, 5]}
    svr = svm.SVC()
    clf = GridSearchCV(svr, parameters)
    clf.fit(X, y)
    results.append(clf.cv_results_)

In [None]:
clf.cv_results_

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
target_names = ['negative', 'neutral', 'positive']
print(classification_report(y_test, y_pred, target_names=target_names))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
best_scores = []
for i, r in enumerate(results):
    im = np.argmax(r['mean_test_score'])
    best_scores.append([np.max(r['mean_test_score']), im, r['params'][im]])
best_scores