In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})
sns.set_context("paper")    
y_labels = ["love", "haha", "wow", "angry", "sad"]

dataset = pd.read_json("data/preprocessed.json")
dataset = dataset.reset_index(drop=True)
dataset.shape

(9072, 29)

## Notes

- The purpose of this is to quickly get a predictive model, it will serve as a benchmark for future works.

- `GridSearchCV` may be the easiest way do Hyperparameter Optimization. Scripts for searching those optimal parameters are on folder `grid_search`.

- And we are gonna use StratifiedKFold to deal with the **imbalanced** dataset.

## Score

The classification score for this multilabel task must be an accuracy defined by
$$
accuracy(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} 1(\hat{y}_i = y_i)
$$
Where $N$ is the number of samples, $\hat{y}_i$ is the predicted value of the i-th sample and $y_i$ is the corresponding true value.

## Model Selection

### 1. Predictive Model: TF-IDF + Linear SVC

It is very common to follow the bag-of-words approach when extracting features from documents. One of the simplest feature model is TF-IDF.

`CountVectorizer, TfidfTransformer`

Now we can choose a learning algorithm/classifier, according to "scikit learn", http://scikit-learn.org/stable/_static/ml_map.png. The first one we should try is a linear-SVC.

**Linear SVM (One vs Rest):** Is the first multiclass classifier we should try according to a guide from scikit-learn, and we evaluate its accuracy using K-Fold cross validation. 

The C parameter basically tells us how much you want to avoid misclassifying each sample.

The thing now is to try with every possible value of $C > 1$. It is common to use exponentially growing sequences of 2, like $C = 2^{-5},2^{-3},\ldots,2^{15}$.

#### Hyperparameter Optimization

I'd suggest running the file `tfidf_linearsvc.py`, but it takes like 1 hour. I'll write down those parameters and its score.

#### Score

``
best_params_: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__C': 0.29999999999999999, 'vect__min_df': 2, 'vect__max_df': 1.0}
best_score_: 0.724586549063
``

It is not the final model, I'll try with another feature models first.

### 2. Predictive Model: TF-IDF + Naive Bayes

Now, we should try with a Multinomial Naive Bayes. We can have those countings this algorithm needs using a `CountVectorizer`.

#### Hyperparameter Optimization

The same with file `tfidf_naivebayes.py`. I'll write down those parameters and its score.

#### Score

`
best_params_: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False, 'vect__min_df': 4, 'vect__max_df': 0.040000000000000001}
best_score_: 0.680044101433
`

### 3. Predictive Model: TF-IDF + Stochastic Gradient Descent

The next one is SGD, I don't have more than 10K samples, and I didn't get a better score with it.

#### Hyperparameter Optimization

The same with file `tfidf_sgd.py`.

#### Score

`
best_params_: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'clf__n_iter': 10000, 'clf__loss': 'modified_huber', 'tfidf__use_idf': True, 'vect__min_df': 1, 'clf__penalty': 'l2', 'clf__alpha': 1e-05}
best_score_: 0.692570546737
`

**And apparently a Linear SVC gives a better score, so I'll just keep using it but with another feature models, I won't try with any other Learning algorithm either since I just want a benchmark.**

I will keep this SGD model to have the possibility to get the probability of belonging to each class.

## Feature set in a 2D space


![Image](notebook_figures/tfidf_2d.png)

This image was created running

`
visualization/tfidf_2d.py
`

It takes some minutes to save the scatter plot in `notebook_figures`, and it looks messy, I'll try with LDA.