# Goals of this Notebook
* Use Logistic Regression for a multi-class problem by using scikit-learn to find movies with good review
* Classifying the sentiments of phrases taken from movie reviews from the rotten tomatos database

## Phrase Classifications
* Phrases can be classified as one of the following sentiments:
  * Negative  (0) 
  * Somewhat negative (1)
  * Neutral (2)
  * Somewhat positive (3)
  * Positive (4)
* Contributing Factors include
  * Sarcasm
  * Negation
  * Other turns-of-phrase
  
The data can be obtained as the train.tsv on [kaggle.com](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)

## Multi-class classification

In [1]:
import pandas as pd
df = pd.read_csv('./train.tsv', header=0, delimiter='\t')
print(df.count())

PhraseId      156060
SentenceId    156060
Phrase        156060
Sentiment     156060
dtype: int64


In [2]:
print(df.head())

   PhraseId  SentenceId                                             Phrase  \
0         1           1  A series of escapades demonstrating the adage ...   
1         2           1  A series of escapades demonstrating the adage ...   
2         3           1                                           A series   
3         4           1                                                  A   
4         5           1                                             series   

   Sentiment  
0          1  
1          2  
2          2  
3          2  
4          2  


In [3]:
# Print the first ten phrases
print(df['Phrase'].head(10))

0    A series of escapades demonstrating the adage ...
1    A series of escapades demonstrating the adage ...
2                                             A series
3                                                    A
4                                               series
5    of escapades demonstrating the adage that what...
6                                                   of
7    escapades demonstrating the adage that what is...
8                                            escapades
9    demonstrating the adage that what is good for ...
Name: Phrase, dtype: object


In [4]:
print(df['Sentiment'].describe())

count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64


In [5]:
# Count the number of individual sentiments
print(df['Sentiment'].value_counts())

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64


In [12]:
# Training a classifier with scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Load file, differentiate between phrases and sentiments, and split the data.
# Then create the pipeline and parameters
df = pd.read_csv('./train.tsv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1,1), (1,2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}
# Apply grid search algorithm, fit the model, and print the output
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('t%s: %r' % (param_name, best_parameters[param_name]))

  if sys.path[0] == '':


Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  5.1min finished


Best score: 0.620
Best parameters set:
tclf__C: 10
tvect__max_df: 0.25
tvect__ngram_range: (1, 2)
tvect__use_idf: False


## Multi-Class Classificaiton Performance Metrics

* Multi-class classification shares many metrics in common with binary classification including:
    * Confusion matrices
    * Precision
    * Recall
    * F1 score
    * Accuracy

In [13]:
# Multi-class classification performance metrics
predictions = grid_search.predict(X_test)

print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))

Accuracy: 0.6356016916570549
Confusion Matrix:
[[ 1155  1668   622    77     6]
 [  961  5985  6250   535    33]
 [  223  3116 32588  3650   157]
 [   18   370  6466  8221  1309]
 [    5    26   488  2454  1647]]
Classification Report:
             precision    recall  f1-score   support

          0       0.49      0.33      0.39      3528
          1       0.54      0.43      0.48     13764
          2       0.70      0.82      0.76     39734
          3       0.55      0.50      0.52     16384
          4       0.52      0.36      0.42      4620

avg / total       0.62      0.64      0.62     78030

