## Lighthouse Labs
### W05D04 Comparing Classifiers
Instructor: Socorro Dominguez  
October 15, 2020

**Extra**

In [35]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import tree 

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB

# For tokenization
import nltk
# For converting words into frequency counts
from sklearn.feature_extraction.text import CountVectorizer

# ignore warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Demo
The learning objectives are:

1. to learn how to use classification methods with `scikit-learn`
2. compare and contrast different classification methods

## Data and preprocessing

We will focus on a task called sentiment analysis. We will assignpositive or negative label to a text based on the sentiment or attitude expressed in it. 

We will use a subset of 3,000 rows from the [IMDB movie review data set](https://www.kaggle.com/utathya/imdb-review-dataset) (original data is 50,000 examples). 

In [3]:
# Read IMDB movie reviews into a pandas DataFrame
imdb_df = pd.read_csv('data/imdb_master.csv', encoding = "ISO-8859-1")
imdb_df.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [4]:
# Keep pos and neg reviews
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

# Sample 3000 rows from the dataframe. 
imdb_df_subset = imdb_df.sample(n = 3000)

# Convert a collection of text documents to a matrix of token presence or absence. 
# We are using only 5000 words, English stopwords, and tokenization is done using nltk
movie_vec = CountVectorizer(max_features=5000, 
                            tokenizer=nltk.word_tokenize, 
                            stop_words='english', 
                            binary = True)

# Create X and y
X = movie_vec.fit_transform(imdb_df_subset['review'])
y = imdb_df_subset.label

## Comparison of classifiers

We will compare different classifiers covered so far in your Lighthouse Journey:
  
  * $k$-nearest neighbours  
  * decision trees
  * random forests
  * SVM
  * Logistic Regression
  * naive Bayes  
   
For each classifier, we are going to use scikit-learn implementation with **default** hyperparameters. 

For an in-depth analysis, you will have to choose a model and do more work.

### Empirical comparison

Split the dataset into train and test split. Report the results for all classifiers in a table  with 3 columns: Classifier, Train Accuracy, and Test Accuracy. 

If time allows, let's discuss the following results:
  - Are certain classifiers better for certain problems?
  - Would you change any of the hyperparameters? 
  - What about speed? What would you do for a huge data set? 
  - Any other considerations?
  
**Note:** because all sklearn classifiers use the name fit/predict structure, you can put your 5 classifiers in a list/dict and iterate over them with a loop. 

This will probably be easier writing 15 calls to fit and 15 to predict, etc.

In [32]:
results_dict = {'Classifier':[],
                'Train Accuracy':[], 
                'Test Accuracy':[]
               }

models = {
    'knn'           : KNeighborsClassifier(),
    'decision tree' : DecisionTreeClassifier(),
    'random forest' : RandomForestClassifier(n_estimators=10),
    'SVM'           : SVC(gamma='scale'),
    'logistic regression': LogisticRegression(),
    'Naive Bayes' : MultinomialNB()
}                                                    

In [33]:
# Divide training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20,
                                                    random_state = 15)

print('Working on IMBD dataset')

# Looping through models
for model_name, model in models.items():
    print("Fitting %s..." % model_name)
    model.fit(X_train, y_train);
    train_accuracy = model.score(X_train, y_train)*100
    test_accuracy = model.score(X_test, y_test)*100
    results_dict['Classifier'].append(model_name)
    results_dict['Train Accuracy'].append(train_accuracy)
    results_dict['Test Accuracy'].append(test_accuracy)  
    
results_df = pd.DataFrame(results_dict)

Working on IMBD dataset
Fitting knn...
Fitting decision tree...
Fitting random forest...
Fitting SVM...
Fitting logistic regression...
Fitting Naive Bayes...


In [34]:
results_df.round(2)

Unnamed: 0,Classifier,Train Accuracy,Test Accuracy
0,knn,68.96,58.67
1,decision tree,100.0,67.5
2,random forest,99.29,76.17
3,SVM,98.54,84.83
4,logistic regression,100.0,84.5
5,Naive Bayes,92.38,84.83


## Support Vector Machines (SVM)

### Kernels in SVM classification

1. Let's try the three different kernels: linear, polynomial (with `gamma=0.001`), and RBF. 
2. Report the train and test accuracies in each case.
3. Why do you think `scikit-learn` uses an RBF kernel by default? 

In [26]:
kernel_experiments = {'kernel':[], 'train_accuracy %':[], 'test_accuracy %':[]}
kernels = ['linear', 'poly', 'rbf']

for kernel in kernels:
    model = SVC(kernel = kernel, degree=2, gamma='scale') 
    model.fit(X_train, y_train)
    train_accuracy = model.score(X_train, y_train)*100
    test_accuracy = model.score(X_test, y_test)*100
    
    kernel_experiments['kernel'].append(kernel)
    kernel_experiments['train_accuracy %'].append(train_accuracy)
    kernel_experiments['test_accuracy %'].append(test_accuracy)

In [27]:
kernel_df = pd.DataFrame(kernel_experiments)
kernel_df.round(2)

Unnamed: 0,kernel,train_accuracy %,test_accuracy %
0,linear,100.0,84.0
1,poly,96.46,78.83
2,rbf,98.54,84.83


`C` hyperparameters

1. Play around with the `C` hyperparameters. Try 5 different values for each of these parameters. 
2. What effects do they have? Can you relate them to the fundamental tradeoff of ML?

In [30]:
C_experiments = {}
C = [10.0**(i-1) for i in range(5)]

for c in C:
    model = SVC(C = c, gamma='scale')        
    model.fit(X_train, y_train)
    train_accuracy = model.score(X_train, y_train)*100
    test_accuracy = model.score(X_test, y_test)*100
    C_experiments[c] = {'train_accuracy':train_accuracy, 'test_accuracy':test_accuracy}        

In [31]:
C_df = pd.DataFrame(C_experiments)
C_df

Unnamed: 0,0.1,1.0,10.0,100.0,1000.0
train_accuracy,65.0,98.541667,100.0,100.0,100.0
test_accuracy,58.166667,84.833333,84.166667,84.166667,84.166667


Increasing the `C` value also leads to overfitting.

## Being Naive with Naive Bayes

In [40]:
nb_dict = {'Classifier':[],
           'Train Accuracy':[], 
           'Test Accuracy':[]
               }

nb_models = {'Bernoulli' : BernoulliNB(),
             'Multinomial' : MultinomialNB()
}                                                    

In [44]:
# Divide training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20,
                                                    random_state = 15)

print('Working on IMBD dataset')

# Looping through models
for model_name, model in nb_models.items():
    print("Fitting %s..." % model_name)
    model.fit(X_train, y_train);
    train_accuracy = model.score(X_train, y_train)*100
    test_accuracy = model.score(X_test, y_test)*100
    nb_dict['Classifier'].append(model_name)
    nb_dict['Train Accuracy'].append(train_accuracy)
    nb_dict['Test Accuracy'].append(test_accuracy)  
    
nb_results_df = pd.DataFrame(nb_dict)

Working on IMBD dataset
Fitting Bernoulli...
Fitting Multinomial...


In [45]:
nb_results_df.round(2)

Unnamed: 0,Classifier,Train Accuracy,Test Accuracy
0,Bernoulli,92.54,85.0
1,Multinomial,92.38,84.83
2,Bernoulli,92.54,85.0
3,Multinomial,92.38,84.83
4,Bernoulli,92.54,85.0
5,Multinomial,92.38,84.83


In [54]:
fake_reviews = ['This movie was excellent! The performances were oscar-worthy!',
               'Unbelievably disappointing.', 
               'Full of zany characters and richly applied satire, and some great plot twists',
               'This is the greatest screwball comedy ever filmed',
               'It was pathetic. The worst part about it was the boxing scenes.', 
               '''It could have been a great movie. It could have been excellent, 
                and to all the people who have forgotten about the older, 
                greater movies before it, will think that as well. 
                It does have beautiful scenery, some of the best since Lord of the Rings. 
                The acting is well done, and I really liked the son of the leader of the Samurai.
                He was a likeable chap, and I hated to see him die...
                But, other than all that, this movie is nothing more than hidden rip-offs.
                '''
              ]
realfake_labels = ['pos', 'neg', 'pos', 'pos', 'neg', 'neg']

In [55]:
# Create word count encoding of the reviews.  
fake_reviews_counts = movie_vec.transform(fake_reviews)
fake_reviews_binary = fake_reviews_counts > 0

In [56]:
model = MultinomialNB()
model.fit(X_train, y_train)

MultinomialNB()

In [57]:
# Predict using the Naive Bayes classifier
predictions = model.predict(fake_reviews_binary)

In [59]:
pd.set_option('display.max_colwidth', 0)
d = {'Review':fake_reviews, 'Real(Fake) labels':realfake_labels, 'NB labels':predictions}
df = pd.DataFrame(d)
df

Unnamed: 0,Review,Real(Fake) labels,NB labels
0,This movie was excellent! The performances were oscar-worthy!,pos,pos
1,Unbelievably disappointing.,neg,neg
2,"Full of zany characters and richly applied satire, and some great plot twists",pos,pos
3,This is the greatest screwball comedy ever filmed,pos,pos
4,It was pathetic. The worst part about it was the boxing scenes.,neg,neg
5,"It could have been a great movie. It could have been excellent, \n and to all the people who have forgotten about the older, \n greater movies before it, will think that as well. \n It does have beautiful scenery, some of the best since Lord of the Rings. \n The acting is well done, and I really liked the son of the leader of the Samurai.\n He was a likeable chap, and I hated to see him die...\n But, other than all that, this movie is nothing more than hidden rip-offs.\n",neg,pos
