<img src="figures/NWEA_logo.png" width="200">

# <span style='color:#ffd800' > Step by Step Scikit-learn working with Text Data </span>

## <span style='color:#ffd800'> A Machine Learning Problem of Classification </span>

<table><tr>
<td> <img src="figures/fig8_imbd.png" width="300"> </td>
</tr></table>

In this session, we will build a model that automatically classifies text as either having a positive or negative sentiment in IMDB.

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

## <span style='color:#ffd800'> How to Get the Data </span>

http://ai.stanford.edu/~amaas/data/sentiment/

https://github.com/nas5w/imdb-data

## <span style='color:#ffd800'> Load the Data </span>

### <span style='color:#ffd800'>  Step 1. load json file into python </span>

In [173]:
import json

file_name = './Data/reviews.json'

with open(file_name) as json_file:
    reviews = json.load(json_file)

#### <span style='color:#ffd800'> Sample Size </span>

In [174]:
len(reviews)

50000

#### <span style='color:#ffd800'> Raw data </span>

In [175]:
reviews[0:5]

[{'t': "Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",
  's': 0},
 {'t': "This is an example of why the majority of action films are the same. G

### <span style='color:#ffd800'>  Step 2. convert to dataframe </span>

In [65]:
import pandas as pd
reviews = pd.DataFrame(reviews)
reviews.columns = ['text', 'sentiment']
reviews

Unnamed: 0,text,sentiment
0,Once again Mr. Costner has dragged out a movie...,0
1,This is an example of why the majority of acti...,0
2,"First of all I hate those moronic rappers, who...",0
3,Not even the Beatles could write songs everyon...,0
4,Brass pictures (movies is not a fitting word f...,0
...,...,...
49995,"Seeing as the vote average was pretty low, and...",1
49996,"The plot had some wretched, unbelievable twist...",1
49997,I am amazed at how this movie(and most others ...,1
49998,A Christmas Together actually came before my t...,1


In [70]:
reviews.sentiment.value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

In [135]:
import numpy as np
np.random.seed(123)
index_neg = np.random.randint(0, 24999, size = 500)
index_pos = np.random.randint(25000, 49999, size = 500)
reviews_neg = reviews.iloc[index_neg, :]
reviews_pos = reviews.iloc[index_pos, :]
reviews_sub = pd.concat([reviews_neg, reviews_pos], axis=0).reset_index(drop=True)

In [151]:
reviews_sub

Unnamed: 0,text,sentiment
0,Tromaville High has become an amoral wasteland...,0
1,It just seems bizarre that someone read this s...,0
2,Normally I'm not motivated to write reviews. B...,0
3,This movie is deeply idiotic. A man wants reve...,0
4,Well the reason for seeing it in the cinema wa...,0
...,...,...
995,"Well, after long anticipation after seeing a f...",1
996,"In April 1947, New York City faced an epidemic...",1
997,"In the early 1970s, many of us who had embrace...",1
998,"While the ""date doctor"" concept is the one thi...",1


In [86]:
reviews_sub.shape

(1000, 2)

In [87]:
reviews_sub.sentiment.value_counts()

1    500
0    500
Name: sentiment, dtype: int64

## <span style='color:#ffd800'> Data Preprocessing </span>

### <span style='color:#ffd800'> Split data to training set and testing set </span>

In [155]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(reviews_sub, test_size=0.2, random_state=123)
train.sentiment.value_counts()

1    401
0    399
Name: sentiment, dtype: int64

In [156]:
train, test = train_test_split(reviews_sub, test_size=0.2, 
                                       random_state=123, 
                                       stratify=reviews_sub['sentiment'])
train.sentiment.value_counts()

1    400
0    400
Name: sentiment, dtype: int64

In [172]:
train_x = np.array([x.text for _, x in train.iterrows()])
train_y = np.array([x.sentiment for _, x in train.iterrows()])

test_x = np.array([x.text for _, x in test.iterrows()])
test_y = np.array([x.sentiment for _, x in test.iterrows()])

train_x[0]

'It was praised to be a fast paced screwball comedy and the best German movie of the year, so I gave it a try, even though I\'ve already seen some films by Dani Levy - or at least parts of them.<br /><br />I got what I had expected: no comedy at all, unless you think that heart attacks are funny. It\'s a fine example of sloppy screen writing, with an implausible plot and characters, loaded with clichés that might be true, but surely are not funny either.<br /><br />The most annoying character is that of Zucker\'s wife, played by Hannelore Elsner. She has to behave incredibly strange to keep the plot moving. For example: She doesn\'t know a single thing about Judaism, but by reasons most likely unknown to even herself she gets the idea to play the charade that she and her family are Jewish laws obeying Jews for her husband\'s family, who really are, and of the very orthodox and self-righteous variety. To make it a bit more complicated, she invites the four of them to stay at her city fl

### <span style='color:#ffd800'> Turn text content into numerical feature vector </span>

#### <span style='color:#ffd800'> The most intuitive method: Bags of words </span>
https://en.wikipedia.org/wiki/Bag-of-words_model

#### <span style='color:#ffd800'> Text preprocessing </span>

In [217]:
from sklearn.feature_extraction.text import CountVectorizer
vectorize = CountVectorizer()

vectorize.fit(train_x)
train_x_vectors = vectorize.transform(train_x)

# X_train_counts = count_vect.fit_transform(train_x)

test_x_vectors = vectorize.transform(test_x)
print(train_x_vectors.shape, test_x_vectors.shape)

(800, 15553) (200, 15553)


In [210]:
train_x[1]

'One of my favorite villains, the Evil Princess is just the perfect villain for this movie. Full of space travel, horses, diamonds, mystical characters, colorful backgrounds, evil characters, etc etc. Very bright, full of action, you will not get bored. Great movie!'

In [213]:
print(train_x_vectors[1])
# print(train_x_vectors[1].toarray())

  (0, 327)	1
  (0, 1178)	1
  (0, 1732)	1
  (0, 1852)	1
  (0, 2363)	2
  (0, 2743)	1
  (0, 3895)	1
  (0, 4822)	2
  (0, 4864)	2
  (0, 5183)	1
  (0, 5512)	1
  (0, 5695)	2
  (0, 5867)	1
  (0, 6093)	1
  (0, 6712)	1
  (0, 7383)	1
  (0, 7621)	1
  (0, 9137)	2
  (0, 9222)	1
  (0, 9233)	1
  (0, 9493)	1
  (0, 9612)	3
  (0, 9664)	1
  (0, 10124)	1
  (0, 10666)	1
  (0, 12899)	1
  (0, 13886)	2
  (0, 13937)	1
  (0, 14233)	1
  (0, 14830)	1
  (0, 14882)	1
  (0, 14884)	1
  (0, 15257)	1
  (0, 15491)	1


In [203]:
count_vect.vocabulary_.get('favorite')

5183

In [208]:
len(count_vect.get_feature_names())

15553

In [214]:
# train_x_vectors
# train_y

## <span style='color:#ffd800'> Classification </span>

### <span style='color:#ffd800'> SVM </span>

In [253]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)



SVC(kernel='linear')

In [250]:
test_y[0]

0

In [251]:
clf_svm.predict(test_x_vectors[0])

array([1])

### <span style='color:#ffd800'> Random Forest </span>

In [265]:
from sklearn.ensemble import RandomForestClassifier 

clf_rfc = RandomForestClassifier()
clf_rfc.fit(train_x_vectors, train_y)
clf_rfc.predict(test_x_vectors[0])


array([0])

### <span style='color:#ffd800'> Logistic Regression </span>

In [225]:
from sklearn.linear_model import LogisticRegression

clf_lgr = LogisticRegression(max_iter=10000)
clf_lgr.fit(train_x_vectors, train_y)
clf_lgr.predict(test_x_vectors[0])

array([0])

### <span style='color:#ffd800'> Naive Bayes </span>

In [239]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
clf_gnb.predict(test_x_vectors[0].toarray())

array([0])

### <span style='color:#ffd800'> Nearest Neighbors Classification </span> 

In [243]:
from sklearn.neighbors import KNeighborsClassifier

clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_knn.fit(train_x_vectors, train_y)
clf_knn.predict(test_x_vectors[0])

array([1])

## <span style='color:#ffd800'> Model Evaluation </span>

### <span style='color:#ffd800'> Simple score </span>

In [282]:
print('SVM mean accuracy: ' + str(clf_svm.score(test_x_vectors, test_y)))
print('Random Forest mean accuracy: ' + str(clf_rfc.score(test_x_vectors, test_y)))
print('Logistic Regression mean accuracy: ' + str(clf_lgr.score(test_x_vectors, test_y)))
print('Naive Bayes mean accuracy: ' + str(clf_gnb.score(test_x_vectors.toarray(), test_y)))
print('kNN mean accuracy: ' + str(clf_knn.score(test_x_vectors, test_y)))

SVM mean accuracy: 0.77
Random Forest mean accuracy: 0.795
Logistic Regression mean accuracy: 0.81
Naive Bayes mean accuracy: 0.62
kNN mean accuracy: 0.59


### <span style='color:#ffd800'> F1 score </span>

In [283]:
from sklearn.metrics import f1_score

print('SVM f1 score: ' + str(f1_score(test_y, clf_svm.predict(test_x_vectors))))
print('Random Forest f1 score: ' + str(f1_score(test_y, clf_rfc.predict(test_x_vectors))))
print('Logistic Regression f1 score: ' + str(f1_score(test_y, clf_lgr.predict(test_x_vectors))))
print('Naive Bayes f1 score: ' + str(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()))))
print('kNN mean f1 score: ' + str(f1_score(test_y, clf_knn.predict(test_x_vectors))))


SVM f1 score: 0.780952380952381
Random Forest f1 score: 0.8110599078341014
Logistic Regression f1 score: 0.819047619047619
Naive Bayes f1 score: 0.6082474226804124
kNN mean f1 score: 0.6272727272727272


### <span style='color:#ffd800'> Truing our models using Grid Search </span>

In [298]:
clf_rfc.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [292]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [50, 100, 200], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']},
]

clf_rfc_GS = RandomForestClassifier()
grid_search = GridSearchCV(clf_rfc_GS, param_grid, cv=5,  return_train_score=True)

grid_search.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid=[{'bootstrap': [True, False],
                          'criterion': ['gini', 'entropy'],
                          'n_estimators': [50, 100, 200]}],
             return_train_score=True)

In [296]:
grid_search.best_estimator_

RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=200)

In [297]:
clf_rfc_GS_best = grid_search.best_estimator_
clf_rfc_GS_best.fit(train_x_vectors, train_y)
print('Random Forest mean accuracy: ' + str(clf_rfc.score(test_x_vectors, test_y)))
print('Tuned Random Forest mean accuracy: ' + str(clf_rfc_GS_best.score(test_x_vectors, test_y)))
print('Random Forest f1 score: ' + str(f1_score(test_y, clf_rfc.predict(test_x_vectors))))
print('Tuned Random Forest f1 score: ' + str(f1_score(test_y, clf_rfc_GS_best.predict(test_x_vectors))))

Random Forest mean accuracy: 0.795
Tuned Random Forest mean accuracy: 0.825
Random Forest f1 score: 0.8110599078341014
Tuned Random Forest f1 score: 0.8372093023255814


### <span style='color:#ffd800'> Saving and Loading our Model using Pickle</span>

In [302]:
## save our best models using pickle package
import pickle, os

filename = 'finalized_model.sav'
pickle.dump(clf_rfc_GS_best, open(os.path.join('./models', filename), 'wb'))

In [304]:
## load the model from disk
loaded_clf_rfc = pickle.load(open(os.path.join('./models', filename), 'rb'))
print(f1_score(test_y, loaded_clf_rfc.predict(test_x_vectors)))

0.8372093023255814


## <span style='color:#ffd800'> A Multiple Classification Problem </span> 

### <span style='color:#ffd800'> Data: 'Twenty Newsgroups' </span> 

http://qwone.com/~jason/20Newsgroups/

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

### <span style='color:#ffd800'> Load Data </span> 

In [324]:
# load data using sklearn
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.autos', 'sci.med', 'comp.graphics', 'misc.forsale']

train_twenty = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=123)
test_twenty = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=123)

In [325]:
# show the sample size
print(len(train_twenty.data), len(train_twenty.target_names))
print(len(test_twenty.data), len(test_twenty.target_names))

2357 4
1571 4


In [323]:
# showing the first 10 samples' labels 
[train_twenty.target_names[t] for t in train_twenty.target[:10]]

['comp.graphics',
 'rec.autos',
 'rec.autos',
 'comp.graphics',
 'sci.med',
 'misc.forsale',
 'comp.graphics',
 'rec.autos',
 'misc.forsale',
 'misc.forsale']

### <span style='color:#ffd800'> Data Preprocessing </span> 

In [365]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_twenty.data)
train_y = train_twenty.target

test_x_vectors = vectorizer.transform(test_twenty.data)
test_y = test_twenty.target

In [370]:
# from the occurrence to frequencies
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(train_x_vectors)

train_x_tf_idf = tf_transformer.transform(train_x_vectors)
test_x_tf_idf = tf_transformer.transform(test_x_vectors)

### <span style='color:#ffd800'> Train a Random Forest Classifier </span> 

In [378]:
from sklearn.ensemble import RandomForestClassifier

clf_rfc = RandomForestClassifier().fit(train_x_tf_idf, train_y)
pred_y = clf_rfc.predict(test_x_tf_idf)

In [380]:
# Evaluation the classifier
from sklearn import metrics

print(metrics.classification_report(test_y, pred_y, target_names=test_twenty.target_names))
print(metrics.confusion_matrix(test_y, pred_y))

               precision    recall  f1-score   support

comp.graphics       0.71      0.88      0.79       389
 misc.forsale       0.88      0.93      0.90       390
    rec.autos       0.83      0.84      0.83       396
      sci.med       0.94      0.65      0.77       396

     accuracy                           0.83      1571
    macro avg       0.84      0.83      0.82      1571
 weighted avg       0.84      0.83      0.82      1571

[[343  17  18  11]
 [ 12 364  12   2]
 [ 42  18 333   3]
 [ 83  16  39 258]]


### <span style='color:#ffd800'> Tuning our Model</span> 

In [373]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [50, 100, 200], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']},
]

clf_rfc_GS = RandomForestClassifier()
grid_search = GridSearchCV(clf_rfc_GS, param_grid, cv=5,  return_train_score=True)
grid_search.fit(train_x_tf_idf, train_y)
grid_search.best_params_

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid=[{'bootstrap': [True, False],
                          'criterion': ['gini', 'entropy'],
                          'n_estimators': [50, 100, 200]}],
             return_train_score=True)

In [382]:
clf_rfc_best = grid_search.best_estimator_
pred_y = clf_rfc_best.predict(test_x_tf_idf)

print(metrics.classification_report(test_y, pred_y, target_names=test_twenty.target_names))
print(metrics.confusion_matrix(test_y, pred_y))

               precision    recall  f1-score   support

comp.graphics       0.78      0.93      0.85       389
 misc.forsale       0.90      0.94      0.92       390
    rec.autos       0.89      0.88      0.88       396
      sci.med       0.94      0.73      0.83       396

     accuracy                           0.87      1571
    macro avg       0.88      0.87      0.87      1571
 weighted avg       0.88      0.87      0.87      1571

[[360   8   9  12]
 [ 14 365   9   2]
 [ 28  15 350   3]
 [ 61  17  27 291]]


## <span style='color:#ffd800'> Pipeline in Sklearn </span>

In [1]:
# load data
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.autos', 'sci.med', 'comp.graphics', 'misc.forsale']

train_twenty = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=123)
test_twenty = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=123)

### <span style='color:#ffd800'> First pipeline for data preprocessing </span>

In [23]:
# define the preprocessing fucntion by combining two functions in a pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.pipeline import Pipeline

preprocessing = Pipeline([('vectorizer', CountVectorizer()),
                          ('tfidf',  TfidfTransformer(use_idf=True))])

train_x_tf_idf = preprocessing.fit_transform(train_twenty.data)
train_y = train_twenty.target

test_y_tf_idf = preprocessing.transform(test_twenty.data)
test_y = test_twenty.target

### <span style='color:#ffd800'> Pipeline can be contained in another pipeline </span>

In [24]:
# combine data preprocessing with classifier
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([('features', preprocessing), 
                     ('classifier', RandomForestClassifier())])

pipeline.fit(train_twenty.data, train_twenty.target)
pred_y = pipeline.predict(test_twenty.data)

### <span style='color:#ffd800'> Cross Validation To Find The Best Pipeline </span>

In [25]:
# the hyperparameters of pipeline
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'features', 'classifier', 'features__memory', 'features__steps', 'features__verbose', 'features__vectorizer', 'features__tfidf', 'features__vectorizer__analyzer', 'features__vectorizer__binary', 'features__vectorizer__decode_error', 'features__vectorizer__dtype', 'features__vectorizer__encoding', 'features__vectorizer__input', 'features__vectorizer__lowercase', 'features__vectorizer__max_df', 'features__vectorizer__max_features', 'features__vectorizer__min_df', 'features__vectorizer__ngram_range', 'features__vectorizer__preprocessor', 'features__vectorizer__stop_words', 'features__vectorizer__strip_accents', 'features__vectorizer__token_pattern', 'features__vectorizer__tokenizer', 'features__vectorizer__vocabulary', 'features__tfidf__norm', 'features__tfidf__smooth_idf', 'features__tfidf__sublinear_tf', 'features__tfidf__use_idf', 'classifier__bootstrap', 'classifier__ccp_alpha', 'classifier__class_weight', 'classifier__criterion', 'classifier__

In [26]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {'classifier__n_estimators': [50, 100, 200],
                   'classifier__bootstrap': [True, False],
                   'classifier__criterion': ['gini', 'entropy']}

grid_search = GridSearchCV(pipeline, hyperparameters, cv=5)
grid_search.fit(train_twenty.data, train_twenty.target)
grid_search.best_params_

{'classifier__bootstrap': False,
 'classifier__criterion': 'gini',
 'classifier__n_estimators': 100}

In [18]:
from sklearn import metrics
clf_rfc_best = grid_search.best_estimator_

pred_y = clf_rfc_best.predict(test_twenty.data)
print(metrics.classification_report(test_twenty.target, pred_y, target_names=test_twenty.target_names))
print(metrics.confusion_matrix(test_twenty.target, pred_y))

               precision    recall  f1-score   support

comp.graphics       0.76      0.92      0.83       389
 misc.forsale       0.91      0.94      0.93       390
    rec.autos       0.89      0.86      0.88       396
      sci.med       0.93      0.73      0.82       396

     accuracy                           0.86      1571
    macro avg       0.87      0.86      0.86      1571
 weighted avg       0.87      0.86      0.86      1571

[[359   7  11  12]
 [ 14 367   5   4]
 [ 34  14 341   7]
 [ 66  15  25 290]]
