## Category-2-Supervised Machine Learning on Unstructured data
> Dataset consists of a set of reviews written by customers and the corresponding label indicating whether they 'Liked' the experience or not. The objective of the learning program is to predict the label 'Liked' based on the text review. So this is a text classification problem.
> This version utilizes sklearn pipelines

** Step 1 - Import relevant libraries **

In [1]:
# Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
from time import time

import string
import re,nltk
from nltk.corpus import stopwords
#from nltk.stem.porter import PorterStemmer
from nltk.stem import PorterStemmer,WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier

stopWords = set(stopwords.words('english'))
ps = PorterStemmer()
lem = WordNetLemmatizer()

# Download just once
#nltk.download('stopwords')
#nltk.download('wordnet')

** Step 2 - Reading the dataset into pandas dataframe **

In [2]:
# Importing the dataset
reviews_original = pd.read_csv('./0.datasets/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
print(reviews_original.shape)

(1000, 2)


In [3]:
reviews_original.head(5)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
num_reviews = reviews_original.shape[0]
print(num_reviews)

1000


** Step 3 - Text Preprocessing **
> Remove special characters

> Make lowercase

> Stem the words

> Remove stopwords

In [5]:
#creating a function to encapsulate preprocessing, to make it easy to replicate on submission data
def processing(df,col='text',remove_stop_words='Yes',treatment='lemmatize'):
    num_reviews = df.shape[0]
    #lowering and removing punctuation
    df['processed'] = df[col].apply(lambda x: re.sub(r'[^\w\s]','', x.lower()))
    
    #numerical feature engineering
    #total length of sentence
    df['length'] = df['processed'].apply(lambda x: len(x))
    #get number of words
    df['words'] = df['processed'].apply(lambda x: len(x.split(' ')))
    df['words_not_stopword'] = df['processed'].apply(lambda x: len([t for t in x.split(' ') if t not in stopWords]))
    #get the average word length
    df['avg_word_length'] = df['processed'].apply(lambda x: np.mean([len(t) for t in x.split(' ') if t not in stopWords]) if len([len(t) for t in x.split(' ') if t not in stopWords]) > 0 else 0)
    #get the average word length
    df['commas'] = df[col].apply(lambda x: x.count(','))
   
    if remove_stop_words=="Yes":
        df['processed'] = df['processed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopWords)]))

    if treatment=="stemming":
        df['processed'] = df['processed'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split() if word not in (stopWords)]))
    elif treatment=="lemmatize":
        df['processed'] = df['processed'].apply(lambda x: ' '.join([lem.lemmatize(word) for word in x.split() if word not in (stopWords)]))
    else:
        pass
    
    return(df)

In [6]:
reviews_processed = processing(reviews_original,col='Review',remove_stop_words='Yes',treatment='lemmatize')
reviews_processed.head()

Unnamed: 0,Review,Liked,processed,length,words,words_not_stopword,avg_word_length,commas
0,Wow... Loved this place.,1,wow loved place,20,4,3,4.333333,0
1,Crust is not good.,0,crust good,17,4,2,4.5,0
2,Not tasty and the texture was just nasty.,0,tasty texture nasty,40,8,3,5.666667,0
3,Stopped by during the late May bank holiday of...,1,stopped late may bank holiday rick steve recom...,86,15,9,5.888889,0
4,The selection on the menu was great and so wer...,1,selection menu great price,58,12,4,6.0,0


In [7]:
review_corpus = reviews_processed['processed']

** Step 4: Initial Analysis - Bag of Words using CountVectorizer & Tfidf transformer **

In [8]:
cv = CountVectorizer()
reviews_bow = cv.fit_transform(review_corpus)
vocab_bow = cv.get_feature_names()

In [9]:
print('Shape of Sparse Matrix: ', reviews_bow.shape)
print('Amount of Non-Zero occurences: ', reviews_bow.nnz)

sparsity = (100.0 * reviews_bow.nnz / (reviews_bow.shape[0] * reviews_bow.shape[1]))
print('sparsity: {}'.format(sparsity))  

Shape of Sparse Matrix:  (1000, 1829)
Amount of Non-Zero occurences:  5532
sparsity: 0.3024603608529251


In [10]:
tfidf = TfidfTransformer()
reviews_tfidf = tfidf.fit_transform(reviews_bow)
print(reviews_tfidf.shape) 

(1000, 1829)


** Note: You can continue to fit classifiers on Bag of Words & TFIDF values as explained in the previous version of the same notebook. In this notebook, am going to show how to use pipelines for fitting the classifer models **

### Step 5: Creating the Pipeline

** Step 5.1: Creating custom transformers (as required) **

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
class DenseTransformer(BaseEstimator, TransformerMixin):

    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

** Step 5.2: Split into train & validation set **

In [12]:
# Setting the features & target
features = [c for c in reviews_processed.columns.values if c not in ['Review','Liked']]
numeric_features= [c for c in reviews_processed.columns.values if c  not in ['Review','Liked','processed']]
target = 'Liked'

X = reviews_processed[features]
y = reviews_processed[target]

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_val.head()

Unnamed: 0,processed,length,words,words_not_stopword,avg_word_length,commas
521,havent gone go,30,7,3,4.0,0
737,try airport experience tasty food speedy frien...,81,14,8,6.25,1
740,restaurant clean family restaurant feel,67,13,5,7.0,0
660,personally love hummus pita baklava falafel ba...,106,18,10,6.5,3
411,come hungry leave happy stuffed,35,6,5,5.4,1


** Step 5.3: Create separate pipeline for each feature **

In [14]:
text_tfidf = Pipeline([
                ('selector', TextSelector(key = 'processed')),
                ('tfidf', TfidfVectorizer(stop_words='english'))
            ])

text_cv = Pipeline([
                ('selector', TextSelector(key='processed')),
                ('cv', CountVectorizer())
            ])

length =  Pipeline([
                ('selector', NumberSelector(key='length')),
                ('standard', StandardScaler())
            ])

words =  Pipeline([
                ('selector', NumberSelector(key='words')),
                ('standard', StandardScaler())
            ])

words_not_stopword =  Pipeline([
                ('selector', NumberSelector(key='words_not_stopword')),
                ('standard', StandardScaler())
            ])
avg_word_length =  Pipeline([
                ('selector', NumberSelector(key='avg_word_length')),
                ('standard', StandardScaler())
            ])

commas =  Pipeline([
                ('selector', NumberSelector(key='commas')),
                ('standard', StandardScaler()),
            ])

** Step 5.4: Combine all features using FeatureUnion & check whether preprocessing makes sense **

In [15]:
feats = FeatureUnion([('text_tfidf', text_tfidf),
                      ('text_cv', text_cv),
                      ('length', length),
                      ('words', words),
                      ('words_not_stopword', words_not_stopword),
                      ('avg_word_length', avg_word_length),
                      ('commas', commas)
                     ])

feature_processing = Pipeline([('feats', feats)])

In [16]:
text_df = pd.DataFrame(feature_processing.fit_transform(X_train,y_train).todense())
text_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3079,3080,3081,3082,3083,3084,3085,3086,3087,3088
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.774591,-0.791911,-0.844648,0.255891,-0.577461
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.052362,-0.472511,-0.237806,2.256122,-0.577461
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.774591,-0.632211,-0.237806,-1.410968,-0.577461
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.073243,-0.153112,0.065615,0.533701,-0.577461
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.334973,-0.312812,-0.844648,0.811511,-0.577461


In [17]:
text_df_val = pd.DataFrame(feature_processing.transform(X_val).todense())
text_df_val.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3079,3080,3081,3082,3083,3084,3085,3086,3087,3088
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.837394,-0.632211,-0.844648,-1.410968,-0.577461
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.764072,0.485687,0.672456,0.464248,0.988535
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.324454,0.325987,-0.237806,1.08932,-0.577461
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.549104,1.124485,1.279298,0.672606,4.120526
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.680387,-0.791911,-0.237806,-0.244167,0.988535


** Step 5.5: Compare different classifiers using KFold cross validation ** 

In [18]:
classifiers = [
    GaussianNB(),
    LogisticRegression(),
    SGDClassifier(),
    XGBClassifier()]

In [19]:
model_pipeline = Pipeline([
    ('features', feats),
    ('todense',DenseTransformer()),
    ('classifier', classifiers[0])
])

In [20]:
if 1==1:
    from sklearn import model_selection
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import KFold

    scoring = 'accuracy'
    #scoring = 'roc_auc'
    #scoring = "f1_macro"
    #scoring = 'neg_log_loss'

    n_splits=5
    seed = 10

    #kfold = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    kfold = model_selection.StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

    for clf in classifiers:
        model_pipeline.set_params(classifier=clf)
        scores = cross_val_score(model_pipeline, X, y, scoring=scoring, cv=kfold)
        print('----------------------')
        print(str(clf))
        print('----------------------')
        print(scores)
        print(scores.mean())

----------------------
GaussianNB(priors=None)
----------------------
[ 0.66   0.71   0.685  0.75   0.68 ]
0.697
----------------------
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
----------------------
[ 0.77   0.78   0.785  0.785  0.8  ]
0.784
----------------------
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
----------------------
[ 0.735  0.74   0.755  0.785  0.765]
0.756
----------------------
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, m

** Step 5.6: Select the best classifier for hyperparameter tuning using GridSearch **

In [21]:
lr_model_pipeline = Pipeline([
    ('features', feats),
    ('todense',DenseTransformer()),
    ('lr_classifier', LogisticRegression())
])

In [22]:
# use the pipeline object as you would a regular classifier
lr_model_pipeline.fit(X_train,y_train)
print(model_pipeline)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_tfidf', Pipeline(steps=[('selector', TextSelector(key='processed')), ('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
    ...logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1))])


In [23]:
y_preds = lr_model_pipeline.predict(X_val)
print("Performance before hyperparameter tuning: %0.2f" %accuracy_score(y_val,y_preds))

Performance before hyperparameter tuning: 0.78


In [24]:
# Get all hyper-parameters for the pipeline object
sorted(lr_model_pipeline.get_params().keys())

['features',
 'features__avg_word_length',
 'features__avg_word_length__selector',
 'features__avg_word_length__selector__key',
 'features__avg_word_length__standard',
 'features__avg_word_length__standard__copy',
 'features__avg_word_length__standard__with_mean',
 'features__avg_word_length__standard__with_std',
 'features__avg_word_length__steps',
 'features__commas',
 'features__commas__selector',
 'features__commas__selector__key',
 'features__commas__standard',
 'features__commas__standard__copy',
 'features__commas__standard__with_mean',
 'features__commas__standard__with_std',
 'features__commas__steps',
 'features__length',
 'features__length__selector',
 'features__length__selector__key',
 'features__length__standard',
 'features__length__standard__copy',
 'features__length__standard__with_mean',
 'features__length__standard__with_std',
 'features__length__steps',
 'features__n_jobs',
 'features__text_cv',
 'features__text_cv__cv',
 'features__text_cv__cv__analyzer',
 'feature

In [25]:
from sklearn.model_selection import GridSearchCV

hyperparameters = { #'features__text_tfidf__tfidf__max_df': [0.5, 0.75, 1.0],
                    'features__text_tfidf__tfidf__ngram_range': [(1,1), (1,2)],
                    #'features__text_tfidf__tfidf__use_idf': [True, False],
                    #'features__text_tfidf__tfidf__norm': ['l1','l2'],
                    #'features__text_cv__cv__max_features': [None, 1000, 3000],
                    'lr_classifier__C': [50, 70],
                    'lr_classifier__penalty' : ['l1','l2']
                  }

grid_search = GridSearchCV(lr_model_pipeline, hyperparameters, cv=5)

print("Performing grid search...")
print("pipeline:", [name for name, _ in lr_model_pipeline.steps])
print("parameters:")
pprint(hyperparameters)
t0 = time()
grid_search.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(hyperparameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


# Fit and tune model
#clf.fit(X_train, y_train)

Performing grid search...
pipeline: ['features', 'todense', 'lr_classifier']
parameters:
{'features__text_tfidf__tfidf__ngram_range': [(1, 1), (1, 2)],
 'lr_classifier__C': [50, 70],
 'lr_classifier__penalty': ['l1', 'l2']}
done in 6.935s

Best score: 0.780
Best parameters set:
	features__text_tfidf__tfidf__ngram_range: (1, 2)
	lr_classifier__C: 50
	lr_classifier__penalty: 'l2'


In [26]:
print("Best parameters are:")
grid_search.best_params_

Best parameters are:


{'features__text_tfidf__tfidf__ngram_range': (1, 2),
 'lr_classifier__C': 50,
 'lr_classifier__penalty': 'l2'}

** Step 5.7: Refit on entire training set using best parameters obtained used GridSearch **

In [27]:
#refitting on entire training data using best settings
grid_search.refit
print(clf)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)


** Step 5.8: Predict on Test Set **

In [28]:
preds = grid_search.predict(X_val)
probs = grid_search.predict_proba(X_val)
print("Performance after hyperparameter tuning: %0.2f" %accuracy_score(y_val,preds))

Performance after hyperparameter tuning: 0.80


** Conclusion: Accuracy of the model has improved after tuning the hyper-parameters **