## Category-2-Supervised Machine Learning on Unstructured data
> Dataset consists of a set of reviews written by customers and the corresponding label indicating whether they 'Liked' the experience or not. The objective of the learning program is to predict the label 'Liked' based on the text review. So this is a text classification problem

** Step 1 - Import relevant libraries **

In [1]:
# Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
import re,nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

** Step 2 - Readin the dataset into pandas dataframe **

In [2]:
# Importing the dataset
reviews_original = pd.read_csv('./0.datasets/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
print(reviews_original.shape)

(1000, 2)


In [3]:
reviews_original.head(5)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
num_reviews = reviews_original.shape[0]

** Step 3 - Text Preprocessing **
> Remove special characters

> Make lowercase

> Stem the words

> Remove stopwords

In [5]:
review_corpus = []
for i in range(0, num_reviews):
    review = re.sub('[^a-zA-Z]', ' ', reviews_original['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    review_corpus.append(review)

** Step 4: Create the bag of words by using Count Vectorizer **

In [6]:
cv = CountVectorizer()
reviews_bow = cv.fit_transform(review_corpus)
vocab_bow = cv.get_feature_names()

In [7]:
print('Shape of Sparse Matrix: ', reviews_bow.shape)
print('Amount of Non-Zero occurences: ', reviews_bow.nnz)

sparsity = (100.0 * reviews_bow.nnz / (reviews_bow.shape[0] * reviews_bow.shape[1]))
print('sparsity: {}'.format(sparsity))  

Shape of Sparse Matrix:  (1000, 1565)
Amount of Non-Zero occurences:  5372
sparsity: 0.34325878594249204


** Step 5: Create training & validation datasets and run machine learning models on bag of words **

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

In [9]:
X_bow = reviews_bow.toarray()
y_bow = reviews_original.iloc[:, 1].values

In [10]:
# Splitting the dataset into the Training set and Test set
X_train, X_val, y_train, y_val = train_test_split(X_bow, y_bow, test_size = 0.20, random_state = 0)

In [11]:
# Fitting 3 Algorithms to the Training set
review_gaussian_nb_bow = GaussianNB()
review_log_bow = LogisticRegression()
review_sgd_bow = SGDClassifier()

review_gaussian_nb_bow.fit(X_train, y_train)
review_log_bow.fit(X_train, y_train)
review_sgd_bow.fit(X_train, y_train)

# Predicting the Validation set results
predictions_gaussian_nb_bow = review_gaussian_nb_bow.predict(X_val)
predictions_log_bow = review_log_bow.predict(X_val)
predictions_sgd_bow = review_sgd_bow.predict(X_val)

** Step 6: Compare results from different algorithms **

In [12]:
from sklearn.metrics import accuracy_score
print("Naive Bayes Accuracy - BOW:",accuracy_score(y_val,predictions_gaussian_nb_bow))
print("Logistic Regression Accuracy - BOW:",accuracy_score(y_val,predictions_log_bow))
print("SGD Accuracy - BOW:",accuracy_score(y_val,predictions_sgd_bow))

Naive Bayes Accuracy - BOW: 0.73
Logistic Regression Accuracy - BOW: 0.71
SGD Accuracy - BOW: 0.725


** Step 7: Create TFIDF scores for each token in the bag of words ** 

In [13]:
tfidf = TfidfTransformer()
reviews_tfidf = tfidf.fit_transform(reviews_bow)
print(reviews_tfidf.shape) 

(1000, 1565)


** Step 8: Create training & validation datasets and run machine learning models on TFIDF **

In [14]:
X_tfidf = reviews_tfidf.toarray()
y_tfidf = reviews_original.iloc[:, 1].values

In [15]:
# Splitting the dataset into the Training set and Test set
X_train, X_val, y_train, y_val = train_test_split(X_tfidf, y_tfidf, test_size = 0.20, random_state = 0)

In [16]:
# Fitting 3 Algorithms to the Training set
review_gaussian_nb_tfidf = GaussianNB()
review_log_tfidf = LogisticRegression()
review_sgd_tfidf = SGDClassifier()

review_gaussian_nb_tfidf.fit(X_train, y_train)
review_log_tfidf.fit(X_train, y_train)
review_sgd_tfidf.fit(X_train, y_train)

# Predicting the Validation set results
predictions_gaussian_nb_tfidf = review_gaussian_nb_tfidf.predict(X_val)
predictions_log_tfidf = review_log_tfidf.predict(X_val)
predictions_sgd_tfidf = review_sgd_tfidf.predict(X_val)

** Step 9: Compare results from different algorithms for TFIDF **

In [17]:
from sklearn.metrics import accuracy_score
print("Naive Bayes Accuracy - TFIDF:",accuracy_score(y_val,predictions_gaussian_nb_tfidf))
print("Logistic Regression Accuracy - TFIDF:",accuracy_score(y_val,predictions_log_tfidf))
print("SGD Accuracy - TFIDF:",accuracy_score(y_val,predictions_sgd_tfidf))

Naive Bayes Accuracy - TFIDF: 0.72
Logistic Regression Accuracy - TFIDF: 0.755
SGD Accuracy - TFIDF: 0.74


** Step 10: Implement Grid Search to find the right hyper-parameters **

In [18]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from pprint import pprint
from time import time

In [19]:
# Define a pipeline combining a text feature extractor with a simple classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (10, 50, 80),
}

In [20]:
X_for_param_tuning = review_corpus
y_for_param_tuning = reviews_original.Liked.values

In [21]:
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(X_for_param_tuning, y_for_param_tuning)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__n_iter': (10, 50, 80),
 'clf__penalty': ('l2', 'elasticnet'),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 1152 candidates, totalling 3456 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done 217 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done 717 tasks      | elapsed:   25.9s
[Parallel(n_jobs=-1)]: Done 1417 tasks      | elapsed:   44.7s
[Parallel(n_jobs=-1)]: Done 2317 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 3417 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 3449 out of 3456 | elapsed:  1.7min remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 3456 out of 3456 | elapsed:  1.7min finished


done in 103.727s

Best score: 0.773
Best parameters set:
	clf__alpha: 1e-06
	clf__n_iter: 10
	clf__penalty: 'elasticnet'
	tfidf__norm: 'l1'
	tfidf__use_idf: True
	vect__max_df: 0.5
	vect__max_features: 50000
	vect__ngram_range: (1, 2)


** Step 11 - Run machine learning algorithms using the hyper-parameters found in the previous step **

In [22]:
cv = CountVectorizer(max_features = 50000,max_df=0.75,ngram_range=(1,2))
reviews_bow = cv.fit_transform(review_corpus)
vocab_bow = cv.get_feature_names()

In [23]:
tfidf = TfidfTransformer(norm = 'l1',use_idf='True')
reviews_tfidf = tfidf.fit_transform(reviews_bow)
print(reviews_tfidf.shape) 

(1000, 5634)


In [24]:
X = reviews_tfidf.toarray()
y = reviews_original.iloc[:, 1].values

In [25]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 0)



In [26]:
review_sgd_model = SGDClassifier(penalty='elasticnet',alpha=0.00001,n_iter=10)
review_sgd_model = review_sgd_model.fit(X=X_train, y=y_train)
predictions_sgd_tfidf = review_sgd_model.predict(X_val)
print("SGD Accuracy after tuning:",accuracy_score(y_val,predictions_sgd_tfidf))

SGD Accuracy after tuning: 0.75


** Conclusion: Accuracy of the model has improved after tuning the hyper-parameters **