## Data Camp Tutorial 
#### Detecting True and Deceptive Hotel Reviews using Machine Learning

https://www.datacamp.com/community/tutorials/machine-learning-hotel-reviews

http://myleott.com/op-spam.html

_Before getting anything going, i set up a virtual environment to use with ML notebooks run locally._ 

In [272]:
import os
import fnmatch
from textblob import TextBlob
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import pos_tag,pos_tag_sents
import regex as re
import operator
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

from sklearn.model_selection import train_test_split  
# from sklearn.cross_validation import train_test_split  

from sklearn import metrics
from sklearn import svm
from sklearn.grid_search import GridSearchCV
import pickle

Fetch text files & join text into one
The path contains a number of .txt files downloaded from http://myleott.com/op-spam.html

In [273]:
path = '../tutorials/HotelReviews_tutorial/op_spam_train' # edit path for your local dir
label = []
configfiles = [os.path.join(subdir,f)
              for subdir, dirs, files in os.walk(path)
                for f in fnmatch.filter(files, '*.txt')]

_There should be 1600 paths each representing a new text file._



In [274]:
len(configfiles)

1600

ex: _print one of the text file paths to see_

In [275]:
configfiles[1]

'../tutorials/HotelReviews_tutorial/op_spam_train/positive_polarity/deceptive_from_MTurk/fold2/d_talbott_8.txt'

## Extract Labels into a dataframe
Next use regex to filter out the labels "truth" || "deceptive" from the .txt files


In [276]:
for f in configfiles:
    c = re.search('(trut|deceptiv)\w',f)
    label.append(c.group())

create a dataframe from the "label" list created above

In [277]:
labels = pd.DataFrame(label, columns = ['Labels'])

In [278]:
labels.head(5)

Unnamed: 0,Labels
0,deceptive
1,deceptive
2,deceptive
3,deceptive
4,deceptive


## Create a list of the reviews
Once you have extracted all the labels, it's time to extract the reviews from the text files!


In [279]:
review = []
directory = os.path.join('../tutorials/HotelReviews_tutorial/op_spam_train')

for subdir, dirs, files in os.walk(directory):
    # print (subdir)
    for file in files:
        if fnmatch.filter(files, '*.txt'):
            f = open(os.path.join(subdir, file), 'r')
            a = f.read()
            review.append(a)


Create a dataframe of the reviews 

In [280]:
reviews = pd.DataFrame(review, columns = ['HotelReviews'])
reviews.head(5)

Unnamed: 0,HotelReviews
0,"excellent staff and customer service, very cle..."
1,My stay at this hotel was one of the best I ha...
2,We just got back from a trip to Chicago for my...
3,I have to say that the Hard Rock Hotel in Chic...
4,My husband and I recently stayed at the Hard R...


### Merge Labels and Reviews DataFrames together!
Create a new Dataframe called "result" 

In [281]:
result = pd.merge(reviews, labels,right_index=True,left_index = True)
# merges dataFrames into new DF called result

result['HotelReviews'] = result['HotelReviews'].map(lambda x: x.lower())
# maps over all of column labeler "HotelReviews" and transform text to lowercase

result.head()

Unnamed: 0,HotelReviews,Labels
0,"excellent staff and customer service, very cle...",deceptive
1,my stay at this hotel was one of the best i ha...,deceptive
2,we just got back from a trip to chicago for my...,deceptive
3,i have to say that the hard rock hotel in chic...,deceptive
4,my husband and i recently stayed at the hard r...,deceptive


## Removing stopwords
Stopwords are not meaningful and are not helpful in training a model
* Create a new column called "review_without_stopwords"

In [282]:
import nltk
# nltk.download() # this opens a gui python downloader.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/siggy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

_**Note:** I had some issues getting the nltk above working. I got an error about SSL certificates and found a fix here:_ https://stackoverflow.com/questions/41348621/ssl-error-downloading-nltk-data

In [283]:
stop = stopwords.words('english')

_Creates a new column in the result DataFrame which is the hotelreview excluding any stop words_

In [284]:
result['review_without_stopwords'] = result['HotelReviews'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [285]:
result.head()

Unnamed: 0,HotelReviews,Labels,review_without_stopwords
0,"excellent staff and customer service, very cle...",deceptive,"excellent staff customer service, clean spotle..."
1,my stay at this hotel was one of the best i ha...,deceptive,"stay hotel one best ever had! location, servic..."
2,we just got back from a trip to chicago for my...,deceptive,got back trip chicago 30th birthday could impr...
3,i have to say that the hard rock hotel in chic...,deceptive,say hard rock hotel chicago cool place stay. f...
4,my husband and i recently stayed at the hard r...,deceptive,husband recently stayed hard rock hotel chicag...


In [286]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
printmd('__Original:__')
print(result['HotelReviews'][0])
printmd("__Without Stopwords:__")
print(result['review_without_stopwords'][0])

__Original:__

excellent staff and customer service, very clean and spotless. elegant and luxurious with a beautiful ocean view. the bed is very comfortable and relaxing. i give it a five star.



__Without Stopwords:__

excellent staff customer service, clean spotless. elegant luxurious beautiful ocean view. bed comfortable relaxing. give five star.


In [287]:
# t = result[result.Labels == 'truth']
# print(t)

# print(t['HotelReviews'][400])

<hr>

## Extra Save the DF to a CSV
_**note:**_ You can save the result DF to a csv. This is useful if you want to use colabratory rather than a local notebook. Do you data wrangling on your machine then load the compiled dataframe to google docs and import into colabratory. :)  <br>
For somereason trying to make a csv works here but errors out further down in the doc. Some of the data the code further down returns is probably not a real string. FYI: The result.csv doc will be saved in same dir as this notebook. 

In [288]:
# result.to_csv('result.csv', sep='\t', index=False)
result.to_csv('result.csv', index=False)


<hr>

## Extract parts of speech to be used as Feature Input
"TextBlob is a Python library for processing textual data. <br> 
It provides a simple API for diving into ordinary natural language processing (NLP) <br>
tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, <br>
classification, translation, and more."

In [289]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/siggy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/siggy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [290]:
def pos(review_without_stopwords):
    return TextBlob(review_without_stopwords).tags

In [291]:
# This takes a pretty long time to run
os = result.review_without_stopwords.apply(pos)
os1 = pd.DataFrame(os)

In [292]:
os1.head()

Unnamed: 0,review_without_stopwords
0,"[(excellent, JJ), (staff, NN), (customer, NN),..."
1,"[(stay, JJ), (hotel, NN), (one, CD), (best, JJ..."
2,"[(got, VBD), (back, RB), (trip, NN), (chicago,..."
3,"[(say, VB), (hard, JJ), (rock, NN), (hotel, NN..."
4,"[(husband, NN), (recently, RB), (stayed, VBD),..."


* "You will not be able to vectorize a list which you will be feeding into the model. So, you will have to convert these rows of lists into string."

* "Let's convert each row into a string, where each word will be joined with its corresponding pos using a forward slash, and a single space will separate the words."

In [293]:
os1['pos'] = os1['review_without_stopwords'].map(
    lambda x:" ".join(["/".join(x) for x in x ]) )


* Finally, let's merge the pos column with the main result dataframe and print first few rows of it!

In [294]:
result = pd.merge(result, os1, right_index = True, left_index = True)
result.head()

Unnamed: 0,HotelReviews,Labels,review_without_stopwords_x,review_without_stopwords_y,pos
0,"excellent staff and customer service, very cle...",deceptive,"excellent staff customer service, clean spotle...","[(excellent, JJ), (staff, NN), (customer, NN),...",excellent/JJ staff/NN customer/NN service/NN c...
1,my stay at this hotel was one of the best i ha...,deceptive,"stay hotel one best ever had! location, servic...","[(stay, JJ), (hotel, NN), (one, CD), (best, JJ...",stay/JJ hotel/NN one/CD best/JJS ever/RB had/V...
2,we just got back from a trip to chicago for my...,deceptive,got back trip chicago 30th birthday could impr...,"[(got, VBD), (back, RB), (trip, NN), (chicago,...",got/VBD back/RB trip/NN chicago/NN 30th/CD bir...
3,i have to say that the hard rock hotel in chic...,deceptive,say hard rock hotel chicago cool place stay. f...,"[(say, VB), (hard, JJ), (rock, NN), (hotel, NN...",say/VB hard/JJ rock/NN hotel/NN chicago/NN coo...
4,my husband and i recently stayed at the hard r...,deceptive,husband recently stayed hard rock hotel chicag...,"[(husband, NN), (recently, RB), (stayed, VBD),...",husband/NN recently/RB stayed/VBD hard/JJ rock...


# Training 
https://www.datacamp.com/community/tutorials/machine-learning-hotel-reviews#training

* Will split the data into two parts for training and testing (80/20)
* random_state = 13

In [310]:
review_train, review_test, label_train, label_test = train_test_split(
    result['pos'], result['Labels'], test_size=0.2, random_state=13)


#### Vectorize the Training and Testing data using TfidfVectorizer
*Tf* (term frequency). How many times a word appears
*idf* (inverse document frequency). Helps deal with frequent/rare words. It calculates the log of number of docs divided by the number of docs where the term appears.

The last step is multiplying the tf x idf to get a _weight_

In [311]:
tf_vect = TfidfVectorizer(
    lowercase = True, use_idf=True, smooth_idf=True, sublinear_tf=False)

X_train_tf = tf_vect.fit_transform(review_train)

X_test_tf = tf_vect.transform(review_test)

## Implementing the model
* Using a machine learning model known as Support Vector Machines (SVM)
http://scikit-learn.org/stable/modules/svm.html
* "To select the best hyperparameters for your ML algorithm, you will use GridSearchCV, which based on your training data and labels suggests you the best hyperparameter values out of the values that you specify as a list. You will choose five different values for Cs and gammas and based on your data; you will get the best hyperparameter values."

In [312]:
def svc_param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(svm.SVC(kernel='linear'), param_grid, cv=nfolds)
    grid_search.fit(X, y)
    return grid_search.best_params_

In [313]:
svc_param_selection(X_train_tf,label_train,5)

{'C': 10, 'gamma': 0.001}

clf = svm.SVC(C=10,gamma=0.001,kernel='linear')
clf.fit(X_train_tf,label_train)
pred = clf.predict(X_test_tf)

#### Save the model you just trained and the tfidf vercotizer
* "Let's save the *model* that you just trained along with the *Tfidf vectorizer* using the _pickle library_ that you had imported in the beginning, so that later on you can just simply load the data, vectorize it and predict using the ML model." 

In [314]:
with open('vectorizer.pickle', 'wb') as fin:
    pickle.dump(tf_vect, fin)

In [315]:
with open('mlmodel.pickle', 'wb') as f: 
    pickle.dump(clf,f)

#### Load the tfidf vectorizer and the ML model

In [316]:
pkl = open('mlmodel.pickle', 'rb')
clf = pickle.load(pkl)   
vec = open('vectorizer.pickle', 'rb')
tf_vect = pickle.load(vec)

#### Predict on the test data

In [317]:
X_test_tf = tf_vect.transform(review_test)

#### Analyse the performance of the model.
#### Plot: 
* accuracy score
* confusion matrix
* and the classification report

In [318]:
printmd("**Accuracy**")
print(metrics.accuracy_score(label_test, pred))

printmd("**Confusion Matrix**")
print(confusion_matrix(label_test, pred))

printmd("**Classification report**")
print (classification_report(label_test, pred))

**Accuracy**

0.865625


**Confusion Matrix**

[[139  16]
 [ 27 138]]


**Classification report**

             precision    recall  f1-score   support

  deceptive       0.84      0.90      0.87       155
      truth       0.90      0.84      0.87       165

avg / total       0.87      0.87      0.87       320



#### Explainition of above: 
_Confusion Matrix:_ <br>
TP: True positive,  FN: False Negavive, <br>
FP: False positive, TN: True negative

[[ TP FN ] <br>
 [ FP TN ]]
 
With the result: <br>
[[139  16] <br>
 [ 27 138]]<br>
**16 + 27 = 43 out of total 320 were misclassified**

### Test the model using random states
#### random_state = 1

In [319]:
review_train, review_test, label_train, label_test = train_test_split(
    result['pos'],result['Labels'], test_size=0.2,random_state=1)

X_test_tf = tf_vect.transform(review_test)
pred = clf.predict(X_test_tf)

printmd("**Accuracy**")
print(metrics.accuracy_score(label_test, pred))

printmd("**Confusion Matrix**")
print (confusion_matrix(label_test, pred))

printmd("**Classification report**")
print (classification_report(label_test, pred))

**Accuracy**

0.959375


**Confusion Matrix**

[[145   6]
 [  7 162]]


**Classification report**

             precision    recall  f1-score   support

  deceptive       0.95      0.96      0.96       151
      truth       0.96      0.96      0.96       169

avg / total       0.96      0.96      0.96       320



#### random_state = 10

In [320]:
review_train, review_test, label_train, label_test = train_test_split(result['pos'],result['Labels'], test_size=0.2,random_state=10)

X_test_tf = tf_vect.transform(review_test)
pred = clf.predict(X_test_tf)

printmd("**Accuracy**")
print(metrics.accuracy_score(label_test, pred))
printmd("**Confusion Matrix**")
print (confusion_matrix(label_test, pred))
printmd("**Classification report**")
print (classification_report(label_test, pred))


**Accuracy**

0.978125


**Confusion Matrix**

[[156   4]
 [  3 157]]


**Classification report**

             precision    recall  f1-score   support

  deceptive       0.98      0.97      0.98       160
      truth       0.98      0.98      0.98       160

avg / total       0.98      0.98      0.98       320



#### random_state =  42

In [321]:
review_train, review_test, label_train, label_test = train_test_split(
    result['pos'],result['Labels'], test_size=0.2,random_state=42)

X_test_tf = tf_vect.transform(review_test)
pred = clf.predict(X_test_tf)

printmd("**Accuracy**")
print(metrics.accuracy_score(label_test, pred))

printmd("**Confusion Matrix**")
print (confusion_matrix(label_test, pred))

printmd("**Classification report**")
print (classification_report(label_test, pred))

**Accuracy**

0.96875


**Confusion Matrix**

[[163   5]
 [  5 147]]


**Classification report**

             precision    recall  f1-score   support

  deceptive       0.97      0.97      0.97       168
      truth       0.97      0.97      0.97       152

avg / total       0.97      0.97      0.97       320



* "From the above predictions, you can observe that the Model did a Fantastic Job and is not overfitting _since you tested the model several times by splitting the data differently everytime_"

## Test the model with two Yelp reviews

In [322]:
def test_string(s):
    X_test_tf = tf_vect.transform([s])
    y_predict = clf.predict(X_test_tf)
    return y_predict

In [323]:
test_string("The hotel was good.The room had a 27-inch Samsung led tv, a microwave.The room had a double bed")


array(['truth'], dtype=object)

In [324]:
test_string("My family and I are huge fans of this place. The staff is super nice, and the food is great. The chicken is very good, and the garlic sauce is perfect. Ice cream topped with fruit is delicious too. Highly recommended!")

array(['truth'], dtype=object)