# Overview

Sometimes we make decisions beyond the rating of a restaurant. For example, if a restaurant has a high rating but it often fails to pass hygiene inspections, then this information can dissuade many people to eat there. Using this hygiene information could lead to a more informative system; however, it is often the case where we don’t have such information for all the restaurants, and we are left to make predictions based on the small sample of data points.

In this task, we are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Making a prediction about an unobserved attribute using data mining techniques represents a wide range of important applications of data mining.

In [1]:
import pickle
import os
import pandas as pd
import numpy as np
import sys  
import sklearn
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
import ast
from xgboost import plot_importance
import matplotlib.pyplot as plt
import spacy  
import re  
import multiprocessing
from gensim.models import Word2Vec
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical
from keras import optimizers

Using TensorFlow backend.


# About Dataset
This dataset can be obtained from: https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz

The dataset is composed of a training subset containing 546 restaurants used for training your classifier, in addition to a testing subset of 12753 restaurants used for evaluating the performance of the classifier. In the training subset, we have a binary label for each restaurant, which indicates whether the restaurant has passed the latest public health inspection test or not, whereas for the testing subset, we will not have access to any labels. The dataset is spread across three files such that the first 546 lines in each file corresponding to the training subset, and the rest are part of the testing subset. Below is a description of each file:

- hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
- hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have "[None]" in their label field implying that they are part of the testing subset.
- hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).



Other Notes:
- The training data is perfectly balanced, whereas the testing data is skewed, which creates a new challenge since the training and testing data have different distributions.



### About test data and scoring test data labels

- Students cannot access labels of test dataset. In order to check f1 score of a model, call the scoring API. Please see submit.py file for more info. To get F1 score in leaderboard, call **python submit.py [ucid] [filename of file that contains test labels]**

In [2]:
# This data is provided by course instructor. Please see above note on how to access it.
hygiene_text_path= "../data/Hygiene/hygiene.dat"
hygiene_labels_path= "../data/Hygiene/hygiene.dat.labels"
hygiene_others_path= "../data/Hygiene/hygiene.dat.additional"

In [3]:
with open(hygiene_text_path) as f:
    arrText = [l.rstrip() for l in f]
with open(hygiene_labels_path) as f:
    arrLabels = [l.rstrip() for l in f]
df = pd.DataFrame({'text':arrText, 'labels':arrLabels})
hygiene_others = pd.read_csv(hygiene_others_path, names=["cuisines", "zipcode", "reviews", "avg_ratings"])
df = df.join(hygiene_others)

In [4]:
# create OHE of different cuisine types. MultiLabelBinarizer is used to transform arrays into encodings
df.cuisines = [ast.literal_eval(x) for x in df.cuisines]
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(df.cuisines),
                   columns=mlb.classes_,
                   index=df.cuisines.index)
df = df.drop("cuisines", axis =1)
df = df.join(res)

In [5]:
# Check if there is any NAs
df.columns[df.isna().any()].tolist()

[]

### Baseline model without using NLP 
Results of this base model will help to compare other advance models

In [6]:
train_df = df[df["labels"] != "[None]" ]
test_df = df[df["labels"] == "[None]" ]
X_train, y_train =train_df.drop(['text', 'labels', "zipcode"], axis=1), train_df["labels"]
X_test, y_test =test_df.drop(['text', 'labels', "zipcode"], axis=1), test_df["labels"]

In [None]:
model = XGBClassifier()
model.fit(np.array(X_train), np.array(y_train))
y_pred = model.predict(np.array(X_test))

In [None]:
np.savetxt('./baseline_predictions.out', y_pred, fmt='%s')
with open('./baseline_predictions.out', 'r') as original: data = original.read()
with open('./baseline_predictions.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)

- F1: 0.6659

### Raw text preprocesing of reviews for more advanced model
- lemmatization, lower case, removing stop words, regex processing,etc

In [None]:
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)
    
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['text'])

In [None]:
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]

In [None]:
from UtilWordEmbedding import DocPreprocess
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed
stop_words = spacy.lang.en.stop_words.STOP_WORDS
all_docs = DocPreprocess(nlp, stop_words, df['text'], df['labels'])

In [8]:

dir_path = "./"
# # Save all_docs as pickle.
with open(os.path.join(dir_path, 'all_docs.pickle'), 'wb') as f:
    pickle.dump(all_docs, f, pickle.HIGHEST_PROTOCOL)
# Read pickle.
with open(os.path.join(dir_path, 'all_docs.pickle'), 'rb') as f:
    all_docs = pickle.load(f)

In [9]:
len(all_docs.tagdocs), df.shape # check whether dimension is correct

(13299, (13299, 104))

### Build word embedding using Word2vec

In [None]:
workers = multiprocessing.cpu_count()
word_model = Word2Vec(all_docs.doc_words,
                      min_count=2,
                      size=100,
                      window=5,
                      workers=workers,
                      iter=100)

In [None]:
word_model.wv.syn0.shape

In [None]:
word_model.wv.syn0[1]

### Averaging word embedding for each restuarant's reviews(doc)

In [None]:
from UtilWordEmbedding import MeanEmbeddingVectorizer
mean_vec_tr = MeanEmbeddingVectorizer(word_model)
doc_vec = mean_vec_tr.transform(all_docs.doc_words)

In [None]:
doc_vec.shape

In [None]:
np.savetxt(os.path.join(dir_path,'doc_vec.csv'), doc_vec, delimiter=',')

In [109]:
doc_vec = pd.read_csv("./doc_vec.csv").values

In [110]:
mean_embedding_df = df.join(pd.DataFrame(doc_vec))

In [111]:
mean_embedding_df.shape

(13299, 204)

**XGBOOST**

In [112]:
train_df = mean_embedding_df[mean_embedding_df["labels"] != "[None]" ]
test_df = mean_embedding_df[mean_embedding_df["labels"] == "[None]" ]
X_train, y_train =train_df.drop(['text', 'labels', 'zipcode'], axis=1), train_df["labels"]
X_test, y_test =test_df.drop(['text', 'labels', 'zipcode'], axis=1), test_df["labels"]

In [None]:
dtrain = xgb.DMatrix(np.array(X_train), label=np.array(y_train))
dtest = xgb.DMatrix(np.array(X_test))

In [None]:
model = XGBClassifier(n_estimators=100, subsample=1, colsample_bytree=1, colsample_bylevel=1)
model.fit(np.array(X_train), np.array(y_train))
y_pred = model.predict(np.array(X_test))

In [None]:
param = {'max_depth': 6, 'eta': 0.3, 'objective': 'binary:logistic', 'subsample':0.8, "n_estimators":200}
param['nthread'] = 4
param['eval_metric'] = 'auc'
bst = xgb.train(param, dtrain)
y_pred = bst.predict(dtest)
y_pred = np.where(y_pred > 0.95, 1, 0)

In [None]:
np.savetxt('./average_word2vec_predictions.out', y_pred, fmt='%s')
with open('./average_word2vec_predictions.out', 'r') as original: data = original.read()
with open('./average_word2vec_predictions.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)

In [None]:
X_train.shape

- F1: 0.7027

**Keras Feed Forward Neural Net**

In [None]:
# custom supporting functions for keras model
def recall_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:
model = Sequential()
model.add(Dense(150, input_dim=201, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(100, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(50, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(10, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid', kernel_initializer= "random_uniform", bias_initializer='zeros'))

model.compile(optimizer=optimizers.Adam(lr=0.0001),
              loss='binary_crossentropy',
              metrics=["binary_accuracy"])


model.fit(np.array(X_train, dtype=np.float32),np.array(y_train, dtype=np.float32) , epochs=100, batch_size=32)
y_pred = model.predict(np.array(X_test, dtype=np.float32))
y_pred = np.where(y_pred > 0.5, 1, 0)
np.savetxt('./average_word2vec_predictions_dl.out', y_pred, fmt='%s')
with open('./average_word2vec_predictions_dl.out', 'r') as original: data = original.read()
with open('./average_word2vec_predictions_dl.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)

**Logistic Regression** - results are not that great

In [113]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = clf.predict(np.array(X_test.fillna(0), dtype=np.float32))



In [125]:
np.savetxt('./average_word2vec_predictions_lr.out', y_pred, fmt='%s')
with open('./average_word2vec_predictions_lr.out', 'r') as original: data = original.read()
with open('./average_word2vec_predictions_lr.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)

### Term Frequency- Inverse Document Frequency & Averaging Word2vec

In [None]:
from UtilWordEmbedding import TfidfEmbeddingVectorizer
tfidf_vec_tr = TfidfEmbeddingVectorizer(word_model)

tfidf_vec_tr.fit(all_docs.doc_words)  # fit tfidf model first
tfidf_doc_vec = tfidf_vec_tr.transform(all_docs.doc_words)
np.savetxt(os.path.join(dir_path, './tfidf_doc_vec.csv'), tfidf_doc_vec, delimiter=',')


In [None]:
tfidf_mean_embedding_df = df.join(pd.DataFrame(tfidf_doc_vec))

In [None]:
train_df = tfidf_mean_embedding_df[tfidf_mean_embedding_df["labels"] != "[None]" ]
test_df = tfidf_mean_embedding_df[tfidf_mean_embedding_df["labels"] == "[None]" ]
X_train, y_train =train_df.drop(['text', 'labels', 'zipcode'], axis=1), train_df["labels"]
X_test, y_test =test_df.drop(['text', 'labels', 'zipcode'], axis=1), test_df["labels"]

In [None]:
model = Sequential()
model.add(Dense(150, input_dim=201, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(100, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(50, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))
model.add(Dense(10, activation='linear', kernel_initializer= "random_uniform"))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid', kernel_initializer= "random_uniform", bias_initializer='zeros'))

model.compile(optimizer=optimizers.Adam(lr=0.0001),
              loss='binary_crossentropy',
              metrics=["binary_accuracy"])


model.fit(np.array(X_train, dtype=np.float32),np.array(y_train, dtype=np.float32) , epochs=100, batch_size=64)
y_pred = model.predict(np.array(X_test, dtype=np.float32))

y_pred = np.where(y_pred > 0.5, 1, 0)
np.savetxt('./average_word2vec_predictions_dl.out', y_pred, fmt='%s')
with open('./average_word2vec_predictions_dl.out', 'r') as original: data = original.read()
with open('./average_word2vec_predictions_dl.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)

### Count Vectorizor using sklearn 
Results are very poor with logistic regression compare to models that uses word2vec features


In [102]:
from nltk.stem import WordNetLemmatizer
lemmer=nltk.stem.WordNetLemmatizer()
new_corpus=[' '.join([lemmer.lemmatize(word) for word in text.split(' ')])
          for text in df[df["labels"] != "[None]" ]['text']]



In [103]:
vect = CountVectorizer(stop_words=set(nltk.corpus.stopwords.words('english')) ,
#                   ngram_range=(2, 5),
#                   token_pattern = ' \b[^\d\W]+\b' #tokenizer overwrites this so see above lemma class. This match all words excepts ones that have numbers in them. This also ignores any other chars like punctuations
                      )  
X = vect.fit_transform(new_corpus)



In [105]:
countVec = df.join(pd.DataFrame(X.toarray()))

In [126]:
train_df = countVec[countVec["labels"] != "[None]" ]
test_df = countVec[countVec["labels"] == "[None]" ]
X_train, y_train =train_df.drop(['text', 'labels', 'zipcode'], axis=1), train_df["labels"]
X_test, y_test =test_df.drop(['text', 'labels', 'zipcode'], axis=1), test_df["labels"]

In [127]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = clf.predict(np.array(X_test.fillna(0), dtype=np.float32))
np.savetxt('./bow_predictions_lr.out', y_pred, fmt='%s')
with open('./bow_predictions_lr.out', 'r') as original: data = original.read()
with open('./bow_predictions_lr.out', 'w') as modified: modified.write("Viraj Bhalala(vbb2)\n" + data)



# Results


Baseline model
![Image ](./img/baselineModel.png )

word2vev doc average - xgboost
![Image ](./img/word2vecAvgxgboost.png )


word2vev doc average - Feed Forward Neural Net
![Image ](./img/word2vecAvgDL.png )


word2vev doc average & TF-IDF  - Feed Forward Neural Net
![Image ](./img/word2vecTFIDFDL.png )

# Conclusion:

After running various experiments, we can see that using doc average of word2vec embedding of all words in a doc provided us decent F1 score. In addition, we also found out that Feed Forward Neural Network performed quite well compare to other models in above experiments. One of the reason Neural Net did well is because we were able to customize it based on our requirements. In particular, increasing dropout layer percentage after each layer generated quite higher F1 score in test set. Moreover, we found out that dropout percentage of around 50% was optimal. We tested smaller and large percentage but resutls were not that great. One of the most challenging part of this project was to avoid overfitting training set. As you can see above, we have very small training set compare to test set, moreover training set consist only 4% of overall dataset. It is very important that we train model on training set very carefully to avoid any overfitting. Most of my early modelling attempts overfit the training set. These attempts included training xgb classifier and neural net with no or very small percentage of drop out layer(results of these models are not shown above as they were pretty bad). We also found out from these experiments that word2vec and word2vec+tf idf did quite well to generalize data and find great relationship between different review documents compare to count vectorizor(bag of words). Results from BOW on training logistic regression was very bad(not show above), this is because BOW generated very sparse feature matrix which again overfits the training data.




How can we improve this in future?

There are couple of ways we can improve the results. First of all, we can put more effort to create better modelling pipeline that can include grid search to find best paramenters of different models. One the main challenge in this part is that we have to incorporate API that instructor created in our code to automatically get F1 score of test set after each attempts. The second challenge is that it doesnt make sense to further split train set into train and validation set since train set is such as small set. In addition, we are not allowed to acces test set. Finally, after having a good modelling pipleine, we can test other text processing methods such as doc2vec, LDA, tf-idf, etc.