# Yelp Dataset Challenge

![Yelp Data Challenge](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/6d323fc75cb1/assets/img/dataset/960x225_dataset@2x.png)

## Natural Language Processing

In [1]:
import pandas as pd

df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [2]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-07-03,0,c6iTbCMMYWnOd79ZiWwobg,1,"I ordered a few 12 inch sandwiches , a turkey ...",1,ih7Dmu7wZpKVwlBRbakJOQ
1,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2018-03-10,0,5iDdZvpK4jOv2w5kZ15TUA,1,Worst subway of any I have visited. I have man...,1,m3WBc9bGxn1q1ikAFq8PaA
2,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-26,0,oCUrLS4T-paZBr6WnrXg_A,2,Good luck trying to get the order right. The c...,0,H7bJDtGzhdg1fsmBL4KZWg
3,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-16,0,qXHvWYgL-8yfcGvP_ydKGA,2,Here to get my pick up order at the moment it ...,0,58sXi_0oTgVlM3aUuFYHUA
4,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0,1,2016-12-29,0,j9l7IMJX9bvWjkJ18EWGpg,5,"My husband & I were visiting the area, found t...",0,ZS7V0uC4kVrJR_4Yi3oTHA


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398037 entries, 0 to 398036
Data columns (total 12 columns):
business_id    398037 non-null object
name           398037 non-null object
categories     398037 non-null object
avg_stars      398037 non-null float64
cool           398037 non-null int64
date           398037 non-null object
funny          398037 non-null int64
review_id      398037 non-null object
stars          398037 non-null int64
text           398037 non-null object
useful         398037 non-null int64
user_id        398037 non-null object
dtypes: float64(1), int64(4), object(7)
memory usage: 36.4+ MB


## Define features

In [4]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text']

# inspect your documents, e.g. check the size, take a peek at elements of the numpy array
print(documents.shape)
print('1:')
print(documents[0])
print('2:')
print(documents[1])
print('3:')
print(documents[2])

(398037,)
1:
I ordered a few 12 inch sandwiches , a turkey and a chicken... wanted avocado...they were out!
Really?? This store is in the same parking lot as SMITHS Grocery and these lazy and UNEMPOWERED employees can't take a five'r over there and buy a few avocado's for peoples sandwiches???  Subway needs to look at their policies, training, and allow their people to be able to satisfy their customer's needs....Adios!!
2:
Worst subway of any I have visited. I have many years of eating at subway all over the country and even in England once. I've had 2 bad experiences here at this location. About a year ago I went in and they had no bread. They claimed that they ran out. How the F does subway run out of bread. It's literally their main item. My 2nd bad experience was 2 days ago. I got stale bread. Again, bread is their main item.  How can they serve stale bread? They really suck at their job!  I would avoid this location and go to any other location.
3:
Good luck trying to get the ord

## Define target

* Target 要是 categorical variable
* 可以考慮當 Target 的有 `avg_stars`, `cool`, `funny`, `stars`, `useful`
  * 結果只有 `avg_stars` 和 `stars` 合適，其他幾個太多值了

In [5]:
# Check the unique values
for col in ['avg_stars', 'cool', 'funny', 'stars', 'useful']:
    print(col + ':')
    print(df[col].unique())

avg_stars:
[2.5 4.  1.5 4.5 3.  2.  3.5 5.  1. ]
cool:
[  0   1   3   2   9  16  11   7   4  30   6  31  17  28   5   8  37  46
  10  36  14  24  20  13  12  18  34  15  39  21  62  43  19  22  27  83
  26  80  23  60  54  25  49  29  96  81  71  35  92  38  68  48  47  32
  33  52  50  41  70  53  69  65  51  42 172  63  91  82  40  44 170  59
  61  58 131  77  88  66  64  56  45  84  75 104  72  73  97  98  78  57
 127  76  79  74  55 107 111  67 200 195  87  85 102  89 198 181 101 121
 208  90 114  86 227  93 123 207 119 113]
funny:
[  0   1   2   3  11   7   5   4  19  17   6  10  24  20  26  25   9  12
   8  35  14  18  47  41  13  16  62  38  55  39  33  15  53  23  28  29
  63  71  22  49  34  27  21  56  32  37  60  46  98  31  44  65  48 102
  77  42  68  43  40  59  45  30  88  67  72  84  73  51  99  80 183 160
  75  50  36  52  61 181  64  87  74  69  89 182  78  57  70 202  92 105
  79]
stars:
[1 2 5 4 3]
useful:
[  1   0   2   4   5   3  10  16  12   7  29  31  17   6  20

### 5 stars and non-5 stars rating

In [6]:
# Make a column and take the values, save to a variable named "target"
df['five stars'] = (df['stars'] > 4)
target = df['five stars']

#### Summary statistic

In [7]:
target.describe()

count     398037
unique         2
top        False
freq      202056
Name: five stars, dtype: object

In [8]:
target.mean(), target.std(), target.shape

(0.4923687998854378, 0.4999423894034161, (398037,))

## Train and test splitting

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(documents, target, test_size=0.4, random_state=42)

## NLP representation

1. Create `TfidfVectorizer`, and name it `vectorizer`
2. Train the model with your training data
3. Use the trained model to transform your test data
4. Get the vocab of your tfidf

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english', max_features=5000)
vec_train = vectorizer.fit_transform(X_train)
vec_test = vectorizer.transform(X_test)
vocab = vectorizer.get_feature_names()

In [11]:
print(vocab)



In [12]:
vec_train_array = vec_train.toarray()
vec_test_array = vec_test.toarray()

In [13]:
print(documents.shape[0] * 0.6, documents.shape[0] * 0.4)
print(X_train.shape, X_test.shape)
print(vec_train.shape, vec_test.shape, vec_train_array.shape, vec_test_array.shape)

238822.19999999998 159214.80000000002
(238822,) (159215,)
(238822, 5000) (159215, 5000) (238822, 5000) (159215, 5000)


`vec_train_array` and `vec_test_array` are sparse matrixes where a lot of elements are 0.
It is hard to read from matrix form. So don't need to print and check them

In [14]:
# print(vec_train_array) # sparse matrix

In [15]:
# print(vec_test_array) # sparse matrix

## Similar review search engine

In [16]:
import numpy as np

def get_top_values(lst, n, labels):
    '''
    Input: list, integer, list
    Output: list
    
    Given a list of values, find the indices with the highest n values.
    Return the labels for each of these indices
    
    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ['cat', 'dog', 'mouse', 'pig', 'rabbit']
    output: ['cat', 'pig']
    '''
    # np.argsort() 預設是遞增排列，所以要用 -1 把排序反過來
    return [labels[i] for i in np.argsort(lst)[::-1][:n]]

In [17]:
def get_bottom_values(lst, n, labels):
    '''
    Input: list, integer, list
    Output: list
    
    Given a list of values, find the indices with the lowest n values.
    Return the labels for each of thest indices.
    
    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ['cat', 'dog', 'mouse', 'pig', 'rabbit']
    output: ['mouse', 'rabbit']
    '''
    return [labels[i] for i in np.argsort(lst)[:n]]

從 test sample 中隨便選一個 review，然後藉由 cosine similarity 來和 train samples 比較，選出 train samples 中前五名相似度最高的

In [18]:
# cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Draw an arbitrary review from test (unseen in training) documents
# random_number = np.random.randint(0, documents.shape[0]) # 只是用來驗證
random_number = np.random.randint(0, vec_test.shape[0])
# random_review = documents[random_number]
random_review = X_test[random_number] # 用 test sample
# print(vec_test.shape[0], X_test.shape)
print(random_number)
print(random_review)

# Transform the drawn review(s) to vector(s)
vec_random_review = vectorizer.transform([random_review])
print(vec_random_review.shape)
print(vec_random_review.toarray()) # 是 sparse 的 array

# Calculate the similarity score(s) between vector(s) and training vectors
similarity_scores = cosine_similarity(vec_random_review, vec_train) # vec_random_review: 1 x 5000, vec_train: 238822 x 5000
print(similarity_scores.shape)
print(similarity_scores) # vec_random_review 和 vec_train 中每一筆數據的 cosine similarity 值，vec_train 有 238822 筆數據，所以會有 238822 個數值

40790
I'm from California and was here visiting my parents who live here. We went to Bass Pro and where looking for a place to grab a bite. This place did not disappoint. I had the Mommas sandwich. Kind of like a club but better. It was fantastic. I got it with hash browns and they were good as well. My mom got the french dip and my dad got a caesar salad and they loved it. Service was fantastic to. I cant recall the waitress name but she was on spot. Highly recommend this place and will definitely be coming back next time I visit.
(1, 5000)
[[0. 0. 0. ... 0. 0. 0.]]
(1, 238822)
[[0.05936968 0.05803406 0.07272053 ... 0.10705818 0.08946764 0.        ]]


In [19]:
# Let's find top 5 similar reviews
n = 5
top_reviews = get_top_values(similarity_scores[0], n, X_train.values) # 和 train sample 中的來比較
print(len(top_reviews))
print(top_reviews)

5
["This place is my favorite and they never disappoint. Every time I've come in the food is fantastic, nyzah is fantastic and i just love the atmosphere. I will always recommend this place to others!!!", 'I loved this place! It was my first time here and everything was fantastic. It felt very Italian, great vibe from the service, live music and fantastic food. I definitely will be back next time I am in Vegas!', 'Had lunch here for the first time. The food and service was fantastic! ! Highly recommend! !!!', 'This place is fantastic! A little pricy for a sandwich but worth it. Every bite is delicious. I highly recommend this place if you are hungry and in need of a quick bite.', 'This place is great. Food was delicious and the customer service was fantastic! Definitely coming back here.']


In [20]:
# print('Our search query:')
# print(random_review) # To be added

In [21]:
print('Most %s similar reviews:' % n)
# print()  # To be added
for index, review in enumerate(top_reviews):
    print(str(index) + ':\t' + review + '\n')

Most 5 similar reviews:
0:	This place is my favorite and they never disappoint. Every time I've come in the food is fantastic, nyzah is fantastic and i just love the atmosphere. I will always recommend this place to others!!!

1:	I loved this place! It was my first time here and everything was fantastic. It felt very Italian, great vibe from the service, live music and fantastic food. I definitely will be back next time I am in Vegas!

2:	Had lunch here for the first time. The food and service was fantastic! ! Highly recommend! !!!

3:	This place is fantastic! A little pricy for a sandwich but worth it. Every bite is delicious. I highly recommend this place if you are hungry and in need of a quick bite.

4:	This place is great. Food was delicious and the customer service was fantastic! Definitely coming back here.



#### Q: Does the result make sense to you?

A: Yes. The randomly peak review in the testing sample is a positive review. The word fantatistic shows several time in the review of testing and training samples.

## Classifying positive and negative review

### Naive-Bayes Classifier

In [22]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(vec_train, y_train)

# Get score for training set and test set
nb_score_train = nb.score(vec_train, y_train)
nb_score_test = nb.score(vec_test, y_test)

print('train score: ', nb_score_train)
print('test score: ', nb_score_test)

train score:  0.8153603939335572
test score:  0.8109097760889363


### Logistic Regression Classifier

In [23]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(vec_train, y_train)

# Get score for training set and test set
lr_score_train = lr.score(vec_train, y_train)
lr_score_test = lr.score(vec_test, y_test)

print('train score: ', lr_score_train)
print('test score: ', lr_score_test)

print(lr.coef_.shape)
print(lr.coef_)



train score:  0.8460820192444582
test score:  0.8366548377979461
(1, 5000)
[[-0.84688229  0.33964894 -0.07805271 ... -2.63234671  0.45618508
  -0.31004826]]


#### Q: What are the key features (words) that make the positive prediction?

In [24]:
# Let's find it out by ranking
n = 20
features_with_positive_predictions = get_top_values(lr.coef_[0], n, vocab)
print(features_with_positive_predictions)

['amazing', 'best', 'incredible', 'thank', 'awesome', 'delicious', 'perfection', 'highly', 'phenomenal', 'perfect', 'heaven', 'fantastic', 'great', 'excellent', 'favorite', 'love', 'outstanding', 'impeccable', 'omg', 'exceeded']


#### Q: What are the key features(words) that make the negative prediction?

In [25]:
# Let's find it out by ranking
n = 20
features_with_negative_predictions = get_bottom_values(lr.coef_[0], n, vocab)
print(features_with_negative_predictions)

['worst', 'horrible', 'ok', 'disappointing', 'terrible', 'mediocre', 'rude', 'bland', 'okay', 'slow', 'average', 'poor', 'lacking', 'meh', 'awful', 'disgusting', 'reason', 'worse', 'lacked', 'overpriced']


### Random Forest Classifier

In [26]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = None, n_estimators = 5, min_samples_leaf = 10)
rf.fit(vec_train, y_train)

# Get score for training set and test set
rf_score_train = rf.score(vec_train, y_train)
rf_score_test = rf.score(vec_test, y_test)

print('train score: ', rf_score_train)
print('test score: ', rf_score_test)

print(rf.feature_importances_.shape)
print(rf.feature_importances_)

train score:  0.8255897697867031
test score:  0.787199698520868
(5000,)
[6.87191564e-04 0.00000000e+00 0.00000000e+00 ... 7.00319274e-05
 0.00000000e+00 0.00000000e+00]


#### Q: What do you see from the training score and the test score?

A: train score is higher than test score, it means overfitting.

#### Q: Can you tell what features (words) are important by inspecting the RFC model?

In [27]:
n = 20
top_features_importances = get_top_values(rf.feature_importances_, n, vocab)
print(top_features_importances)

['amazing', 'best', 'minutes', 'delicious', 'love', 'worst', 'wasn', 'great', 'definitely', 'ok', 'friendly', 'horrible', 'place', 'didn', 'awesome', 'bad', 'told', 'order', 'good', 'favorite']


Compare the top 10 words in Logistic Regression and Random Forest models

|words     |Logistic Regression|Random Forest|
|:--------:|:-----------------:|:-----------:|
|amazing   |1                  |1            |
|best      |2                  |2            |
|incredible|3                  |             |
|thank     |4                  |             |
|awesome   |5                  |             |
|delicious |6                  |4            |
|perfection|7                  |             |
|highly    |8                  |             |
|phenomenal|9                  |             |
|perfect   |10                 |             |
|minutes   |x                  |3            |
|love      |x                  |5            |
|worst     |x                  |6            |
|wasn      |x                  |7            |
|great     |x                  |8            |
|definitely|x                  |9            |
|ok        |x                  |10           |


## Extra Credit

### 1. Use cross validation to evaluate classifiers

[sklearn cross validation](https://scikit-learn.org/stable/modules/cross_validation.html)

In [28]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, vec_train, y_train, cv=5, scoring='accuracy')
print(scores)



[0.83423356 0.8369902  0.83774391 0.83742986 0.83822544]


### 2. Use grid search to find best predictable classifier

[sklearn grid search tutorial (with cross validation)](https://scikit-learn.org/stable/modules/grid_search.html#grid-search)

[sklearn grid search documentation (with cross validation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

[Model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

[classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [29]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

params = {'penalty':['l1', 'l2'], 'C':[0.1, 1, 10, 100]}
scorings = ['accuracy']

for score in scorings:
    print("Tuning hyper-parameters for {}".format(score))
    clf = GridSearchCV(lr, params, cv=5, scoring=score)
    clf.fit(vec_train, y_train)
    
    print("Best parameters set found on development set:\n")
    print(clf.best_params_)
    print('\n')
    
    cv_result = clf.cv_results_
    means_train = cv_result['mean_train_score']
    stds_train = cv_result['std_train_score']
    means_test = cv_result['mean_test_score']
    stds_test = cv_result['std_test_score']
    
    for params, mean_train, std_train, mean_test, std_test in zip(cv_result['params'], means_train, stds_train, means_test, stds_test):
        print('For params:', params)
        print('Train: {:.3f} +/- {:.3f}'.format(mean_train, std_train))
        print('Test: {:.3f} +/- {:.3f}'.format(mean_test, std_test))
    

    print('Classification report:')
    print('The model is trained on the full development set.\n\
           The scores are computed on the full evaluation set.')
    y_pred = clf.predict(vec_test)
    print(classification_report(y_test, y_pred))

Tuning hyper-parameters for accuracy




Best parameters set found on development set:

{'C': 1, 'penalty': 'l1'}


For params: {'C': 0.1, 'penalty': 'l1'}
Train: 0.829 +/- 0.000
Test: 0.827 +/- 0.001
For params: {'C': 0.1, 'penalty': 'l2'}
Train: 0.837 +/- 0.000
Test: 0.833 +/- 0.001
For params: {'C': 1, 'penalty': 'l1'}
Train: 0.847 +/- 0.000
Test: 0.838 +/- 0.001
For params: {'C': 1, 'penalty': 'l2'}
Train: 0.847 +/- 0.000
Test: 0.837 +/- 0.001
For params: {'C': 10, 'penalty': 'l1'}
Train: 0.849 +/- 0.000
Test: 0.834 +/- 0.001
For params: {'C': 10, 'penalty': 'l2'}
Train: 0.849 +/- 0.000
Test: 0.835 +/- 0.001
For params: {'C': 100, 'penalty': 'l1'}
Train: 0.849 +/- 0.000
Test: 0.834 +/- 0.001
For params: {'C': 100, 'penalty': 'l2'}
Train: 0.849 +/- 0.000
Test: 0.834 +/- 0.001
Classification report:
The model is trained on the full development set.
           The scores are computed on the full evaluation set.
              precision    recall  f1-score   support

       False       0.85      0.82      0.84     80785
      

