# Yelp Dataset Challenge

![Yelp Data Challenge](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/6d323fc75cb1/assets/img/dataset/960x225_dataset@2x.png)

## Load Data

Use the processed data `last_2_years_restaurant_reviews.csv`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [3]:
# Inspect
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398037 entries, 0 to 398036
Data columns (total 12 columns):
business_id    398037 non-null object
name           398037 non-null object
categories     398037 non-null object
avg_stars      398037 non-null float64
cool           398037 non-null int64
date           398037 non-null object
funny          398037 non-null int64
review_id      398037 non-null object
stars          398037 non-null int64
text           398037 non-null object
useful         398037 non-null int64
user_id        398037 non-null object
dtypes: float64(1), int64(4), object(7)
memory usage: 36.4+ MB


In [4]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-07-03,0,c6iTbCMMYWnOd79ZiWwobg,1,"I ordered a few 12 inch sandwiches , a turkey ...",1,ih7Dmu7wZpKVwlBRbakJOQ
1,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2018-03-10,0,5iDdZvpK4jOv2w5kZ15TUA,1,Worst subway of any I have visited. I have man...,1,m3WBc9bGxn1q1ikAFq8PaA
2,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-26,0,oCUrLS4T-paZBr6WnrXg_A,2,Good luck trying to get the order right. The c...,0,H7bJDtGzhdg1fsmBL4KZWg
3,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-16,0,qXHvWYgL-8yfcGvP_ydKGA,2,Here to get my pick up order at the moment it ...,0,58sXi_0oTgVlM3aUuFYHUA
4,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0,1,2016-12-29,0,j9l7IMJX9bvWjkJ18EWGpg,5,"My husband & I were visiting the area, found t...",0,ZS7V0uC4kVrJR_4Yi3oTHA


## 1. Cluster the review text data for all the restaurants

### Define feature variables
* Use `text` of review as predictor and `avg_stars` as target

In [5]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text']
# Make a column and take the values, save to a variable named "target"
# df['favorable'] = df['stars'] > 4
# target = df['favorable']
stars = df['avg_stars']

In [6]:
documents.shape

(398037,)

In [7]:
stars.shape

(398037,)

In [8]:
documents.head(3)

0    I ordered a few 12 inch sandwiches , a turkey ...
1    Worst subway of any I have visited. I have man...
2    Good luck trying to get the order right. The c...
Name: text, dtype: object

In [9]:
stars.mean()

3.8557659212585764

### Create training and test dataset
Use larger test size to aviod crash when training

In [10]:
from sklearn.model_selection import train_test_split
X = documents.values
y = stars.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

### Get NLP representation of the documents

#### Fit TfidfVectorizer with training data only, then tranform all the data to tf-idf

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# Create TfidfVectorizer, and name it vectorizer, choose a reasonable max_features, e.g. 1000
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

In [13]:
# Train the model with training data
vec_trained = vectorizer.fit_transform(X_train)# vec_trained 是存了 tf-idf 的矩陣

In [14]:
print(type(vec_trained)) # it is a sparse matrix in compressed sparse row (csr) format
# print(vec_trained)
# print(vec_trained.toarray())
# print(vec_trained.todense())

<class 'scipy.sparse.csr.csr_matrix'>


In [15]:
vec_arr = vec_trained.toarray() # This is an array
vec_den = vec_trained.todense() # This is an matrix
print(type(vec_arr), type(vec_den))
print(vec_arr)
print(vec_den)

<class 'numpy.ndarray'> <class 'numpy.matrixlib.defmatrix.matrix'>
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [16]:
vec_trained = vec_trained.toarray() # 最後採用 array 的形式，因為 sklearn 的模型要吃 ndarray 

In [17]:
vec_trained.shape # 199018 x 1000: 398037 筆資料 * 0.5 當 train = 199018.5, 1000 個單詞

(199018, 1000)

In [18]:
# Get the vocab of your tfidf
vocab = vectorizer.get_feature_names() # features 就是單字

In [19]:
# vocab

In [20]:
print(type(vocab))
print(len(vocab)) # 1000 個單詞

<class 'list'>
1000


In [21]:
# Use the trained model to transform all the reviews
vec_documents = vectorizer.transform(documents) # 把整個 documents 拿來求 tf-idf 的矩陣並存到 vec_documents

In [22]:
print(type(vec_documents)) # sparse matrix in csr format
vec_doc_arr = vec_documents.toarray() # use ndarray
print(vec_doc_arr)
print(vec_doc_arr.shape)

<class 'scipy.sparse.csr.csr_matrix'>
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(398037, 1000)


### Cluster reviews with KMeans

#### Fit k-means clustering with the training vectors and apply it on all the data

In [23]:
from sklearn.cluster import KMeans

In [24]:
kmeans = KMeans().fit(vec_trained)

In [25]:
print(kmeans.labels_.size)
kmeans.labels_ # 每一列是屬於哪個 cluster

199018


array([7, 0, 3, ..., 2, 2, 7], dtype=int32)

In [26]:
np.unique(kmeans.labels_) # 有 8 個 clusters

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int32)

In [27]:
kmeans.cluster_centers_ # 是 8 x 1000 的維度

array([[0.00428003, 0.01990024, 0.00285123, ..., 0.00386236, 0.00072805,
        0.00159136],
       [0.00260117, 0.00825471, 0.00153353, ..., 0.0012559 , 0.00489136,
        0.00830148],
       [0.00090411, 0.00382081, 0.00126668, ..., 0.00120306, 0.00228962,
        0.00621649],
       ...,
       [0.0030302 , 0.00767996, 0.00152819, ..., 0.00159134, 0.00203535,
        0.00573765],
       [0.00074661, 0.00597622, 0.00201548, ..., 0.00100808, 0.00271493,
        0.00621402],
       [0.00230933, 0.0060532 , 0.00219558, ..., 0.00118722, 0.00356352,
        0.00710311]])

In [28]:
print(kmeans.labels_.shape)
print(kmeans.cluster_centers_.shape) # there are 8 clusters and 1000 words

(199018,)
(8, 1000)


#### Make predictions on all your data

In [29]:
doc_cluster = kmeans.predict(vec_doc_arr) # assign all documents to correct clusters

#### Inspect the centroids

`kmeans.cluster_centers_` 是每一個 cluster center 的 tf-idf 形成的 ndarray，每一個欄位代表對應的單詞 (feature)

In [30]:
kmeans.cluster_centers_

array([[0.00428003, 0.01990024, 0.00285123, ..., 0.00386236, 0.00072805,
        0.00159136],
       [0.00260117, 0.00825471, 0.00153353, ..., 0.0012559 , 0.00489136,
        0.00830148],
       [0.00090411, 0.00382081, 0.00126668, ..., 0.00120306, 0.00228962,
        0.00621649],
       ...,
       [0.0030302 , 0.00767996, 0.00152819, ..., 0.00159134, 0.00203535,
        0.00573765],
       [0.00074661, 0.00597622, 0.00201548, ..., 0.00100808, 0.00271493,
        0.00621402],
       [0.00230933, 0.0060532 , 0.00219558, ..., 0.00118722, 0.00356352,
        0.00710311]])

In [31]:
kmeans.cluster_centers_.shape

(8, 1000)

#### Find the top 10 features for each cluster.

要選出 cluster center 中每一列 tf-idf 最大的前十個欄位，在找出該欄位對應的單詞是哪個

`np.argsort()` 預設會對列排序，由左到右表示由小到大，但是傳回的是對應於原本的 ndarray 的元素的 index

array([3, 1, 2]) 由小到大排序會是 array([1, 2, 3]) 對應原本的 ndarray 的元素的 index 就是 [1, 2, 0]

見 https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.argsort.html


In [32]:
print(type(kmeans.cluster_centers_))

<class 'numpy.ndarray'>


In [33]:
sorted_index_centroids = np.argsort(kmeans.cluster_centers_)
print(sorted_index_centroids) # 每一列的倒數十個數字，是 tf-idf 最大的欄位的 index

[[ 22 921 199 ... 887 597 327]
 [836  54 366 ... 337 365 150]
 [243 149 984 ... 774 327 373]
 ...
 [541 921  90 ... 365 203 639]
 [243 853 984 ... 641 327  27]
 [642 783 538 ... 641 327 365]]


In [34]:
top_ten = sorted_index_centroids[:, -10:] # 取屁股後面十個
print(top_ten) # 由左到右是從第十名到第一名

[[367  40 230 124 540 450 774 887 597 327]
 [478 373 746 641 598 720 327 337 365 150]
 [222  43  52 824 339 365 641 774 327 373]
 [321 335 774 365 373  54 726 727 641 857]
 [598 373 777 327 641 145 365 111 341 110]
 [598  79 597 794 145 373 641 365 203 639]
 [223 503 934 222  79 373 774 641 327  27]
 [450 694 223  79 478 774 934 641 327 365]]


In [35]:
top_ten = top_ten[:, ::-1] # 改變排名的順序，從左到右由第一名排到第十名
print(top_ten)

[[327 597 887 774 450 540 124 230  40 367]
 [150 365 337 327 720 598 641 746 373 478]
 [373 327 774 641 365 339 824  52  43 222]
 [857 641 727 726  54 373 365 774 335 321]
 [110 341 111 365 145 641 327 777 373 598]
 [639 203 365 641 373 145 794 597  79 598]
 [ 27 327 641 774 373  79 222 934 503 223]
 [365 327 641 934 774 478  79 223 694 450]]


In [36]:
print(sorted_index_centroids[:, -1:-10:-1]) # 用一行解決

[[327 597 887 774 450 540 124 230  40]
 [150 365 337 327 720 598 641 746 373]
 [373 327 774 641 365 339 824  52  43]
 [857 641 727 726  54 373 365 774 335]
 [110 341 111 365 145 641 327 777 373]
 [639 203 365 641 373 145 794 597  79]
 [ 27 327 641 774 373  79 222 934 503]
 [365 327 641 934 774 478  79 223 694]]


In [37]:
# 然後去 vocab 中找欄位 index 對應的單字
# print(vocab[639])
for i, row in enumerate(top_ten):
#     print(i, row)
    print('%d: %s' % (i, ', '.join([vocab[j] for j in row])))

0: food, order, time, service, just, minutes, came, didn, asked, got
1: chicken, good, fried, food, rice, ordered, place, sauce, great, like
2: great, food, service, place, good, friendly, staff, awesome, atmosphere, definitely
3: sushi, place, rolls, roll, ayce, great, good, service, fresh, fish
4: burger, fries, burgers, good, cheese, place, food, shake, great, ordered
5: pizza, crust, good, place, great, cheese, slice, order, best, ordered
6: amazing, food, place, service, great, best, definitely, vegas, love, delicious
7: good, food, place, vegas, service, like, best, delicious, really, just


#### Try different k

Wrap up above steps into a function

In [38]:
def top_ten_words(training_data, n):
    kmeans = KMeans(n_clusters=n, random_state=0)
    kmeans.fit(training_data)
    index_array = kmeans.cluster_centers_.argsort()[:, -1:-10:-1]
    print('Top 10 features for each cluster:')
    for i, row in enumerate(index_array):
        print('{}: {}'.format(i, ', '.join([vocab[i] for i in row])))

In [39]:
# 6 clusters
top_ten_words(vec_trained, 6)

Top 10 features for each cluster:
0: good, chicken, food, really, ordered, like, place, just, burger
1: food, place, best, vegas, amazing, service, love, delicious, good
2: sushi, place, rolls, roll, ayce, great, good, service, fresh
3: food, order, time, just, service, minutes, came, like, didn
4: pizza, crust, good, place, great, cheese, slice, order, best
5: great, food, service, place, good, amazing, friendly, staff, atmosphere


In [40]:
# 4 clusters
top_ten_words(vec_trained, 4)

Top 10 features for each cluster:
0: pizza, good, crust, place, great, cheese, slice, order, best
1: great, food, service, place, amazing, good, friendly, staff, definitely
2: good, place, food, chicken, best, vegas, delicious, like, really
3: food, order, time, just, service, minutes, like, came, didn


#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [41]:
print(doc_cluster.shape)
print(doc_cluster) # 每一個列屬於哪一個 cluster
print(np.unique(doc_cluster)) # 共有 8 個 cluster，從編號 0 到 7

(398037,)
[7 7 0 ... 7 7 0]
[0 1 2 3 4 5 6 7]


In [42]:
print(vec_doc_arr.shape)
cluster = np.arange(0, vec_doc_arr.shape[0])
print(cluster)
print([doc_cluster==1])
print(cluster[doc_cluster==1])

(398037, 1000)
[     0      1      2 ... 398034 398035 398036]
[array([False, False, False, ..., False, False, False])]
[    24     76     92 ... 397979 397994 398026]


In [43]:
for i in range(kmeans.n_clusters):
    records = np.arange(0, vec_doc_arr.shape[0]) # 產生全部的列的數目
    records = records[doc_cluster == i] # 加入 mask 挑出屬於第 i 個 cluster 的那一列
    random_reviews = np.random.choice(records, 2, replace=False) # randomly pick 2 indeces
#     print(random_reviews)
    print('Cluster {}:'.format(i))
    for review in random_reviews:
        star = df.iloc[review].loc['stars']
        text = df.iloc[review].loc['text']
        print('Stars={}: {}\n'.format(star, text))
    print('='* 20)

Cluster 0:
Stars=1: Employees behind the counter need more training and are not helpful. I would expect more from a place in Aria. Their attitude was apathetic to customers and you could easily tell they all didn't want to be there. Food was mediocre and all three types of salad dressing were terrible. Disappointed.

Stars=4: We visited today for the first time, and we're seated within a short wait time. The waiter was friendly and immediately took drink orders. I ordered the chili relleno with shrimp and cream sauce. It was divine! Everyone else ordered tacos and no one was disappointed. The puffy tacos with carnitas got very high praise. Looking forward to a return visit.

Cluster 1:
Stars=5: Two words: BOMB DIGGITY. 
Me and my girlfriend needed some desperate food after drinking all night and we tried the original fried chicken and waffle and it sure did the job!!! Nothing like the egg running through your fingers and crunching on that crispy chicken wrapped with bacon and cheese...

## 2. Cluster all the reviews of the most reviewed restaurant

A review is a record in the dataframe. Each record is index by `business_id`. Find the highest number of count of `business_id` to get the most review restaurant

In [44]:
df['business_id'].value_counts() # 遞減排列

RESDUcs7fIiihp38-d6_6g    2895
4JNXUYY8wbaaDmk3BPzlWw    1994
faPVqws-x-5k2CQKDNtHxw    1961
f4x1YBxkLrZg652xt2KR5g    1960
QXV3L_QFGj8r6nWX2kS2hA    1633
K7lWdNUhCbcnEvI0NhGewg    1628
77h11eWv6HKJAgojLx8G4w    1500
RwMLuOkImBIqqYj4SSKSPg    1443
IWN2heYitkg-D4UdqfxcMA    1409
xfWdUmrz2ha3rcigyITV0g    1396
hihud--QRriCYZw1zZvW4g    1344
HhVmDybpU7L50Kb5A0jXTg    1267
mU3vlAVzTxgmZUu6F4XixA    1239
ysv6yhVYOoH9Pf7PlMyD0g    1224
mDR12Hafvr84ctpsV6YLag    1204
iCQpiavjjPzJ5_3gPD5Ebg    1164
vHz2RLtfUMVRPFmd7VBEHA    1138
OETh78qcgDltvHULowwhJg    1089
3kdSl5mo9dWC4clrQjEDGg    1075
YJ8ljUhLsz6CtT_2ORNFmg    1065
0d0i0FaJq1GIeW1rS2D-5w    1064
El4FC8jcawUVgw_0EIcbaQ    1053
QJatAcxYgK1Zp9BRZMAx7g    1046
cYwJA2A6I12KNkm2rtXd5g    1039
XXW_OFaYQkkGOGniujZFHg    1037
igHYkXZMLAc9UdV5VnR_AA    1015
q3oJ6bNRV3OoJrwc95GOwg     990
3BCsAgo_1i4xMuTyLKMLRQ     951
KskYqH1Bi7Z_61pH6Om8pg     940
CVKOPzBVOj3_apFUmZ9ZWw     937
                          ... 
z3abN49dQaUftXb60y9LtA       1
RO1h2JPb

In [45]:
df['business_id'].value_counts().index[0]

'RESDUcs7fIiihp38-d6_6g'

In [46]:
most_reviewed_restaurand_id = df['business_id'].value_counts().index[0]
print(most_reviewed_restaurand_id) # 確認一下

most_reviewed = (df['business_id'] == most_reviewed_restaurand_id)

RESDUcs7fIiihp38-d6_6g


In [47]:
# Find the business who got most reviews, get your filtered df, name it df_top_restaurant
df_top_restaurant = df[most_reviewed].copy() # duplicate dataframe so we don't change the original dataframe

In [48]:
df_top_restaurant.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
382207,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-09-09,0,mQfl6ci46mu0xaZrkRUhlA,5,"This buffet is amazing. Yes, it is expensive,...",0,f638AHA_GoHbyDB7VFMz7A
382208,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-02-08,0,lMarDJDg4-e_0YoJOKJoWA,2,This place....lol our server was nice. But fo...,0,A21zMqdN76ueLZFpmbue0Q
382209,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-12-22,0,30xmXTzJwHPcqt0uvSLQhQ,3,One star knocked off for the cold air conditio...,0,uNHEnP28MMmVy96ZSJKaMA
382210,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-09-22,0,SOUuNn4f1fHKxFHntYzonw,3,Was torn between 2 and 3. Caught the last of ...,0,WvVqnHU_eVBUfL-CI9efdw
382211,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2016-12-14,0,1mAf8vTO6TGTrQ3WSfTB3g,4,This place was one of those once in a lifetime...,0,aYLS5lhdCp5HSPOtkMvapw


In [49]:
df_top_restaurant = df_top_restaurant.reset_index() # keep original index
df_top_restaurant.head()

Unnamed: 0,index,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,382207,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-09-09,0,mQfl6ci46mu0xaZrkRUhlA,5,"This buffet is amazing. Yes, it is expensive,...",0,f638AHA_GoHbyDB7VFMz7A
1,382208,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-02-08,0,lMarDJDg4-e_0YoJOKJoWA,2,This place....lol our server was nice. But fo...,0,A21zMqdN76ueLZFpmbue0Q
2,382209,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-12-22,0,30xmXTzJwHPcqt0uvSLQhQ,3,One star knocked off for the cold air conditio...,0,uNHEnP28MMmVy96ZSJKaMA
3,382210,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2017-09-22,0,SOUuNn4f1fHKxFHntYzonw,3,Was torn between 2 and 3. Caught the last of ...,0,WvVqnHU_eVBUfL-CI9efdw
4,382211,RESDUcs7fIiihp38-d6_6g,Bacchanal Buffet,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",4.0,0,2016-12-14,0,1mAf8vTO6TGTrQ3WSfTB3g,4,This place was one of those once in a lifetime...,0,aYLS5lhdCp5HSPOtkMvapw


In [50]:
# Load business dataset (optional)
import json

# Loading a single file works, wrap in function
def read_json_file(input_file):
    with open(input_file) as fin:
        df = pd.DataFrame(json.loads(line) for line in fin)
    return df

In [51]:
df_business = read_json_file('yelp_academic_dataset_business.json')
# df_review = read_json_file('yelp_academic_dataset_business.json')

In [52]:
# Take a look at the most reviewed restaurant's profile (optional)
df_business[df_business['business_id'] ==  most_reviewed_restaurand_id]

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
185167,3570 S Las Vegas Blvd,"{'Alcohol': 'full_bar', 'Ambience': '{'romanti...",RESDUcs7fIiihp38-d6_6g,"Sandwiches, Buffets, Breakfast & Brunch, Food,...",Las Vegas,"{'Monday': '7:30-22:0', 'Tuesday': '7:30-22:0'...",1,36.116113,-115.176222,Bacchanal Buffet,The Strip,89109,7866,4.0,NV


In [53]:
address = df_business[df_business['business_id'] ==  most_reviewed_restaurand_id]['address']
attributes = df_business[df_business['business_id'] ==  most_reviewed_restaurand_id]['attributes']
category = df_business[df_business['business_id'] ==  most_reviewed_restaurand_id]['categories']
hours = df_business[df_business['business_id'] ==  most_reviewed_restaurand_id]['hours']

In [54]:
print(address.values)
print(attributes.values)
print(category.values)
print(hours.values)

['3570 S Las Vegas Blvd']
[{'Alcohol': 'full_bar', 'Ambience': "{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", 'BikeParking': 'False', 'BusinessAcceptsCreditCards': 'True', 'BusinessParking': "{'garage': True, 'street': False, 'validated': False, 'lot': False, 'valet': True}", 'Caters': 'False', 'GoodForKids': 'True', 'GoodForMeal': "{'dessert': True, 'latenight': False, 'lunch': True, 'dinner': True, 'breakfast': False, 'brunch': True}", 'HasTV': 'False', 'NoiseLevel': 'average', 'OutdoorSeating': 'False', 'RestaurantsAttire': 'casual', 'RestaurantsDelivery': 'False', 'RestaurantsGoodForGroups': 'True', 'RestaurantsPriceRange2': '3', 'RestaurantsReservations': 'False', 'RestaurantsTableService': 'True', 'RestaurantsTakeOut': 'False', 'WheelchairAccessible': 'True', 'WiFi': 'no'}]
['Sandwiches, Buffets, Breakfast & Brunch, Food, Restaurants']
[{'Monday': '7:30-22:0', 'Tues

### Vectorize the text feature

In [55]:
# Take the values of the column that contains review text data, save to a variable named "documents_top_restaurant"
documents_top_restaurant = df_top_restaurant['text']

In [56]:
documents_top_restaurant.shape

(2895,)

In [57]:
documents_top_restaurant[:3].values # show first 3 reviews

array(["This buffet is amazing.  Yes, it is expensive, but it is worth the splurge.  I recommend that you look at everything first and then decide what to get, because you can't possibly try everything.  I missed an entire corner of great food that I didn't see at first, and then I was too full to eat more. I like how everything is on little plates, bowls, or baskets, so everything doesn't get mixed together.  Lines are long, but you can check in and then they text when your time is almost up.  The wait time was less than they had said it would be, so don't go far away to wait.",
       'This place....lol our server was nice.  But for 50 something dollars for dinner was not worth it....Sorry but if I could choose another place to spend that much I know I would have been much happier. Not to mention they took a photo of us when we came in then brought us like 3-4 printouts of them, I mean nice ones...then they told us it was 15 dollars for one of the pictures...we honestly thought out o

### Define your target variable (for later classification use)

#### Again, we look at perfect (5 stars) and imperfect (1-4 stars) rating

In [58]:
df_top_restaurant['perfect'] = (df_top_restaurant['stars'] > 4)
df_top_restaurant['perfect'].head(10) # show first 10 rows

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: perfect, dtype: bool

In [59]:
df_top_restaurant.shape

(2895, 14)

In [60]:
target_top_restaurant = df_top_restaurant['perfect'].values.astype(int)
target_top_restaurant

array([1, 0, 0, ..., 0, 0, 1])

In [61]:
target_top_restaurant.shape

(2895,)

#### Check the statistic of the target variable

In [62]:
target_top_restaurant.mean()

0.3727115716753022

### Create training dataset and test dataset

In [63]:
X = documents_top_restaurant.values
y = target_top_restaurant # 已經是 ndarray 了
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [64]:
print(documents_top_restaurant.shape)
print(documents_top_restaurant.shape[0] * 0.7)
print(documents_top_restaurant.shape[0] * 0.3)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2895,)
2026.4999999999998
868.5
(2026,) (869,) (2026,) (869,)


### Get NLP representation of the documents

In [65]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

In [66]:
# Train the model with your training data
vec_train = vectorizer.fit_transform(X_train)

In [67]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [68]:
# Use the trained model to transform the test data
vec_test = vectorizer.transform(X_test)

In [69]:
# Use the trained model to transform all the data
vec_documents = vectorizer.transform(X)

In [70]:
print(vec_train.shape, vec_test.shape, vec_documents.shape)
print(len(words))

(2026, 1000) (869, 1000) (2895, 1000)
1000


### Cluster reviews with KMeans

#### Fit k-means clustering on the training vectors and make predictions on all data

In [71]:
kmeans = KMeans(n_clusters=5) # use 5 clusters because there 5 categories of stars
kmeans.fit(vec_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Make predictions on all your data

In [72]:
all_doc_clusters = kmeans.predict(vec_documents)

In [73]:
print(all_doc_clusters.shape)
all_doc_clusters

(2895,)


array([3, 0, 0, ..., 4, 0, 2], dtype=int32)

In [74]:
all_doc_clusters[:10] # show the cluster of the first 10 rows

array([3, 0, 0, 0, 4, 3, 1, 4, 3, 3], dtype=int32)

In [75]:
df_top_restaurant.iloc[:10]['stars'] # 把星級評等和 clusters 的結果比較一下

0    5
1    2
2    3
3    3
4    4
5    1
6    1
7    5
8    1
9    4
Name: stars, dtype: int64

#### Inspect the centroids

In [76]:
print(kmeans.cluster_centers_.shape)
kmeans.cluster_centers_

(5, 1000)


array([[0.00402107, 0.0067377 , 0.00243767, ..., 0.0054031 , 0.00124449,
        0.00255439],
       [0.00131056, 0.00727753, 0.00049908, ..., 0.0078129 , 0.        ,
        0.00099472],
       [0.00101777, 0.00577627, 0.004025  , ..., 0.00723879, 0.00168687,
        0.00272017],
       [0.0044884 , 0.01636911, 0.00403772, ..., 0.00785816, 0.00092318,
        0.00619948],
       [0.00386003, 0.010437  , 0.00241871, ..., 0.00862592, 0.00285825,
        0.00387736]])

#### Find the top 10 features for each cluster.

In [81]:
sorted_index_centroids = np.argsort(kmeans.cluster_centers_)
sorted_index_centroids

array([[666, 113, 612, ..., 109, 361, 325],
       [499, 177, 612, ..., 767, 325, 367],
       [498, 689, 831, ..., 939,  83, 109],
       [738, 666, 659, ..., 888, 949, 471],
       [372, 506, 740, ..., 109, 463, 195]])

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

## 3. Use PCA to reduce dimensionality

### Stardardize features
Your X_train and X_test

In [None]:
from sklearn.prepocessing import StandardScaler

### Use PCA to transform data (train and test) and get princial components

In [None]:
from sklearn.

### See how much (and how much percentage of) variance the principal components explain

### Viz: plot proportion of variance explained with top principal components

For clear display, you may start with plotting <=20 principal components

## Classifying positive/negative review with PCA preprocessing

### Logistic Regression Classifier
#### Use standardized tf-idf vectors as features