Goal:

The goal of this project is to create a personalized recommender system using the Yelp dataset. 

In [1]:
import pandas as pd 

# configure file path
user_filename = 'yelp_data/yelp_user.csv'
business_filename = 'yelp_data/yelp_business.csv'
review_filename = 'yelp_data/yelp_review.csv'

business = pd.read_csv(business_filename)
user = pd.read_csv(user_filename)
review = pd.read_csv(review_filename)

In [2]:
business.head()

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,4.0,22,1,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,3.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,1.5,18,1,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,3.5,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...


In [3]:
business.shape

(174567, 13)

In [4]:
user.shape

(1326100, 22)

In [5]:
review.shape

(5261668, 9)

## Data Preprocessing

1. Choose businesses that have "Restaurants" and "Food" category

In [6]:
# create a mask for restaurants
restaurants = business['categories'].str.contains('Restaurants')

# create a mask for food
food = business['categories'].str.contains('Food')

# apply both masks

restaurants_and_food = business[restaurants | food]

# number of businesses that have food and restaurant in their category
restaurants_and_food['categories'].count()

69070

In [7]:
list_restaurants = list(restaurants_and_food['business_id'])
list_review_restaurants = list(review['business_id'])

In [8]:
review[review['business_id'].isin(list_restaurants)]

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool
0,vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,Super simple place but amazing nonetheless. It...,0,0,0
1,n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28,Small unassuming place that changes their menu...,0,0,0
2,MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28,Lester's is located in a beautiful neighborhoo...,0,0,0
3,IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28,Love coming here. Yes the place always needs t...,0,0,0
4,L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28,Had their chocolate almond croissant and it wa...,0,0,0
6,ymAUG8DZfQcFTBSOiaNN4w,u0LXt3Uea_GidxRW1xcsfg,9_CGhHMz8698M9-PkVf0CQ,4,2012-05-11,Who would have guess that you would be able to...,0,0,2
7,8UIishPUD92hXtScSga_gw,u0LXt3Uea_GidxRW1xcsfg,gkCorLgPyQLsptTHalL61g,4,2015-10-27,Always drove past this coffee house and wonder...,1,0,0
8,w41ZS9shepfO3uEyhXEWuQ,u0LXt3Uea_GidxRW1xcsfg,5r6-G9C4YLbC7Ziz57l3rQ,3,2013-02-09,"Not bad!! Love that there is a gluten-free, ve...",1,0,0
10,PIsUSmvaUWB00qv5KTF1xA,u0LXt3Uea_GidxRW1xcsfg,z8oIoCT1cXz7gZP5GeU5OA,4,2013-05-01,This is currently my parents new favourite res...,1,0,0
11,PdZ_uFjbbkjtm3SCY_KrZw,u0LXt3Uea_GidxRW1xcsfg,XWTPNfskXoUL-Lf32wSk0Q,3,2011-09-28,Server was a little rude.\n\nOrdered the calam...,5,0,1


In [9]:
print("# of restaurants: ", len(review[review['business_id'].isin(list_restaurants)]))

# of restaurants:  3540133


In [10]:
business = business[business['business_id'].isin(list_restaurants)]

In [11]:
review = review[review['business_id'].isin(list_restaurants)]

Now review reduced to 3,540,133 reviews

In [12]:
len(review)

3540133

2. Reduce the reviews written by users with less than 3 reviews

In [13]:
user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,friends,useful,funny,cool,fans,elite,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,JJ-aSuM4pCFPdkfoZ34q0Q,Chris,10,2013-09-24,"0njfJmB-7n84DlIgUByCNw, rFn3Xe3RqHxRSxWOU19Gpg...",0,0,0,0,,...,0,0,0,0,0,0,0,0,0,0
1,uUzsFQn_6cXDh6rPNGbIFA,Tiffy,1,2017-03-02,,0,0,0,0,,...,0,0,0,0,0,0,0,0,0,0
2,mBneaEEH5EMyxaVyqS-72A,Mark,6,2015-03-13,,0,0,0,0,,...,0,0,0,0,0,0,0,0,0,0
3,W5mJGs-dcDWRGEhAzUYtoA,Evelyn,3,2016-09-08,,0,0,0,0,,...,0,0,0,0,0,0,0,0,0,0
4,4E8--zUZO1Rr1IBK4_83fg,Lisa,11,2012-07-16,,4,0,0,0,,...,0,0,0,0,0,0,0,0,1,0


In [14]:
user.review_count.value_counts()

1       244232
2       166170
3       125947
4        89844
5        68334
6        54448
7        44497
8        37465
9        32321
10       28455
11       25273
12       22297
13       20032
14       18290
15       16529
16       15134
17       13832
18       12481
19       11723
20       10844
21        9879
22        9173
23        8434
24        8027
25        7650
26        7084
27        6589
28        6152
29        5855
30        5601
         ...  
1272         1
1797         1
1271         1
1799         1
1270         1
1269         1
1801         1
1268         1
2293         1
753          1
1281         1
5868         1
1288         1
1753         1
1297         1
1755         1
1294         1
1760         1
1761         1
1289         1
2313         1
1282         1
1770         1
1771         1
2310         1
1285         1
1284         1
1283         1
2306         1
1455         1
Name: review_count, Length: 1719, dtype: int64

In [15]:
review_counts = pd.DataFrame(user.groupby('review_count').size(), columns=['count'])
review_counts

Unnamed: 0_level_0,count
review_count,Unnamed: 1_level_1
0,1272
1,244232
2,166170
3,125947
4,89844
5,68334
6,54448
7,44497
8,37465
9,32321


In [16]:
orignal_user = len(user)

In [17]:
print("# of users with less than 3 reviews: ", len(user[user.review_count < 3]) / orignal_user)

# of users with less than 3 reviews:  0.3104396350199834


In [18]:
avg = sum(user['review_count'])/(orignal_user)
print("average counts: ", avg)

average counts:  23.117172913053313


In [19]:
list_more_rating_userID = user[user.review_count>=3]['user_id']

In [20]:
review = review[review['user_id'].isin(list_more_rating_userID)]

3. Remove restaurants with less reviews

Most of restaurants have more than 3 reviews
==> decide not to remove

In [21]:
review_counts = pd.DataFrame(business.groupby('review_count').size(), columns=['count'])
review_counts

Unnamed: 0_level_0,count
review_count,Unnamed: 1_level_1
3,7270
4,4789
5,3794
6,3249
7,2750
8,2217
9,2043
10,1851
11,1676
12,1520


In [22]:
review.shape

(3288978, 9)

In [23]:
from sklearn.preprocessing import LabelEncoder  
le = LabelEncoder()
review['userID'] = le.fit_transform(review['user_id'])

In [24]:
review_encode = LabelEncoder()
review['businessID'] = review_encode.fit_transform(review['business_id'])

In [25]:
review.drop(columns=['user_id','business_id'])

Unnamed: 0,review_id,stars,date,text,useful,funny,cool,userID,businessID
0,vkVSCC7xljjrAI4UGfnKEQ,5,2016-05-28,Super simple place but amazing nonetheless. It...,0,0,0,450222,12060
1,n6QzIUObkYshz4dz2QRJTw,5,2016-05-28,Small unassuming place that changes their menu...,0,0,0,450222,34995
2,MV3CcKScW05u5LVfF6ok0g,5,2016-05-28,Lester's is located in a beautiful neighborhoo...,0,0,0,450222,14291
3,IXvOzsEMYtiJI0CARmj77Q,4,2016-05-28,Love coming here. Yes the place always needs t...,0,0,0,450222,12011
4,L_9BTb55X0GDtThi6GlZ6w,4,2016-05-28,Had their chocolate almond croissant and it wa...,0,0,0,450222,60410
6,ymAUG8DZfQcFTBSOiaNN4w,4,2012-05-11,Who would have guess that you would be able to...,0,0,2,654370,11298
7,8UIishPUD92hXtScSga_gw,4,2015-10-27,Always drove past this coffee house and wonder...,1,0,0,654370,48237
8,w41ZS9shepfO3uEyhXEWuQ,3,2013-02-09,"Not bad!! Love that there is a gluten-free, ve...",1,0,0,654370,7308
10,PIsUSmvaUWB00qv5KTF1xA,4,2013-05-01,This is currently my parents new favourite res...,1,0,0,654370,68068
11,PdZ_uFjbbkjtm3SCY_KrZw,3,2011-09-28,Server was a little rude.\n\nOrdered the calam...,5,0,1,654370,37262


In [26]:
pd.DataFrame(review.groupby('date').size(), columns=['count'])

Unnamed: 0_level_0,count
date,Unnamed: 1_level_1
2004-10-12,1
2004-10-19,5
2004-12-19,2
2005-01-24,1
2005-01-26,3
2005-03-03,4
2005-03-04,2
2005-03-08,1
2005-03-09,1
2005-03-10,2


Sort review file by date and split them into train and test (70:30).

In [27]:
review_sort = review.sort_values('date')

In [29]:
len(review_sort[review_sort.date < "2016-07-01"])/len(review_sort)

0.7057754718943088

In [30]:
from sklearn.model_selection import train_test_split
train = review_sort[review_sort.date < "2016-07-01"]
test = review_sort[review_sort.date >= "2016-07-01"]

In [None]:
#pd.DataFrame(train.groupby('date').size(), columns=['count'])

In [None]:
#pd.DataFrame(test.groupby('date').size(), columns=['count'])

In [31]:
print('train size: ', len(train))
print('test size: ', len(test))

train size:  2321280
test size:  967698


In [32]:
train.drop(columns=['user_id','business_id'])

Unnamed: 0,review_id,stars,date,text,useful,funny,cool,userID,businessID
5205788,_CRpX4FGBkxie_1q0-DbjQ,5,2004-10-12,Hole in the wall burrito joint with the BEST b...,1,0,1,634199,28250
4665147,0QHCY_55TFHHvyumEMpDew,4,2004-10-19,Good stuff. Pricey by normal pizza standards.,0,1,0,584070,62649
4665130,1Iobyi_7BkFON25Oegs0aw,4,2004-10-19,Love their subs. Cheap and top shelf ingredients.,0,0,0,584070,16740
2772361,JDBubAcRw4FXfg1c5xk-dA,5,2004-10-19,Best pizza I've ever had. My favorite is the ...,2,0,0,452344,62649
4665146,pho1XNCTeRxQVzWR_5vacg,4,2004-10-19,Pokey Sticks are the best!,0,0,0,584070,12759
4665143,2F5J51OYtD49eyIUKJKVgg,4,2004-10-19,Love their pizza. They used to have a great ta...,0,0,0,584070,56662
1303380,Ef1skKLKZ9izwBmreb_-qw,4,2004-12-19,"Frequently busy due to their great food, but t...",1,1,1,79576,67520
938191,6POnAs_4MijROSKeOevXHQ,3,2004-12-19,Not the best part of town. Not particularly g...,0,0,0,79576,25988
917340,Dm7Jh7tVp_97sOxEYH4DnA,4,2005-01-24,"This is a little bit off the strip, but the vi...",0,0,0,357039,24575
3383314,gfPLT7BTqd2mJQ449BTrpw,4,2005-01-26,"Fresh shrimp, crab legs to enjoy. Prime Ribs &...",0,0,0,14195,39666


In [33]:
test.drop(columns=['user_id','business_id'])

Unnamed: 0,review_id,stars,date,text,useful,funny,cool,userID,businessID
1180491,DDa-dpkuE_LbIFdHIdoHJw,5,2016-07-01,This is by far the best Pho I've had! The brot...,1,0,0,33599,61285
1098449,lmOZ83aVHU9UTFSzvVOtPw,2,2016-07-01,I made a point to visit one of the best macaro...,0,0,0,338324,17223
734750,lXbTrDHAspvIDNPeERcEEA,5,2016-07-01,I'm a huge fan of trying different foods off m...,1,1,2,551002,44402
2708053,tDt7XDKljeKbBqRXO7gtLQ,5,2016-07-01,Was visiting from out of town. Love this place...,1,0,0,165211,19733
3355833,yCdQily0gmZgFrN_FwFa8Q,1,2016-07-01,One thing I can't stand is poor service. First...,3,0,0,633853,55110
118838,XZprMf4oZ-TI_MbBvuFEiw,3,2016-07-01,"Taco Tuesday, this is the place to be... The ...",0,0,0,217556,20075
3614410,hn4VaUaOdDx9wjoej42l1A,4,2016-07-01,Finally had a chance to visit this place.\nI h...,4,2,4,325348,15597
1965040,6jLdT1NWrMn9nxY2xVMRNg,4,2016-07-01,This restaurant is literally one of a kind. Ch...,3,1,1,244522,40973
1891419,tbMMl-kyk_u470TGRB5roQ,2,2016-07-01,Tried this place out for the first time last w...,1,0,0,156227,32628
2708054,6Ea-t19Zhr4o123z87PElA,5,2016-07-01,Just amazing! Visiting from out of town and we...,0,0,0,165211,44264


In [34]:
list_train_business_id = list(train['businessID'])
t = test[test['businessID'].isin(list_train_business_id)]

In [35]:
new_test = t

In [36]:
list_train_user_id = list(train['userID'])
ut = new_test[new_test['userID'].isin(list_train_user_id)]

In [37]:
new_test2 = ut

In [38]:
new_test2.shape

(430409, 11)

In [39]:
train.shape

(2321280, 11)

In [40]:
train_user_review = pd.DataFrame(train.groupby('userID').size(), columns=['num_reviews'])
train_user_review

Unnamed: 0_level_0,num_reviews
userID,Unnamed: 1_level_1
0,68
1,2
4,1
5,1
8,5
9,6
10,1
11,3
13,5
14,14


In [41]:
train_userID = train_user_review[train_user_review.num_reviews >= 2].index

In [42]:
train = train[train['userID'].isin(train_userID)]

In [43]:
train_bus_review = pd.DataFrame(train.groupby('businessID').size(), columns=['num_reviews'])
train_bus_review

Unnamed: 0_level_0,num_reviews
businessID,Unnamed: 1_level_1
0,21
1,20
2,946
3,22
4,65
5,3
6,29
7,6
8,10
9,4


In [44]:
train_businessID = train_bus_review[train_bus_review.num_reviews >= 2].index

In [45]:
train = train[train['businessID'].isin(train_businessID)]

In [46]:
train.shape

(2079120, 11)

In [47]:
new_test2.shape

(430409, 11)

In [48]:
train = train.drop(columns=['user_id', 'business_id'])
test = new_test2.drop(columns=['user_id', 'business_id'])

In [49]:
train.shape

(2079120, 9)

In [50]:
test.shape

(430409, 9)

In [51]:
train.head()

Unnamed: 0,review_id,stars,date,text,useful,funny,cool,userID,businessID
4665147,0QHCY_55TFHHvyumEMpDew,4,2004-10-19,Good stuff. Pricey by normal pizza standards.,0,1,0,584070,62649
4665130,1Iobyi_7BkFON25Oegs0aw,4,2004-10-19,Love their subs. Cheap and top shelf ingredients.,0,0,0,584070,16740
4665146,pho1XNCTeRxQVzWR_5vacg,4,2004-10-19,Pokey Sticks are the best!,0,0,0,584070,12759
4665143,2F5J51OYtD49eyIUKJKVgg,4,2004-10-19,Love their pizza. They used to have a great ta...,0,0,0,584070,56662
1303380,Ef1skKLKZ9izwBmreb_-qw,4,2004-12-19,"Frequently busy due to their great food, but t...",1,1,1,79576,67520


In [52]:
test.head()

Unnamed: 0,review_id,stars,date,text,useful,funny,cool,userID,businessID
1180491,DDa-dpkuE_LbIFdHIdoHJw,5,2016-07-01,This is by far the best Pho I've had! The brot...,1,0,0,33599,61285
734750,lXbTrDHAspvIDNPeERcEEA,5,2016-07-01,I'm a huge fan of trying different foods off m...,1,1,2,551002,44402
3355833,yCdQily0gmZgFrN_FwFa8Q,1,2016-07-01,One thing I can't stand is poor service. First...,3,0,0,633853,55110
3614410,hn4VaUaOdDx9wjoej42l1A,4,2016-07-01,Finally had a chance to visit this place.\nI h...,4,2,4,325348,15597
1965040,6jLdT1NWrMn9nxY2xVMRNg,4,2016-07-01,This restaurant is literally one of a kind. Ch...,3,1,1,244522,40973


In [54]:
len(train)/(len(test)+len(train))

0.8284901270318056

In [55]:
train_csv = train.to_csv(r'yelp_data/train.csv', index=None,header=True)
test_csv = test.to_csv(r'yelp_data/test.csv', index=None,header=True)