### Content-Based Filtering (text)


This notebook contains part 2 of the content-based filtering using text review. Frist, we extract information from text and mapping into features for each reviews. Then use cosine to measure the distance of between review and recommend users based on most simialr reviews.  

In [43]:
import pandas as pd
import numpy as np
import re 
from nltk.corpus import stopwords
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora, models, similarities
from sklearn.feature_extraction.text import TfidfTransformer

%matplotlib inline

In [3]:
review = pd.read_csv("review.csv")

In [4]:
review.head(2)

Unnamed: 0,user_id,review_id,text,votes.cool,business_id,votes.funny,stars,date,type,votes.useful
0,Xqd0DzHaiyRqVH3WRG7hzg,15SdjuK7DmYqUAj6rjGowg,dr. goldberg offers everything i look for in a...,1,vcNAWiLM4dR7D2nwwJ7nCA,0,5,2007-05-17,review,2
1,H1kH6QZV7Le4zqTRNxoZow,RF6UnRTtG7tWMcrO2GEoAg,"Unfortunately, the frustration of being Dr. Go...",0,vcNAWiLM4dR7D2nwwJ7nCA,0,2,2010-03-22,review,2


In [5]:
review.text[:5]

0    dr. goldberg offers everything i look for in a...
1    Unfortunately, the frustration of being Dr. Go...
2    Dr. Goldberg has been my doctor for years and ...
3    Been going to Dr. Goldberg for over 10 years. ...
4    Got a letter in the mail last week that said D...
Name: text, dtype: object

In [6]:
rest = pd.read_pickle('rest.csv')
rest_text = pd.merge(rest, review, on = 'business_id', )

In [7]:
rest_text.head(2)

Unnamed: 0,city,categories,business_id,name,user_id,review_id,text,votes.cool,votes.funny,stars,date,type,votes.useful
0,Las Vegas,"['Wine Bars', 'Bars', 'Restaurants', 'Nightlif...",_SM8UKIwBNbmj1r629ipoQ,Chianti Cafe,RZwkUvViHYEh5Z65--cVZw,q63uf05O8LJjjPiv8YqMLg,"I like Chianti, the outdoor seating area is ni...",3,1,4,2008-01-27,review,3
1,Las Vegas,"['Wine Bars', 'Bars', 'Restaurants', 'Nightlif...",_SM8UKIwBNbmj1r629ipoQ,Chianti Cafe,mW0l2ZhDeAAgjXPz_x2qRQ,FDTrDJbM-MSzVNsx2d5b7A,My wife and I went to Chianti for our annivers...,1,0,4,2008-08-13,review,0


In [8]:
text = rest_text.text

In [9]:
len(text)

190789

In [10]:
def review_to_words(review):
    """
    Convert a raw review to a string of words.
    Input: a single string of words
    Output: a single string of preprocessed review
    """
    
    # 1. Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review) 
    
    # 2.Convert to lower case, split into individual words
    words = letters_only.lower().split()
    
    # 3. remove stop words
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops]
    
    # Returen a list of words into a string separated by space. 
    return( " ".join( meaningful_words ))   

In [11]:
review_to_words(text[0])

'like chianti outdoor seating area nice spring fall inside typifies average las vegas dinner spot staff attentive entrees tasty liked beef carpaccio appetizer well nice simple pasta dishes pizzas cioppino surprisingly pretty good although tomato broth little overwhelming seafood obviously coastal fresh bread crostini horrible ala white wonder bread decent wine list less helpful staff regard overall would recommend spot anyone stranded vegas craving bowl cioppino p know better spot cioppino vegas please let know'

In [13]:
# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in xrange( 0, len(text)):
    # If the index is evenly divisible by 50000, print a message
    if( (i+1)%50000 == 0 ):
        print "Cleaning review %d of %d\n" % ( i+1, len(text))                                                             
  
    # Call our function for each one, and add the result to the list of
    # clean reviews
    try:
        clean_train_reviews.append( review_to_words( text[i] ) )
    except Exception as e:
        clean_train_reviews.append( review_to_words("I'm a placeholder sentence."))
        print "Execption raised:", e

Cleaning review 50000 of 190789

Cleaning review 100000 of 190789

Cleaning review 150000 of 190789



In [14]:
clean_train_reviews[:1]

['like chianti outdoor seating area nice spring fall inside typifies average las vegas dinner spot staff attentive entrees tasty liked beef carpaccio appetizer well nice simple pasta dishes pizzas cioppino surprisingly pretty good although tomato broth little overwhelming seafood obviously coastal fresh bread crostini horrible ala white wonder bread decent wine list less helpful staff regard overall would recommend spot anyone stranded vegas craving bowl cioppino p know better spot cioppino vegas please let know']

In [15]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 
train_data_features = vectorizer.fit_transform(clean_train_reviews)

In [16]:
vocab = vectorizer.get_feature_names()
print vocab[:50]

[u'abc', u'ability', u'able', u'absolute', u'absolutely', u'abundance', u'accent', u'accept', u'acceptable', u'accepted', u'access', u'accessible', u'accident', u'accidentally', u'accommodate', u'accommodated', u'accommodating', u'accomodating', u'accompanied', u'accompaniment', u'accompany', u'accompanying', u'according', u'accordingly', u'account', u'accurate', u'acknowledge', u'acknowledged', u'across', u'act', u'acted', u'action', u'actual', u'actually', u'ad', u'add', u'added', u'addicted', u'addicting', u'addictive', u'adding', u'addition', u'additional', u'additionally', u'address', u'adds', u'adequate', u'adjacent', u'admit', u'admittedly']


In [17]:
dist = np.sum(train_data_features.toarray(), axis=0)

for tag, count in zip(vocab, dist)[:10]:
    print tag, count

abc 190
ability 189
able 5594
absolute 1210
absolutely 6995
abundance 236
accent 370
accept 483
acceptable 576
accepted 211


In [19]:
tfidf = TfidfTransformer(norm=u'l2')
tfidf.fit(train_data_features)

print "Inverse Document Freq:", tfidf.idf_

Inverse Document Freq: [ 8.23167494  7.92782001  4.60174579 ...,  8.10268282  7.92248666
  6.79935476]


In [20]:
tf_idf_matrix = tfidf.transform(train_data_features)
print tf_idf_matrix.todense()

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [21]:
from sklearn.metrics.pairwise import cosine_similarity
#print tf_idf_matrix
cosine_similarity(tf_idf_matrix[0:1], tf_idf_matrix)

array([[ 1.        ,  0.05036309,  0.06800091, ...,  0.00873643,
         0.11325085,  0.04328023]])

In [22]:
tf_idf_matrix.shape

(190789, 5000)

In [23]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tf_idf_matrix[0:1], tf_idf_matrix).flatten()
cosine_similarities 

array([ 1.        ,  0.05036309,  0.06800091, ...,  0.00873643,
        0.11325085,  0.04328023])

In [24]:
related_docs_indices = cosine_similarities.argsort()[:-10:-1]
related_docs_indices

array([     0, 134085,  36686,  41471,  65589,  50718,  50241, 173645,
        65724])

In [25]:
cosine_similarities[related_docs_indices]

array([ 1.        ,  0.44495272,  0.43169436,  0.3601242 ,  0.34221723,
        0.33880715,  0.32793478,  0.32531392,  0.31616053])

In [26]:
text[0]

'I like Chianti, the outdoor seating area is nice during the spring and fall, while the inside typifies an average Las Vegas dinner spot. The staff is attentive. Entrees are tasty, I liked the beef carpaccio appetizer, as well as the nice, simple pasta dishes and pizzas, and the cioppino is (surprisingly) pretty good too (although the tomato broth is a little overwhelming and the seafood is obviously not coastal fresh). The bread/crostini is horrible (ala white Wonder Bread). Decent wine list, but less than helpful staff in that regard. Overall, I would recommend this spot to anyone stranded in Vegas and craving a bowl of cioppino (p.s., if you know of a better spot for cioppino in Vegas please let me know).'

In [27]:
text[134085]

'Had the Cioppino and it was great!  Service was very good too.  I would go back again.'

In [28]:
text[36686]

"We went to a Christmas Eve Dinner with the family and I was let down hard! The Cioppino that I ordered for $50 was far from average. I've had a great cioppino and was not even worthy being called cioppino. The broth was like it was out of a can and the seafood was fishy (not fresh), more like frozen crap. Let's just say the entire dinner was a 1 star rating. The bill was over $500 and for what? I expected so much better. A big let down!!!"

In [29]:
text[41471]

"Can't say enough about this place.  Had dinner here on 5/11/13, and it was wonderful. Try the Cioppino, it is Excellent."

In [30]:
text[52209]

'Nobu is always perfect.'

In [31]:
text[65724]

'I purchased a Groupon ($45 for two dinner entrees and a 12oz carafe of wine) for this place and decided it give it a try.  We arrived on a Monday night without reservations, and were promptly seated.  The staff was very friendly and accommodating. I ordered the Cioppino, and my sweetie had the Diver Scallops. My Cioppino was delicious!  Lots of seafood, and the pasta was perfectly al dente. The Diver Scallops were wonderful and super tender. Overall: good food, good service. Thanks Groupon!'

In [32]:
user = "RZwkUvViHYEh5Z65--cVZw"
rvs_ind = rest_text.ix[(rest_text.user_id == user) & (rest_text.stars >= 4)].index.tolist()

In [33]:
ind = []
for i in range(len(rvs_ind)):
    cosine_similarities = linear_kernel(tf_idf_matrix[rvs_ind[i]], tf_idf_matrix).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-20:-1].tolist()
    inds = filter(lambda x: x != rvs_ind[i], related_docs_indices)
    ind = ind + inds
print ind

[134085, 36686, 41471, 65589, 50718, 50241, 173645, 65724, 65635, 84327, 84072, 156948, 65758, 65661, 134291, 75323, 92454, 16023, 31022, 40784, 97364, 97410, 169473, 31083, 40845, 27177, 30762, 169644, 30882, 40557, 169812, 40770, 186374, 40432, 3000, 30828, 186343, 27182, 154472, 30762, 31115, 40780, 27119, 169419, 169752, 97410, 40631, 169838, 169473, 2935, 186355, 31083, 40756, 27132, 3963, 4602, 4001, 4513, 4218, 107522, 3649, 4201, 4088, 4054, 4092, 3899, 4019, 4214, 124897, 4085, 3969, 3746, 138001, 8636, 35272, 60249, 60254, 148570, 8359, 7049, 8424, 8459, 148702, 11743, 8556, 33677, 8024, 8590, 108616, 8449, 189447, 13424, 33059, 24714, 123662, 81559, 25417, 24572, 32797, 117658, 125817, 176495, 24432, 143815, 24717, 22792, 111046, 58616, 115443, 41604, 42964, 150558, 41602, 115965, 7538, 7520, 129435, 115971, 114822, 105567, 141328, 151156, 7517, 4694, 119813, 118604, 59792, 9421, 59249, 59780, 59520, 59625, 59767, 59519, 59506, 9441, 59764, 59195, 9426, 59811, 59732, 179992,

In [34]:
rest = rest_text.ix[ind].name.unique()
rest

array(['Buzios Seafood Restaurant', 'Panevino Restaurant', 'Casa Di Amore',
       'RM Seafood', 'Mesa Grill', 'Pasta Pirate', 'Triple George Grill',
       'Mezzo Bistro and Wine', "Don Vito's", "Maggiano's Little Italy",
       'Vintner Grill', 'Fiesta Filipina', 'India Palace',
       'Mint Indian Bistro', 'Mantra Masala',
       "Mount Everest India's Cuisine", "Gandhi India's Cuisine",
       'India Masala', 'Samosa Factory', 'Bollywood Grill Indian Cusine',
       'Ichiza', 'Del Taco', "Dick's Last Resort", "Joe's New York Pizza",
       'Luv-It Frozen Custard', 'The Hush Puppy', "Tiffany's Cafe",
       'Lemongrass Caf\xc3\xa9', 'Blueberry Hill Family Restaurant',
       "Roberto's Taco Shop", 'Maverick Truck Stop', 'White Cross Drugs',
       "Angelina's Pizzeria", 'Godiva Chocolatier #909',
       "Smith's Food & Drug Centers Inc", 'Eiffel Tower Restaurant',
       'Spago', "Binion's Ranch Steak House", 'Il Mulino New York',
       'Circo', "Ethel's Chocolate Lounge", "Fellini

In [35]:
rest_went = rest_text.ix[rvs_ind].name.tolist()
rest_went

['Chianti Cafe',
 'Samosa Factory',
 'India Oven',
 'Ichiza',
 'White Cross Drugs',
 'Spago',
 'Albinas Italian American Bakery',
 'Swiss Cafe Restaurant',
 'Baladie Caf\xc3\xa9',
 'Go Raw Cafe',
 'Pamplemousse Le Restaurant']

In [36]:
rest = rest_text.ix[ind].name.unique()
rest_went = rest_text.ix[rvs_ind].name.tolist()
rec_rest = filter(lambda x: x not in rest_went, rest)
rec_rest

['Buzios Seafood Restaurant',
 'Panevino Restaurant',
 'Casa Di Amore',
 'RM Seafood',
 'Mesa Grill',
 'Pasta Pirate',
 'Triple George Grill',
 'Mezzo Bistro and Wine',
 "Don Vito's",
 "Maggiano's Little Italy",
 'Vintner Grill',
 'Fiesta Filipina',
 'India Palace',
 'Mint Indian Bistro',
 'Mantra Masala',
 "Mount Everest India's Cuisine",
 "Gandhi India's Cuisine",
 'India Masala',
 'Bollywood Grill Indian Cusine',
 'Del Taco',
 "Dick's Last Resort",
 "Joe's New York Pizza",
 'Luv-It Frozen Custard',
 'The Hush Puppy',
 "Tiffany's Cafe",
 'Lemongrass Caf\xc3\xa9',
 'Blueberry Hill Family Restaurant',
 "Roberto's Taco Shop",
 'Maverick Truck Stop',
 "Angelina's Pizzeria",
 'Godiva Chocolatier #909',
 "Smith's Food & Drug Centers Inc",
 'Eiffel Tower Restaurant',
 "Binion's Ranch Steak House",
 'Il Mulino New York',
 'Circo',
 "Ethel's Chocolate Lounge",
 "Fellini's Ristorante",
 'Khotan',
 'Twin Creeks',
 'Sushi 21',
 'Vosges Haut Chocolat',
 'Shibuya',
 'Panaderia Y Pasteria Latina',


In [44]:
def try_dif_rest(ind,rvs_ind):
    rest = rest_text.ix[ind].name.unique()
    rest_went = rest_text.ix[rvs_ind].name.tolist()
    rec_rest = filter(lambda x: x not in rest_went, rest)
    return rec_rest

In [45]:
def make_recommedation(user):
    """
    Function: make recommedations for user with user_id with most similar reviews
    Input: user_id
    Output: restaurants from most similar reviews
    """
    
    rvs_ind = rest_text.ix[(rest_text.user_id == user) & (rest_text.stars >= 4)].index.tolist()
    went_ind = rest_text.ix[(rest_text.user_id == user)].index.tolist()
   
    if len(rvs_ind) == 0:
        return str("Not enought information!!")
    else:
        ind = []
        for i in range(len(rvs_ind)):
            cosine_similarities = linear_kernel(tf_idf_matrix[rvs_ind[i]], tf_idf_matrix).flatten()
            related_docs_indices = cosine_similarities.argsort()[:-15:-1].tolist()
            inds = filter(lambda x: x != went_ind, related_docs_indices)
            ind = ind + inds
            
        return try_dif_rest(ind,went_ind)    

In [46]:
n_users = 5

for user in np.random.choice(rest_text.user_id, n_users, replace=False):
    print "User %s" % user
    print "Already Liked:", ", ".join(rest_text.ix[(rest_text.user_id == user) & (rest_text.stars >= 4)].name.tolist())
    print "Recommended:", ", ".join(make_recommedation(user))
    print
 

User YnKPH9_dPUsHUA99-Y1mwQ
Already Liked: Havana Grill
Recommended: Cuba Cafe Restaurant, Kona Grill, RA Sushi Bar Restaurant, Rincon Criollo, Metro Pizza

User SE0Sckp7UwlS6SlGuj5FAw
Already Liked: Ronald's Donuts, Lotus of Siam, Tom Colicchio's Craftsteak, Bouchon Bistro, Bouchon Bistro, Canter's Delicatessen, Carnevino, Scarpetta
Recommended: Dessert Avenue, Lemongrass, Ocha Cuisine, Komol Restaurant, Pin Kaow Thai Restaurant, Thai Room, Island Flavor, Thai Style Noodle House, Daniel Boulud Brasserie, Pinot Brasserie, Sweet Water Prime Seafood, Montana Meat Company, Delmonico Steakhouse, Serendipity 3, Planet Dailies, Blueberry Hill Family Restaurant, The Cracked Egg, Pho Kim Long, Olive Garden Italian Restaurant, Wichcraft, Earl of Sandwich, Greenberg's Deli, Mr Tofu, Deli Den, Quiznos, Society Cafe Encore, Fleur by Hubert Keller, STRIPSTEAK, Del Frisco's Double Eagle Steak House, Gallagher's Steakhouse, The Range Steakhouse, Wendy's Noodle Cafe, T-Bones Chophouse, Charlie Palmer 