This report will conduct SGD Regression analysis on the yelp reviews in the Nevada and will create a model to predict potential consumer ratings for businesses that they have not attended. 

It is split into two parts: 

Part 1 creates the sample of Nevada reviews used in the analysis. 

Part 2 creates a SGD Regression model for that data. 

# Part 1: Nevada DataFrame

This first section will create a DataFrame for the most commonly rated businesses in Nevada. This will be used in Part 2 to create the model. 

To save memory, the DataFrame will be stored as an Excel file for easy importation. This first part is only intended to be run once in order to create this DataFrame. From then on, the Excel file can be imported directly. 

In [1]:
import pandas as pd
import json
from textblob import TextBlob
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns

import datetime
from geopy.distance import vincenty

from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

In [2]:
business = pd.DataFrame.from_csv('data\yelp_academic_dataset_business.csv', index_col=None)
review = pd.DataFrame.from_csv('data\yelp_academic_dataset_review.csv', index_col=None)
user = pd.DataFrame.from_csv('data\yelp_academic_dataset_user.csv', index_col=None)

In [3]:
# Each of these are set the number of businesses and reviewers used for creating the sample space. The 1,000 most commonly used businesses were used and the 20000 most profilic reviewers of those businesses were taken.  
business_taking = 1000
reviewers_taking = 20000

In [4]:
# nv stores all the businesses in Nevada
nv = business[business['state'] == 'NV']
# nv reviews stores all the reviews of all the businesses in Nevada
nv_reviews = review[review['business_id'].isin(nv.business_id.tolist())]
# nv_businesses stores the most frequently reviewed businesses in Nevada (the number of businesses being selected is business_taking)
nv_businesses = nv_reviews['business_id'].value_counts().head(business_taking)
# Determines all the reviews for the top selected businesses and now stores that as nv_reviews
nv_reviews = nv_reviews[nv_reviews['business_id'].isin(nv_businesses.index.tolist())]
# nv_reviewers stores the most frequently reviewers amidst these reviews (the number of reviewers being selected is reviewers_taking)
nv_reviewers = nv_reviews['user_id'].value_counts().head(reviewers_taking)
# nv_reviewers_list stores these reviewers as a list
nv_reviewers_list = nv_reviewers.index.tolist()
# Determines all the reviews of only those selected reviewers from nv_reviewers and saves this as nv_reviews
nv_reviews = nv_reviews[nv_reviews['user_id'].isin(nv_reviewers_list)]

In [5]:
'''
This section belows adds conducts sentiment analysis on the sampled reviews. 
This analysis determines how favorable (polarity) and how subjective the language in the review is. 
It then stories the polarity and subjectivity as columns in the nv_reviews DataFrame. 
'''

sentiment = lambda x: TextBlob(str(x)).sentiment

nv_reviews['sentiment'] = nv_reviews['text'].apply(sentiment)
polarity = lambda x: x[0]
subjectivity = lambda x: x[1]
nv_reviews['polarity'] = nv_reviews['sentiment']
nv_reviews['polarity'] = nv_reviews['sentiment'].apply(polarity)
nv_reviews['subjectivity'] = nv_reviews['sentiment']
nv_reviews['subjectivity'] = nv_reviews['sentiment'].apply(subjectivity)

nv_reviews = nv_reviews.drop('sentiment', axis=1) 


In [6]:
# Adds a column to nv_reviews which is the business_id and user_id separated by a comma. This will later be used to locate specific reviews by business and reviewer.
nv_reviews['pairing'] = nv_reviews['business_id'] + ',' + nv_reviews['user_id']

In [7]:
# To see the sampled reviews as currently created
nv_reviews

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,type,useful,user_id,polarity,subjectivity,pairing
126774,A0X1baHPgw9IiBRivu0G9g,1,2015-02-27,1,PBuL2xZYXI_RyhQYaWlmJg,4,I love this place been coming here since last ...,review,1,h_c7PjnpUnF_-LAegBv7CA,0.185714,0.366667,"A0X1baHPgw9IiBRivu0G9g,h_c7PjnpUnF_-LAegBv7CA"
126775,A0X1baHPgw9IiBRivu0G9g,3,2014-05-04,4,1wT35ZGPVZaOSq0TjCI-hA,5,"Besides getting a ticket to Paris, this French...",review,6,tw1YfWwGJffPlUNTl_w1fw,0.207500,0.456157,"A0X1baHPgw9IiBRivu0G9g,tw1YfWwGJffPlUNTl_w1fw"
126777,A0X1baHPgw9IiBRivu0G9g,0,2015-01-29,0,hT9m4yh20Ao38Tu2eD2NaQ,4,Saw all the great yelp reviews and wanted to t...,review,0,mAqrLyouqff2P4lBzakDPg,0.296429,0.652778,"A0X1baHPgw9IiBRivu0G9g,mAqrLyouqff2P4lBzakDPg"
126778,A0X1baHPgw9IiBRivu0G9g,0,2014-12-04,0,2LUuhvDtyvgBrbq9hFAZug,4,I love the sprawling casualness of this place....,review,1,ZKwO9lkpAnOnQlBL1fXf1w,0.209444,0.629074,"A0X1baHPgw9IiBRivu0G9g,ZKwO9lkpAnOnQlBL1fXf1w"
126779,A0X1baHPgw9IiBRivu0G9g,3,2015-06-05,3,OR61T4qYV6OY-gUtL_CqWQ,5,I love love love Patisserie Manon!\n\nI was ex...,review,4,-594af_E7Z9VVjQc9pJK3g,0.427083,0.712500,"A0X1baHPgw9IiBRivu0G9g,-594af_E7Z9VVjQc9pJK3g"
126781,A0X1baHPgw9IiBRivu0G9g,0,2012-03-04,0,GG6lGiH48-bFQfZhQP7j4w,5,What a gem! I absolutely love their sandwiches...,review,0,pHp0UVnYiRZWm1mSqPiS5g,0.194643,0.442857,"A0X1baHPgw9IiBRivu0G9g,pHp0UVnYiRZWm1mSqPiS5g"
126783,A0X1baHPgw9IiBRivu0G9g,1,2015-11-02,0,oAMLLzxiDBoeNNWSON3RvA,5,Just went there again for breakfast and the se...,review,1,S_7OkmN0BicgWEt2oMmzIQ,0.462500,0.575000,"A0X1baHPgw9IiBRivu0G9g,S_7OkmN0BicgWEt2oMmzIQ"
126785,A0X1baHPgw9IiBRivu0G9g,2,2011-03-03,2,25lVvhF2ps60OpLilh79yQ,2,How excited was I when I spotted the sign guy ...,review,5,btUugfufQAe-QD6gC_Ckmw,0.134065,0.501360,"A0X1baHPgw9IiBRivu0G9g,btUugfufQAe-QD6gC_Ckmw"
126786,A0X1baHPgw9IiBRivu0G9g,2,2016-03-02,2,qilDOjZ1c0KtniNXCsYWOg,4,Real French bread and pastry all made in house...,review,2,ptqOIs3ZwYPMU466assCQg,0.000000,0.000000,"A0X1baHPgw9IiBRivu0G9g,ptqOIs3ZwYPMU466assCQg"
126788,A0X1baHPgw9IiBRivu0G9g,2,2010-11-12,0,fwXC3VohV-L0TtagJQn5Rw,4,The hubby and I stopped in after a yummie meal...,review,3,MGRNCiPHnzBhcDULy8OSuQ,0.162564,0.431923,"A0X1baHPgw9IiBRivu0G9g,MGRNCiPHnzBhcDULy8OSuQ"


In [8]:
# X will be the input matrix for the SGD Regression model. Here it is created to be stored as a CSV file in the section below. 
X = nv_reviews[['pairing',  'business_id', 'stars', 'polarity', 'subjectivity']]
X = X.merge(business[['business_id', 'city', 'postal_code', 'state', 'latitude', 'longitude']])
del X['business_id']

In [9]:
# Adds relevent user and review information to X
nv_reviews = nv_reviews.merge(user[['user_id', 'average_stars']])
X = X.merge(nv_reviews[['pairing', 'date', 'average_stars']])

In [10]:
# Calculates and stores the distance from each business to the Las Vegas Strip. This is the most central location for businesses in the area to center measurements of proximity. 
strip = (36.1147, -115.1728)
X['distance_from_center'] = list(zip(X.latitude, X.longitude))
calculate_distance_from_center = lambda x: vincenty(strip, x).miles
X['distance_from_center'] = X['distance_from_center'].apply(calculate_distance_from_center)

In [11]:
#Uses the date for each review to determine the day of the week and month for each review in order to analyze differences between different times of the year and days of the week

# to_datetime converts a date string to a datetime object 
def to_datetime(str): 
    year, month, day = (int(x) for x in str.split('-'))
    return datetime.date(year, month, day)

day_of_week = lambda x: x.weekday()
month = lambda x: x.month

X['day_of_the_week'] = X['date'].apply(to_datetime)
X['month'] = X['day_of_the_week']
X['day_of_the_week'] = X['day_of_the_week'].apply(day_of_week)
X['month'] = X['month'].apply(month)

In [12]:
# Provides a full view of all the variables in X
X

Unnamed: 0,pairing,stars,polarity,subjectivity,city,postal_code,state,latitude,longitude,date,average_stars,distance_from_center,day_of_the_week,month
0,"A0X1baHPgw9IiBRivu0G9g,h_c7PjnpUnF_-LAegBv7CA",4,0.185714,0.366667,Las Vegas,89117,NV,36.157591,-115.285658,2015-02-27,3.78,6.970403,4,2
1,"A0X1baHPgw9IiBRivu0G9g,tw1YfWwGJffPlUNTl_w1fw",5,0.207500,0.456157,Las Vegas,89117,NV,36.157591,-115.285658,2014-05-04,3.74,6.970403,6,5
2,"A0X1baHPgw9IiBRivu0G9g,mAqrLyouqff2P4lBzakDPg",4,0.296429,0.652778,Las Vegas,89117,NV,36.157591,-115.285658,2015-01-29,4.55,6.970403,3,1
3,"A0X1baHPgw9IiBRivu0G9g,ZKwO9lkpAnOnQlBL1fXf1w",4,0.209444,0.629074,Las Vegas,89117,NV,36.157591,-115.285658,2014-12-04,3.04,6.970403,3,12
4,"A0X1baHPgw9IiBRivu0G9g,-594af_E7Z9VVjQc9pJK3g",5,0.427083,0.712500,Las Vegas,89117,NV,36.157591,-115.285658,2015-06-05,3.83,6.970403,4,6
5,"A0X1baHPgw9IiBRivu0G9g,pHp0UVnYiRZWm1mSqPiS5g",5,0.194643,0.442857,Las Vegas,89117,NV,36.157591,-115.285658,2012-03-04,4.77,6.970403,6,3
6,"A0X1baHPgw9IiBRivu0G9g,S_7OkmN0BicgWEt2oMmzIQ",5,0.462500,0.575000,Las Vegas,89117,NV,36.157591,-115.285658,2015-11-02,4.67,6.970403,0,11
7,"A0X1baHPgw9IiBRivu0G9g,btUugfufQAe-QD6gC_Ckmw",2,0.134065,0.501360,Las Vegas,89117,NV,36.157591,-115.285658,2011-03-03,3.89,6.970403,3,3
8,"A0X1baHPgw9IiBRivu0G9g,ptqOIs3ZwYPMU466assCQg",4,0.000000,0.000000,Las Vegas,89117,NV,36.157591,-115.285658,2016-03-02,3.56,6.970403,2,3
9,"A0X1baHPgw9IiBRivu0G9g,MGRNCiPHnzBhcDULy8OSuQ",4,0.162564,0.431923,Las Vegas,89117,NV,36.157591,-115.285658,2010-11-12,3.45,6.970403,4,11


In [13]:
# Stores X as a csv file. To save memory, these files are read directly in Part 2. 
X.to_csv('Data\\' + 'nv_master_sample_set.csv', sep=',')