# Part 4: Model - Hypothesis Testing
Now that we have explored our data (in notebook #1), let's begin to answer our key questions using hypothesis testing. Hypothesis testing is often used in order to determine whether or not an outcome is statistically significant. The key questions we will be evaluating are:  

1. Does a restaurant's Yelp rating influence how many Yelp reviews the restaurant will receive?
2. Does a restaurant's inspection grade influence how many Yelp reviews the restaurant will receive?
3. Does the type of cuisine influence how many Yelp reviews the restaurant will receive?
4. Is there a relationship between the Inspection Grade and the Neighborhood, Price, or Cuisine Type?


We are looking into these question in order to help restaurateurs have a better understanding of things they can do to help gain more Yelp reviews or better inspection grades. Having more reviews (if positive) and having a better inspection grade could help elevate a restaurant and influence more people to go in and try a certain restaurant, therefore boosting revenue for that restaurant.

First, we will call all needed libraries and import the dataset that we scrubbed in our EDA notebook.

In [1]:
# Import Libraries:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing


# Import relevant libraries for hypothesis testing:
from scipy import stats # for significance levels and normality
import statsmodels.api as sm # for statistical exploration/testing
from statsmodels.formula.api import ols # for hypothesis testing
from statsmodels.stats.multicomp import pairwise_tukeyhsd # for pairwise comparisons
from statsmodels.stats.multicomp import MultiComparison # for multiple comparisons testing

In [2]:
# Import Data:
with open ('scrubbed_data.pickle','rb') as f:
    df_merged = pickle.load(f)

print(len(df_merged))
df_merged.head()

3930


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,display_phone,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,wine_bars,womenscloth,wraps,delivery,pickup,restaurant_reservation,num_of_cat,mainstream_category,rare_category,price_value
0,41322152,1 2 3 BURGER SHOT BEER,Manhattan,738,10 AVENUE,10019.0,(212) 315-0123,American,2019-12-20,Violations were cited in the following area(s).,...,0,0,0,1,1,0,3,1,0,1
1,41430594,1 STOP PATTY SHOP,Manhattan,1708,AMSTERDAM AVENUE,10031.0,(212) 491-7466,Bakery,2019-03-27,Violations were cited in the following area(s).,...,0,0,0,0,0,0,2,1,0,1
2,50059935,108 FOOD DRIED HOT POT,Manhattan,2794,BROADWAY,10025.0,(917) 675-6878,Chinese,2019-05-23,Violations were cited in the following area(s).,...,0,0,0,1,1,0,2,1,0,2
3,41092609,10TH AVENUE COOKSHOP,Manhattan,156,10 AVENUE,10011.0,(212) 924-4440,American,2019-04-11,Violations were cited in the following area(s).,...,1,0,0,1,1,0,3,1,0,2
4,50057272,11 HANOVER GREEK,Manhattan,11,HANOVER SQ,10005.0,(212) 785-4000,Greek,2019-02-28,Violations were cited in the following area(s).,...,1,0,0,1,1,0,3,1,0,3


## 4.1 Yelp Rating vs. # of Yelp Reviews

I will begin by setting my hypothesis. The Null Hypothesis ( 𝐻0 ) is typically that there is no difference between the samples, while the Alternative Hypothesis ( 𝐻𝐴 ) is our educated guess about the relationship between our two samples. Below are the hypothesis I will be using to answer this first question:  

**𝐻0  = There is no difference in the number of Yelp reviews received by restaurants with a high (4.0 or above) Yelp rating vs. a low (3.5 or below) Yelp rating.  
𝐻𝐴  = The number of Yelp reviews received by restaurants with high ratings is statistically greater than the number of Yelp reviews received by restaurants with low Yelp ratings.**

We will be using an alpha value of .05 to determine if our data is statistically significant. This means that we are fine with accepting our alternative hypothesis as true if there is less than a 5% chance the results we are getting are actually due to randomness.

To begin our hypothesis testing, let's first group the data to identify high Yelp ratings (4+) and low Yelp ratings (3.5 and below). I then will take a look at the mean scores for the number of reviews  for high Yelp rating vs. low Yelp rating.

In [3]:
# Update rating to be grouped by high (4+) and low (3.5 and below) ratings:
score = {1: 0, 1.5: 0, 2:0, 2.5: 0, 3: 0, 3.5:0,4:1,4.5:1,5:1}
df_merged['rating_score'] = df_merged['rating'].map(score)

# Select data needed for analysis:
high_star = df_merged[df_merged['rating_score']==1]['review_count']
low_star = df_merged[df_merged['rating_score']==0]['review_count']

# Compare mean scores of number of review for high Yelp ratings vs. low Yelp ratings:
print('High Score Mean:',high_star.mean())
print('Low Score Mean:',low_star.mean())

High Score Mean: 352.6193628465039
Low Score Mean: 314.0522141440846


Based on the mean values, it does seem like having a high rating may indicate more reviews have been written for a restaurant.  Next I will need to choose the appropriate testing method. We typically use a t-test for hypothesis testing, which tells us if there is a statistical difference between the means of two populations. If our sample sizes and/or sample variances are equal, then we would use a standard student's t-test. However, if sample size and variances are unequal between our 2 populations, then we should use an adaption of the student's t-test known as a Welch's t-test.

In [4]:
# Test whether variances and sample size are equal:
print('Are variances equal?:',np.var(high_star) == np.var(low_star))
print('Are sample sizes equal?:',len(high_star) == len(low_star))

Are variances equal?: False
Are sample sizes equal?: False


Since our sample sizes and variances are not equal, we will proceed with the Welch's t-test. We will use a 1-tailed t-test since we are just looking to see if a high rating leads to a greater number of reviews. I will use the ttest_ind function to determine if there is any difference in the number of reviews, and will pass the 'equal_var = False' function to indicate that our variances are unequal.

In [5]:
# Run 1-sided Welch's t-test:
result = stats.ttest_ind(high_star, low_star, equal_var = False) # 1-tailed Welch's t-test
print('Reject Null Hypothesis' if result[1]/2<.05 else print('Failed to Reject Null Hypothesis'))
print('t-statistic:',result[0],'p-value:',result[1]/2)

Reject Null Hypothesis
t-statistic: 3.931521379542655 p-value: 4.302440410486111e-05


We end up with a very low p-value that is less than our alpha value of .05. This value is statistically significant and gives a strong case against our null hypothesis. Therefore, we would reject our null hypothesis and say that having a high Yelp rating does lead to a greater number of reviews being written compared to restaurants with low ratings.  

Let's now quantify the size of the difference between our 2 means by looking at the effect size.

### Effect Size:
Effect size will help us understand the practical significance of our results. In other words, how meaningful is the statistical difference between our two groups. To understand the effect size, I will use Cohen's d, which represents the magnitude of differences between 2 groups on a given variable. Larger values for Cohen's d will indicate greater differentiation between the two groups. A Cohen's d effect size around .2 is considered 'small', around .5 is considered 'medium, and around .8 is considered 'large'.

The formula for Cohen's d is:
𝑑 = effect size (difference of means) / pooled standard deviation

In [6]:
# Cohen's d formula:
def Cohen_d(group1, group2):
    '''This function takes in two groups of data and calculates the Cohen's d value between them.'''

    diff = group1.mean() - group2.mean()

    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()

    # Calculate the pooled threshold
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    
    return d

In [7]:
# Run Cohen's d for our data:
Cohen_d(high_star,low_star)

0.1254212679098783

With a low Cohen d value of .125, we can say having a high Yelp rating has only a small effect on the number of reviews.  

## 4.2 Inspection Grade vs. # of Yelp Reviews
Let's now run the same analysis looking at the relationship between inspection grades and # of Yelp reviews.
 
**𝐻0  = There is no difference in the number of Yelp reviews received by restaurants with a high inspection grade (A)  vs. a low inspection grade (B or C).  
𝐻𝐴  = The number of Yelp reviews received by restaurants with high inspection grades is statistically greater than the number of Yelp reviews received by restaurants with low inspection grades.**

Let's first identify the data we will be using as either a high grade (A) or not a high grade (not an A).

In [8]:
# Select only data that has an actual inspection grade of A, B, or C.
df_merged_temp = df_merged.loc[df_merged['GRADE'].isin(['A','B','C'])].copy()

# Select data needed for analysis:
high_grade = df_merged_temp[df_merged_temp['GRADE']=='A']['review_count']
low_grade = df_merged_temp[df_merged_temp['GRADE']!='A']['review_count']

# Compare mean scores of number of reviews for high Yelp ratings vs. low Yelp ratings:
print('High Grade Mean:',high_grade.mean())
print('Low Grade Mean:',low_grade.mean())

High Grade Mean: 345.80331125827814
Low Grade Mean: 309.01020408163265


The mean for having an A grade seems to be a bit higher than the mean for lower inspection grades. Let's move forward with the rest of our hypothesis testing by checking if the variances and sample sizes are equal or different.

In [9]:
# Test whether variances and sample size are equal:
print('Are variances equal?:',np.var(high_grade) == np.var(low_grade))
print('Are sample sizes equal?:',len(high_grade) == len(low_grade))

Are variances equal?: False
Are sample sizes equal?: False


Since our sample sizes and variances are not equal, we will proceed with a 1 tailed Welch's t-test using an alpha value of .05.

In [10]:
# Run 1-sided Welch's t-test:
result2 = stats.ttest_ind(high_grade, low_grade, equal_var = False) # 1-tailed Welch's t-test
print('Reject Null Hypothesis' if result2[1]/2<.05 else print('Failed to Reject Null Hypothesis'))
print('t-statistic:',result2[0],'p-value:',result2[1]/2)

Reject Null Hypothesis
t-statistic: 1.6560018608590783 p-value: 0.049565266142070984


Our p-value is just under the alpha value threshold of .05, so we can reject the null hypothesis here. Therefore, we can say that there is a meaningful relationship between the inspection grade given and the number of Yelp reviews present for a restaurant.

Let's take a look at the effect size here.

In [11]:
# Run Cohen's d for our data:
Cohen_d(high_grade,low_grade)

0.11917228221276982

With a low Cohen's d score of .119, we can say there is a small effect size.

## 4.3 Cusine Type vs. # of Yelp Reviews

Let's take a look at the relationship between the type of cuisine and the number of Yelp reviews.

**𝐻0  = There is no relationship between the cuisine type and the # of Yelp reivews.
<br>𝐻𝐴  = There is at least one cuisine type with a significantly different number of Yelp reviews.<br>**

We will use an ANOVA test here rather than a t-test since we are looking at multiple different groups (i.e. each cuisine is considered a different group). Similar to the t-tests used earlier, ANOVA will allow us to compare the means of each cuisine type to see if there are any statistical differences.

In [12]:
# Re-name the cuisine description column to have no spaces so that we can run it through the anova test.
df_merged.rename(columns={'CUISINE DESCRIPTION': 'CUISINE_DESCRIPTION'}, inplace=True)

# Perform 2-sided ANOVA test. Use C() when working with categorical data
formula1 = 'review_count ~ C(CUISINE_DESCRIPTION)'
lm1 = ols(formula1, df_merged).fit()
result1 = sm.stats.anova_lm(lm1, typ=2)
print(result1)

                              sum_sq      df         F        PR(>F)
C(CUISINE_DESCRIPTION)  2.588763e+07    74.0  3.887362  1.035454e-25
Residual                3.469205e+08  3855.0       NaN           NaN


We get a very low p-value that is much smaller than our alpha of .05. Therefore, we can reject our null hypothesis.  

This ANOVA test only tells us that at least one of the cuisines signficantly differs from one (or multiple) other cuisines. While this is helpful to know, it would be even more helpful to know which cuisines are generating statistically high number of reviews, and which (if any) are similar to each other. To do this, we can perform a multiple comparisons analysis, which will compare all possible pairwise groups of means, and use Tukey's HSD test to determine the statistical significance of these comparisons. Since there are over 200 cuisine types, I will just show the rows of the Tukey results that reject the null hypothesis.

In [13]:
# Perform Tukey's HSD Test:
mc1 = MultiComparison(df_merged['review_count'], df_merged['CUISINE_DESCRIPTION'])
mc1_results = mc1.tukeyhsd()

# Convert results to a dataframe:
tukey_data1 = pd.DataFrame(data=mc1_results._results_table.data[1:], columns = mc1_results._results_table.data[0])

# Select only rows where we are rejecting the null hypothesis:
tukey_data1 = tukey_data1.loc[tukey_data1['reject']==True]
tukey_data1

Unnamed: 0,group1,group2,meandiff,p-adj,lower,upper,reject
160,American,Chicken,-190.956,0.0457,-380.7624,-1.1496,True
161,American,Chinese,-100.9642,0.0011,-184.4225,-17.5058,True
195,American,Other,-193.606,0.001,-340.2303,-46.9818,True
198,American,Pizza,-126.8779,0.0166,-245.7389,-8.0169,True
636,Barbecue,Chicken,-389.7667,0.0241,-762.6087,-16.9247,True
671,Barbecue,Other,-392.4167,0.0074,-745.2386,-39.5947,True
679,Barbecue,Salads,-410.4542,0.0238,-802.8395,-18.0688,True
680,Barbecue,Sandwiches,-430.3101,0.0292,-846.4762,-14.1441,True
898,Café/Coffee/Tea,French,190.2045,0.0419,2.2071,378.2019,True
909,Café/Coffee/Tea,Italian,177.7321,0.0192,9.9171,345.547,True


We have a lot of statistical significances here. We can utilize the meandiff column above to determine which of the two groups has a higher mean for each combination of cuisines. Next, I will create a function that allows us to take a look at the effect size of each of those pairings that are statisitcally significant.

In [14]:
# Define a function to iterate through all of our desired combinations and find the cohen's d of each combination:
def multi_cohen_d(values_list, data_column, value, column_label1, column_label2):
    '''This function will evaluate Cohen's d between multiple identified pairs of data.
    Inputs:
        - values_list: list containing lists of each pair of values to evaluate
        - data_column: column from full dataset that is used to match with the data in values_list 
        - value: column from full dataset that includes the actual data values you want to extract
        - column_label1: label for the column of the dataframe we are creating that includes group1
        - column_label2: label for the column of the dataframe we are creating that includes group2
    
    Returns:
        - A dataframe listing the components of each group and the corresponding Cohen's d'''
    
    d = pd.DataFrame(columns = [column_label1,column_label2,"Cohen's d"], index = None)
    for x in values_list:
        cohen = Cohen_d(df_merged[df_merged[data_column]==x[0]][value],
                        df_merged[df_merged[data_column]==x[1]][value])
        d = d.append({column_label1:x[0],column_label2:x[1], "Cohen's d":cohen}, ignore_index=True)
    return d

# Identify the values we want to utilize to find Cohen's d:
cuisine_list = tukey_data1.loc[tukey_data1['reject']==True].iloc[:,:2].values.tolist()

# Run function to see all Cohen's d values:
print("Cohen's d Chart:")
multi_cohen_d(cuisine_list, 'CUISINE_DESCRIPTION', 'review_count','Cuisine1', 'Cuisine2')

Cohen's d Chart:


Unnamed: 0,Cuisine1,Cuisine2,Cohen's d
0,American,Chicken,0.615966
1,American,Chinese,0.326166
2,American,Other,0.630111
3,American,Pizza,0.409269
4,Barbecue,Chicken,1.441044
5,Barbecue,Other,1.543021
6,Barbecue,Salads,1.79114
7,Barbecue,Sandwiches,1.61254
8,Café/Coffee/Tea,French,-0.66325
9,Café/Coffee/Tea,Italian,-0.632607


It looks like we have a variety of effect sizes here, with some being small, some being medium, and some being large.

## 4.4 Inspection Grade vs. Neighborhood, Price, and Cuisine Type

Let's now run an analysis looking at the relationship between inspection grades (using the SCORE column which gives the actual number grade that the grade represents) and the neighborhood, price, and type of cuisine for restaurants.
 
**𝐻0  = There is no relationship between the inspection grade and the neighborhood, price, and cuisine type.
<br>𝐻𝐴  = There is a relationship between the inspection grade and the neighborhood, price, and cuisine type.<br>**


The first thing we need to do is remove any values in the SCORE column that are not numerical. We do this because our analysis will only run with numerical data for our target variable (the SCORE column). We'll also update the data type of the SCORE column to ensure it is numerical.

In [15]:
# Remove rows with 'PEND' values so we only have restaurants with a numerical score:
df_merged.drop(df_merged.loc[df_merged['SCORE']=='PEND'].index, inplace=True)

# Update the SCORE column type to be integers so it is a numerical column:
df_merged['SCORE'] = df_merged['SCORE'].astype(int)

Not that our data is ready we can move on with our hypothesis test. We will use an ANOVA test here since we are looking at multiple different groups.

In [16]:
# Perform 2-sided ANOVA test. Use C() when working with categorical data
formula2 = 'SCORE ~ C(neighborhood)+C(price_value)+C(CUISINE_DESCRIPTION)'
lm2 = ols(formula2, df_merged).fit()
result2 = sm.stats.anova_lm(lm2, typ=2)
print(result2)

                               sum_sq      df         F    PR(>F)
C(neighborhood)           1934.925964    20.0  1.397902  0.111294
C(price_value)            1333.055664     4.0  4.815383  0.000713
C(CUISINE_DESCRIPTION)    6326.960270    74.0  1.235395  0.085385
Residual                255516.637277  3692.0       NaN       NaN


Only the price has a p-value lower than .05 and therefore we can reject the null hypothesis and say the price has a relationship with the inspection score. However, for the neighborhood and cuisine type, neither seems to have an impact on the inspection score as its p-value is higher than .05, and therefore we fail to reject the null hypothesis.

To understand which prices are statistically significant, we will run a Tukey HSD test.

In [17]:
# Perform Tukey's HSD Test:
mc2 = MultiComparison(df_merged['SCORE'], df_merged['price_value'])
mc2_results = mc2.tukeyhsd()

# Convert results to a dataframe"
tukey_data2 = pd.DataFrame(data=mc2_results._results_table.data[1:], columns = mc2_results._results_table.data[0])
print("Price vs. Score:",'\n',tukey_data2)

Price vs. Score: 
    group1  group2  meandiff   p-adj   lower   upper  reject
0       0       1   -1.0846  0.3290 -2.6602  0.4910   False
1       0       2    0.2255  0.9000 -1.1455  1.5966   False
2       0       3   -0.0002  0.9000 -1.6869  1.6865   False
3       0       4   -2.0171  0.1299 -4.3613  0.3272   False
4       1       2    1.3102  0.0047  0.2818  2.3385    True
5       1       3    1.0844  0.2286 -0.3379  2.5067   False
6       1       4   -0.9324  0.7373 -3.0943  1.2294   False
7       2       3   -0.2257  0.9000 -1.4174  0.9659   False
8       2       4   -2.2426  0.0206 -4.2602 -0.2250    True
9       3       4   -2.0169  0.1019 -4.2610  0.2273   False


From these Tukey results we see that most prices do not generate statistically different inspection scores. However, the price range of 2 dollar signs does show statistical differences to both 1 dollar sign and 4 dollar signs. Next, I will calculate the effect size of each combination of prices that lead to us rejecting the null hypothesis.

In [18]:
# Identify the values we want to utilize to find Cohen's d:
price_list = tukey_data2.loc[tukey_data2['reject']==True].iloc[:,:2].values.tolist()

# Run function to see all Cohen's d values:
print("Cohen's d Chart:")
multi_cohen_d(price_list, 'price_value', 'SCORE','Price1', 'Price2')

Cohen's d Chart:


Unnamed: 0,Price1,Price2,Cohen's d
0,1.0,2.0,-0.156453
1,2.0,4.0,0.263009


In both cases the effect size is on the small side.

# Conclusion:

In summary, here are the results of our hypothesis test for each of our four questions:

**1. Does a restaurant's Yelp rating influence how many Yelp reviews the restaurant will receive?**
    <br>--> Reject Null Hypothesis, indicating a higher Yelp rating does lead to having a greate number of Yelp reviews received. However, the effect of this is very small.<br><br>
**2. Does a restaurant's inspection grade influence how many Yelp reviews the restaurant will receive?**
    <br>--> Reject Null Hypothesis, indicating having a higher inspection grade does lead to having a higher number of Yelp reviews received, though the effect size is small.<br><br>
**3. Does the type of cuisine influence how many Yelp reviews the restaurant will receive?**
    <br>--> Reject Null Hypothesis, indicating some cuisine types do impact the number of Yelp reviews received.<br><br>
**4. Is there a relationship between the Inspection Grade and the Neighborhood, Price, or Cuisine Type?**
    <br>--> Reject Null Hypothesis, as some price levels can have an impact on the inspection grade, such as when we look at 1 dollar sign vs. 2 dollar signs or 2 dollar signs vs. 4 dollar signs. However, we fail to reject the null hypothesis when looking specifically at the impact of the neighborhood or the cuisine type on inspection grade. <br><br>

Based on these results, there are a few recommendations I would give to current or prospective restaurant owners:
- Consider a 4 dollar sign price level rather than a 2 dollar sign price level as these types of restaurants often receive better inspection grades. 
- Ensure your restaurant is up to code and has minimal violations so that you are more likely to receive a better inspection grade, which likely will lead to a greater number of Yelp reviews, which in turn can draw more customers into your restaurant.
- Ensure customers have an enjoyable experience at your restaurant so that they will not only give a high Yelp rating, but will also leave a positive review which can encourage other potential customers to try your restaurant.
- When trying to ensure a strong inspection grade, cuisine type and neighborhood do not play a significant factor, so no limitations need to be considered in respect to these two aspects.

# Next Steps:
A few things to look into next:
- Does the price impact the number of Yelp reviews?
- Does having a critical violation impact the number of Yelp reviews or the inspection grade?
- Do certain neighborhoods have significantly higher Yelp ratings?
- Do certain neighborhoods have significantly higher inspection grades?