# Analyzing Hotel Ratings on Tripadvisor

In this homework, we will analyze the data we scraped in Part 1 by fitting a regression model on the data.

** Task 1 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating.

For example, the average rating of a hotel is calculated as follows:

![Information to be scraped](traveler_ratings.png)

$$ \text{AVG_SCORE} = \frac{1*15 + 2*21 + 3*55 + 4*228 + 5*1271}{1590}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [27]:
import pandas as pd
import numpy as np
data = pd.read_csv('part1.csv')

In [28]:
data

Unnamed: 0.1,Unnamed: 0,hotel_name,rating,count
0,0,"Marriott Vacation Club Pulse at Custom House, ...",Excellent,509
1,1,"Marriott Vacation Club Pulse at Custom House, ...",Very good,175
2,2,"Marriott Vacation Club Pulse at Custom House, ...",Average,13
3,3,"Marriott Vacation Club Pulse at Custom House, ...",Poor,13
4,4,"Marriott Vacation Club Pulse at Custom House, ...",Terrible,12
5,5,Boston Harbor Hotel,Excellent,1281
6,6,Boston Harbor Hotel,Very good,230
7,7,Boston Harbor Hotel,Average,55
8,8,Boston Harbor Hotel,Poor,20
9,9,Boston Harbor Hotel,Terrible,16


In [29]:
data['count'] = data['count'].apply(lambda x: x.replace(",", ""))

In [30]:
data = data[['hotel_name', 'rating', 'count']]
hotels = list(set(data['hotel_name'].values.tolist()))
avg_score = []
i = 0
while i < len(hotels):
    group = data.groupby('hotel_name').get_group(hotels[i])
    count = group['count'].values.tolist()
    count = list(map(int, count))
    score = (5*count[0] + 4*count[1] + 3*count[2] + 2*count[3] + 1*count[1])/sum(count)
    avg_score.append((hotels[i],score))
    i = i + 1

In [31]:
group = data.groupby('hotel_name').get_group(hotels[0])
group

Unnamed: 0,hotel_name,rating,count
230,Club Quarters Hotel in Boston,Excellent,687
231,Club Quarters Hotel in Boston,Very good,500
232,Club Quarters Hotel in Boston,Average,182
233,Club Quarters Hotel in Boston,Poor,66
234,Club Quarters Hotel in Boston,Terrible,50


In [32]:
feature = pd.read_csv('part2.csv')

In [33]:
groups = []
i = 0
while i < len(hotels):
    group = feature.groupby('hotel_name').get_group(hotels[i])
    groups.append((group['value_star'].mean(), group['location_star'].mean(), group['cleanliness_star'].mean()
                ,group['service_star'].mean(), group['rooms_star'].mean(), group['sleep_quality'].mean()))
    i = i + 1

In [34]:
feature.groupby('hotel_name').get_group(hotels[11])

Unnamed: 0,hotel_name,review_id,value_star,location_star,cleanliness_star,service_star,rooms_star,sleep_quality
41788,The Bostonian Boston,review_434155158,,,,3.0,4.0,4.0
41789,The Bostonian Boston,review_434154028,,5.0,5.0,5.0,,
41790,The Bostonian Boston,review_434027819,5.0,,,5.0,,5.0
41791,The Bostonian Boston,review_433169881,3.0,5.0,,4.0,,
41792,The Bostonian Boston,review_432950031,,,,,,
41793,The Bostonian Boston,review_432413090,,,3.0,3.0,3.0,
41794,The Bostonian Boston,review_431371996,,,,,,
41795,The Bostonian Boston,review_431348069,,,,5.0,5.0,5.0
41796,The Bostonian Boston,review_431027224,,,,,,
41797,The Bostonian Boston,review_430531691,5.0,5.0,,4.0,,


In [35]:
for i in range(len(groups)):
    groups[i] = avg_score[i] + groups[i]

In [36]:
df_feature = pd.DataFrame(groups, columns = ['hotel_name', 'avg_score','value_star','location_star','cleanliness_star','service_star','rooms_star','sleep_quality'])

In [37]:
df_feature = df_feature.dropna(how = 'any')

In [38]:
from sklearn import linear_model
attributes = ['value_star','location_star','cleanliness_star','service_star','rooms_star','sleep_quality']
X = df_feature[['value_star','location_star','cleanliness_star','service_star','rooms_star','sleep_quality']]
Y = df_feature['avg_score']

In [39]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

In [40]:
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
y_train_pred = regr.predict(X_train)
y_test_pred = regr.predict(X_test)
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

print('MSE train: %.3f, test: %.3f' %(mean_squared_error(Y_train, y_train_pred),
                                      mean_squared_error(Y_test, y_test_pred)))
 
print('R^2 train: %.3f, test: %.3f' %(r2_score(Y_train, y_train_pred),
                                      r2_score(Y_test, y_test_pred)))

MSE train: 0.007, test: 0.016
R^2 train: 0.951, test: 0.932


In [57]:
import statsmodels.formula.api as sm
model = sm.OLS(Y_train, X_train)
results = model.fit()
print(results.summary())
#notice that there are significant evidence that location_star, rooms_star and cleaniness have an effect on avg_score

                            OLS Regression Results                            
Dep. Variable:              avg_score   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.345e+04
Date:                Tue, 15 Nov 2016   Prob (F-statistic):           2.03e-84
Time:                        20:38:51   Log-Likelihood:                 59.103
No. Observations:                  56   AIC:                            -106.2
Df Residuals:                      50   BIC:                            -94.05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
value_star           0.1004      0.070  

In [42]:
model1 = sm.OLS(Y_train, X_train[['value_star', 'service_star', 'sleep_quality']])
results = model1.fit()
print(results.summary())
#regress avg_score on insignificant features above, and find service_star and sleep_quality
#are significant now

                            OLS Regression Results                            
Dep. Variable:              avg_score   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 2.775e+04
Date:                Tue, 15 Nov 2016   Prob (F-statistic):           1.16e-84
Time:                        20:37:50   Log-Likelihood:                 42.786
No. Observations:                  56   AIC:                            -79.57
Df Residuals:                      53   BIC:                            -73.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
value_star        0.1713      0.088      1.947

In [43]:
model2 = sm.OLS(Y_train, X_train['value_star'])
results = model2.fit()
print(results.summary())
#if we only regress on value_star, then it is also significant, and itself explains 99.5% variation

                            OLS Regression Results                            
Dep. Variable:              avg_score   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                 2.340e+04
Date:                Tue, 15 Nov 2016   Prob (F-statistic):           5.13e-74
Time:                        20:37:50   Log-Likelihood:                 6.2567
No. Observations:                  56   AIC:                            -10.51
Df Residuals:                      55   BIC:                            -8.488
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
value_star     1.1227      0.007    152.961      0.0

In conclusion, all six features have an effect on the average score, but the features themselves are highly correlated, so if we regress avg_score on all of them, only some of them are significant, because the rest features have nothing to explain. However, if we regress avg_score on any feature individually, every feature will have a significantly contribution to avg_score. 

-------

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [67]:
label = []
i = 0
while i < len(hotels):
    group = data.groupby('hotel_name').get_group(hotels[i])
    count = group['count'].values.tolist()
    count = list(map(int, count))
    pct_excellent = count[0]/sum(count)
    if pct_excellent > 0.6:
        label.append((hotels[i],'Excellent'))
    else:
        label.append((hotels[i],'Not_excellent'))
    i = i + 1

In [68]:
for i in range(len(groups)):
    groups[i] = label[i] + groups[i][2:]

In [69]:
df_part2 = pd.DataFrame(groups, columns = ['hotel_name', 'label','value_star','location_star','cleanliness_star','service_star','rooms_star','sleep_quality'])

In [70]:
df_part2 = df_part2.dropna(how = 'any')

In [71]:
X2 = df_part2[['value_star','location_star','cleanliness_star','service_star','rooms_star','sleep_quality']]
Y2 = df_part2['label']

In [49]:
from sklearn.linear_model import LogisticRegression
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, Y2, test_size=0.3, random_state=1)

In [97]:
lr = LogisticRegression(C=1000, random_state=0)
lr.fit(X_train2, y_train2)

LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [98]:
y_pred = lr.predict(X_test2)
print('Number of misclassified samples: %d' % (y_test2 != y_pred).sum())
from sklearn.metrics import accuracy_score

print('Accuracy: %.2f' % accuracy_score(y_test2, y_pred))    

print('Coefficients: \n', lr.coef_)
#Works very well! 

Number of misclassified samples: 0
Accuracy: 1.00
Coefficients: 
 [[-1.97668908 -1.24161023  4.70915864 -6.65381942 -2.61542549 -7.71429434]]


In [77]:
for i in range(len(Y2.values)):
    if Y2.values[i] == 'Not_excellent':
        Y2.values[i] = 0
    else:
        Y2.values[i] = 1
Y2 = Y2.astype('int64')

In [88]:
import statsmodels.api as sm
#Look at the detailed summary of the model
logit = sm.OLS(Y2,X2)

In [92]:
result1 = logit.fit() 
print (result1.summary())
# only room_star and cleanliness star are significant!

                            OLS Regression Results                            
Dep. Variable:                  label   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.479
Method:                 Least Squares   F-statistic:                     13.43
Date:                Tue, 15 Nov 2016   Prob (F-statistic):           2.79e-10
Time:                        20:46:54   Log-Likelihood:                -39.355
No. Observations:                  81   AIC:                             90.71
Df Residuals:                      75   BIC:                             105.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
value_star          -0.0422      0.263  

In [100]:
logit2 = sm.OLS(Y2,X2[['value_star', 'location_star', 'service_star', 'sleep_quality']])
result2 = logit2.fit() 
print (result2.summary())
#None of them are significant!

                            OLS Regression Results                            
Dep. Variable:                  label   R-squared:                       0.388
Model:                            OLS   Adj. R-squared:                  0.357
Method:                 Least Squares   F-statistic:                     12.23
Date:                Tue, 15 Nov 2016   Prob (F-statistic):           9.58e-08
Time:                        20:53:47   Log-Likelihood:                -48.997
No. Observations:                  81   AIC:                             106.0
Df Residuals:                      77   BIC:                             115.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
value_star       -0.2204      0.288     -0.765

Therefore, we can conclude that only room_star and cleanliness_star have a significant influence on whether a hotel is excellent or not! 

-------