# Regression on review scores

## Introduction

In this notebook, I will fit a simple regression model on the reviewer's scores. The purpose of the model is to try to find out the words/phases that are indicative to the review score of the guest. Can we do it? I am not sure right now, but we will know it later!

# Preparing data

In [1]:
import pandas as pd
df = pd.read_csv('../input/Hotel_Reviews.csv')
df.head()

In [2]:
print(df.shape)

In [3]:
df['all_review'] = df.apply(lambda x:x['Positive_Review']+' '+x['Negative_Review'],axis=1)

The size of data is not quite small, and we want to execute the code quickily as we got a time limitation in kernel! So I decide to train a model on 20% of the data and valid the model on 80% of the data. The validation set (80%) will be splitted into three parts and we will compare the statistics of validation seperately. This is always my validation strategy when I the dataset is large or I do not have enough computation resources.

In [4]:
from sklearn.model_selection import train_test_split
train,test1 = train_test_split(df,test_size=0.8,random_state=42)
test1,test2 = train_test_split(test1,test_size=0.67,random_state=42)
test2,test3 = train_test_split(test2,test_size=0.5,random_state=42)
print(train.shape);print(test1.shape);print(test2.shape);print(test3.shape)

I plan to train a TFIDF model on both train and test set, in order to provide the data for sklearn model.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
t = TfidfVectorizer(max_features=10000)
train_feats = t.fit_transform(train['all_review'])
test_feats1 = t.transform(test1['all_review'])
test_feats2 = t.transform(test2['all_review'])
test_feats3 = t.transform(test3['all_review'])

After this step, the feature preparation is done, and we can start to think on a classifier.

## Model Fitting

In this part, we will try to fit a Gradient Boosting Regressor on the data set. The ultimate objective of this, is to have a decent model that can tell us the importance of different words. 

In [6]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [7]:
gbdt = GradientBoostingRegressor(max_depth=5,learning_rate=0.1,n_estimators=150) # Large iteration, fewer estimators
gbdt.fit(train_feats,train['Reviewer_Score'])

This is a simple GBDT model without elegant parameter tuning. Let's evaluate the performance of the model.

In [8]:
pred_inbag = gbdt.predict(train_feats)
pred_test1 = gbdt.predict(test_feats1)
pred_test2 = gbdt.predict(test_feats2)
pred_test3 = gbdt.predict(test_feats3)

Let's first compare the mean absolute error of the inbag data and three out bag data.

In [9]:
MAEs = pd.DataFrame({'data':['in_bag','out_bag1','out_bag2','out_bag3'],'MAE':[mean_absolute_error(train['Reviewer_Score'],pred_inbag),mean_absolute_error(test1['Reviewer_Score'],pred_test1),mean_absolute_error(test2['Reviewer_Score'],pred_test2),mean_absolute_error(test3['Reviewer_Score'],pred_test3)]})

In [10]:
MAEs

In [13]:
from ggplot import *
p = ggplot(MAEs,aes(x='data',weight='MAE')) + geom_bar()+theme_bw()+ggtitle('Mean Absolute Error of GBDT models')
print(p)

How about Rooted-Mean-Squred-Error?

In [16]:
RMSEs = pd.DataFrame({'data':['in_bag','out_bag1','out_bag2','out_bag3'],'RMSE':[mean_squared_error(train['Reviewer_Score'],pred_inbag)**0.5,mean_squared_error(test1['Reviewer_Score'],pred_test1)**0.5,mean_squared_error(test2['Reviewer_Score'],pred_test2)**0.5,mean_squared_error(test3['Reviewer_Score'],pred_test3)**0.5]})

In [17]:
RMSEs

In [18]:
p = ggplot(RMSEs,aes(x='data',weight='RMSE')) + geom_bar()+theme_bw()+ggtitle('Rooted Mean Squared Error of GBDT models')
print(p)

We can see that: the difference between in-bag data and three out-bag data samples are quite close to each other. This doesn't mean model is good enough, but at least it is generalized very well. But also, the errors are within a tolerable range. Therefore, if we take the feature importance of this model, it will be indicative to the guest's review scores.

## Most important words

In [20]:
words = t.get_feature_names()
importance = gbdt.feature_importances_
impordf = pd.DataFrame({'Word' : words,
'Importance' : importance})
impordf = impordf.sort_values(['Importance', 'Word'], ascending=[0, 1])
## Check the top 30 most important words
impordf.head(30)

Words with strong emotion implication (like not, rude.etc) gain higher score in feature importance table.

In [21]:
impordf.to_csv('Most_important_words.csv',index=False)

## Concluding Remark

With some really simple tricks, one can get a better model than this one. Following are some tips:
* Perform stemming and remove stopwords in preprocessing
* Set ngram range to (1,2) in TFIDF training
* Smaller learning rate, large number of trees
* Fune tune models and try other algorithm

Please have fun with this dataset and be creative! Thanks!