**Happy to stay here or not? – Hotel reviews**

**Introduction**
Here I will use the data published by Anurag Sharma about hotel reviews that were given by costumers.  
The data is given in two files, a train and test. 
* *train.csv* – is the training data, containing unique **User_ID** for each entry with the review entered by a costumer and the browser and device used. The target variable is **Is_Response**, a variable that stats whether the costumes was **happy** or **not_happy** while staying in the hotel.  This type of variable makes the project to a classification problem. 
* *test.csv* – is the testing data, contains similar headings as the train data, without the target variable. 


**Helper functions and libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pal = sns.color_palette()
from wordcloud import WordCloud, STOPWORDS

#text preprocessing
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
import string
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.sparse import hstack, csr_matrix

#ML model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection import KFold, cross_val_score


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


**Load data**

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

**Overview of train data**

In [None]:
df_train.head()

**Overview of test data**

In [None]:
df_test.head()

In [None]:
print('Total number of reviews for training: {}'.format(len(df_train)))
print('Total number of reviews for testing: {}'.format(len(df_test)))

**Check for missing values in test and train**

In [None]:
df_train.isnull().sum().sum()

In [None]:
df_test.isnull().sum().sum()

**Preprocessing the train and test sets**

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df_train["Is_Response"] = labelencoder.fit_transform(df_train["Is_Response"])
#1 not happy, 0 happy

df_train["Device_Used"] = labelencoder.fit_transform(df_train["Device_Used"])
df_test["Device_Used"] = labelencoder.transform(df_test["Device_Used"])

df_train["Browser_Used"] = labelencoder.fit_transform(df_train["Browser_Used"])
df_test["Browser_Used"] = labelencoder.transform(df_test["Browser_Used"])

**Overview after preprocessing**

In [None]:
df_train.head()

In [None]:
df_test.head()

**The target feature**

Is the target feature balanced?

In [None]:
ax = df_train['Is_Response'].value_counts().plot(kind='bar')
totals = []

# find the values and append to list
for i in ax.patches:
    totals.append(i.get_height())

# set individual bar lables using above list
total = sum(totals)

# set individual bar lables using above list
for i in ax.patches:
    ax.text(i.get_x()+0.1, i.get_height(), \
            str(round((i.get_height()/total)*100, 1))+'%', fontsize = 13,
                color = 'black')

The data is clearly imbalanced. 68% of the reviews are happy costumers and approximately 32% are not happy. The imbalance of the target variable requires a careful consideration in the prediction stage in this project. 

**Text preprocessing**

Some of the text in the description column is contracted so expansion of the text in needed. Here I will use the function *decontracted* in order to expand the text. 

In [None]:
import re
def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'cause", " because", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"\'em", " them", phrase)
    phrase = re.sub(r"\'t've", " not have", phrase)
    phrase = re.sub(r"\'d've", " would have", phrase)
    phrase = re.sub(r"\'clock", "f the clock", phrase)
    return phrase

print("finished  decontracted")

Example for the function *decontracted* :

In [None]:
text = "very good hotel in the midst of it all.best:you can't starve:carnegie-deli next doordel frisco's and ruths chris some blocks awaygordon ramsay with - michelin-stars downstairs. park-view from vista-suites looking north"

In [None]:
decontracted(text)

Let's apply the fuction *decontracted* on the Description column in test and train:

In [None]:
df_train["Description"] = df_train["Description"].apply(decontracted)
df_test["Description"] = df_test["Description"].apply(decontracted)
df_train.head()

**Most frequent Description words**

In [None]:
train_desc = pd.Series(df_train['Description'].tolist()).astype(str)
cloud = WordCloud(width=1440, height=1080,stopwords=STOPWORDS).generate(" ".join(train_desc.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(cloud)
plt.title("Most frequent words in the Description column")
plt.axis('off')

Oh WOW! The most frequent words in the reviews of the hotels is "front desk"! This is very intresting because my first thught here was that the most frequent word will be something like "comfortable bed" or "breakfast". This means that people that write positive/negative reviews about hotels refers to the front desk as a main property in their review.

Now let's divide the *Description* column to 2, for happy review and not happy review, and see which words appears the most in the text.

In [None]:
happy=df_train[df_train["Is_Response"]==0]
not_happy=df_train[df_train["Is_Response"]==1]

train_happy = pd.Series(happy['Description'].tolist()).astype(str)
train_not_happy = pd.Series(not_happy['Description'].tolist()).astype(str)

In [None]:
cloud_happy = WordCloud(background_color="white",max_words=50,width=300, height=300,stopwords=STOPWORDS).generate(" ".join(train_happy.astype(str)))
cloud_not_happy = WordCloud(background_color="white",max_words=50,width=300, height=300,stopwords=STOPWORDS).generate(" ".join(train_not_happy.astype(str)))

fig, axes = plt.subplots(ncols=2, figsize=(10, 5))
ax = axes[0]
ax.imshow(cloud_happy)
ax.set_title("Happy")
ax.axis('off')

ax = axes[1]
ax.imshow(cloud_not_happy)
ax.set_title("Not Happy")
ax.axis('off')

plt.show()

Top 5 words for **happy** reviews:
1. *hotel*  
2. *one*
3. *front desk*
4. *room* 
5. *even*

Top 5 words for **not happy** reviews:
1. *room*  
2. *hotel*
3. *one*
4. *front desk*
5. *stay*

The top 4 words for happy and not happy reviews are similar. The word "front desk" pops again as one of the most common words that appear in a review.  

**Length of a review** 

Let's look at the length of each hotel review by its characters and words in the text: 

In [None]:
fig, ax = plt.subplots(1,2,figsize=(10,5))

dist_happy_char = train_happy.apply(len)
dist_not_happy_char = train_not_happy.apply(len)

ax[0].hist(dist_happy_char, bins=100, range=[0, 15000], color=pal[1], normed=True, label='happy')
ax[0].hist(dist_not_happy_char, bins=100, range=[0, 15000], color=pal[2], normed=True, alpha=0.5, label='not_happy')
ax[0].set_title('Normalised histogram of '+ r"$\bf{" + 'character' + "}$"+ ' \n count in review description')
ax[0].legend()
ax[0].set_xlabel('Number of characters')
ax[0].set_ylabel('Probability')

print(' for number of charcter: \n mean-happy {:.2f} std-happy {:.2f} \n mean-not_happy {:.2f} std-not_happy {:.2f} \n max-happy {:.2f} \n max-not_happy {:.2f}'.format(
        dist_happy_char.mean(), dist_happy_char.std(), dist_not_happy_char.mean(), dist_not_happy_char.std(), dist_happy_char.max(), dist_not_happy_char.max()))

dist_happy_word = train_happy.apply(lambda x: len(x.split(' ')))
dist_not_happy_word = train_not_happy.apply(lambda x: len(x.split(' ')))

ax[1].hist(dist_happy_word, bins=100, range=[0, 2400], color=pal[1], normed=True, label='happy')
ax[1].hist(dist_not_happy_word, bins=100, range=[0, 2400], color=pal[2], normed=True, alpha=0.5, label='not_happy')
ax[1].set_title('Normalised histogram of '+ r"$\bf{" + 'word' + "}$"+ ' \n count in review description')
ax[1].legend()
ax[1].set_xlabel('Number of words')
ax[1].set_ylabel('Probability')

print('for number of words: \n mean-happy {:.2f} std-happy {:.2f} \n mean-not_happy {:.2f} std-not_happy {:.2f} \n max-happy {:.2f} \n max-not_happy {:.2f}'.format(dist_happy_word.mean(), 
                          dist_happy_word.std(), dist_not_happy_word.mean(), dist_not_happy_word.std(), dist_happy_word.max(), dist_not_happy_word.max()))

plt.show()

Both graphs look very similar but a closer look reveals that *happy* reviews are most likely to be short and *not_happy* reviews are long. This might help the classification model so let's add this as features to the data.

In [None]:
#for words
df_train["num_words"] = df_train["Description"].apply(lambda x: len(str(x).split()))
df_test["num_words"] = df_test["Description"].apply(lambda x: len(str(x).split()))
#for chars
df_train["num_chars"] = df_train["Description"].apply(lambda x: len(str(x)))
df_test["num_chars"] = df_test["Description"].apply(lambda x: len(str(x)))

df_train.head()

In [None]:
'''
#extracting more features from the text

import string
def unique_word_fraction(row):
    """function to calculate the fraction of unique words on total words of the text"""
    text = row['Description']
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    word_count = text_splited.__len__()
    unique_count = list(set(text_splited)).__len__()
    return (unique_count/word_count)


eng_stopwords = set(stopwords.words("english"))
def stopwords_count(row):
    """ Number of stopwords fraction in a text"""
    text = row['Description'].lower()
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    word_count = text_splited.__len__()
    stopwords_count = len([w for w in text_splited if w in eng_stopwords])
    return (stopwords_count/word_count)


def punctuations_fraction(row):
    """functiopn to claculate the fraction of punctuations over total number of characters for a given text """
    text = row['Description']
    char_count = len(text)
    punctuation_count = len([c for c in text if c in string.punctuation])
    return (punctuation_count/char_count)


def fraction_noun(row):
    """function to give us fraction of noun over total words """
    text = row['Description']
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    word_count = text_splited.__len__()
    pos_list = nltk.pos_tag(text_splited)
    noun_count = len([w for w in pos_list if w[1] in ('NN','NNP','NNPS','NNS')])
    return (noun_count/word_count)

def fraction_adj(row):
    """function to give us fraction of adjectives over total words in given text"""
    text = row['Description']
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    word_count = text_splited.__len__()
    pos_list = nltk.pos_tag(text_splited)
    adj_count = len([w for w in pos_list if w[1] in ('JJ','JJR','JJS')])
    return (adj_count/word_count)

def fraction_verbs(row):
    """function to give us fraction of verbs over total words in given text"""
    text = row['Description']
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    word_count = text_splited.__len__()
    pos_list = nltk.pos_tag(text_splited)
    verbs_count = len([w for w in pos_list if w[1] in ('VB','VBD','VBG','VBN','VBP','VBZ')])
    return (verbs_count/word_count)


df_train['unique_word_fraction'] = df_train.apply(lambda row: unique_word_fraction(row), axis =1)
df_train['stopwords_count'] = df_train.apply(lambda row: stopwords_count(row), axis =1)
df_train['punctuations_fraction'] = df_train.apply(lambda row: punctuations_fraction(row), axis =1)
df_train['fraction_noun'] = df_train.apply(lambda row: fraction_noun(row), axis =1)
df_train['fraction_adj'] = df_train.apply(lambda row: fraction_adj(row), axis =1)
df_train['fraction_verbs'] = df_train.apply(lambda row: fraction_verbs(row), axis =1)
df_train.head()

#did not improved the classifier result
'''

**Sentiment analysis**

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
def sentiment_nltk(text):
    res = sia.polarity_scores(text)
    return res['compound']

In [None]:
df_train["sentiment"] = df_train["Description"].apply(sentiment_nltk)
#df_test["sentiment"] = df_test["Description"].apply(sentiment_nltk)

In [None]:
happy_sent=df_train[df_train["Is_Response"]==0]
not_happy_sent=df_train[df_train["Is_Response"]==1]

In [None]:
plt.figure()
plt.hist(happy_sent['sentiment'], bins=100, range=[-1, 1], color=pal[1], normed=True, label='happy')
plt.hist(not_happy_sent['sentiment'], bins=100, range=[-1, 1], color=pal[2], normed=True, alpha=0.5, label='not_happy')
plt.title('Normalised histogram from sentiment analysis')
plt.legend()
plt.xlabel('Sentiment analysis polarity score')
plt.ylabel('Probability')

Preparing for feature extraction from text

In [None]:
X_description = df_train['Description']

Preparing the data for classification

In [None]:
X = df_train.drop(['User_ID','Description','Is_Response'], axis=1)
Y = df_train['Is_Response']

Preparing CountVectorizer for classification

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english', max_features = 400)

X_ctv =  ctv.fit_transform(X_description)

In [None]:
from scipy.sparse import hstack, csr_matrix

feat_train = csr_matrix(X.values)

X_train_stack_ctv = hstack([feat_train, X_ctv[0:feat_train.shape[0]]])

print('Train shape: ', X_train_stack_ctv.shape)

Preparing TfidfVectorizer for classification

In [None]:
tfv = TfidfVectorizer(min_df=3,
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english', max_features=400)

X_tfv =  tfv.fit_transform(X_description) 

In [None]:
X_train_stack_tfidf = hstack([feat_train, X_tfv[0:feat_train.shape[0]]])

print('Train shape: ', X_train_stack_tfidf.shape)

Some useful code

In [None]:
from sklearn.metrics import precision_score
def classification_report_with_precision_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return precision_score(y_true, y_pred) # return accuracy score

In [None]:
def model_cv(model,X,Y):
    outer_cv = KFold(n_splits=10, shuffle=True)
    clf = model
    nested_score = cross_val_score(clf, X=X, y=Y, cv=outer_cv, scoring = make_scorer(classification_report_with_precision_score))
    print(classification_report(originalclass, predictedclass)) 
    print ("mean precision score: " + str(model)+ ": %0.3f std: (%0.3f)" % (np.mean(nested_score),np.std(nested_score)))

**Predictions with CountVectorizer**

In [None]:
from sklearn.linear_model import LogisticRegression
originalclass = []
predictedclass = []
model_cv(LogisticRegression(),X_train_stack_ctv,Y)

In [None]:
from sklearn.tree import DecisionTreeClassifier
originalclass = []
predictedclass = []
model_cv(DecisionTreeClassifier(),X_train_stack_ctv,Y)

In [None]:
from sklearn.ensemble import RandomForestClassifier
originalclass = []
predictedclass = []
model_cv(RandomForestClassifier(n_estimators = 40),X_train_stack_ctv,Y)

**Predictions with TfidfVectorizer**

In [None]:
originalclass = []
predictedclass = []
model_cv(LogisticRegression(),X_train_stack_tfidf,Y)

In [None]:
originalclass = []
predictedclass = []
model_cv(DecisionTreeClassifier(),X_train_stack_tfidf,Y)

In [None]:
originalclass = []
predictedclass = []
model_cv(RandomForestClassifier(n_estimators = 40),X_train_stack_tfidf,Y)