**Happy to stay here or not? – Hotel reviews**

**Introduction**
Here I will use the data published by Anurag Sharma about hotel reviews that were given by costumers.  
The data is given in two files, a train and test. 
* *train.csv* – is the training data, containing unique **User_ID** for each entry with the review entered by a costumer and the browser and device used. The target variable is **Is_Response**, a variable that stats whether the costumes was **happy** or **not_happy** while staying in the hotel.  This type of variable makes the project to a classification problem. 
* *test.csv* – is the testing data, contains similar headings as the train data, without the target variable. 


**Helper functions and libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pal = sns.color_palette()
from wordcloud import WordCloud, STOPWORDS

#text preprocessing
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
import string
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.sparse import hstack, csr_matrix

#ML model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection import KFold, cross_val_score


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


**Load data**

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

**Overview of train data**

In [None]:
df_train.head()

**Overview of test data**

In [None]:
df_test.head()

In [None]:
print('Total number of reviews for training: {}'.format(len(df_train)))
print('Total number of reviews for testing: {}'.format(len(df_test)))

**Check for missing values in test and train**

In [None]:
df_train.isnull().sum().sum()

In [None]:
df_test.isnull().sum().sum()

**Preprocessing the train and test sets**

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df_train["Is_Response"] = labelencoder.fit_transform(df_train["Is_Response"])
#1 not happy, 0 happy

df_train["Device_Used"] = labelencoder.fit_transform(df_train["Device_Used"])
df_test["Device_Used"] = labelencoder.transform(df_test["Device_Used"])

df_train["Browser_Used"] = labelencoder.fit_transform(df_train["Browser_Used"])
df_test["Browser_Used"] = labelencoder.transform(df_test["Browser_Used"])

**Overview after preprocessing**

In [None]:
df_train.head()

In [None]:
df_test.head()

**The target feature**

Is the target feature balanced?

In [None]:
ax = df_train['Is_Response'].value_counts().plot(kind='bar')
totals = []

# find the values and append to list
for i in ax.patches:
    totals.append(i.get_height())

# set individual bar lables using above list
total = sum(totals)

# set individual bar lables using above list
for i in ax.patches:
    ax.text(i.get_x()+0.1, i.get_height(), \
            str(round((i.get_height()/total)*100, 1))+'%', fontsize = 13,
                color = 'black')

The data is clearly imbalanced. 68% of the reviews are happy costumers and approximately 32% are not happy. The imbalance of the target variable requires a careful consideration in the prediction stage in this project. 

**Text preprocessing**

Some of the text in the description column is contracted so expansion of the text in needed. Here I will use the function *decontracted* in order to expand the text. 

In [None]:
import re
def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'cause", " because", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"\'em", " them", phrase)
    phrase = re.sub(r"\'t've", " not have", phrase)
    phrase = re.sub(r"\'d've", " would have", phrase)
    phrase = re.sub(r"\'clock", "f the clock", phrase)
    return phrase

print("finished  decontracted")

Example for the function *decontracted* :

In [None]:
text = "very good hotel in the midst of it all.best:you can't starve:carnegie-deli next doordel frisco's and ruths chris some blocks awaygordon ramsay with - michelin-stars downstairs. park-view from vista-suites looking north"

In [None]:
decontracted(text)

Let's apply the fuction *decontracted* on the Description column in test and train:

In [None]:
df_train["Description"] = df_train["Description"].apply(decontracted)
df_test["Description"] = df_test["Description"].apply(decontracted)
df_train.head()

**Most frequent Description words**

In [None]:
train_desc = pd.Series(df_train['Description'].tolist()).astype(str)
cloud = WordCloud(width=1440, height=1080,stopwords=STOPWORDS).generate(" ".join(train_desc.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(cloud)
plt.title("Most frequent words in the Description column")
plt.axis('off')

Oh WOW! The most frequent words in the reviews of the hotels is "front desk"! This is very intresting because my first thught here was that the most frequent word will be something like "comfortable bed" or "breakfast". This means that people that write positive/negative reviews about hotels refers to the front desk as a main property in their review.

In [None]:
# function to plot most frequent terms
def freq_words(x, terms = 30):
  all_words = ' '.join([text for text in x])
  all_words = all_words.split()

  fdist = FreqDist(all_words)
  words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

  # selecting top 20 most frequent words
  d = words_df.nlargest(columns="count", n = terms) 
  plt.figure(figsize=(20,5))
  ax = sns.barplot(data=d, x= "word", y = "count")
  ax.set(ylabel = 'Count')
  plt.show()

In [None]:
import nltk
from nltk import FreqDist
freq_words(df_train['Description'])

Most common words are ‘the’, ‘and’, ‘to’, so on and so forth. These words are not so important for our task and they do not tell any story. We’ have to get rid of these kinds of words. Before that let’s remove the punctuations and numbers from our text data.

In [None]:
# remove unwanted characters, numbers and symbols
df_train['Description'] = df_train['Description'].str.replace("[^a-zA-Z#]", " ")

In [None]:
#Let’s try to remove the stopwords and short words (<2 letters) from the reviews.
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
# function to remove stopwords
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

# remove short words (length < 3)
df_train['Description'] = df_train['Description'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

# remove stopwords from the text
reviews = [remove_stopwords(r.split()) for r in df_train['Description']]

# make entire text lowercase
reviews = [r.lower() for r in reviews]

In [None]:
#Let’s again plot the most frequent words and see if the more significant words have come out.

freq_words(reviews, 35)

In [None]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

def lemmatization(texts, tags=['NOUN', 'ADJ']): # filter noun and adjective
       output = []
       for sent in texts:
             doc = nlp(" ".join(sent)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

In [None]:
#Let’s tokenize the reviews and then lemmatize the tokenized reviews.

tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) # print lemmatized review

In [None]:
#As you can see, we have not just lemmatized the words but also filtered only nouns and adjectives. Let’s de-tokenize the lemmatized reviews and plot the most common words.

reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

df_train['Description'] = reviews_3

freq_words(df_train['Description'], 40)

It seems that now most frequent terms in our data are relevant. We can now go ahead and start building our topic model.

In [None]:
# Building an LDA model
# We will start by creating the term dictionary of our corpus, where every unique term is assigned an index
import gensim
from gensim import corpora
dictionary = corpora.Dictionary(reviews_2)

In [None]:
#Then we will convert the list of reviews (reviews_2) into a Document Term Matrix using the dictionary prepared above.

doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]
# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=7, random_state=100,
                chunksize=1000, passes=50)

In [None]:
# The code above will take a while. Please note that I have specified the number of topics as 7 for this model using the num_topics parameter. You can specify any number of topics using the same parameter.

# Let’s print out the topics that our LDA model has learned.

lda_model.print_topics()

In [None]:
# Topics Visualization
# To visualize our topics in a 2-dimensional space we will use the pyLDAvis library. This visualization is interactive in nature and displays topics along with the most relevant words.
# libraries for visualization
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis