# Introduction

Project is about extracting pattern from the guest comments, later predicting the new guest behavior based on the extracted patterns. To do that, I have used several python libraries; NTLK, Gensim, Scikit-learn. The data is from booking.com which 515,000 guest reviews and scoring of 1493 luxury hotels across Europe. 

# Read the data

In [None]:
# libraries
import pandas as pd
# read the data
reviews = pd.read_csv("/home/han/jnotebooks/data/Hotel_Reviews.csv")
# reviews.head() # see the data
# append the positive and negative reviews. add as a label
reviews["review"] = reviews["Positive_Review"] + reviews["Negative_Review"]
# create the label as good or bad 
reviews["good_bad"] = reviews["Reviewer_Score"].apply(lambda x: 1 if x > 5 else 0)
# select review and good_bad label
reviews = reviews[["review", "good_bad"]]
reviews.head()

# Sample data
**DataFrame.sample** function will be used to return a random sample of items from an asis of object. Reason is to speed up computations.

In [None]:
# frac: Fraction of axis items to return. 
# replace: Allow or disallow sampling of the same row more than once.
reviews = reviews.sample(frac = 0.1, replace = False, random_state=0)
reviews.head()

# Clean data
**Positive_Review:** Positive Review the reviewer gave to the hotel. If the reviewer does not give the positive review, then it should be: 'No Positive'

**Negative_Review:** Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'

In [None]:
# 'No Positive' or 'No Negative' has no meaning in the reviews, so these words are removed. 
reviews["review"] = reviews["review"].apply(lambda x: x.replace("No Negative", "").replace("No Positive", ""))
reviews.head()

**WordNet** is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.

**Token** Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. 

**e.g.** the process of reducing the different forms of a word to one single form, for example, reducing "builds", "building", or "built" to the lemma "build"

*In lemmatisation,''nt" is marked as "not".*

**Stop Words:** A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

**Part-of-speech tagging** (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

In [None]:
   
import string
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

def getCleanText(text):
    # lower text
    text = text.lower()
    # tokenize the text and remove punctuation. 
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove the words that contains number
    text = [word for word in text if not any(char.isdigit() for char in word)]
    # remove stop words
    text = [x for x in text if x not in stopwords.words("english")]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text. e.g. [('now', 'RB'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], getWordnetPos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)


from nltk.corpus import wordnet

def getWordnetPos(pos_tag):
    if pos_tag.startswith('JJ'):
        return wordnet.ADJ
    elif pos_tag.startswith('VB'):
        return wordnet.VERB
    elif pos_tag.startswith('NN'):
        return wordnet.NOUN
    elif pos_tag.startswith('RB'):
        return wordnet.ADV
    else:
        return wordnet.NOUN 
    
# clean text data
reviews["clean_review"] = reviews["review"].apply(lambda x: getCleanText(x))
reviews.head()

# Feature engineering

Firstly, sentiment analysis going to be added, because it is directly related with guests' review. To make sentiment analysis, Vader is going to be used from nltk. 

**VADER (Valence Aware Dictionary and sEntiment Reasoner)** is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative. Moreover it considers the context of the sentences. Vader returns; neutrality score, positivity score, negativity score, overall score for each text.

In [None]:
# polarity_scores method of SentimentIntensityAnalyzer
# oject gives a sentiment dictionary. 
# which contains pos, neg, neu, and compound scores. 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
reviews["sentiments"] = reviews["review"].apply(lambda x: SentimentIntensityAnalyzer().polarity_scores(x))
reviews.head()

In [None]:
# Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
reviews = pd.concat([reviews.drop(["sentiments"], axis=1), reviews["sentiments"].apply(pd.Series)], axis=1)
reviews.head()

In [None]:
# number of characters label
reviews["number_chars"] = reviews["review"].apply(lambda review: len(review))
# number of words label
reviews["number_words"] = reviews["review"].apply(lambda review: len(review))
reviews.head()

Next, Gensim is going to be used to extract vector representations of reviews.

**Gensim** is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

**Word2vec** is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

**The purpose** and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words.

**Doc2Vec** is a Model that represents each Document as a Vector. 

In [None]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Initialize & train a model
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(reviews["clean_review"].apply(lambda x: x.split(" ")))]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
# transfrom each document into a vector data
# Infer a vector(infer_vector) for given post-bulk training document.
dataframe_doc2vec = reviews["clean_review"].apply(lambda x: model.infer_vector(x.split(" "))).apply(pd.Series)
dataframe_doc2vec.head()

In [None]:
dataframe_doc2vec.columns = ["doc2vec_" + str(x) for x in dataframe_doc2vec.columns]
dataframe_doc2vec.head()

In [None]:
reviews = pd.concat([reviews, dataframe_doc2vec], axis=1)
reviews.head()

Lastly, the TF-IDF values are going to added for every word and every document.

In information retrieval, **tf–idf** or **TFIDF**, short for **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

**The reason of using TF-IDF:** It considers the importance of words in text, rare words my have more importance than common words.

TF-IDF columns are going to added for each word that appear in minimum 10 different texts.

In [None]:
# tf-idfs labels 
# fit_transform: Learn vocabulary and idf, return document-term matrix.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=10)
tfidf_result = tfidf.fit_transform(reviews["clean_review"]).toarray()
tfidf_df = pd.DataFrame(tfidf_result, columns = tfidf.get_feature_names())
tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = reviews.index
tfidf_df.head()

In [None]:
reviews = pd.concat([reviews, tfidf_df])
reviews.head()

In [None]:
reviews.shape

# Explore data

In [None]:
reviews["good_bad"].value_counts(normalize = True)

# Modeling

**Random forests** is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

ignored_columns = ["good_bad", "review", "clean_review"]
labels = [column for column in reviews.columns if column not in ignored_columns]

x_train, x_test, y_train, y_test = train_test_split(reviews[labels], reviews["good_bad"], test_size=0.3, random_state=0)

# train
random_forest = RandomForestClassifier(n_estimators = 100, random_state=0).fit(x_train, y_train)

# predict probablity 
predicted_probability = [r[1] for r in random_forest.predict_proba(x_test)]

# feature importance
feature_importances = pd.DataFrame({"feature": labels, "importance": random_forest.feature_importances_}).sort_values("importance", ascending = False)
feature_importances.head()

RandomForestClassifier is used to see the feature impotances. As seen in the head, sentiment analyis is provides the most important features. 

# Metrics

In [None]:
# ROC curve
from sklearn.metrics import auc, roc_curve
import matplotlib.pyplot as plt

tpr, fpr = roc_curve(y_test, predicted_probability, pos_label=1)

plt.figure(1, figsize=(16, 9))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Area under the ROC curve = %0.2f' % auc(fpr, tpr))
plt.plot([0,1], [0,1], lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('Receiver operating characteristic example')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc="lower right")
plt.show()

The ROC curve is mostly the best to show the quality of classifier. In general, an AUC of 0.5 suggests no discrimination (i.e., ability to diagnose patients with and without the disease or condition based on the test), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding. As seen the the curve, the model is very good.

In [None]:
# PR curve
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.utils.fixes import signature

average_precision = average_precision_score(y_test, predicted_probability)
precision, recall, _ = precision_recall_curve(y_test, predicted_probability)

plt.figure(1, figsize = (16, 9))
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {}))

plt.ylabel('Precision')
plt.xlabel('Recall')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))

As mentioned in 'explore data' section, the data set is out of balance. Therefore PR curve is the best metric for this data set. As seen in the graph, precision decreases when we increase the recall. This means prediction threshold adaption is required. To reach high recall values, low prediction treshold values are needed. In contrast, to reach high precison value, high prediction treshold values are needed. Thus the graph represents that the model has a well predictive capacity.

# Conclusion

In conclusion, guest reviews has been used to predict the guest behaviour via sentiment analyisis, natural language processing features, gensim topic modelling, term frequency–inverse document frequency values. Therefore it is possible to make sentiment analyisis and make prediction form the raw guest reviews.