<h1>Trip Advisor Hotel Reviews - EDA and NLP</h1>
<p>Exploratory Data Analysis and Language Modelling of the Trip Advisor Hotel Reviews Dataset.</p>
<img src="https://static.tacdn.com/img2/brand_refresh/application_icons/post-image-550x370.png" style="margin : auto;">

In [None]:
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

<h1>Loading the Data and Preparing it</h1>
<p>Load the data set present in csv format. The dataset consists of two columns - 
<ul>
    <li><i>Review</i> - Review of the user</li>
    <li><i>Rating</i> - Rating of the user out of 5</li>
</ul>
</p>

In [None]:
data = pd.read_csv("../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv")
data.head(5)

In [None]:
data.isna().sum()

<h1>Exploratory Analysis</h1>
<p>Conduct visualization of the dataset. Generate sentiment of each review and add it to the dataset.
</p>

<h3>Sentiment Analylsis of each Review</h3>

In [None]:
sentiments = []
for review in data['Review']:
    if TextBlob(review).sentiment.polarity < 0:
        sentiments.append("Negative")
    if TextBlob(review).sentiment.polarity == 0:
        sentiments.append("Neutral")
    if TextBlob(review).sentiment.polarity > 0:
        sentiments.append("Positive")
data["Sentiment"] = np.array(sentiments)

In [None]:
del sentiments

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
x_axis,counts = np.unique(data['Rating'],return_counts=True)
plt.bar([str(i) for i in x_axis],counts)
plt.title("Rating vs Counts")
plt.xlabel("Rating")
plt.ylabel("Count")

plt.subplot(1,2,2)
x_axis,counts = np.unique(data['Sentiment'],return_counts=True)
plt.bar(x_axis,counts)
plt.title("Sentiment vs Counts")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.tight_layout()

<p>Majority of the reviews are 4 and 5 and denotes a high number of positive reviews in comparison with negative and neutral reviews</p>

<h1>Natural Language Processing</h1>
<p>Using TextBlob, we will try to predict what approximate rating each review can get, consdiering the rating is dependent only on the content of the review and is not subject to any bias. We will use the Naive Bayes Classifier from TextBlob to build a text classification system.</p>

In [None]:
vectorizer = TfidfVectorizer(stop_words=text.ENGLISH_STOP_WORDS)
X_train,X_test,Y_train,Y_test = train_test_split(vectorizer.fit_transform(data['Review']).toarray(),
                                                 data['Rating'].values,
                                                 test_size = 0.2,
                                                 random_state=42)

In [None]:
print("Train Features : ",X_train.shape)
print("Train Labels   : ",Y_train.shape)
print("Test Features  : ",X_test.shape)
print("Test Labels    : ",Y_test.shape)

In [None]:
clf = LogisticRegression(solver='liblinear',random_state=0)
clf.fit(X_train,Y_train)
print("Train Accuracy : {:.2f} %".format(accuracy_score(clf.predict(X_train),Y_train)*100))
print("Test Accuracy  : {:.2f} %".format(accuracy_score(clf.predict(X_test),Y_test)*100))

In [None]:
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train,Y_train)
print("Train Accuracy : {:.2f} %".format(accuracy_score(clf.predict(X_train),Y_train)*100))
print("Test Accuracy  : {:.2f} %".format(accuracy_score(clf.predict(X_test),Y_test)*100))

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,Y_train)
print("Train Accuracy : {:.2f} %".format(accuracy_score(clf.predict(X_train),Y_train)*100))
print("Test Accuracy  : {:.2f} %".format(accuracy_score(clf.predict(X_test),Y_test)*100))

<p>As seen from above results, the accuracy is quite poor. Possible justificaion might be the inconsistency in the review and ratings. Another solution to the above problem would be grouping the reviews into two categories depending on their count - 1,2,3 for class 0 and 4,5 for class 1. This way, the reviews in each group will be more consistent with the ratings</p>

In [None]:
groups = []
for rating in data['Rating']:
    if rating in [1,2,3]:
        groups.append(0)
    else:
        groups.append(1)
data['Group'] = groups

In [None]:
data

<p>Let us perform the train test split again, but this time with Group as the target.</p>

In [None]:
vectorizer = TfidfVectorizer(stop_words=text.ENGLISH_STOP_WORDS)
X_train,X_test,Y_train,Y_test = train_test_split(vectorizer.fit_transform(data['Review']).toarray(),
                                                 data['Group'].values,
                                                 test_size = 0.2,
                                                 random_state=42)

In [None]:
clf = LogisticRegression(solver='liblinear',random_state=0)
clf.fit(X_train,Y_train)
print("Train Accuracy : {:.2f} %".format(accuracy_score(clf.predict(X_train),Y_train)*100))
print("Test Accuracy  : {:.2f} %".format(accuracy_score(clf.predict(X_test),Y_test)*100))

<p>As seen above, Simple Logistic Regression itself achieves a good performance with using Group as a feature compare to using Rating as a feature.</p>