<img src="https://images.unsplash.com/photo-1603076534270-364861eac82d?ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=800&q=80" alt="resort_img">

<span>Photo by <a href="https://unsplash.com/@uvastrishamarie?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Trisha Marie Uvas</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

# Mission

In this case I was trying to figuring Out, <br>
what are the key aspects should be considered by a Hotel/Villa management<br>
and I did a litttle bit of sentiment analysis

In [None]:
# start Our Engine
import pandas as pd
import numpy as np
import seaborn as sns

# Data Overview

In [None]:
# import Our Data
raw_data = pd.read_csv('../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')

In [None]:
# See Our Data
raw_data.head()

# Clean Data

## Missing Value

In [None]:
# check missing data
raw_data.isnull().sum()

In [None]:
# Check for whitespace strings

# Empty list for Our index 
blanks = []  

for i,rv,lb in raw_data.itertuples(): # change dataframe into tuples
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # find 'review' data only include space in text
            blanks.append(i)     # add matching index numbers to the list
        
len(blanks) # how many number of index only has space in review column

There is no Missing data and Whitespace strings. So, We don't need to clean the data

## Rating Categories

I think We need add a column that show is the rating number means positive, negative, or neutral. <br>
Cause, It's more meaningful than 1, 2, 3, 4, 5. <br>
In this case, There are 3 categories:<br>
1. Positive, include rating 4 and 5
2. Neutral, include rating 3
3. Negative, include rating 1 and 2

In [None]:
# create a list of our conditions
conditions = [
    (raw_data['Rating'] > 3),
    (raw_data['Rating'] < 3),
    (raw_data['Rating'] == 3)
    ]

# create a list of the values we want to assign for each condition
values = ['Positive','Negative','Neutral']

# create a new column and use np.select to assign values to it using our lists as arguments
raw_data['Sentiment'] = np.select(conditions, values)

In [None]:
raw_data.head()

## Tokenizer

In [None]:
# import Our nlp machine
import spacy
nlp = spacy.load('en_core_web_lg')

In [None]:
# create token list
token_list = []

for rv in raw_data['Review']:
    doc = nlp(rv, disable=['parser', 'tagger', 'ner']) # disable for speed
    tokens = [n.lemma_ for n in doc if(n.is_punct == False and n.is_stop == False)] # remove punctution and stopwords
    x = " ".join(tokens) # turn list into string
    token_list.append(x) 

In [None]:
# add column tokens
raw_data['Tokens'] = token_list

In [None]:
# checkpoint
df = raw_data.copy()

In [None]:
# We have cleaned the data, so We don't need review data anymore
df.drop('Review', axis=1,inplace=True)

In [None]:
df.head()

# Key Aspect in Hotel

I tried to count the number of each word to figuring out what are the key aspects of a Hotel

In [None]:
# Invite Our assistant
from collections import Counter

In [None]:
# words counter
collect_words = Counter([word for token in df['Tokens'] for word in token.split()])

In [None]:
# make pandas table, top 25
freq_words = pd.DataFrame(collect_words.most_common(25))

In [None]:
# columns name
freq_words.columns = ['word','count']

In [None]:
freq_words

Based on the result above, We can conclude that There is some **Key Aspects** which should be considered by Hotel Management, such as:
1. **Hotel Room**
2. **Staff/Service**
3. **Cleanness/Hygiene**
4. **Restaurant/Food**
5. **Pool/Beach**

# Sentiment Analysis

## Balancing Data

In [None]:
# Class count
count_positive, count_negative, count_neutral = df['Sentiment'].value_counts()

In [None]:
# Divide by class
df_positive = df[df['Sentiment'] == 'Positive']
df_negative = df[df['Sentiment'] == 'Negative']
df_neutral = df[df['Sentiment'] == 'Neutral']

In [None]:
print(count_positive)
print(count_negative)
print(count_neutral)

In [None]:
# under sampling
df_positive_under = df_positive.sample(count_neutral)
df_negative_under = df_negative.sample(count_neutral)

df_test_under = pd.concat([df_positive_under, df_negative_under, df_neutral], axis=0)

In [None]:
df_test_under['Sentiment'].value_counts()

It's balanced now

## Make a Model

In [None]:
# split the data
from sklearn.model_selection import train_test_split

X = df_test_under['Tokens']
y = df_test_under['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X,y)

In [None]:
# Start the Engine
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

In [None]:
predictions = text_clf.predict(X_test)

## Accuracy

In [None]:
from sklearn.metrics import classification_report

# Print a classification report
print(classification_report(y_test,predictions))

Because We have balanced the data, so We should look Our Model's accuracy.<br>
as You can see, Our model has **70%** accuracy.
Yeeaaa :)