# ASHEVILLE AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **TEXT PREPROCESSING** and **TOPIC MODELLING** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [4]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np
import spacy

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
import en_core_web_sm
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [5]:
# load data

df = pd.read_csv("C:/Users/lizab/halew-reviews-clean.csv")

In [6]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
0,198480,87255046.0,2016-07-19,40872502,Nour,Belle appartement rÃ©cent situÃ© Ã 15 minutes...,belle appartement r cent situ minutes pied du ...
1,198480,91814194.0,2016-08-06,40494412,Vitor,"Morada excelente, com limpeza Ã³tima, muito be...",morada excelente com limpeza tima muito bem lo...
2,198480,94780243.0,2016-08-17,70116792,Ricardo,"Boa localizaÃ§Ã£o, casa cÃ´moda e simpÃ¡tica ....",boa localiza casa c moda e simp tica propriet ...
3,198480,96934467.0,2016-08-25,72247207,Victor,En fin ganska nybyggd lÃ¤genhet med all utrus...,en fin ganska nybyggd l genhet med utrustning ...
4,198480,111434378.0,2016-10-31,96738915,Elbert Takeshi,"Ã“timo apartamento, mtu aconchegante e espaÃ§o...",timo apartamento mtu aconchegante e espa oso g...


In [7]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499 entries, 0 to 498
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   listing_id        499 non-null    int64  
 1   id                499 non-null    float64
 2   date              499 non-null    object 
 3   reviewer_id       499 non-null    int64  
 4   reviewer_name     499 non-null    object 
 5   comments          499 non-null    object 
 6   comments_cleaned  498 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.4+ KB


In [8]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [9]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,comments_cleaned,object,496,1,0.2,499
1,listing_id,int64,7,0,0.0,499
2,id,float64,499,0,0.0,499
3,date,object,459,0,0.0,499
4,reviewer_id,int64,495,0,0.0,499
5,reviewer_name,object,420,0,0.0,499
6,comments,object,499,0,0.0,499


> Although these have been fixed on the previous process, seems that there are some `dtypes` that are not proper, there are also a missing values on *comments_clean* feature. Therefore once again I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [10]:
# check the missing values

df[df['comments_cleaned'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
69,393699,16277694.0,2014-07-24,12313219,Timur,ÐžÑ‚Ð»Ð¸Ñ‡Ð½Ð¾Ðµ Ð¼ÐµÑÑ‚Ð¾! ÐžÑ‡ÐµÐ½ÑŒ ÑƒÐ´Ð¾...,


> It seems the missing values are caused by the other language or improper commentaries as shown above, therefore I'll fill these values as *No Description* instead and move to clean the datatypes.

In [11]:
# fill missing values

df['comments_cleaned'] = df['comments_cleaned'].fillna('No Description')

In [12]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [14]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,object,7,0,0.0,499
1,id,object,499,0,0.0,499
2,date,datetime64[ns],459,0,0.0,499
3,reviewer_id,object,495,0,0.0,499
4,reviewer_name,object,420,0,0.0,499
5,comments,object,499,0,0.0,499
6,comments_cleaned,object,497,0,0.0,499


> It seems that everything's on set. I'll move to the text processing to later do the text modelling.

## TEXT PROCESSING

### LEMMATIZATION

> For this part, I'll lemmatize the text on the *comments_clean* feature to get the tokenized result for modelling part.

In [15]:
# load spacymodel for lemmatization

nlp = en_core_web_sm.load(disable=['parser', 'ner'])

In [16]:
# function to lemmatize text

def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']): 
    output = []
    for text in texts:
        doc = nlp(text) 
        output.append(' '.join([word.lemma_ for word in doc if word.pos_ in allowed_postags ]))
    return output

In [17]:
# apply lemmatization

comment_list = df['comments_cleaned'].tolist()
comment_tokenized = lemmatization(comment_list)

In [18]:
# check lemmatized comment

print(comment_tokenized[0])

appartement r cent situ minute tro minute bus durant


In [20]:
# create new feature to store tokenized text

df['comments_tokenized'] = comment_tokenized

In [21]:
# show dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized
0,198480,87255046.0,2016-07-19,40872502,Nour,Belle appartement rÃ©cent situÃ© Ã 15 minutes...,belle appartement r cent situ minutes pied du ...,appartement r cent situ minute tro minute bus ...
1,198480,91814194.0,2016-08-06,40494412,Vitor,"Morada excelente, com limpeza Ã³tima, muito be...",morada excelente com limpeza tima muito bem lo...,excelente com
2,198480,94780243.0,2016-08-17,70116792,Ricardo,"Boa localizaÃ§Ã£o, casa cÃ´moda e simpÃ¡tica ....",boa localiza casa c moda e simp tica propriet ...,recomendo
3,198480,96934467.0,2016-08-25,72247207,Victor,En fin ganska nybyggd lÃ¤genhet med all utrus...,en fin ganska nybyggd l genhet med utrustning ...,med man beh inte all detta boende
4,198480,111434378.0,2016-10-31,96738915,Elbert Takeshi,"Ã“timo apartamento, mtu aconchegante e espaÃ§o...",timo apartamento mtu aconchegante e espa oso g...,com hospedado lugar era


### SENTIMENT ANALYSIS - VADER

> Now, I'll start to analyze the sentiment using **VADER ( Valence Aware Dictionary for Sentiment Reasoning)**. It is basically a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. VADER relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.

In [22]:
# function to assign sentiment class based on compound score

def sentiment(comp):
    if comp >= 0.05:
        return 'positive'
    elif (comp > -0.05) and (comp < 0.05):
        return 'neutral'
    elif comp <= -0.05 :
        return 'negative'

In [23]:
# initialize sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# calculate compound score

compound_score = []
for i in df['comments_tokenized']:
    compound_score.append(analyzer.polarity_scores(i)['compound'])

In [24]:
# create new feature to store compound score

df['compound_score'] = compound_score

In [25]:
# check sentiment on first data

sentiment(df['compound_score'][0])

'neutral'

In [26]:
# calculate sentiment based on compound score

sent = []
for i in range(0, len(df)):
    sent.append(sentiment(df['compound_score'][i]))

In [27]:
# create new feature to store sentiment

df['sentiment'] = sent

In [28]:
# check final dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment
0,198480,87255046.0,2016-07-19,40872502,Nour,Belle appartement rÃ©cent situÃ© Ã 15 minutes...,belle appartement r cent situ minutes pied du ...,appartement r cent situ minute tro minute bus ...,0.0,neutral
1,198480,91814194.0,2016-08-06,40494412,Vitor,"Morada excelente, com limpeza Ã³tima, muito be...",morada excelente com limpeza tima muito bem lo...,excelente com,0.0,neutral
2,198480,94780243.0,2016-08-17,70116792,Ricardo,"Boa localizaÃ§Ã£o, casa cÃ´moda e simpÃ¡tica ....",boa localiza casa c moda e simp tica propriet ...,recomendo,0.0,neutral
3,198480,96934467.0,2016-08-25,72247207,Victor,En fin ganska nybyggd lÃ¤genhet med all utrus...,en fin ganska nybyggd l genhet med utrustning ...,med man beh inte all detta boende,0.0,neutral
4,198480,111434378.0,2016-10-31,96738915,Elbert Takeshi,"Ã“timo apartamento, mtu aconchegante e espaÃ§o...",timo apartamento mtu aconchegante e espa oso g...,com hospedado lugar era,0.0,neutral


> Now we've already got everything we need to do the modelling. First I will do topic modelling.

## TOPIC MODELLING

> **Topic Modeling** falls under unsupervised machine learning where the documents are processed to obtain the relative topics. It is a very important concept of the traditional Natural Processing Approach because of its potential to obtain semantic relationship between words in the document clusters. 

> We will use one of **sklearn's method of topic modeling**, the NMF modeling. The NMF is based on Non-negative Matrix Factorization to implement topic modeling. In the NMF model we will use the tf-idf feature vector to train the model.

### NON NEGATIVE MATRIX FACTORIZATION (NMF)

> **Non-Negative Matrix Factorization** is a statistical method to reduce the dimension of the input corpora. It uses factor analysis method to provide comparatively less weightage to the words with less coherence. NMF produces more coherent topics compared to LDA, and it is by default produces sparse representations. This mean that most of the entries are close to zero and only very few parameters have significant values. This can be used when we strictly require fewer topics. 

> In sort, **the goal of NMF is to find two non-negative matrices (W, H) whose product approximates the non-negative matrix X**. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. We will be using sklearn’s implementation of NMF.

In [29]:
# converting the text term-document matrix

vectorizer = TfidfVectorizer(max_features=100)
tf_idf = vectorizer.fit_transform(df['comments_tokenized'])

In [30]:
# show feature names

vectorizer.get_feature_names()

['able',
 'amazing',
 'apartment',
 'appartement',
 'appartment',
 'area',
 'arrival',
 'available',
 'avon',
 'bar',
 'bathroom',
 'beautiful',
 'bed',
 'bus',
 'center',
 'central',
 'check',
 'city',
 'clean',
 'close',
 'comfortable',
 'communication',
 'day',
 'distance',
 'easy',
 'ellie',
 'enough',
 'et',
 'excellent',
 'experience',
 'fantastic',
 'first',
 'flat',
 'friend',
 'friendly',
 'good',
 'great',
 'heart',
 'helpful',
 'home',
 'host',
 'house',
 'information',
 'kind',
 'kitchen',
 'lisboa',
 'lisbon',
 'little',
 'living',
 'local',
 'location',
 'lot',
 'lovely',
 'main',
 'manager',
 'many',
 'minute',
 'much',
 'neighborhood',
 'neighbourhood',
 'nice',
 'night',
 'nous',
 'old',
 'park',
 'part',
 'people',
 'perfect',
 'person',
 'picture',
 'place',
 'problem',
 'quarti',
 'quiet',
 'real',
 'recommendation',
 'restaurant',
 'right',
 'room',
 'shop',
 'short',
 'small',
 'spacious',
 'station',
 'stay',
 'street',
 'super',
 'thank',
 'thing',
 'time',
 'ti

In [31]:
# applying NMF factorization

nmf_model = NMF(n_components=5, init='nndsvd', random_state=42)
W = nmf_model.fit_transform(tf_idf)
H = nmf_model.components_
print(W.shape, H.shape)

(499, 5) (5, 100)


In [32]:
# get topics

get_topics = []
for index, topic in enumerate(H):
    feature_names = vectorizer.get_feature_names()
    get_topics.append(' '.join([feature_names[i] for i in topic.argsort()[-5:]]))

In [33]:
# show topics

get_topics

['host flat place good nice',
 'et quarti avon nous appartement',
 'beautiful comfortable ellie lisbon apartment',
 'day communication location host great',
 'wonderful place clean location perfect']

> Seems that the first, third, and fifth topic talking about how nice the place and also the host is. The second, and fouth one specifically are talking about the location.

In [34]:
# predict the topics based on tokenized comments

topics = []

for i in df['comments_tokenized']:
    text_to_vector = vectorizer.transform([i])
    prob_score = nmf_model.transform(text_to_vector)
    topics.append(get_topics[np.argmax(prob_score)])

In [35]:
# create new feature to store the topics

df['topics'] = topics

In [37]:
# show top 5 data

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
0,198480,113367752.0,2016-11-12,98837232,Aleksandra,It's amazing place to stay for a longer time!!...,amazing place stay longer time br lovely comfo...,amazing place long time comfortable modern gre...,0.9846,positive,area room flat place nice
1,198480,229173000.0,2018-01-22,111501869,Alberto,Exactly what I needed. I was in Lisbon for wor...,exactly needed lisbon work day home night plac...,lisbon work day home night place less minute r...,0.9618,positive,day lisbon home host good
2,198480,270488735.0,2018-05-29,20963483,Carlos,The host canceled this reservation 102 days be...,host canceled reservation days arrival automat...,host reservation day arrival posting,0.0,neutral,day lisbon home host good
3,393699,1069644.0,2012-04-01,1974225,Cidalia,We are a family with a 6 year old and normally...,family year old normally book entire apartment...,family year old book entire apartment time dif...,0.994,positive,thank comfortable ellie lisbon apartment
4,198480,298408477.0,2018-07-29,85843062,Fabia,It's a very functional and spacious apartment....,functional spacious apartment issues non worki...,functional spacious apartment issue non workin...,0.6369,positive,thank comfortable ellie lisbon apartment


In [36]:
# show example on topic 1

df['comments'][0]

"Belle appartement rÃ©cent situÃ© Ã\xa0 15 minutes Ã\xa0 pied du mÃ©tro et Ã\xa0 5 minutes du bus qui est desservi tout au long de la nuit.<br/>Ana s'est rendu disponible durant le sÃ©jour."

In [37]:
# show example on topic 2

df['comments'][1]

'Morada excelente, com limpeza Ã³tima, muito bem localizada, uma regiÃ£o com um espaÃ§o verde muito bom e prÃ³ximo a transportes de qualidade (metro e autocarros)'

In [38]:
# show example on topic 5

df['comments'][3]

'En fin ganska nybyggd lÃ¤genhet med  all utrustning man behÃ¶ver. Det Ã¤r inte alls svÃ¥rt att ta sig till stan om det sÃ¥ Ã¤r med tunnelbana buss eller taxi. Kommer att rekommendera detta boende till andra.'

> Seems that this model is quite right to predict the topics. Now to the next phase, I'll start to dump the cleaned data and build the model using various algorithm on the separate notebook. Although I want to address something first, many of features such as *comments* and *comments_cleaned* are rather giving out some false sentiment and topics. I think just for the sake of modelling, if possible I would rather drop these data later. I'll show it below.

In [39]:
# dump to new dataframe

df.to_csv('halew-reviews-tokenized.csv', index=False)

In [40]:
# show the anomaly

df[df['comments']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics


In [42]:
# show the anomaly 2

df[df['comments_cleaned']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
69,393699,16277694.0,2014-07-24,12313219,Timur,ÐžÑ‚Ð»Ð¸Ñ‡Ð½Ð¾Ðµ Ð¼ÐµÑÑ‚Ð¾! ÐžÑ‡ÐµÐ½ÑŒ ÑƒÐ´Ð¾...,No Description,description,0.0,neutral,host flat place good nice


## REFERENCES

>- https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664
>- https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45