## AirBnB Sentimental Analysis - Text Preprocessing and Topic Modelling - Step 2

> The purpose of this report is **to analyze customer reviews for Airbnb for Bangkok**. And act as a stepping stone **to know what the customers think of the service offered by Bangkoks's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **TEXT PREPROCESSING** and **TOPIC MODELLING** part.

> The dataset contains the **detailed review data for listings in Bangkok** compiled on **21 Sep, 2022**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [1]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np
import spacy

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
import en_core_web_sm
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('airbnb-bkk-reviews-clean.csv')

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en,stayed apartment week enjoyed much nuttee nice...
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en,br husband daughter months stayed one month pe...
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en,girlfriend recently stayed nuttee condo month ...
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en,honestly thank raewyn enough fiance looking qu...
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en,first time bangkok could picked better place s...


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177914 entries, 0 to 177913
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   listing_id        177914 non-null  int64 
 1   id                177914 non-null  int64 
 2   date              177914 non-null  object
 3   reviewer_id       177914 non-null  int64 
 4   reviewer_name     177914 non-null  object
 5   comments          177914 non-null  object
 6   lang              177914 non-null  object
 7   comments_cleaned  177908 non-null  object
dtypes: int64(3), object(5)
memory usage: 10.9+ MB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,int64,9367,0,0.0,177914
1,id,int64,177914,0,0.0,177914
2,date,object,3676,0,0.0,177914
3,reviewer_id,int64,149093,0,0.0,177914
4,reviewer_name,object,47405,0,0.0,177914
5,comments,object,172964,0,0.0,177914
6,lang,object,1,0,0.0,177914
7,comments_cleaned,object,169170,6,0.0,177914


> Although these have been fixed on the previous process, seems that there are some `dtypes` that are not proper, there are also a missing values on *comments_clean* feature. Therefore once again I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# check the missing values

df[df['comments_cleaned'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned
25649,5744888,593809298,2020-01-17,153408062,동우,Up to you,en,
40957,8957682,274330541,2018-06-09,88053449,Joseph,You'll be on your own.,en,
57124,11853172,369035913,2019-01-09,232332280,祥,Do yourself!,en,
82429,17047141,224411058,2018-01-03,114758757,Tony,It's,en,
92736,18864066,227122287,2018-01-14,38215411,Thomas,This,en,
169775,45643491,629298679010874005,2022-05-18,442951156,Rita,Not at all,en,


> It seems the missing values are caused by the other language or improper commentaries as shown above, therefore I'll fill these values as *No Description* instead and move to clean the datatypes.

In [8]:
# fill missing values

df['comments_cleaned'] = df['comments_cleaned'].fillna('No Description')

In [9]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [10]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,object,9367,0,0.0,177914
1,id,object,177914,0,0.0,177914
2,date,datetime64[ns],3676,0,0.0,177914
3,reviewer_id,object,149093,0,0.0,177914
4,reviewer_name,object,47405,0,0.0,177914
5,comments,object,172964,0,0.0,177914
6,lang,object,1,0,0.0,177914
7,comments_cleaned,object,169171,0,0.0,177914


> It seems that everything's on set. I'll move to the text processing to later do the text modelling.

## TEXT PROCESSING

### LEMMATIZATION

> For this part, we will lemmatize the text on the *comments_clean* feature to get the tokenized result for modelling part.

In [11]:
# load spacymodel for lemmatization

nlp = en_core_web_sm.load(disable=['parser', 'ner'])

In [12]:
# function to lemmatize text

def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']): 
    output = []
    for text in texts:
        doc = nlp(text) 
        output.append(' '.join([word.lemma_ for word in doc if word.pos_ in allowed_postags ]))
    return output

In [13]:
# apply lemmatization

comment_list = df['comments_cleaned'].tolist()
comment_tokenized = lemmatization(comment_list)

In [14]:
# check lemmatized comment

print(comment_tokenized[0])

apartment week much nuttee nice host perfect apartment love view balcony apartment modern spacious location central min bt station supermarket bus taxi central world shopping mall food stall massage next visit


In [15]:
# create new feature to store tokenized text

df['comments_tokenized'] = comment_tokenized

In [16]:
# show dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en,stayed apartment week enjoyed much nuttee nice...,apartment week much nuttee nice host perfect a...
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en,br husband daughter months stayed one month pe...,daughter month month peace host friendly know ...
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en,girlfriend recently stayed nuttee condo month ...,girlfriend nuttee month beautiful great view g...
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en,honestly thank raewyn enough fiance looking qu...,enough fiance quiet place hectic life property...
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en,first time bangkok could picked better place s...,first time well place well people complexity c...


In [17]:
# show dataframe

df.tail()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized
177909,706107713188647022,707676595298286010,2022-09-03,338572947,Dieuvy,It was a nice apartment and very clean. A nice...,en,nice apartment clean nice rooftop pool short s...,nice apartment clean rooftop pool short nice
177910,710588929620049969,717799228870862725,2022-09-17,996481,Wilson,Angelia is such a great host! Everything as de...,en,angelia great host everything description plac...,great host description place clean thank listing
177911,710710056332826161,716267676026310287,2022-09-15,9555706,Lino,We really enjoyed our stay at Brians flat. Bri...,en,really enjoyed stay brians flat brian helpful ...,brian flat brian helpful tip orientation stuff...
177912,717796047487387803,719207337995482279,2022-09-19,26461010,Jiajia,The owner’s attitude was so nice and I had bee...,en,owner attitude nice served well certainly come...,owner attitude nice next time good condition
177913,707439535910302353,712750445698916344,2022-09-10,2781667,Alberto,"Super hotel, just a bit confusing to get there...",en,super hotel bit confusing get overall great st...,super hotel bit confusing overall great structure


### SENTIMENT ANALYSIS - VADER

> Now, we will start to analyze the sentiment using **VADER ( Valence Aware Dictionary for Sentiment Reasoning)**. It is basically a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. VADER relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.

In [18]:
# function to assign sentiment class based on compound score

def sentiment(comp):
    if comp >= 0.05:
        return 'positive'
    elif (comp > -0.05) and (comp < 0.05):
        return 'neutral'
    elif comp <= -0.05 :
        return 'negative'

In [19]:
# initialize sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# calculate compound score

compound_score = []
for i in df['comments_tokenized']:
    compound_score.append(analyzer.polarity_scores(i)['compound'])

In [20]:
# create new feature to store compound score

df['compound_score'] = compound_score

In [21]:
# check sentiment on first data

sentiment(df['compound_score'][0])

'positive'

In [22]:
# calculate sentiment based on compound score

sent = []
for i in range(0, len(df)):
    sent.append(sentiment(df['compound_score'][i]))

In [23]:
# create new feature to store sentiment

df['sentiment'] = sent

In [24]:
# check final dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized,compound_score,sentiment
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en,stayed apartment week enjoyed much nuttee nice...,apartment week much nuttee nice host perfect a...,0.872,positive
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en,br husband daughter months stayed one month pe...,daughter month month peace host friendly know ...,0.9794,positive
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en,girlfriend recently stayed nuttee condo month ...,girlfriend nuttee month beautiful great view g...,0.9885,positive
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en,honestly thank raewyn enough fiance looking qu...,enough fiance quiet place hectic life property...,0.9871,positive
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en,first time bangkok could picked better place s...,first time well place well people complexity c...,0.9552,positive


> Now we've already got everything we need to do the modelling. First I will do topic modelling.

## TOPIC MODELLING

> **Topic Modeling** falls under unsupervised machine learning where the documents are processed to obtain the relative topics. It is a very important concept of the traditional Natural Processing Approach because of its potential to obtain semantic relationship between words in the document clusters. 

> We will use one of **sklearn's method of topic modeling**, the NMF modeling. The NMF is based on Non-negative Matrix Factorization to implement topic modeling. In the NMF model we will use the tf-idf feature vector to train the model.

### NON NEGATIVE MATRIX FACTORIZATION (NMF)

> **Non-Negative Matrix Factorization** is a statistical method to reduce the dimension of the input corpora. It uses factor analysis method to provide comparatively less weightage to the words with less coherence. NMF produces more coherent topics compared to LDA, and it is by default produces sparse representations. This mean that most of the entries are close to zero and only very few parameters have significant values. This can be used when we strictly require fewer topics. 

> In sort, **the goal of NMF is to find two non-negative matrices (W, H) whose product approximates the non-negative matrix X**. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. We will be using sklearn’s implementation of NMF.

In [25]:
# converting the text term-document matrix

vectorizer = TfidfVectorizer(max_features=100)
tf_idf = vectorizer.fit_transform(df['comments_tokenized'])

In [26]:
# show feature names

vectorizer.get_feature_names()

['airbnb',
 'airport',
 'amazing',
 'amenity',
 'apartment',
 'area',
 'bathroom',
 'beautiful',
 'bed',
 'big',
 'bit',
 'br',
 'breakfast',
 'bt',
 'building',
 'check',
 'city',
 'clean',
 'close',
 'comfortable',
 'communication',
 'condo',
 'convenient',
 'cozy',
 'day',
 'distance',
 'easy',
 'excellent',
 'experience',
 'family',
 'first',
 'floor',
 'food',
 'friend',
 'friendly',
 'good',
 'great',
 'gym',
 'helpful',
 'home',
 'host',
 'hotel',
 'house',
 'kitchen',
 'little',
 'local',
 'location',
 'lot',
 'lovely',
 'mall',
 'many',
 'market',
 'min',
 'minute',
 'modern',
 'much',
 'new',
 'next',
 'nice',
 'night',
 'overall',
 'people',
 'perfect',
 'picture',
 'place',
 'pool',
 'price',
 'problem',
 'question',
 'quick',
 'quiet',
 'responsive',
 'restaurant',
 'room',
 'safe',
 'service',
 'shop',
 'shopping',
 'short',
 'small',
 'space',
 'spacious',
 'staff',
 'station',
 'stay',
 'store',
 'street',
 'super',
 'taxi',
 'thank',
 'thing',
 'time',
 'train',
 'trip

In [27]:
# applying NMF factorization

nmf_model = NMF(n_components=5, init='nndsvd', random_state=42)
W = nmf_model.fit_transform(tf_idf)
H = nmf_model.components_
print(W.shape, H.shape)

(177914, 5) (5, 100)


In [28]:
# get topics

get_topics = []
for index, topic in enumerate(H):
    feature_names = vectorizer.get_feature_names()
    get_topics.append(' '.join([feature_names[i] for i in topic.argsort()[-5:]]))

In [29]:
# show topics

get_topics

['comfortable host room clean apartment',
 'value stay host location great',
 'close perfect clean amazing place',
 'room price value location good',
 'location pool clean room nice']

> Seems that the first, third, and fifth topic talking about how nice the place and also the host is. The second, and fouth one specifically are talking about the location.

In [30]:
# predict the topics based on tokenized comments

topics = []

for i in df['comments_tokenized']:
    text_to_vector = vectorizer.transform([i])
    prob_score = nmf_model.transform(text_to_vector)
    topics.append(get_topics[np.argmax(prob_score)])

In [31]:
# create new feature to store the topics

df['topics'] = topics

In [32]:
# show top 5 data

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en,stayed apartment week enjoyed much nuttee nice...,apartment week much nuttee nice host perfect a...,0.872,positive,comfortable host room clean apartment
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en,br husband daughter months stayed one month pe...,daughter month month peace host friendly know ...,0.9794,positive,comfortable host room clean apartment
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en,girlfriend recently stayed nuttee condo month ...,girlfriend nuttee month beautiful great view g...,0.9885,positive,value stay host location great
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en,honestly thank raewyn enough fiance looking qu...,enough fiance quiet place hectic life property...,0.9871,positive,close perfect clean amazing place
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en,first time bangkok could picked better place s...,first time well place well people complexity c...,0.9552,positive,close perfect clean amazing place


In [33]:
# show example on topic 1

df['comments'][0]

'We stayed in the apartment for a week and we enjoyed it very much. Nuttee is a very nice host, and she did her best to accommodate us. Everything is perfect in the apartment, and we love the view from the balcony. The apartment is very modern and spacious, and the location is very central, 10 mins walk to BTS station (and supermarket), or 10 mins by bus/taxi to central world shopping mall. There are also a lot of food stalls and massage nearby. We will definitely stay there again for our next visit to Bangkok. Highly recommended.'

In [34]:
# show example on topic 2

df['comments'][1]

'\r<br/>We, my husband, my daughter (15 months) and I, have stayed one month in this haven of peace. \r<br/>Reawyn and Charlie are great hosts, they are available, really friendly but know how to remain discreet.\r<br/>The chalet is spacious and spotless. The huge garden is sublimely decorated , the swimming pool is perfect (with a jacuzzi corner). There are only few cottages, so the pool area is NOT overcrowding like is the case in most of the places in Bangkok.\r<br/>The aera is quiet and fast of access in taxi, not too far from MTR or BTS, shops are just around the corner. \r<br/>A delicious fruits basket and some cold waters were stored in the room for us. \r<br/>Everything was for best, they took us in charge since our arrival at the airport. \r<br/>Everything was at our disposal, open kitchen, a private patio, a washing machine, clean linen, food ordering (yummy) …\r<br/>Raewyn and Charlie even thought about everything needed for our baby : bed, swing, mosquitos net … \r<br/>They

In [35]:
# show example on topic 5

df['comments'][3]

'I honestly can\'t thank Raewyn enough. Myself and my fiance were looking for a quiet place to relax after the hectic Bangkok life. Raewyn\'s property is an oasis. Tucked away from the real world. My fiance loved it and didn\'t want to do anything but stay in our cabin or lie by the pool. It\'s very surprising they aren\'t overrun. My fiance was recovering for three days and Raewyn assured she was ok while I was away and when I had to fly out early. \r<br/>\r<br/>The property is a bit hard to find but Raewyn picked us up from the metro station. Once you figure it out you\'ll have no problem. Took me about 30 minutes to get down to central Bangkok on the train system. \r<br/>\r<br/>We were there during Songkran so it was a bit noisy outside the property as it\'s a five minute walk to the local temple. This didn\'t bother us much as it was over by 10. \r<br/>\r<br/>When it\'s said that this place is off the beaten path, it\'s the truth. If you\'re looking for a party don\'t stay here. It

> Seems that this model is quite right to predict the topics. Now to the next phase, I'll start to dump the cleaned data and build the model using various algorithm on the separate notebook. Although I want to address something first, many of features such as *comments* and *comments_cleaned* are rather giving out some false sentiment and topics.

In [36]:
# dump to new dataframe

df.to_csv('airbnb-bkk-reviews-tokenized.csv', index=False)

In [37]:
# show the anomaly

df[df['comments']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
18667,3963238,107557973,2016-10-11,89172667,Aubrey,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
41485,8637376,689545407425155927,2022-08-09,42662144,Jack,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
49317,9917169,280311793,2018-06-23,96809532,Luna,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
49627,10158510,326771807,2018-09-23,156904066,Nick,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
62438,13286234,126708978,2017-01-13,50779794,Yeung,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
71582,15320642,236754628,2018-02-20,3007755,Gerhard,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
74543,15976663,602846948,2020-02-08,222656274,Sai Kiran,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
78024,16203889,556169882415852752,2022-02-06,129938815,Marcel,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
87942,17700557,149802100,2017-05-06,113753396,Melisa,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment
94280,19127428,574993078,2019-12-09,104127081,Thu Sang,No Description,en,description,description,0.0,neutral,comfortable host room clean apartment


In [38]:
# show the anomaly 2

df[df['comments_cleaned']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
25649,5744888,593809298,2020-01-17,153408062,동우,Up to you,en,No Description,description,0.0,neutral,comfortable host room clean apartment
40957,8957682,274330541,2018-06-09,88053449,Joseph,You'll be on your own.,en,No Description,description,0.0,neutral,comfortable host room clean apartment
57124,11853172,369035913,2019-01-09,232332280,祥,Do yourself!,en,No Description,description,0.0,neutral,comfortable host room clean apartment
82429,17047141,224411058,2018-01-03,114758757,Tony,It's,en,No Description,description,0.0,neutral,comfortable host room clean apartment
92736,18864066,227122287,2018-01-14,38215411,Thomas,This,en,No Description,description,0.0,neutral,comfortable host room clean apartment
169775,45643491,629298679010874005,2022-05-18,442951156,Rita,Not at all,en,No Description,description,0.0,neutral,comfortable host room clean apartment
