## AirBnB Sentimental Analysis - Data Preprocessing - Part 1

> The purpose of this report is **to analyze customer reviews for Airbnb for Bangkok**. And act as a stepping stone **to know what the customers think of the service offered by Bangkoks's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **DATA PREPROCESSING** part.

> The dataset contains the **detailed review data for listings in Bangkok** compiled on **21 Sep, 2022**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [1]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('reviews.csv')

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)..."
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255675 entries, 0 to 255674
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     255675 non-null  int64 
 1   id             255675 non-null  int64 
 2   date           255675 non-null  object
 3   reviewer_id    255675 non-null  int64 
 4   reviewer_name  255673 non-null  object
 5   comments       255650 non-null  object
dtypes: int64(3), object(3)
memory usage: 11.7+ MB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,comments,object,245048,25,0.01,255675
1,listing_id,int64,10067,0,0.0,255675
2,id,int64,255675,0,0.0,255675
3,date,object,3683,0,0.0,255675
4,reviewer_id,int64,217084,0,0.0,255675
5,reviewer_name,object,73051,2,0.0,255675


> There are some `dtypes` that are not proper, then there are also a missing values on *comments* feature. I'll check on it later. But I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [8]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255675 entries, 0 to 255674
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   listing_id     255675 non-null  object        
 1   id             255675 non-null  object        
 2   date           255675 non-null  datetime64[ns]
 3   reviewer_id    255675 non-null  object        
 4   reviewer_name  255673 non-null  object        
 5   comments       255650 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 11.7+ MB


In [9]:
# check missing values

df[df['comments'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
22513,3963238,107557973,2016-10-11,89172667,Aubrey,
53083,8637376,689545407425155927,2022-08-09,42662144,Jack,
64142,9917169,280311793,2018-06-23,96809532,Luna,
64531,10158510,326771807,2018-09-23,156904066,Nick,
83437,13286234,126708978,2017-01-13,50779794,Yeung,
97035,15320642,236754628,2018-02-20,3007755,Gerhard,
101178,15976663,602846948,2020-02-08,222656274,Sai Kiran,
106424,16203889,556169882415852752,2022-02-06,129938815,Marcel,
122197,17700557,149802100,2017-05-06,113753396,Melisa,
131306,19127428,574993078,2019-12-09,104127081,Thu Sang,


In [10]:
# fill missing values

df['comments'].fillna('No Description', inplace=True)

In [11]:
# check missing values

df.isna().sum()

listing_id       0
id               0
date             0
reviewer_id      0
reviewer_name    2
comments         0
dtype: int64

> Now that everything is properly cleaned. we will continue to text processing.

## TEXT PROCESSING

> We identify the different languages used in the customer reviews

In [12]:
# create a function to detect the language used for the review. We will only be considering English Reviews

from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
def language(comments):
    review = str(comments)
    try: 
            return detect(review) 
    except : 
            return 'Unknown Language'      

In [13]:
# create a new column lang to identify comments language

for i, row in df.iterrows():
    lang = language(row['comments'])
    df.at[i,'lang'] = lang    

In [14]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en


In [15]:
# show last 5

df.tail()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang
255670,715116819229582647,717102522740483216,2022-09-16,477946386,Le,位置便利，步行约五分钟可到达BTS站。房间干净､房东热心帮忙。推荐入住！！,zh-cn
255671,706896876692173997,715634827622465913,2022-09-14,285141894,八一镰刀,房东非常好，下次还会住,zh-cn
255672,717796047487387803,719207337995482279,2022-09-19,26461010,Jiajia,The owner’s attitude was so nice and I had bee...,en
255673,706911691898125330,717062807939818506,2022-09-16,473434095,J,"매우 깨끗하고, ekkamai 역에서 가까워요.",ko
255674,707439535910302353,712750445698916344,2022-09-10,2781667,Alberto,"Super hotel, just a bit confusing to get there...",en


In [16]:
# filter and make a new data frame to consider only english reviews

df_en = df.loc[df['lang']=='en']

In [17]:
# show top 5

df_en.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en


### Text Cleaning

> To start with, we clean the text on **comments** features by doing **case folding** and **tokenizing** as well as **removing stopwords** on the text.

In [18]:
# function to clean text

def clean_text(data, stopword):
    
    # casefolding
    data = [i.lower() for i in data]
    data = [' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|\d+", " ", i).split()) for i in data]
    res = ' '.join(data) 

    # tokenizing 
    word_tokens = word_tokenize(res)    
    res = ' '.join([i for i in word_tokens if not i in stopword])
    
    return res

In [19]:
# set stopword

stop_words = set(stopwords.words('english'))

# text cleaning

comment_filtered = []
for i in df_en['comments']:
    comment_filtered.append(clean_text([i], stop_words))

In [20]:
# check filtered comment

comment_filtered[0]

'stayed apartment week enjoyed much nuttee nice host best accommodate us everything perfect apartment love view balcony apartment modern spacious location central mins walk bts station supermarket mins bus taxi central world shopping mall also lot food stalls massage nearby definitely stay next visit bangkok highly recommended'

In [21]:
# create new feature to store cleaned text

df_en['comments_cleaned'] = comment_filtered

In [22]:
# show dataframe

df_en.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,comments_cleaned
0,27934,1094339,2012-04-07,1368195,Michael,We stayed in the apartment for a week and we e...,en,stayed apartment week enjoyed much nuttee nice...
1,172332,3367236,2013-01-18,4406839,Sophie,"\r<br/>We, my husband, my daughter (15 months)...",en,br husband daughter months stayed one month pe...
2,27934,1241042,2012-05-07,2007324,Scott,My girlfriend and I recently stayed in Nuttee'...,en,girlfriend recently stayed nuttee condo month ...
3,172332,4227455,2013-04-20,5857160,Erick,I honestly can't thank Raewyn enough. Myself a...,en,honestly thank raewyn enough fiance looking qu...
4,172332,7747052,2013-10-01,7431791,Peter,This was my first time in Bangkok and I could ...,en,first time bangkok could picked better place s...


> Next, we will drop this cleaned data to new dataframe to be used on the next part. 

In [23]:
# drop to new dataframe
df_en.to_csv('airbnb-bkk-reviews-clean.csv', index=False)