# LISBON AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **DATA PREPROCESSING** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [24]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
from nltk.corpus import stopwords
from nltk.test.portuguese_en_fixt import setup_module
from nltk.tokenize import sent_tokenize, word_tokenize

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data
df = pd.read_csv("C:/Users/lizab/Downloads/halew.csv")

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,198480,87255046.0,7/19/2016,40872502,Nour,Belle appartement rÃ©cent situÃ© Ã 15 minutes...
1,198480,91814194.0,8/6/2016,40494412,Vitor,"Morada excelente, com limpeza Ã³tima, muito be..."
2,198480,94780243.0,8/17/2016,70116792,Ricardo,"Boa localizaÃ§Ã£o, casa cÃ´moda e simpÃ¡tica ...."
3,198480,96934467.0,8/25/2016,72247207,Victor,En fin ganska nybyggd lÃ¤genhet med all utrus...
4,198480,111434378.0,10/31/2016,96738915,Elbert Takeshi,"Ã“timo apartamento, mtu aconchegante e espaÃ§o..."


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499 entries, 0 to 498
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   listing_id     499 non-null    int64  
 1   id             499 non-null    float64
 2   date           499 non-null    object 
 3   reviewer_id    499 non-null    int64  
 4   reviewer_name  499 non-null    object 
 5   comments       499 non-null    object 
dtypes: float64(1), int64(2), object(3)
memory usage: 23.5+ KB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,int64,7,0,0.0,499
1,id,float64,499,0,0.0,499
2,date,object,459,0,0.0,499
3,reviewer_id,int64,495,0,0.0,499
4,reviewer_name,object,420,0,0.0,499
5,comments,object,499,0,0.0,499


> There are some `dtypes` that are not proper, then there are also a missing values on *comments* feature. I'll check on it later. But I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [8]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499 entries, 0 to 498
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   listing_id     499 non-null    object        
 1   id             499 non-null    object        
 2   date           499 non-null    datetime64[ns]
 3   reviewer_id    499 non-null    object        
 4   reviewer_name  499 non-null    object        
 5   comments       499 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 23.5+ KB


In [9]:
# check missing values

df[df['comments'].isna()]


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments


In [10]:
# fill missing values

df['comments'].fillna('No Description', inplace=True)

In [40]:
# check missing values

df.isna().sum()

listing_id       0
id               0
date             0
reviewer_id      0
reviewer_name    0
comments         0
dtype: int64

> Now that everything is properly cleaned. I'll continue to text processing.

## TEXT PROCESSING

> To start with, I'll clean the text on *comments* features by doing * case folding* and *tokenizing* as well as *removing stopwords* on the text.

In [25]:
# function to clean text

def clean_text(data,stopword,stoppies):
    
    # casefolding
    data = [i.lower() for i in data]
    data = [' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|\d+", " ", i).split()) for i in data]
    res = ' '.join(data) 

    # tokenizing 
    word_tokens = word_tokenize(res)    
    res = ' '.join([i for i in word_tokens if not i in stopword])
    
    return res

In [30]:
# set stopword

stopwords = nltk.corpus.stopwords.words('english','portuguese')

# text cleaning

comment_filtered = []
for i in df['comments']:
    comment_filtered.append(clean_text([i], stop_words,stopwords))

In [31]:
# check filtered comment

comment_filtered[0]

'belle appartement r cent situ minutes pied du tro et minutes du bus qui est desservi tout au long de la nuit br ana est rendu disponible durant le jour'

In [32]:
# create new feature to store cleaned text

df['comments_cleaned'] = comment_filtered

In [33]:
# show dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
0,198480,87255046.0,2016-07-19,40872502,Nour,Belle appartement rÃ©cent situÃ© Ã 15 minutes...,belle appartement r cent situ minutes pied du ...
1,198480,91814194.0,2016-08-06,40494412,Vitor,"Morada excelente, com limpeza Ã³tima, muito be...",morada excelente com limpeza tima muito bem lo...
2,198480,94780243.0,2016-08-17,70116792,Ricardo,"Boa localizaÃ§Ã£o, casa cÃ´moda e simpÃ¡tica ....",boa localiza casa c moda e simp tica propriet ...
3,198480,96934467.0,2016-08-25,72247207,Victor,En fin ganska nybyggd lÃ¤genhet med all utrus...,en fin ganska nybyggd l genhet med utrustning ...
4,198480,111434378.0,2016-10-31,96738915,Elbert Takeshi,"Ã“timo apartamento, mtu aconchegante e espaÃ§o...",timo apartamento mtu aconchegante e espa oso g...


> Next, I'll drop this cleaned data to new dataframe to be used on the next part. 

In [34]:
# drop to new dataframe

df.to_csv('halew-reviews-clean.csv', index=False)

## REFERENCES

>- https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221?_branch_match_id=835004835328579359
>- https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8