# <font color='BLUE'> DAB402 - CAPSTONE PROJECT </font>
# <font color='BLUE'> SENTIMENT ANALYSIS IN SHORT-TERM RENTAL INDUSTRY  </font>

#### <font color='BLUE'>Group Name: Snowflake   </font>  
   - Trang Bui- W0753523  

### <font color='Green'> Description   </font>
- The raw review data are extracted from the website http://insideairbnb.com/get-the-data.html. In this project, we used review data from Montreal, Ottawa, Toronto, Vancouver. Raw data is downloaded and save in the folder "data/input_output_csv/raw_data"   

### <font color='Green'> Purpose of this notebook   </font>
This is the first step to prepare data for sentiment classification model training:   
        + Instead of labelling manually for raw data, I decided to apply some pre-trained classification models such as Vader, TextBlob, Flair. The rule of labelling using pre-trained models: "Final label is kept if three above pretrained models make the same prediction values"  
#### Input:   
        - raw data (csv files) in the folder "data/input_output_csv/raw_data" that are formatted as (city + "_reviews.csv")    
#### Output:   
        - labelled data (csv files) in the folder "data/input_output_csv/buffer" that are formatted as (city + '_reviews_labelled_full.csv')     
In which, <font color='blue'>city</font> include "vancouver, toronto, montreal, ottawa"

### <font color='Green'> Steps to execute   </font>
- Assign the city name one by one to the variable "city", then run the entire notebook
        
<img src="images/image1.jpg">

### <font color='purple'>Import libraries</font> 

In [1]:
#pip install nltk==3.3
#pip install vaderSentiment
#pip install pycorenlp
#pip install flair
#pip install BeautifulSoup4# pip install imblearn
#pip install pycontractions

#pip install --user vaderSentiment
#pip install --user flair
#pip install --user textblob
#pip install --user langdetect
#pip install --user plotly
#pip install --user seaborn
#pip install --user matplotlib
#pip install --user statsmodels
#pip install tensorflow --user
#pip install --user sklearn
#pip install tensorflow --user
#pip install --upgrade scikit-learn

In [2]:
# Import the appropriate Libraries
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

import time
import os, sys    
from datetime import datetime, date
import numpy as np
import pandas as pd # for dataframes
import tensorflow as tf
import nltk
import re
import string
from statsmodels.tsa.arima_model import ARIMA

import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import plotly.graph_objs as go #visualization
import plotly.offline as py#visualization
import itertools

In [3]:
city = 'vancouver'

## <font color='purple'>Load Raw Data</font> 

In [4]:
# Read the data
input_folder = './data/input_output_csv/raw_data/'
buffer_folder = './data/input_output_csv/buffer/'
df = pd.read_csv(input_folder + city + '_reviews.csv')
df_orig = df.copy()
print('df.shape: ',df.shape)
df.head()
print('DONE Load raw data')

df.shape:  (147936, 6)
DONE Load raw data


## <font color='purple'>Remove Non-English Reviews from Raw Data </font> 

In [5]:
#from pycontractions import Contractions
from langdetect import detect

#Remove non-English review
def language_detection(text):
    try:
        return detect(text)
    except:
        return "Non-Language"

In [6]:
df['lang'] = df['comments'].apply(language_detection)
df_en = df[(df['lang']== "en")]
print('Df shape after removing non-English: ',df_en.shape)

Df shape after removing non-English:  (138551, 7)


In [7]:
df_en.to_csv(buffer_folder + city + '_reviews_en.csv', index = False)

In [8]:
df=df_en.copy()
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang
0,10080,720466,2011-11-15,143771,Anthony,this accommodation was excellent. beautiful sp...,en
1,10080,786114,2011-12-14,1472653,Nilesh,The host canceled my reservation 13 days befor...,en
2,10080,989885,2012-03-12,1433564,Avril,"This apartment is fantastic, just what I and m...",en
3,10080,1419559,2012-06-05,725806,Dennis,Very nice apartment and great view. Close to S...,en
4,10080,3354964,2013-01-15,3641867,Jude,Both Rami and Mauricio made our family of 5 fe...,en


## <font color='purple'>Labelling to raw data using Pre-trained Models such as Vader, TextBlob, Flair</font> 

In [9]:
df = pd.read_csv(buffer_folder + city + '_reviews_en.csv')
print('df.shape: ',df.shape)
df.head()

df.shape:  (138551, 7)


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang
0,10080,720466,2011-11-15,143771,Anthony,this accommodation was excellent. beautiful sp...,en
1,10080,786114,2011-12-14,1472653,Nilesh,The host canceled my reservation 13 days befor...,en
2,10080,989885,2012-03-12,1433564,Avril,"This apartment is fantastic, just what I and m...",en
3,10080,1419559,2012-06-05,725806,Dennis,Very nice apartment and great view. Close to S...,en
4,10080,3354964,2013-01-15,3641867,Jude,Both Rami and Mauricio made our family of 5 fe...,en


In [10]:
from nltk.sentiment.util import *
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from textblob import TextBlob

import flair
from flair.models import TextClassifier
from flair.data import Sentence



def txtblob_get_polarity(text): 
    return TextBlob(text).sentiment.polarity 

def do_labelling(model, df, buffer_folder, file_name):
    start_process = time.time()
    if (model=='vader'):
        sid = SentimentIntensityAnalyzer()
        df['vader'] = df['comments'].apply(lambda comments:sid.polarity_scores(comments))
        df['vader_score'] = df['vader'].apply(lambda score_dict: score_dict['compound']) 
        df['vader_target']=''
        df.loc[df.vader_score>0,'vader_target']='POSITIVE' 
        df.loc[df.vader_score==0,'vader_target']='NEUTRAL' 
        df.loc[df.vader_score<0,'vader_target']='NEGATIVE'
    
    if (model=='textblob'):
        df['txtblob_Polarity'] = df['comments'].apply(txtblob_get_polarity)
        df['txtblob_target']=''
        df.loc[df.txtblob_Polarity>0,'txtblob_target']= 'POSITIVE' 
        df.loc[df.txtblob_Polarity==0,'txtblob_target']= 'NEUTRAL'
        df.loc[df.txtblob_Polarity<0,'txtblob_target']='NEGATIVE'
    
    if (model=='flair'):
        flair_score=[]
        flair_target=[]
        # Create Flair Model
        pd.set_option('display.max_colwidth',None)
        flair_model = TextClassifier.load('en-sentiment')

        #fitting
        for text in df['comments']:
            if text.strip()=="":
                flair_score.append("")
                flair_target.append("")
            else:        
                sentence = Sentence(text)
                flair_model.predict(sentence)
                flair_score.append(sentence.labels[0].score)
                flair_target.append(sentence.labels[0].value)

        df['flair_score']=flair_score
        df['flair_target']= flair_target
        
    df.to_csv(buffer_folder + file_name, index = False)
    print("+++++ labelling ", model, " duration--- %s seconds ---" % (time.time() - start_process))
    return df

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\teres\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [11]:
file_name= city + '_reviews_labelled_VADER.csv'
df = do_labelling('vader', df, buffer_folder,file_name )

+++++ labelling  vader  duration--- 87.4802348613739 seconds ---


In [12]:
df = pd.read_csv(buffer_folder + city + '_reviews_labelled_VADER.csv')

file_name = city + '_reviews_labelled_TEXTBLOB.csv'
df = do_labelling('textblob', df, buffer_folder,file_name )

+++++ labelling  textblob  duration--- 73.40068054199219 seconds ---


In [13]:
df = pd.read_csv(buffer_folder + city + '_reviews_labelled_TEXTBLOB.csv')

file_name = city + '_reviews_labelled_FLAIR.csv'
df = do_labelling('flair', df, buffer_folder,file_name )

2021-04-21 21:15:28,386 loading file C:\Users\teres\.flair\models\sentiment-en-mix-distillbert_3.1.pt
+++++ labelling  flair  duration--- 12713.397769927979 seconds ---


In [14]:
print('df.shape',df.shape)
df.head(3)

df.shape (138551, 14)


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,lang,vader,vader_score,vader_target,txtblob_Polarity,txtblob_target,flair_score,flair_target
0,10080,720466,2011-11-15,143771,Anthony,"this accommodation was excellent. beautiful space, nicely appointed, clean, amazing view, location is great and there is a parking space for a small fee. our check in went on time and smoothly, simon was very pleasant to deal with. all around happy experience.",en,"{'neg': 0.0, 'neu': 0.538, 'pos': 0.462, 'compound': 0.9824}",0.9824,POSITIVE,0.612,POSITIVE,0.986393,POSITIVE
1,10080,786114,2011-12-14,1472653,Nilesh,The host canceled my reservation 13 days before arrival.,en,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0,NEUTRAL,0.0,NEUTRAL,0.999539,NEGATIVE
2,10080,989885,2012-03-12,1433564,Avril,"This apartment is fantastic, just what I and my 5 graduate students were looking for. The beds are very comfortable, the kitchen is well-equipped, and the apartment is only a few blocks from the convention center. We paid much less for housing than our colleagues who were staying in hotels, and we saved a lot on food because of the kitchen. Has a dishwasher, and a washer & dryer. I was initially confused about how many people could be accommodated because the website said it sleeps 6 but ""extra guests are free"", but Rami clarified that there are 2 queen beds and 2 single beds can be brought in, so that it accommodates a total of 6. The apartment is a corporate suite on the 36th floor of the Marriott, 2 floors from the top with great views of Stanley Park, the seaplane port, and the city. I'd like to bring my family and come back for a vacation. Rami was very responsive and his staff met us at the entrance to the building to get us oriented and settled in. Very professional.",en,"{'neg': 0.017, 'neu': 0.867, 'pos': 0.116, 'compound': 0.9595}",0.9595,POSITIVE,0.13955,POSITIVE,0.999791,POSITIVE


In [15]:
df_tmp=pd.DataFrame()
df_tmp['tmp']=df[['vader_target','txtblob_target','flair_target']].values.tolist()
df_tmp=df_tmp['tmp'].apply(pd.Series.value_counts)

conditions=[(df_tmp['POSITIVE']==3),(df_tmp['NEUTRAL']==2),(df_tmp['NEGATIVE']==3)]
values=['POSITIVE','NEUTRAL','NEGATIVE']
df_tmp['label']=np.select(conditions,values)

df_tmp=df_tmp.query('label=="POSITIVE" | label=="NEUTRAL" | label=="NEGATIVE"')

df_final=pd.concat([df['listing_id'],df['comments'], df_tmp['label']], axis=1, join="inner")
df_final.to_csv(buffer_folder + city + '_reviews_labelled_full.csv', index = False)

In [16]:
df_final

Unnamed: 0,listing_id,comments,label
0,10080,"this accommodation was excellent. beautiful space, nicely appointed, clean, amazing view, location is great and there is a parking space for a small fee. our check in went on time and smoothly, simon was very pleasant to deal with. all around happy experience.",POSITIVE
1,10080,The host canceled my reservation 13 days before arrival.,NEUTRAL
2,10080,"This apartment is fantastic, just what I and my 5 graduate students were looking for. The beds are very comfortable, the kitchen is well-equipped, and the apartment is only a few blocks from the convention center. We paid much less for housing than our colleagues who were staying in hotels, and we saved a lot on food because of the kitchen. Has a dishwasher, and a washer & dryer. I was initially confused about how many people could be accommodated because the website said it sleeps 6 but ""extra guests are free"", but Rami clarified that there are 2 queen beds and 2 single beds can be brought in, so that it accommodates a total of 6. The apartment is a corporate suite on the 36th floor of the Marriott, 2 floors from the top with great views of Stanley Park, the seaplane port, and the city. I'd like to bring my family and come back for a vacation. Rami was very responsive and his staff met us at the entrance to the building to get us oriented and settled in. Very professional.",POSITIVE
3,10080,"Very nice apartment and great view. Close to Stanley Park and Robson street. Convenient underground parking space. Two extra single electric air beds are included, although the air beds will leak some air overnight.\r\n",POSITIVE
4,10080,"Both Rami and Mauricio made our family of 5 feel very welcome and were extremely helpful with our booking and questions. The apartment and facilities were fantastic - we would definitely stay here again. Great spot right down town, close to all of Vancouvers sites and transport, and to be treated at the end of the day with a soak in the hot tub was truly luxurious. The views are stunning too.",POSITIVE
...,...,...,...
138546,47929231,The unit is as described and the view is even better than the photos! The cleanliness is satisfactory and I feel comfy staying there. Convenient location with skytrain and Costco just across street. I will come back again!,POSITIVE
138547,48015009,"We had wonderful stay at May's apartment. The location could not be better, very central, short walking distance from all the main attractions. The apartment was very clean and spacious.",POSITIVE
138548,48015009,"Ideal location, spacious apartment and well equipped – perfect for a few days or long term in Vancouver. Communication with May was very easy, and the photo directions for finding the place were a great help. Highly recommended!",POSITIVE
138549,48015009,"Clean, modern, comfortable, well equipped and well laid out. What else can I say? I definitely recommend May's place!",POSITIVE
