# Simple Data exploration airbnb dataset in Lisbon from JULY 2019

Authors:
Ricardo Clemente

Contact: ricardomiguelrosaclemente@gmail.com

Github: https://github.com/ric-clemente

# Contents
* [1. Loading libraries and loading data](#1.-Loading-libraries-and-data)
  * [1.1 Loading libraries](#1.1-Loading-libraries)
  * [1.2 Loading Data ](#1.2-Loading-data)
* [2. Data exploration by Neighbourhood](#2.-Data-exploration-by-Neighbourhood)
  * [2.1 Frequency](#2.1-Frequency)
  * [2.2 Price Average](#2.2-Price-Average)
  * [2.3 Room types](#2.3-Room-types)
  * [2.4 Review Rating Average](#2.4-Review-Rating-Average)
  * [2.5 Average Prices in the future](#2.5-Average-Prices-in-the-future)
  * [2.6 Prices vs Review Ratings](#2.6-Prices-vs-Review-Ratings)
  * [2.7 Find the best Neighbourhood](#2.7-Find-the-best-Neighbourhood)



# 1. Loading libraries and data

## 1.1 Loading libraries

In [None]:
## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions
library(dplyr) #summarize dat
library(ggplot2)
library(sqldf)
#library(ggpubr)

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

list.files(path = "../input")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.

## 1.2 Loading data


In [None]:
#load datasets
listings<-read.csv("../input/lisboa-airbnb-data/listings.csv")
listings_details<-read.csv("../input/lisboa-airbnb-data/listings_details.csv")
calendar<-read.csv("../input/lisboa-airbnb-data/calendar.csv")





In [None]:
#merge datasets listings and listings_details
listings_df<-cbind(listings,listings_details[,c("property_type","accommodates", "review_scores_rating", "maximum_nights", "listing_url", "host_is_superhost", "host_about", "host_response_time", "host_response_rate", "street", "weekly_price", "monthly_price", "market")])

In [None]:
#show some values from the datasets merged
head(listings_df)

# 2. Data exploration by Neighbourhood

## 2.1 Frequency
Calculate the frequency of listings by neighbourhood


In [None]:
#compute frequency
Neighbourhood_group_freq <- listings_df %>%
  group_by(neighbourhood_group) %>%
  summarise(counts = n())

#Show results
head(Neighbourhood_group_freq)


#plot results
options(repr.plot.width = 30, repr.plot.height = 10)


ggplot(Neighbourhood_group_freq, aes(x = neighbourhood_group, y = counts)) +
  geom_col(fill = "#0073C2FF",stat="dodge") +
  geom_bar(stat="identity",position = "fill")+
  geom_text(aes(label=counts), hjust=1, color="white", size=6)+
  
  coord_flip(expand=FALSE) +
  
  theme_bw()+
  theme(text = element_text(size=20))
  
  

## 2.2 Price Average
Calculate the price average by neighbourhood

In [None]:
#Calculate price average 
Neighbourhood_group_price_avg <- listings_df %>%
  filter(accommodates==2) %>%
  group_by(neighbourhood_group) %>%
  summarise(avg_price = mean(price,na.rm=TRUE))

head(Neighbourhood_group_price_avg)


options(repr.plot.width = 30, repr.plot.height = 10)


ggplot(Neighbourhood_group_price_avg, aes(x = neighbourhood_group, y = avg_price)) +
  geom_col(fill = "#0073C2FF",stat="dodge") +
  geom_bar(stat="identity",position = "fill")+
  geom_text(aes(label=avg_price), hjust=1, color="white", size=6)+
  coord_flip(expand=FALSE) +
  
  theme_bw()+
  theme(text = element_text(size=20))
  

## 2.3 Room types
Check the Room types more frequently booked

In [None]:
Neighbourhood_group_type_freq <- listings_df %>%
  group_by(room_type) %>%
  summarise(counts = n())

head(Neighbourhood_group_type_freq)


options(repr.plot.width = 30, repr.plot.height = 10)


ggplot(Neighbourhood_group_type_freq, aes(x = room_type, y = counts)) +
  geom_col(fill = "#0073C2FF",stat="dodge") +
  geom_bar(stat="identity",position = "fill")+
  geom_text(aes(label=counts), hjust=1, color="white", size=6)+

  coord_flip(expand=FALSE) +
  
  theme_bw()+
  theme(text = element_text(size=20))

## 2.4 Review Rating Average
Calculate the Review Rating average by neighbourhood

In [None]:
Neighbourhood_group_rating_avg <- listings_df %>%
  filter(number_of_reviews>10) %>%
  group_by(neighbourhood_group) %>%
  summarise(avg_score = mean(review_scores_rating,na.rm=TRUE))

head(Neighbourhood_group_rating_avg)


options(repr.plot.width = 30, repr.plot.height = 10)


ggplot(Neighbourhood_group_rating_avg, aes(x = neighbourhood_group, y = avg_score)) +
  geom_col(fill = "#0073C2FF",stat="dodge") +
  geom_bar(stat="identity",position = "fill")+
  geom_text(aes(label=avg_score), hjust=1, color="white", size=6)+

  coord_flip(expand=FALSE) +
  
  theme_bw()+
  theme(text = element_text(size=20))

## 2.5 Average Prices in the future
See the evolution of the prices in the next months with bookings already made in the future

In [None]:
#use of SQL language to merge datasets calendar , listings_detalis and listings
mydatajoin <- sqldf("select calendar.listing_id,calendar.date,calendar.price,calendar.available,listings_details.accommodates from calendar join listings_details on calendar.listing_id = listings_details.id where listings_details.accommodates=2")

#convert column prices into numeric number
mydatajoin$price=as.numeric(gsub("[\\$,]", "", mydatajoin$price))

head(mydatajoin)





In [None]:
#Calculate average price
#Filters used:
#avaliable=false and accomodates =2 
avg_price_future <- mydatajoin %>%
  filter(available=='f') %>%
  group_by(date) %>%
  summarise(avg = mean(price,na.rm=TRUE))

#convert to date format
avg_price_future$date <- as.Date(avg_price_future$date, format="%Y-%m-%d")

head(avg_price_future)

ggplot(avg_price_future, aes(x = date, y = avg)) +
  geom_line(color = "steelblue") +
  scale_x_date(date_labels = "%b/%Y")

We can see that prices start to get higher after April 2020

## 2.6 Prices vs Review Ratings
Check if there is any correlation between prices and review ratings

In [None]:
options(repr.plot.width = 30, repr.plot.height = 10)



ggplot(listings_df,aes(size=0.2,color=neighbourhood_group,y = price, x = review_scores_rating)) +
  geom_point()

There are some neighbourhood groups like Lisboa and Sintra can get people paying more for better reviews

## 2.7 Find the best Neighbourhood
Analyze which neighbourhood have better review rating with the lowest price possible

In [None]:
#merge the 2 datasets that contains the average of each variable(rating average and price average)
avg_price_vs_rating_avg<-merge(x = Neighbourhood_group_rating_avg, y = Neighbourhood_group_price_avg, by = "neighbourhood_group", all = TRUE)



options(repr.plot.width = 30, repr.plot.height = 10)



ggplot(avg_price_vs_rating_avg,aes(size=0.02,color=neighbourhood_group,y = avg_score, x = avg_price)) +
  geom_point()

Arruada dos Vinhos and Azambuja has the best Price quality analyzing the graph below