# <center> Boston Airbnb Analysis

<img src="https://cdn-images-1.medium.com/max/2400/1*BfcMRSGnD5pMG61KfPqjVA.png" style="width:900px;"/>

Original photo by [Anthony Delanoix](https://unsplash.com/@anthonydelanoix) on [Unsplash](https://unsplash.com)

In [91]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
cd = os.getcwd()
import re
import statsmodels.api as sm

# magic word for producing visualizations in notebook
%matplotlib inline

import plotly.plotly as py #for creating interactive data visualizations
import plotly.graph_objs as go
from plotly import tools
import plotly.tools as tls
py.sign_in('salitr', '0Vm0IzVDJl70ydZG9ZjW') #API key has been removed for security
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot #to work with data visualization offline
init_notebook_mode(connected=True)
import cufflinks as cf #connects Plotly with pandas to produce the interactive data visualizations
cf.go_offline()

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import metrics

from IPython.display import Image

import gmaps
gmaps.configure(api_key='AIzaSyDVKBGsT0nQS_2V7KwS8DfSko2CVIQWjQU') #API key has been removed for security
from ipywidgets.embed import embed_minimal_html

---
# <center>STAGE ONE</center>
---

## 1. Business Understanding

### 1.1 Business Questions

<font color=blue> 
    Q1. How prices for all Boston's Airbnb fluctuate throughout the year 2019?

<font color=blue> 
    Q2. What are the peak and off-peak times during the year for Airbnb rental prices in Boston?

<font color=blue> 
    Q3. Who are the hosts with the most number of Airbnb listings?

<font color=blue> 
    Q4. Which neighborhoods have the most number of listing in Boston?

<font color=blue> 
    Q5. Which are the most expensive neighborhoods in Boston? 
    
<font color=blue> 
    Q6. Which are the popular neighborhoods based on the average number of reviews?

<font color=blue> 
    Q7. Which type of room has the majority of listings in Boston Airbnb?

<font color=blue> 
    Q8. What are the features that influence the price in Boston Airbnb? Moreover, can we predict the rental price of new listings based on a predictive model?

## 2. Data Understanding

In [92]:
#loading the three datasets
listings = pd.read_csv(cd+'/listings.csv')
reviews = pd.read_csv(cd+'/reviews.csv')
calendar = pd.read_csv(cd+'/calendar.csv')

### 2.1 listings data

In [93]:
#displaying the dataset
print(listings.shape)
listings.sample(10)

(6155, 106)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
963,7825327,https://www.airbnb.com/rooms/7825327,20190200000000.0,2/9/19,Historical South End Brownstone T-4,Flash Sale For this Month Only Stays Up To 40%...,*We offer you the most competitive rates. Bef...,Flash Sale For this Month Only Stays Up To 40%...,none,"One of the trendiest neighborhoods in Boston, ...",...,t,f,strict_14_with_grace_period,f,f,12,12,0,0,0.52
602,5048406,https://www.airbnb.com/rooms/5048406,20190200000000.0,2/9/19,3rd floor room in Victorian House,"Huge 8 rooms, 5 bdrms. house. Kitchen, living...","We offer full kitchen access: dishes, oven, co...","Huge 8 rooms, 5 bdrms. house. Kitchen, living...",none,Dorchester is the largest and oldest residenti...,...,f,f,strict_14_with_grace_period,f,f,2,0,2,0,0.57
5978,31341983,https://www.airbnb.com/rooms/31341983,20190200000000.0,2/9/19,Boston市区Boston College附近地铁沿线优质房源,"交通极其便利,出门1mins到地铁C线,4mins到地铁D线,5mins到地铁B线,步行15...",可以使用卫生间厨房客厅等公用空间,"交通极其便利,出门1mins到地铁C线,4mins到地铁D线,5mins到地铁B线,步行15...",none,靠近Boston College和Boston University,...,f,f,strict_14_with_grace_period,f,f,1,0,1,0,
3930,23090526,https://www.airbnb.com/rooms/23090526,20190200000000.0,2/9/19,Great 2 bedrooms near Airport East Boston,This two cozy and comfortable bedroom apartmen...,"We have everything in this place, and the gues...",This two cozy and comfortable bedroom apartmen...,none,"A very nice, quiet, and safe neighborhood. The...",...,t,f,moderate,f,f,2,2,0,0,0.91
227,1494726,https://www.airbnb.com/rooms/1494726,20190200000000.0,2/9/19,Cozy in Rozzie!,"Spacious & quiet room w. private bath, for 1+ ...",Spacious room with private bathroom in really ...,"Spacious & quiet room w. private bath, for 1+ ...",none,"If you like quaint villages, come to Rozzie Sq...",...,f,f,strict_14_with_grace_period,f,f,1,0,1,0,0.26
4538,25416313,https://www.airbnb.com/rooms/25416313,20190200000000.0,2/9/19,"Newly renovated, Best Location CENTER of Boston",1 bedroom 1 and half bathrooms available in a ...,Access to the Red Line Shawmut train station t...,1 bedroom 1 and half bathrooms available in a ...,none,"I've lived here for over 15 years, it is consi...",...,f,f,moderate,f,f,7,1,6,0,0.4
696,5927267,https://www.airbnb.com/rooms/5927267,20190200000000.0,2/9/19,"Large room, wood floor, 2 windows",It is a great location and room is very nice t...,,It is a great location and room is very nice t...,none,,...,t,f,moderate,f,f,4,0,4,0,2.05
4812,27125054,https://www.airbnb.com/rooms/27125054,20190200000000.0,2/9/19,Cute & Spacious Home in the Heart of Local Boston,This is an entire one-floor single home with a...,This single ADA family home features lots of o...,This is an entire one-floor single home with a...,none,Dorchester is Boston's largest neighborhood wi...,...,f,f,moderate,f,f,1,1,0,0,
1696,13652374,https://www.airbnb.com/rooms/13652374,20190200000000.0,2/9/19,Private Room near Forest Hills T Stop,Lugares de interés: Arnold Arboretum Walter St...,Private room in the beautiful neighborhood of ...,Lugares de interés: Arnold Arboretum Walter St...,none,,...,f,f,flexible,f,f,1,0,1,0,
3258,21076367,https://www.airbnb.com/rooms/21076367,20190200000000.0,2/9/19,Charming 1 Bedroom in Beacon Hill (Downtown BOS),A one bedroom apartment at the heart of Boston...,,A one bedroom apartment at the heart of Boston...,none,,...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.12


In [94]:
#summary of each feature
listings.describe()

Unnamed: 0,id,scrape_id,thumbnail_url,medium_url,xl_picture_url,host_id,host_acceptance_rate,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,...,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,6155.0,6155.0,0.0,0.0,0.0,6155.0,0.0,6153.0,6153.0,0.0,...,4854.0,4849.0,4849.0,0.0,0.0,6155.0,6155.0,6155.0,6155.0,4911.0
mean,18791440.0,20190200000000.0,,,,59980430.0,,152.358524,152.358524,,...,9.681706,9.545886,9.300474,,,35.479285,33.577417,1.84078,0.061089,1.976685
std,8802758.0,0.0,,,,64811610.0,,372.054375,372.054375,,...,0.807092,0.77795,0.891518,,,73.147833,73.626063,4.079204,0.54038,2.100526
min,3781.0,20190200000000.0,,,,4804.0,,0.0,0.0,,...,2.0,2.0,2.0,,,1.0,0.0,0.0,0.0,0.01
25%,12786800.0,20190200000000.0,,,,11807870.0,,1.0,1.0,,...,10.0,9.0,9.0,,,1.0,0.0,0.0,0.0,0.38
50%,20431830.0,20190200000000.0,,,,30283590.0,,4.0,4.0,,...,10.0,10.0,9.0,,,3.0,1.0,0.0,0.0,1.19
75%,26013020.0,20190200000000.0,,,,95459400.0,,38.0,38.0,,...,10.0,10.0,10.0,,,30.0,21.0,1.0,0.0,3.0
max,32243380.0,20190200000000.0,,,,241878000.0,,1480.0,1480.0,,...,10.0,10.0,10.0,,,306.0,306.0,30.0,8.0,13.71


In [95]:
#listing all the features
list(listings.columns)

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


In [96]:
#checking the nulls for each feature
print(listings.isnull().sum().any())
listings.isnull().sum().sort_values(ascending=False)

True


xl_picture_url                                  6155
neighbourhood_group_cleansed                    6155
host_acceptance_rate                            6155
jurisdiction_names                              6155
license                                         6155
medium_url                                      6155
thumbnail_url                                   6155
square_feet                                     6057
weekly_price                                    5628
monthly_price                                   5626
access                                          2463
notes                                           2445
interaction                                     2064
host_about                                      2025
house_rules                                     1974
neighborhood_overview                           1952
security_deposit                                1875
transit                                         1790
space                                         

As the result of the above cell reports, many of the listings dataset's features have missing values. In fact, some of them has no entry at all. An issue that has to be addressed.

In [97]:
#checking the number of duplicates
listings.duplicated().sum()

0

In [98]:
#checking the number of unique values for each feature
listings.nunique().sort_values()

neighbourhood_group_cleansed                      0
xl_picture_url                                    0
medium_url                                        0
thumbnail_url                                     0
license                                           0
jurisdiction_names                                0
host_acceptance_rate                              0
state                                             1
has_availability                                  1
country_code                                      1
scrape_id                                         1
requires_license                                  1
calendar_last_scraped                             1
last_scraped                                      1
is_business_travel_ready                          1
country                                           1
experiences_offered                               1
instant_bookable                                  2
is_location_exact                                 2
market      

In [99]:
#Providing a set of columns with 0 missing values.
listings_no_nulls = set(listings.columns[listings.isnull().mean()==0]) 
print(listings_no_nulls)

{'latitude', 'minimum_maximum_nights', 'id', 'calculated_host_listings_count_entire_homes', 'availability_30', 'minimum_nights_avg_ntm', 'extra_people', 'picture_url', 'maximum_nights', 'scrape_id', 'country', 'availability_365', 'availability_60', 'minimum_nights', 'maximum_maximum_nights', 'listing_url', 'calculated_host_listings_count', 'maximum_minimum_nights', 'host_id', 'street', 'accommodates', 'requires_license', 'guests_included', 'number_of_reviews', 'longitude', 'host_url', 'experiences_offered', 'bed_type', 'availability_90', 'number_of_reviews_ltm', 'host_verifications', 'amenities', 'require_guest_phone_verification', 'country_code', 'require_guest_profile_picture', 'is_location_exact', 'has_availability', 'calendar_updated', 'maximum_nights_avg_ntm', 'is_business_travel_ready', 'cancellation_policy', 'property_type', 'neighbourhood_cleansed', 'instant_bookable', 'calendar_last_scraped', 'minimum_minimum_nights', 'last_scraped', 'price', 'room_type', 'smart_location', 'ca

In [100]:
#Providing a set of columns with more than 75% of the values missing
listings_most_nulls = set(listings.columns[listings.isnull().mean()> 0.75]) 
print(listings_most_nulls)

{'xl_picture_url', 'square_feet', 'weekly_price', 'jurisdiction_names', 'thumbnail_url', 'license', 'monthly_price', 'neighbourhood_group_cleansed', 'host_acceptance_rate', 'medium_url'}


Based on the previous data exploration, the next cell will set only the features that would be useful for our analysis. The features **'xl_picture_url', 'square_feet', 'weekly_price', 'jurisdiction_names', 'thumbnail_url', 'license', 'monthly_price', 'neighbourhood_group_cleansed', 'host_acceptance_rate', and 'medium_url'** were dropped because more than 75% of the values are missing.

In [101]:
#setting the chosen 20 features for the analysis
listings_columns = ['id', 'host_name', 'host_response_rate', 
               'host_is_superhost', 'neighbourhood_cleansed',
               'zipcode', 'property_type', 'room_type',
               'accommodates', 'bathrooms', 'bedrooms',
               'beds', 'amenities', 'price', 'cleaning_fee',
               'number_of_reviews', 'review_scores_rating', 'instant_bookable',
               'cancellation_policy', 'reviews_per_month', 'latitude', 'longitude']

In [102]:
#confirming the final dataset has only the chosen 22 features
df_listings = listings[listings_columns]
print(df_listings.shape)
df_listings.sample(10)

(6155, 22)


Unnamed: 0,id,host_name,host_response_rate,host_is_superhost,neighbourhood_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,...,amenities,price,cleaning_fee,number_of_reviews,review_scores_rating,instant_bookable,cancellation_policy,reviews_per_month,latitude,longitude
1057,8381472,Nick,,f,South End,2118,Apartment,Entire home/apt,4,2.0,...,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",$400.00,,0,,f,flexible,,42.343095,-71.079798
5166,28586005,Domio,100%,f,North End,2113,Apartment,Entire home/apt,6,1.0,...,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",$129.00,$120.00,13,97.0,f,strict_14_with_grace_period,3.28,42.36555,-71.056143
5997,31397068,Steve,94%,f,Jamaica Plain,2130,Apartment,Entire home/apt,8,1.0,...,"{Wifi,""Air conditioning"",Kitchen,Heating,""Smok...",$300.00,$165.00,1,100.0,t,strict_14_with_grace_period,1.0,42.32179,-71.109635
291,2021483,Shannon,100%,t,South Boston,2127,Condominium,Entire home/apt,6,2.0,...,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",$500.00,$100.00,5,96.0,f,strict_14_with_grace_period,0.16,42.335352,-71.047413
1227,9853959,Margaret,,f,South End,2118,Apartment,Entire home/apt,4,1.0,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$195.00,$25.00,0,,f,strict_14_with_grace_period,,42.341212,-71.079551
4908,27549840,Sean,100%,f,North End,2109,Apartment,Entire home/apt,3,1.0,...,"{TV,Wifi,""Air conditioning"",Kitchen,Heating,""S...",$160.00,,6,96.0,t,flexible,1.01,42.365607,-71.053421
5674,30180140,Evon,98%,f,Downtown,2116,Apartment,Entire home/apt,4,1.0,...,"{TV,Wifi,""Air conditioning"",Kitchen,Breakfast,...",$149.00,$69.00,15,95.0,t,flexible,5.77,42.352685,-71.063872
4964,27733061,Aziz,100%,t,Back Bay,2116,Apartment,Private room,2,1.0,...,"{TV,Wifi,""Air conditioning"",Kitchen,Breakfast,...",$125.00,$50.00,38,97.0,t,moderate,7.08,42.350248,-71.085469
3991,23272168,Bluebird,76%,f,South Boston,2127,Apartment,Entire home/apt,5,2.0,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$279.00,$100.00,1,100.0,t,strict_14_with_grace_period,1.0,42.336815,-71.036718
103,377474,Hermina,100%,f,Dorchester,2122,House,Entire home/apt,10,1.5,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$495.00,$150.00,22,87.0,f,strict_14_with_grace_period,0.28,42.29839,-71.059312


In [103]:
#investgating the data type for each feature
df_listings.dtypes

id                          int64
host_name                  object
host_response_rate         object
host_is_superhost          object
neighbourhood_cleansed     object
zipcode                    object
property_type              object
room_type                  object
accommodates                int64
bathrooms                 float64
bedrooms                  float64
beds                      float64
amenities                  object
price                      object
cleaning_fee               object
number_of_reviews           int64
review_scores_rating      float64
instant_bookable           object
cancellation_policy        object
reviews_per_month         float64
latitude                  float64
longitude                 float64
dtype: object

Some of the final features need to be converted to the correct data type, which will be addressed later.

In [104]:
#checking the number of nulls for each chosen feature
print(df_listings.isnull().sum().any())
df_listings.isnull().sum().sort_values(ascending=False)

True


review_scores_rating      1299
reviews_per_month         1244
host_response_rate        1238
cleaning_fee               979
zipcode                     42
bedrooms                     6
bathrooms                    5
beds                         3
host_name                    2
host_is_superhost            2
room_type                    0
neighbourhood_cleansed       0
property_type                0
longitude                    0
accommodates                 0
latitude                     0
amenities                    0
price                        0
number_of_reviews            0
instant_bookable             0
cancellation_policy          0
id                           0
dtype: int64

Again, some of the features have missing values. A further investigation will be applied to decide if we need to either remove or impute the rows.

In [105]:
#checking the number of unique values for each chosen feature
df_listings.nunique().sort_values()

instant_bookable             2
host_is_superhost            2
room_type                    3
cancellation_policy          6
bedrooms                     9
bathrooms                   12
accommodates                16
beds                        16
property_type               21
neighbourhood_cleansed      25
zipcode                     41
host_response_rate          43
review_scores_rating        45
cleaning_fee               148
number_of_reviews          300
price                      398
reviews_per_month          777
host_name                 1575
amenities                 4980
longitude                 6150
latitude                  6150
id                        6155
dtype: int64

#### 2.1.1 Quality Issues

- Missing data in some of the final 22 chosen features
- Erroneous datatypes in prices and dates features
- Source types in amenities where fields intricate with symbols 


#### 2.1.2 Tidiness Issues
- Features with more than one representation: room_type, amenities, and neighbourhood_cleansed. 

### 2.2 reviews data

In [106]:
#displaying the dataset
print(reviews.shape)
reviews.sample(10)

(199330, 6)


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
121665,15582062,209963762,2017-11-06,7078806,Lori,"If you know Boston Back Bay, you'll really app..."
98580,13296735,104962035,2016-09-29,56434819,Marianne,"Ron and Ignacio are adorable hosts, they did t..."
69803,7692933,95225584,2016-08-18,2157137,Alis,Had a very serene and comfortable stay at Phyl...
174463,22327139,258125311,2018-04-28,182377265,Chris,All and all this is a must stay if your visiti...
114202,15044173,340020707,2018-10-22,218638155,Joshua,"This apartment rocks. It is a beautiful space,..."
13277,815639,209572469,2017-11-05,113911928,Heather,This place was perfect. It's in a great locati...
111627,14868157,332523712,2018-10-05,93661842,Gill,Lori is a great host ...I enjoyed chatting wit...
199318,31690021,409724312,2019-02-08,67259979,Yerzhan,It’s perfect place to stay. Very clean and nic...
107846,14483758,151604995,2017-05-13,21468084,Bryan,Matthew is a responsive host and the apartment...
178609,23170412,248438357,2018-03-31,2441749,Eric And Susan,"Such a cool 1 br apartment, and the location w..."


In [107]:
#summary of each column
reviews.describe()

Unnamed: 0,listing_id,id,reviewer_id
count,199330.0,199330.0,199330.0
mean,12586140.0,208316200.0,72688360.0
std,8224075.0,107663300.0,61926930.0
min,3781.0,1021.0,1.0
25%,4924009.0,115410500.0,20731390.0
50%,13393420.0,213515000.0,54099880.0
75%,19309430.0,302404200.0,116709900.0
max,32163610.0,409756700.0,241460500.0


In [108]:
#information about each feature
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199330 entries, 0 to 199329
Data columns (total 6 columns):
listing_id       199330 non-null int64
id               199330 non-null int64
date             199330 non-null object
reviewer_id      199330 non-null int64
reviewer_name    199330 non-null object
comments         199242 non-null object
dtypes: int64(3), object(3)
memory usage: 9.1+ MB


In [109]:
#checking the number of nulls for each column
print(reviews.isnull().sum().any())
reviews.isnull().sum().sort_values(ascending=False)

True


comments         88
reviewer_name     0
reviewer_id       0
date              0
id                0
listing_id        0
dtype: int64

#### 2.2.1 Quality Issues

- Missing data in comments
- Erroneous datatypes in ids and dates
- Source types in comments reviewer_name where some fields intricate with symbols and some are written in different languages.

### 2.3 calendar data

In [110]:
#displaying the dataset
print(calendar.shape)
calendar.sample(10)

(2246575, 7)


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
816327,16030858,2019-08-09,f,$45.00,$45.00,3,365
1744053,26453232,2019-07-09,t,$150.00,$150.00,32,1125
1670522,25871107,2019-11-07,f,$283.00,$283.00,3,1125
383168,8336364,2019-10-17,f,$100.00,$100.00,2,1125
1879390,28546173,2019-09-21,t,$533.00,$533.00,3,1125
100768,916123,2019-11-01,t,$199.00,$199.00,1,1125
695254,14950476,2019-03-02,f,$307.00,$307.00,2,1125
1967017,29379824,2019-04-19,f,$50.00,$50.00,2,7
1891809,28386218,2019-07-16,f,$60.00,$60.00,2,1125
2176426,30153933,2019-10-31,f,$135.00,$135.00,1,1125


In [111]:
#summary of each column
calendar.describe()

Unnamed: 0,listing_id,minimum_nights,maximum_nights
count,2246575.0,2246575.0,2246575.0
mean,18791440.0,6.885056,16985.45
std,8802044.0,40.36776,1274523.0
min,3781.0,1.0,1.0
25%,12784060.0,1.0,105.0
50%,20431830.0,2.0,1125.0
75%,26013460.0,3.0,1125.0
max,32243380.0,999.0,100000000.0


In [112]:
#checking the number of nulls for each column
print(calendar.isnull().sum().any())
calendar.isnull().sum().sort_values(ascending=False)

False


maximum_nights    0
minimum_nights    0
adjusted_price    0
price             0
available         0
date              0
listing_id        0
dtype: int64

In [113]:
#information about each feature
calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2246575 entries, 0 to 2246574
Data columns (total 7 columns):
listing_id        int64
date              object
available         object
price             object
adjusted_price    object
minimum_nights    int64
maximum_nights    int64
dtypes: int64(3), object(4)
memory usage: 120.0+ MB


#### 2.3.1 Quality Issues

- Erroneous datatypes in price and data

---
### <font color=blue> Q1. How prices for all Boston's airbnb fluctuate throughout the year 2019?

In [114]:
calendar_clean = calendar.copy()

In [115]:
#removing the $ sign for price, and converting the date's type to datetime
calendar_clean['year'], calendar_clean['month'], calendar_clean['day'] = calendar_clean['date'].str.split('-',2).str
calendar_clean.date = pd.to_datetime(calendar_clean['date'])
calendar_clean.price = calendar_clean.price.str.replace(',','')
calendar_clean.price = calendar_clean.price.str.replace('$','')
calendar_clean[['price']] = calendar_clean[['price']].astype(float)

#removing the columns that will not be used for this question
calendar_clean = calendar_clean.drop(['available', 'adjusted_price', 'minimum_nights', 'maximum_nights'], axis=1)

In [116]:
#checking the quality of the dataframe
calendar_clean.sample(5)

Unnamed: 0,listing_id,date,price,year,month,day
1448601,23170412,2019-10-22,283.0,2019,10,22
1448961,23272728,2019-12-20,99.0,2019,12,20
2144385,31102316,2020-01-08,131.0,2020,1,8
11022,1183032,2020-01-08,175.0,2020,1,8
1647863,25126538,2019-06-21,525.0,2019,6,21


In [176]:
#grouping the listings by date and then finding the average price per day
prices = pd.DataFrame(calendar_clean.groupby(['date']).mean()['price']).reset_index()
prices

Unnamed: 0,date,price
0,2019-02-01,143.140812
1,2019-03-01,177.284107
2,2019-04-01,208.852136
3,2019-05-01,235.68754
4,2019-06-01,229.730376
5,2019-07-01,237.662488
6,2019-08-01,234.053924
7,2019-09-01,232.110815
8,2019-10-01,232.060769
9,2019-11-01,203.867051


In [118]:
#finding the average of the average daily prices
prices.price.mean()

215.08826725126008

In [119]:
trace1 = go.Scatter(x=prices.date, y=prices.price, line = dict(color = '#7F7F7F'))

data = [trace1]


layout = {
    'title': 'Daily Average Price Change (Use the slider for more detailed trends)',
    'xaxis': {'title': 'Day'},
    'yaxis': {'title': 'Average Price ($)'},
               
    'shapes': [
        # Line Horizontal, average
        {
            'type': 'line',
            'x0': '2019-02-09',
            'y0': 215,
            'x1': '2020-02-08',
            'y1': 215,
            'line': {
                'color': 'black',
                'width': 1,
                'dash': 'dashdot',
            }
        },
        
        # 1st highlight during Apr 9 - Apr 21
        {
            'type': 'rect',
            # x-reference is assigned to the x-values
            'xref': 'x',
            # y-reference is assigned to the plot [0,1]
            'yref': 'paper',
            'x0': '2019-04-09',
            'y0': 0,
            'x1': '2019-04-21',
            'y1': 1,
            'fillcolor': '#d3d3d3',
            'opacity': 0.3,
            'line': {
                'width': 0,
            }
        },
        
        # 3nd highlight during Dec 25 - Feb 7
        {
            'type': 'rect',
            'xref': 'x',
            'yref': 'paper',
            'x0': '2019-12-25',
            'y0': 0,
            'x1': '2020-02-07',
            'y1': 1,
            'fillcolor': '#d3d3d3',
            'opacity': 0.3,
            'line': {
                'width': 0,
            }
        }
    ]
}

layout.update(dict(
    annotations=[go.Annotation(text="Overall Average Price ($215)", x='2019-03-15', y=215)]),
             xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1m',
                     step='month',
                     stepmode='backward'),
                dict(count=6,
                     label='6m',
                     step='month',
                     stepmode='backward'),
                dict(count=1,
                    label='YTD',
                    step='year',
                    stepmode='todate'),
                dict(count=1,
                    label='1y',
                    step='year',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True
        ),
        type='date'
    )
             )
              

py.iplot({'data': data, 'layout': layout}, filename='Daily Average Price Change')


plotly.graph_objs.Annotation is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Annotation
  - plotly.graph_objs.layout.scene.Annotation



Consider using IPython.display.IFrame instead



The time series plot illustrates that we have two periods where the Airbnb listings prices increased promptly. The shaded areas show the dates when the price rises. The horizontal line represents the average renting price all year. 

    1. The period between April 8, 2019, and April 21, 2019, shows an increase in average prices above the overall average price. After further investigation, the increase is due to the Boston Marathon, one of the world's oldest and most challenging races. The marathon took place on Monday, April 15.
    
    2. The period between December 25, 2019, and February 7, 2020, shows an increase in prices, which is expected as it holds the Christmas Day and the New Year's Eve and Day.
    
    3. A significant difference in the average prices between February 2019 and February 2020. While in 2019 it was the lowest period with rates ranges between around $136 per night up to $200 per night, it starts from $234 per night in 2020.
    
    4. Overall, the off-peak period for Boston's Airbnb occurs during the months of February and March in 2019, and the peak period occurs between April and October in 2019. Prices go down in November until the trend goes up again in December.

---
### <font color=blue> Q2. What are the peak and off-peak months during the year for airbnb rental prices in Boston?

In [120]:
#splitting the date to year and month, then concat the files to create monthly column
calendar_clean['month'] = calendar_clean['year'].map(str) + "-" + calendar_clean['month'].map(str)

#convertting the month to date
calendar_clean.date = pd.to_datetime(calendar_clean['month'])

In [121]:
month_trend = calendar_clean.groupby('month').describe()['price'].reset_index()
month_trend = month_trend.drop(['count', 'std', '25%', '50%', '75%'], axis=1)
month_trend

Unnamed: 0,month,mean,min,max
0,2019-02,143.140812,10.0,5000.0
1,2019-03,177.284107,10.0,7132.0
2,2019-04,208.852136,10.0,5000.0
3,2019-05,235.68754,10.0,9999.0
4,2019-06,229.730376,10.0,5000.0
5,2019-07,237.662488,10.0,5000.0
6,2019-08,234.053924,10.0,5000.0
7,2019-09,232.110815,10.0,6414.0
8,2019-10,232.060769,10.0,5407.0
9,2019-11,203.867051,10.0,5000.0


In [122]:

trace_mean = go.Scatter(
    x=month_trend.month,
    y=month_trend['mean'],
    name = "Average Price Per month",
    line = dict(color = '#7F7F7F'),
    opacity = 0.8)

data = [trace_mean]

layout = {
    'title': 'Monthly Price Trend',
    'xaxis': {'title': 'Month'},
    'yaxis': {'title': 'Average Price ($)'},
               
    'shapes': [
        # Line Horizontal, average
        {
            'type': 'line',
            'x0': '2019-02-01',
            'y0': 215,
            'x1': '2020-02-01',
            'y1': 215,
            'line': {
                'color': 'black',
                'width': 1,
                'dash': 'dashdot',
            }
        },
        
        # 1st highlight above average months
        {
            'type': 'rect',
            # x-reference is assigned to the x-values
            'xref': 'x',
            # y-reference is assigned to the plot [0,1]
            'yref': 'y',
            'x0': '2019-02',
            'y0': 215,
            'x1': '2020-02',
            'y1': 240,
            'fillcolor': 'tomato',
            'opacity': 0.1,
            'line': {
                'width': 0,
            }
        },
        
        # 3nd highlight below average months
        {
            'type': 'rect',
            'xref': 'x',
            'yref': 'y',
            'x0': '2019-02',
            'y0': 215,
            'x1': '2020-02',
            'y1': 140,
            'fillcolor': 'olive',
            'opacity': 0.1,
            'line': {
                'width': 0,
            }
        }
    ]
}

layout.update(dict(annotations=[go.Annotation(text="Overall Average Price ($215)", x='2019-03-15', y=215)]))
        
fig = dict(data=data, layout=layout)
py.iplot(fig, filename = "Monthly Price Trend")


Consider using IPython.display.IFrame instead



The monthly price trend plot shows that the peak and off-peak months for Boston Airbnb listings' prices. The red-colored areas show the months when the average price per night is above the overall average price for all listings per night, while the green-colored area represents the off-peak months when the average price per night is below the overall average rate for all listings per night.

    1. February, March, the beginning of April, November, and most of December in 2019 are the off-peak months when prices are below $215 per nights. 
    
    2. The second half of April, May, July, August, September, the first half of October, end of December in 2019 along with January and February in 2010 are the peak months when prices are above $215 per night.

---
# <center>STAGE TWO</center>
---

## 3. Data Preparation

In [123]:
#copying the latest dataset
df_listings_prep1 = df_listings.copy()

In [124]:
#dropping the features with many number of missing data
df_listings_prep1 = df_listings_prep1.drop(['review_scores_rating', 'host_response_rate', 'reviews_per_month'], axis=1)

Here, as we found earlier, the features **'review_scores_rating', 'reviews_per_month', and 'host_response_rate'** have more than 1200 missing values out of 6155. In my opnion, the columns are not associated with sensitive information that would impact our analysis. Therefore, instead of droping the listings with these missing values or imputing the numbers, I decided to drop the feature as they will not be used in answering the business question or creating the model.

In [125]:
#investgating the data type for each feature
df_listings_prep1.dtypes

id                          int64
host_name                  object
host_is_superhost          object
neighbourhood_cleansed     object
zipcode                    object
property_type              object
room_type                  object
accommodates                int64
bathrooms                 float64
bedrooms                  float64
beds                      float64
amenities                  object
price                      object
cleaning_fee               object
number_of_reviews           int64
instant_bookable           object
cancellation_policy        object
latitude                  float64
longitude                 float64
dtype: object

In [126]:
#converting accommodates' type to float
df_listings_prep1.accommodates = df_listings_prep1.accommodates.astype(float)

#converting number_of_reviews' type to float
df_listings_prep1.number_of_reviews = df_listings_prep1.number_of_reviews.astype(float)

#confirming zipcodes are all set with 5 numbers
df_listings_prep1.zipcode = df_listings_prep1.zipcode.astype(str).str.zfill(5)

#removing the $ sign for price and cleaning_fee, then converting the types to float

df_listings_prep1.price = df_listings_prep1.price.str.replace('$','')
df_listings_prep1.price = df_listings_prep1.price.str.replace(',','')
df_listings_prep1.cleaning_fee = df_listings_prep1.cleaning_fee.str.replace('$','')
df_listings_prep1.cleaning_fee = df_listings_prep1.cleaning_fee.str.replace(',','')
df_listings_prep1[['price', 'cleaning_fee']] = df_listings_prep1[['price', 'cleaning_fee']].astype(float)

In [127]:
#assessing the number of each data type
df_listings_prep1.dtypes.value_counts()

object     9
float64    9
int64      1
dtype: int64

In [128]:
#setting the numerical features
features_num = df_listings_prep1.select_dtypes(include=['float', 'int'])
features_num

Unnamed: 0,id,accommodates,bathrooms,bedrooms,beds,price,cleaning_fee,number_of_reviews,latitude,longitude
0,3781,2.0,1.0,1.0,1.0,125.0,75.0,14.0,42.365241,-71.029361
1,5506,2.0,1.0,1.0,1.0,145.0,40.0,80.0,42.329808,-71.095595
2,6695,4.0,1.0,1.0,2.0,169.0,70.0,85.0,42.329941,-71.093505
3,6976,2.0,1.0,1.0,1.0,65.0,0.0,75.0,42.292438,-71.135765
4,8789,2.0,1.0,1.0,1.0,99.0,250.0,22.0,42.359187,-71.062651
5,8792,2.0,1.0,1.0,1.0,154.0,250.0,24.0,42.358497,-71.062011
6,9765,2.0,1.0,,1.0,229.0,75.0,9.0,42.342594,-71.079421
7,9824,2.0,1.0,,1.0,209.0,,23.0,42.349496,-71.085954
8,9827,2.0,1.0,1.0,1.0,389.0,150.0,8.0,42.352149,-71.063301
9,9855,3.0,1.0,1.0,1.0,259.0,150.0,3.0,42.343371,-71.098708


In [129]:
#setting the categorical features
features_cat = df_listings_prep1.select_dtypes(include='object')
features_cat

Unnamed: 0,host_name,host_is_superhost,neighbourhood_cleansed,zipcode,property_type,room_type,amenities,instant_bookable,cancellation_policy
0,Frank,t,East Boston,02128,Apartment,Entire home/apt,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",f,super_strict_30
1,Terry,t,Roxbury,02119,Guest suite,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",t,strict_14_with_grace_period
2,Terry,t,Roxbury,02119,Condominium,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",t,strict_14_with_grace_period
3,Phil,t,Roslindale,02131,Apartment,Private room,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",f,moderate
4,Anne,f,Downtown,02108,Apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,strict_14_with_grace_period
5,Anne,f,Downtown,02108,Apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,strict_14_with_grace_period
6,Seamless,f,South End,02118,Apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,super_strict_30
7,Seamless,f,Back Bay,02115,Serviced apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,super_strict_30
8,Seamless,f,Downtown,02111,Serviced apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,super_strict_30
9,Seamless,f,Fenway,02215,Apartment,Entire home/apt,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",f,super_strict_30


In [130]:
#checking the number of nulls for each feature
print(df_listings_prep1.shape)
df_listings_prep1.isnull().sum().sort_values()

(6155, 19)


id                          0
cancellation_policy         0
instant_bookable            0
number_of_reviews           0
price                       0
amenities                   0
latitude                    0
longitude                   0
room_type                   0
property_type               0
zipcode                     0
neighbourhood_cleansed      0
accommodates                0
host_is_superhost           2
host_name                   2
beds                        3
bathrooms                   5
bedrooms                    6
cleaning_fee              979
dtype: int64

Here, we are going to drop the listings where they have no values about **'host_name', 'beds', 'bathrooms', and 'bedrooms'**. The reason is that we cannot create a model when having missing values, thus, we clean the data as well as create dummy variables for the categorical variables.

In [131]:
#dropping listings with missing values for host names, beds, bathrooms, and bedrooms
df_listings_prep1 = df_listings_prep1.dropna(axis=0, subset=['host_name', 'beds', 'bathrooms', 'bedrooms'])
print(df_listings_prep1.shape)
df_listings_prep1.isnull().sum().sort_values()

(6140, 19)


id                          0
cancellation_policy         0
instant_bookable            0
number_of_reviews           0
price                       0
amenities                   0
beds                        0
latitude                    0
bedrooms                    0
accommodates                0
room_type                   0
property_type               0
zipcode                     0
neighbourhood_cleansed      0
host_is_superhost           0
host_name                   0
bathrooms                   0
longitude                   0
cleaning_fee              974
dtype: int64

With respect to the **cleaning_fee**, its missing values have been imputed because it could be used for a future analysis. Sometimes, people like to consider the **cleaning_fee** plus the rental price. Here, we will only foucs on the **price** variable. Again, instead of dropping the listings or the feature itself, it has been imputed to be added to our analysis if needed.

In [132]:
#filling nan values for cleaning_fee with the most frequent cleaning_fee
df_listings_prep1 = df_listings_prep1.apply(lambda x:x.fillna(x.value_counts().index[0]))
df_listings_prep1.isnull().sum()

id                        0
host_name                 0
host_is_superhost         0
neighbourhood_cleansed    0
zipcode                   0
property_type             0
room_type                 0
accommodates              0
bathrooms                 0
bedrooms                  0
beds                      0
amenities                 0
price                     0
cleaning_fee              0
number_of_reviews         0
instant_bookable          0
cancellation_policy       0
latitude                  0
longitude                 0
dtype: int64

In [133]:
#assessing the categorical features: which are binary, which are multi-level

#binary
binary_list=[]
#multi-level
multi_level_list=[]

for f in features_cat:
    if (len(df_listings_prep1[f].unique())==2):
        binary_list.append(f)
    elif (len(df_listings_prep1[f].unique())>2):
        multi_level_list.append(f)

In [134]:
#listing the binary features with their unique values
for b in binary_list:
    print(b, df_listings_prep1[b].unique())

host_is_superhost ['t' 'f']
instant_bookable ['f' 't']


In [135]:
#Re-encodeing the binary list
df_listings_prep1['host_is_superhost'] = df_listings_prep1['host_is_superhost'].map({'t': 1, 'f': 0})
df_listings_prep1['instant_bookable'] = df_listings_prep1['instant_bookable'].map({'t': 1, 'f': 0})

In [136]:
#listing the multilevel features with their unique values
for m in multi_level_list:
    print(m, df_listings_prep1[m].unique())

host_name ['Frank' 'Terry' 'Phil' ... 'Yaling' 'Kashyap' 'Gabi']
neighbourhood_cleansed ['East Boston' 'Roxbury' 'Roslindale' 'Downtown' 'Fenway' 'Back Bay'
 'South End' 'North End' 'Dorchester' 'West End' 'Jamaica Plain'
 'Charlestown' 'Beacon Hill' 'Mission Hill' 'Allston' 'South Boston'
 'Brighton' 'West Roxbury' 'Bay Village' 'South Boston Waterfront'
 'Longwood Medical Area' 'Chinatown' 'Mattapan' 'Hyde Park'
 'Leather District']
zipcode ['02128' '02119' '02131' '02108' '02111' '02215' '02116' '02115' '02109'
 '02125' '02114' '02118' '02122' '02130' '02129' '02120' '02134' '02127'
 '02124' '02135' '02113' '02132' '02121' '02108 02111' '00nan' '02110'
 '02126' '02136' '02467' '02163' '02145' '02445' '02210' '33131' '02446'
 '02143' '02141' '02149' '01217' '02421' '02139' '02026']
property_type ['Apartment' 'Guest suite' 'Condominium' 'Serviced apartment' 'Boat'
 'House' 'Guesthouse' 'Bed and breakfast' 'Townhouse' 'Loft' 'Bungalow'
 'Other' 'Villa' 'Boutique hotel' 'Resort' 'Hotel'

In [137]:
#viewing the dataset after re-encoding the binary list
df_listings_prep1

Unnamed: 0,id,host_name,host_is_superhost,neighbourhood_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,price,cleaning_fee,number_of_reviews,instant_bookable,cancellation_policy,latitude,longitude
0,3781,Frank,1,East Boston,02128,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",125.0,75.0,14.0,0,super_strict_30,42.365241,-71.029361
1,5506,Terry,1,Roxbury,02119,Guest suite,Entire home/apt,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",145.0,40.0,80.0,1,strict_14_with_grace_period,42.329808,-71.095595
2,6695,Terry,1,Roxbury,02119,Condominium,Entire home/apt,4.0,1.0,1.0,2.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",169.0,70.0,85.0,1,strict_14_with_grace_period,42.329941,-71.093505
3,6976,Phil,1,Roslindale,02131,Apartment,Private room,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Wifi,""Air conditioning"",Kitchen...",65.0,0.0,75.0,0,moderate,42.292438,-71.135765
4,8789,Anne,0,Downtown,02108,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",99.0,250.0,22.0,0,strict_14_with_grace_period,42.359187,-71.062651
5,8792,Anne,0,Downtown,02108,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",154.0,250.0,24.0,0,strict_14_with_grace_period,42.358497,-71.062011
8,9827,Seamless,0,Downtown,02111,Serviced apartment,Entire home/apt,2.0,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",389.0,150.0,8.0,0,super_strict_30,42.352149,-71.063301
9,9855,Seamless,0,Fenway,02215,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",259.0,150.0,3.0,0,super_strict_30,42.343371,-71.098708
10,9857,Seamless,0,Back Bay,02116,Apartment,Entire home/apt,4.0,1.0,2.0,2.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",342.0,100.0,25.0,0,super_strict_30,42.354290,-71.072772
11,9858,Seamless,0,Back Bay,02116,Apartment,Entire home/apt,6.0,1.0,2.0,2.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",489.0,100.0,1.0,0,super_strict_30,42.344471,-71.081786


---
### <font color=blue> Q3. Who are the hosts with the most number of airbnb listings? 


In [138]:
hosts_count = df_listings_prep1.groupby('host_name').count()['id'].sort_values(ascending=False)
hosts_count[hosts_count >= 20]

host_name
Sonder                306
Kara                  155
Bluebird              153
Mike                  142
Brent                  89
Stay Alfred            85
Jen                    80
Corp Condos & Apts     70
Sonder (Boston)        58
Maverick               56
Marie                  54
Matthew                49
Will                   48
Domio                  44
Taylor                 43
Alex                   41
Paige                  40
Blueground             39
Huggy                  36
Inn Boston             36
Michelle               34
Michael                33
Ken                    33
David                  32
Mario                  31
Chris                  30
Jason                  29
Luxurious              28
Seamless               28
Jonathan               27
Jennifer               26
Kevin                  26
Robert                 25
Susan                  24
Anne                   24
Kiki                   24
Lance                  23
Nav                    22
Ad

In [139]:
h_count = df_listings_prep1.groupby('host_name').count()['id']
h_count = h_count[h_count >= 10]
h_count = h_count.sort_values(ascending=False)
x = h_count.index
y = h_count
y_cum = np.cumsum(y)
y_perc = 100*y_cum/y.sum()

trace1 = dict(type='bar',
    x=x,
    y=y,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5)),
    name='Count of Listings',
    opacity=0.6
)

trace2 = dict(type='scatter',
    x=x,
    y=y_perc,
    marker=dict(
        color='#7F7F7F'
    ),
    line=dict(color= '#7F7F7F', width= 1.5),
    name='Cumulative Listings Percent of Total (%)',
    xaxis='x1', 
    yaxis='y2' 
)
    
data = [trace1, trace2]
layout = go.Layout(
    title='Number of Listings by Host',
    legend= dict(x=-.1, y=1.2),
    yaxis=dict(
        title='Count of Listings'
    ),
    yaxis2=dict(
        title='Percent of Total Listings (%)',
        range=[0,100],
        overlaying='y',
        anchor='x',
        side='right'
        )
    )

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Percent of listings by Host')


Consider using IPython.display.IFrame instead



As we can see from the bar chart, Sonder has around 10% of the 6140 listings, followed by Kara with 155, Bluebird with 154, and Mike with 142 listings. Those four hosts represent about 25% of the total listings. The other hosts' number of listings range from 1 to 89 listings.

According to Inside Airbnb, it is vital to notice that "A host may list separate rooms in the same apartment, or multiple apartments or homes available in their entirety. Hosts with multiple listings are more likely to be running a business, are unlikely to be living in the property, and in violation of most short term rental laws designed to protect residential housing."

---
### <font color=blue> Q4. Which neighborhoods have the most number of listing in Boston?

In [140]:
df_listings_prep1['neighbourhood_cleansed'].value_counts()

Dorchester                 536
Jamaica Plain              513
Back Bay                   492
Downtown                   452
South End                  439
Fenway                     438
Brighton                   358
South Boston               337
Allston                    329
Roxbury                    319
East Boston                311
Beacon Hill                255
North End                  244
Mission Hill               214
Charlestown                156
West End                   154
Chinatown                  139
Roslindale                 117
South Boston Waterfront     76
Mattapan                    75
West Roxbury                70
Hyde Park                   59
Bay Village                 39
Longwood Medical Area       11
Leather District             7
Name: neighbourhood_cleansed, dtype: int64

In [141]:
n_count = df_listings_prep1.groupby('neighbourhood_cleansed').count()['id']
n_count = n_count.sort_values(ascending=False)
x = n_count.index
y = n_count
y_cum = np.cumsum(y)
y_perc = 100*y_cum/y.sum()

trace1 = dict(type='bar',
    x=x,
    y=y,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5)),
    name='Count of Listings',
    opacity=0.6
)

trace2 = dict(type='scatter',
    x=x,
    y=y_perc,
    marker=dict(
        color='#7F7F7F'
    ),
    line=dict(color= '#7F7F7F', width= 1.5),
    name='Cumulative Listings Percent of Total (%)',
    xaxis='x1', 
    yaxis='y2' 
)
    
data = [trace1, trace2]
layout = go.Layout(
    title='Number of Listings by neighbourhood',
    legend= dict(x=-.1, y=1.2),
    yaxis=dict(
        title='Count of Listings'
    ),
    yaxis2=dict(
        title='Percent of Total Listings (%)',
        range=[0,100],
        overlaying='y',
        anchor='x',
        side='right'
        )
    )

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Percent of listings by neighbourhood')


Consider using IPython.display.IFrame instead



Next, we will investigate the price distribution to make sure outliers, hosts who list at above and beyond the reasonable prices per night.

In [142]:
x = df_listings_prep1.price

annotations={}

trace1 = go.Histogram(
    x=x,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5)),
    name='Price',
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title = "Price Distributions",
    xaxis=dict(
        title='Price'),
    yaxis=dict(
        title='Count'))

layout.update(dict(annotations=[go.Annotation(text="Outliers all after", x=1000, y=0)]))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, annotations=annotations, filename='Price Distribution')


Consider using IPython.display.IFrame instead



In [143]:
x = df_listings_prep1.price[df_listings_prep1.price <=1000]

trace1 = go.Histogram(
    x=x,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5)),
    name='Price',
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title = "Price Distributions <1000",
    xaxis=dict(
        title='Price'),
    yaxis=dict(
        title='Count'))

layout.update(dict(annotations=[go.Annotation(text="Outliers all after", x=500, y=0)]))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Price Distribution <1000')



Consider using IPython.display.IFrame instead



In [144]:
x = df_listings_prep1.price[df_listings_prep1.price <=500]

trace1 = go.Histogram(
    x=x,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5)),
    name='Price',
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title = "Price Distributions <500",
    xaxis=dict(
        title='Price'),
    yaxis=dict(
        title='Count'))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Price Distribution <500')



Consider using IPython.display.IFrame instead



Based on the histogram that illustrates the price distribution among all listings, a maximum of 500 dollars per night is reasonable comparing with higher rates that have been dropped; it's high though, as the distribution explicates that most listings occur at the range between 50 up to 200 dollars per night. However, it will the cutpoint for the maximum rate per night in our further analysis.

In [145]:
#setting the final dataset to include only listings within the price up to $500
df_listings_prep1 = df_listings_prep1[df_listings_prep1.price <=500]
print(df_listings_prep1.price.max())

500.0


---
### <font color=blue> Q5. Which neighborhoods are considered as the most expensive in Boston?


In [146]:
n_exp = df_listings_prep1.groupby('neighbourhood_cleansed')
n_exp = n_exp['price'].describe().drop(['std', '25%', '50%', '75%'], axis=1).sort_values(by='mean', ascending=False)
n_exp

Unnamed: 0_level_0,count,mean,min,max
neighbourhood_cleansed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chinatown,130.0,226.430769,54.0,500.0
South Boston Waterfront,70.0,223.385714,49.0,450.0
West End,136.0,218.867647,49.0,499.0
Downtown,428.0,215.432243,49.0,500.0
Back Bay,427.0,215.009368,40.0,500.0
Leather District,6.0,210.5,65.0,375.0
Fenway,429.0,200.037296,20.0,500.0
Charlestown,147.0,192.360544,10.0,500.0
South Boston,322.0,191.214286,48.0,500.0
South End,429.0,187.114219,0.0,500.0


In [147]:
#creating numpy arraies for the top six neighbourhoods' prices: 'Chinatown', 'South Boston Waterfront', 'West End', 'Downtown', 'Back Bay', and 'Fenway'
#'Leather District' was skiped becuase it has only 6 listings, a very small number compared to other neighbourhoods' listings.

chinatown_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'Chinatown']
chinatown_p = np.array(chinatown_p.price)

back_bay_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'Back Bay']
back_bay_p = np.array(back_bay_p.price)

west_end_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'West End']
west_end_p = np.array(west_end_p.price)

waterfront_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'South Boston Waterfront']
waterfront_p = np.array(waterfront_p.price)

downtown_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'Downtown']
downtown_p = np.array(downtown_p.price)

fenway_p = df_listings_prep1[df_listings_prep1.neighbourhood_cleansed == 'Fenway']
fenway_p = np.array(fenway_p.price)

In [148]:
x_data = ['Chinatown', 'West End', 'Downtown', 'South Boston Waterfront', 'Back Bay', 'Fenway']

y0 = chinatown_p
y1 = west_end_p
y2 = downtown_p
y3 = waterfront_p
y4 = back_bay_p
y5 = fenway_p

y_data = [y0,y1,y2,y3,y4,y5]

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']

traces = []

for xd, yd, cls in zip(x_data, y_data, colors):
        traces.append(go.Box(
            y=yd,
            name=xd,
            boxmean=True,
            boxpoints='all',
            jitter=0.5,
            whiskerwidth=0.2,
            fillcolor=cls,
            marker=dict(
                size=2,
            ),
            line=dict(width=1),
        ))

layout = go.Layout(
    title='Comparsion between the top most expensive neighborhoods',
    yaxis=dict(
        autorange=True,
        showgrid=True,
        zeroline=False,
        dtick=40,
        gridcolor='rgb(255, 255, 255)',
        gridwidth=1,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
    showlegend=False
)

fig = go.Figure(data=traces, layout=layout)
py.iplot(fig, filename='Most expensive neighbourhood')


Consider using IPython.display.IFrame instead



While the bar plot shows that Dorchester and Jamaica Plain the top two neighborhoods with respect to the number of listings in the dataset, the box plots illustrate a complete comparison between the top most expensive neighborhoods that excludes the top two mentioned neighborhoods. The plots represent the table that describes the prices for each neighborhood.

      1. Chinatown, which is part of the Downtown, has the highest average price per night around $226, while 'Fenway' has $200 per night among the six most expensive neighborhoods.
      2. Most listings range between $289 per night and $131 per night, which indicates the reasonable price at one of these neighborhoods.
      
*The numbers will be elaborated even further with heatmap generated below as it shows the areas not only where the rates per night are high, but also where most of the listings are located.*

In [149]:
#Get the locations from the data set
locations = df_listings_prep1[['latitude', 'longitude']]
#Get the price from the data
prices = df_listings_prep1['price']
#Set up your map
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locations, weights=prices))
fig

Figure(layout=FigureLayout(height='420px'))

        
<center> <img src="https://i.etsystatic.com/5206469/r/il/4273de/597430359/il_fullxfull.597430359_a122.jpg" style="width:400px;"/>

Art by [Carrie Wagner](https://www.etsy.com/shop/SepiaLepus?ref=simple-shop-header-name&listing_id=208367563) on [Etsy](https://www.etsy.com/?ref=lgo)

---
### <font color=blue> Q6. Which are the popular neighbourhoods based on the average number of reviews?

In [150]:
popularity = df_listings_prep1.groupby('neighbourhood_cleansed')
popularity = popularity['number_of_reviews'].describe().drop(['std', '25%', '50%', 'min', '75%'], axis=1).sort_values(by='mean', ascending=False)
popularity = popularity[popularity['count'] > 120]
popularity

Unnamed: 0_level_0,count,mean,max
neighbourhood_cleansed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
North End,238.0,68.184874,491.0
East Boston,310.0,56.748387,444.0
Dorchester,528.0,42.458333,513.0
Beacon Hill,247.0,41.17004,290.0
Roxbury,307.0,38.068404,371.0
South End,429.0,37.079254,358.0
Jamaica Plain,507.0,33.781065,424.0
South Boston,322.0,32.73913,413.0
Charlestown,147.0,31.346939,321.0
Back Bay,427.0,29.29274,507.0


*The results for the popular neighborhoods based on the average number of reviews is embellished with the heatmap generated below as it confirms that North Ends has the highest reviews although it doesn't have the highest number of listings.*

In [151]:
#Get the locations from the data set
locations = df_listings_prep1[['latitude', 'longitude']]
#Get the price from the data
prices = df_listings_prep1['number_of_reviews']
#Set up your map
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locations, weights=prices))
fig

Figure(layout=FigureLayout(height='420px'))

---
### <font color=blue> Q7. Which type of room has the majority of listings in Boston airbnb?

In [152]:
rooms_df = df_listings_prep1['room_type'].value_counts()
rooms_df

Entire home/apt    3775
Private room       2061
Shared room          71
Name: room_type, dtype: int64

In [153]:
rooms = df_listings_prep1['room_type'].value_counts()
labels = rooms_df.index
values = rooms
colors = ['slateblue', 'aquamarine', 'mediumblue']


trace1 = go.Pie(labels=labels, values=round(100*values/values.sum(), 2),
               hoverinfo='label+percent', textinfo='value', 
               textfont=dict(size=20),
               hole=0.9,
               showlegend=False,
               opacity=0.3,
               marker=dict(colors=colors, 
                           line=dict(color='#000000', width=2)))


trace2 = go.Bar(
    x=[rooms_df.index[0]],
    y=[4002],
    name=rooms_df.index[0],
    marker=dict(
        color='slateblue',
        line=dict(
            color='slateblue',
            width=1.5,
        )
    ),
    opacity=0.4
)

trace3 = go.Bar(
    x=[rooms_df.index[1]],
    y=[2066],
    name=rooms_df.index[1],
    marker=dict(
        color='aquamarine',
        line=dict(
            color='aquamarine',
            width=1.5,
        )
    ),
    opacity=0.4
)

trace4 = go.Bar(
    x=[rooms_df.index[2]],
    y=[72],
    name=rooms_df.index[2],
    marker=dict(
        color='mediumblue',
        line=dict(
            color='mediumblue',
            width=1.5,
        )
    ),
    opacity=0.3
)

data = go.Data([trace1, trace2, trace3, trace4])

layout = go.Layout(
    title='Count of room type',
    xaxis=dict(
        title='Room Type',
        domain=[0.4, 0.6]
    ),
    yaxis=dict(
        title='Count',
        domain=[0.4, 0.7]
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='percent of room types')


plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.



Consider using IPython.display.IFrame instead



Airbnb hosts can list one or all of these room types: entire homes/apartments, private or shared rooms. The donut plot shows that entire homes/apartments accounts for 63.9% of the listings with 3775 listings around Boston. Following by private room with 2061 listings and shared room with only 71 listings account for only 1.2%

---

In [154]:
#copying the latest dataset
df_listings_prep2 = df_listings_prep1.copy()

In [155]:
df_listings_prep2.columns

Index(['id', 'host_name', 'host_is_superhost', 'neighbourhood_cleansed',
       'zipcode', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bedrooms', 'beds', 'amenities', 'price', 'cleaning_fee',
       'number_of_reviews', 'instant_bookable', 'cancellation_policy',
       'latitude', 'longitude'],
      dtype='object')

In [156]:
#dropping two features not needed for further analysis
df_listings_prep2 = df_listings_prep2.drop(['host_name', 'zipcode', 'latitude', 'longitude'], axis=1)
df_listings_prep2.sample(5)

Unnamed: 0,id,host_is_superhost,neighbourhood_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,price,cleaning_fee,number_of_reviews,instant_bookable,cancellation_policy
5066,28218323,0,Chinatown,Apartment,Entire home/apt,4.0,1.0,1.0,2.0,"{TV,Wifi,""Air conditioning"",Kitchen,Elevator,H...",249.0,59.0,6.0,1,moderate
536,4568116,1,Roxbury,Townhouse,Entire home/apt,2.0,1.0,0.0,1.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",130.0,30.0,371.0,0,moderate
235,1545185,0,Brighton,Guest suite,Entire home/apt,4.0,1.0,1.0,2.0,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",140.0,100.0,92.0,1,moderate
2075,15451419,0,South End,Apartment,Entire home/apt,3.0,2.0,2.0,2.0,"{TV,Wifi,""Air conditioning"",Kitchen,Elevator,H...",320.0,20.0,9.0,1,moderate
3441,21648941,0,Brighton,House,Private room,2.0,1.5,1.0,1.0,"{TV,Wifi,""Air conditioning"",Kitchen,""Free stre...",48.0,30.0,1.0,1,moderate


In [157]:
#assessing the amenities feature
df_listings_prep2.amenities.sample(10)

5576                                                   {}
1596    {TV,"Cable TV",Wifi,"Air conditioning","Paid p...
1117    {TV,Internet,Wifi,"Air conditioning",Kitchen,"...
3570    {TV,Wifi,"Air conditioning",Kitchen,Elevator,H...
680     {TV,"Cable TV",Wifi,"Air conditioning","Free s...
315     {TV,"Cable TV",Internet,Wifi,"Air conditioning...
4685    {TV,"Cable TV",Wifi,"Air conditioning",Kitchen...
2944    {TV,Internet,Wifi,"Air conditioning",Kitchen,H...
3161    {TV,Wifi,Kitchen,"Free street parking",Heating...
120     {TV,"Cable TV",Internet,Wifi,"Air conditioning...
Name: amenities, dtype: object

In [158]:
#splitting the amenities and then creating dummies
df_listings_prep2['amenities'] = df_listings_prep2['amenities'].str.replace('[{}" ]', '')
df_amenities = df_listings_prep2.amenities.str.get_dummies(sep = ",")
print(df_amenities.shape)
df_amenities.sample(2)

(5907, 120)


Unnamed: 0,24-hourcheck-in,Accessible-heightbed,Accessible-heighttoilet,Airconditioning,BBQgrill,Babybath,Babymonitor,Babysitterrecommendations,Bathtub,Bathtubwithbathchair,...,Wideclearancetobed,Wideclearancetoshower,Widedoorway,Wideentryway,Widehallwayclearance,Wifi,Windowguards,toilet,translationmissing:en.hosting_amenity_49,translationmissing:en.hosting_amenity_50
3623,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
209,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


In [159]:
#preparing the categorical features that will be used for modeling
features_cat = df_listings_prep2.select_dtypes(include=['object'])
features_cat = features_cat.drop(['amenities'], axis=1) #dropping the amenities feature as it has already split into 121 dummies

#getting dummies for each categorical feature except amenities
features_cat = pd.get_dummies(features_cat)

#viewing the dataset after creatign the dummies
print(features_cat.shape)
features_cat.sample(2)

(5907, 55)


Unnamed: 0,neighbourhood_cleansed_Allston,neighbourhood_cleansed_Back Bay,neighbourhood_cleansed_Bay Village,neighbourhood_cleansed_Beacon Hill,neighbourhood_cleansed_Brighton,neighbourhood_cleansed_Charlestown,neighbourhood_cleansed_Chinatown,neighbourhood_cleansed_Dorchester,neighbourhood_cleansed_Downtown,neighbourhood_cleansed_East Boston,...,property_type_Villa,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,cancellation_policy_flexible,cancellation_policy_moderate,cancellation_policy_strict,cancellation_policy_strict_14_with_grace_period,cancellation_policy_super_strict_30,cancellation_policy_super_strict_60
4958,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
692,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0


In [160]:
#preparing the numerical features that will be used for modeling
features_num = df_listings_prep2.select_dtypes(include=['float', 'int'])
features_num_lst = features_num.columns
print(features_num.shape)
list(features_num_lst)
features_num.sample(2)

(5907, 10)


Unnamed: 0,id,host_is_superhost,accommodates,bathrooms,bedrooms,beds,price,cleaning_fee,number_of_reviews,instant_bookable
5615,29901892,1,2.0,1.0,1.0,1.0,52.0,14.0,10.0,1
1600,13140867,0,2.0,1.0,1.0,1.0,250.0,125.0,133.0,1


In [161]:
#merging all the prepared dataframes created: df_amenities, features_cat, and features_num
df = pd.concat([features_num, features_cat, df_amenities], axis=1, join='inner')
df.sample(10)

Unnamed: 0,id,host_is_superhost,accommodates,bathrooms,bedrooms,beds,price,cleaning_fee,number_of_reviews,instant_bookable,...,Wideclearancetobed,Wideclearancetoshower,Widedoorway,Wideentryway,Widehallwayclearance,Wifi,Windowguards,toilet,translationmissing:en.hosting_amenity_49,translationmissing:en.hosting_amenity_50
3032,20228320,1,3.0,1.0,1.0,1.0,195.0,80.0,86.0,1,...,0,0,0,0,0,1,0,0,0,0
1717,13685207,0,6.0,2.0,3.0,3.0,184.0,154.0,27.0,1,...,0,0,0,0,0,1,0,0,0,0
2848,19455818,0,5.0,2.0,2.0,1.0,469.0,100.0,0.0,0,...,0,0,0,0,0,1,0,0,0,0
618,5259996,0,3.0,1.5,2.0,2.0,245.0,175.0,5.0,0,...,0,0,0,0,0,1,0,0,0,0
6020,31571124,0,5.0,1.0,2.0,3.0,125.0,69.0,0.0,1,...,0,0,0,0,0,1,0,0,0,0
5469,29539009,0,4.0,1.0,0.0,1.0,133.0,100.0,0.0,0,...,0,0,0,0,0,1,0,0,0,0
4075,23467213,0,6.0,3.0,3.0,3.0,405.0,147.0,17.0,1,...,0,0,0,0,0,1,0,0,0,0
5922,31073977,1,2.0,2.0,1.0,1.0,44.0,26.0,2.0,0,...,0,0,0,0,0,1,0,0,0,0
5288,29014832,0,2.0,1.0,1.0,1.0,70.0,15.0,0.0,1,...,0,0,0,0,0,1,0,0,0,0
1520,12581300,0,2.0,1.0,1.0,1.0,65.0,5.0,11.0,0,...,0,0,0,0,0,1,0,0,0,1


In [162]:
#confirming the entire dataframes have been merged correctly with the right list of numerical features, amenities, and dummies for categorical features
print(df.shape)
list(df.columns)

(5907, 185)


['id',
 'host_is_superhost',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'price',
 'cleaning_fee',
 'number_of_reviews',
 'instant_bookable',
 'neighbourhood_cleansed_Allston',
 'neighbourhood_cleansed_Back Bay',
 'neighbourhood_cleansed_Bay Village',
 'neighbourhood_cleansed_Beacon Hill',
 'neighbourhood_cleansed_Brighton',
 'neighbourhood_cleansed_Charlestown',
 'neighbourhood_cleansed_Chinatown',
 'neighbourhood_cleansed_Dorchester',
 'neighbourhood_cleansed_Downtown',
 'neighbourhood_cleansed_East Boston',
 'neighbourhood_cleansed_Fenway',
 'neighbourhood_cleansed_Hyde Park',
 'neighbourhood_cleansed_Jamaica Plain',
 'neighbourhood_cleansed_Leather District',
 'neighbourhood_cleansed_Longwood Medical Area',
 'neighbourhood_cleansed_Mattapan',
 'neighbourhood_cleansed_Mission Hill',
 'neighbourhood_cleansed_North End',
 'neighbourhood_cleansed_Roslindale',
 'neighbourhood_cleansed_Roxbury',
 'neighbourhood_cleansed_South Boston',
 'neighbourhood_cleansed_South Boston Water

In [163]:
#checking if we have any missing value for any feature
df.isnull().sum().any()

False

---
# <center>STAGE THREE</center>
---

## 4. Data Modeling

In [164]:
#copying the latest dataset
df_model = df.copy()

In [165]:
#listing the final dataset's features
print(df_model.shape)
list(df_model.columns)

(5907, 185)


['id',
 'host_is_superhost',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'price',
 'cleaning_fee',
 'number_of_reviews',
 'instant_bookable',
 'neighbourhood_cleansed_Allston',
 'neighbourhood_cleansed_Back Bay',
 'neighbourhood_cleansed_Bay Village',
 'neighbourhood_cleansed_Beacon Hill',
 'neighbourhood_cleansed_Brighton',
 'neighbourhood_cleansed_Charlestown',
 'neighbourhood_cleansed_Chinatown',
 'neighbourhood_cleansed_Dorchester',
 'neighbourhood_cleansed_Downtown',
 'neighbourhood_cleansed_East Boston',
 'neighbourhood_cleansed_Fenway',
 'neighbourhood_cleansed_Hyde Park',
 'neighbourhood_cleansed_Jamaica Plain',
 'neighbourhood_cleansed_Leather District',
 'neighbourhood_cleansed_Longwood Medical Area',
 'neighbourhood_cleansed_Mattapan',
 'neighbourhood_cleansed_Mission Hill',
 'neighbourhood_cleansed_North End',
 'neighbourhood_cleansed_Roslindale',
 'neighbourhood_cleansed_Roxbury',
 'neighbourhood_cleansed_South Boston',
 'neighbourhood_cleansed_South Boston Water

In [166]:
#splitting into explanatory and response variables
X = df_model.drop(['price', 'id', 'cleaning_fee'], axis=1)
X = sm.add_constant(X)
y = df_model['price']

#splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, 
                                                    random_state = 42)

#showing the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 4725 samples.
Testing set has 1182 samples.



Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.



---
# <center>STAGE FOUR</center>
---

## 5. Evaluation

*Source of using statsmodels(sm): Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.*


In [167]:
#predicting and obtaining a summary of the Ordinary least squares method
model = sm.OLS(y, X).fit()
predictions = model.predict(X) 

#printing out the statistics
print(type(model))
print(model.summary())

<class 'statsmodels.regression.linear_model.RegressionResultsWrapper'>
                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.620
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     52.86
Date:                Thu, 09 May 2019   Prob (F-statistic):               0.00
Time:                        22:51:02   Log-Likelihood:                -32648.
No. Observations:                5907   AIC:                         6.565e+04
Df Residuals:                    5729   BIC:                         6.684e+04
Df Model:                         177                                         
Covariance Type:            nonrobust                                         
                                                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------

Based on the Regression Results, the Adjusted R<sup>2</sup> (adjusted R<sup>2</sup> was chosen instead of R<sup>2</sup> because it accounts for the number of features) states that the listed features can explain 61% of the variability in the price for Boston Airbnb. Next, we will see which features have more weight and impact than other features do and then determine whether there is a significant relationship between the features in the model. If the p-value of each of the feature is less than 0.05, we conclude that there is sufficient evidence to say that we are 95% confident that there is a significant linear relationship between the price and the feature.

---
### <font color=blue> Q8. What are the features that infleunce the price in Boston airbinb? And, can we predict the price of new listings based on a predictive model?

In [168]:
#understanding the most influential coefficients in the model
np.abs(model.params).sort_values(ascending=False)

Electricprofilingbed                               80.502081
property_type_Bungalow                             72.998709
cancellation_policy_strict                         64.558547
property_type_Barn                                 63.443751
room_type_Entire home/apt                          57.825315
neighbourhood_cleansed_Back Bay                    57.080997
Washer/Dryer                                       56.774705
Fixedgrabbarsfortoilet                             56.256382
neighbourhood_cleansed_Leather District            51.050006
Showerchair                                        48.782778
property_type_Serviced apartment                   47.726852
neighbourhood_cleansed_West Roxbury                45.355507
neighbourhood_cleansed_Hyde Park                   44.812446
neighbourhood_cleansed_Mattapan                    44.705022
Ski-in/Ski-out                                     43.061954
const                                              41.120013
neighbourhood_cleansed_R

Based on the previous investigation, we can say that these features have more impact on predicting the price for Boston Airbnb listings by either increasing or decreasing the price: 

1. With respect to amenities: 
            
            - Electric Profiling Bed
            - Stair gates
            - Washer/Dryer
            - Hot Water Kettle
            - Fixed Grab Bars For Toilet
            - Private bathroom
            - Room-darkening shades
            
2. With respect to property type:
            
            - Bungalow
            - Serviced apartment
            - Tiny house
            - Hotel
            - House
   
3. With respect to the neighborhood:
            
            - Back Bay
            - Leather District
            - South Boston Waterfront
            - Chinatown
            - Downtown
            - Mattapan
            - Hyde Park
            
4. With respect to other rules and features:
            
            - cancellation policy: strict
            - room type: Entire home/apt

In [169]:
#looking at a summary of the coefficients
np.abs(model.params).describe()

count    183.000000
mean      15.719676
std       15.629477
min        0.090662
25%        4.633837
50%       11.619952
75%       22.833570
max       80.502081
dtype: float64

In [170]:
#listing only the 25% most influential coefficients in the model
coeffs = model.params[np.abs(model.params) >= 20]
print(coeffs.sort_values(ascending=False))

Electricprofilingbed                              80.502081
property_type_Bungalow                            72.998709
cancellation_policy_strict                        64.558547
property_type_Barn                                63.443751
room_type_Entire home/apt                         57.825315
neighbourhood_cleansed_Back Bay                   57.080997
Washer/Dryer                                      56.774705
Fixedgrabbarsfortoilet                            56.256382
neighbourhood_cleansed_Leather District           51.050006
property_type_Serviced apartment                  47.726852
const                                             41.120013
neighbourhood_cleansed_South Boston Waterfront    35.016654
neighbourhood_cleansed_Chinatown                  31.666661
neighbourhood_cleansed_Downtown                   31.492873
property_type_Tiny house                          29.567973
property_type_Aparthotel                          29.146802
neighbourhood_cleansed_South End        

In [171]:
features_coeffs = coeffs.sort_values(ascending=True)


trace1 = go.Bar(
    y=features_coeffs.index,
    x=features_coeffs,
    name='Most Influential Features',
    orientation = 'h',
    marker = dict(
        color = 'rgb(158,202,225)',
        line = dict(
            color = 'rgb(8,48,107)',
            width = 1.5),
        opacity=0.6
    ))

data = [trace1]
 
layout = go.Layout(
    title = "Most Influential Features",
    xaxis=dict(
        title='Coefficients',
        autorange=True),
    yaxis=dict(automargin=True,
               autorange=True))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Most Influential Features')


Consider using IPython.display.IFrame instead



In [172]:
#listing the features that are signficant at 95% confendance level, and has p-values less than 0.005
pvalues = model.pvalues[model.pvalues < 0.05]
pvalues

const                                             7.621847e-08
host_is_superhost                                 3.897856e-04
accommodates                                      4.282992e-25
bathrooms                                         7.039712e-19
bedrooms                                          1.313636e-29
number_of_reviews                                 2.699818e-09
neighbourhood_cleansed_Allston                    1.087438e-09
neighbourhood_cleansed_Back Bay                   9.484528e-55
neighbourhood_cleansed_Bay Village                2.823893e-02
neighbourhood_cleansed_Beacon Hill                1.078504e-09
neighbourhood_cleansed_Brighton                   3.843618e-11
neighbourhood_cleansed_Charlestown                1.595068e-05
neighbourhood_cleansed_Chinatown                  3.779635e-08
neighbourhood_cleansed_Dorchester                 1.809144e-14
neighbourhood_cleansed_Downtown                   8.553095e-17
neighbourhood_cleansed_East Boston                7.784

In [175]:
!!jupyter nbconvert *.ipynb

['[NbConvertApp] Converting notebook boston_airbnb_analysis.ipynb to html',
 '[NbConvertApp] Writing 660642 bytes to boston_airbnb_analysis.html']