![image.png](attachment:image.png)

## MSCA 37014- Python for Analytics 

### Airbnb Project

Airbnb is interested in better understanding data relating to price of listings on their website. They want to gain insight into its usefulness in the listing assessment process. The dataset consists of a random sample of homes that have been booked in Amsterdam during December 2020.

In [1]:
# import packages 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

np.set_printoptions(precision=2)
pd.set_option('display.max_columns', None)
%matplotlib inline
plt.rc('figure', figsize=(10,5))
figsize_with_subplots = (10,10)
bin_size=10

In [2]:
df = pd.read_csv('listings.csv.gz')

## Exploratory Data Analysis

Three main components of exploring data:
1. Understanding variables 
2. Cleaning your dataset
3. Analyzing relationships between variables

In [3]:
df.shape

(16116, 74)

In [4]:
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2818,https://www.airbnb.com/rooms/2818,20210907032724,2021-09-07,Quiet Garden View Room & Super Fast WiFi,Quiet Garden View Room & Super Fast WiFi<br />...,"Indische Buurt (""Indies Neighborhood"") is a ne...",https://a0.muscache.com/pictures/10272854/8dcc...,3159,https://www.airbnb.com/users/show/3159,Daniel,2008-09-24,"Amsterdam, Noord-Holland, The Netherlands","Upon arriving in Amsterdam, one can imagine as...",within an hour,100%,100%,t,https://a0.muscache.com/im/users/3159/profile_...,https://a0.muscache.com/im/users/3159/profile_...,Indische Buurt,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Amsterdam, North Holland, Netherlands",Oostelijk Havengebied - Indische Buurt,,52.36435,4.94358,Private room in rental unit,Private room,2,,1.5 shared baths,1.0,2.0,"[""Single level home"", ""Coffee maker"", ""Long te...",$59.00,3,28,3.0,3.0,1125.0,1125.0,3.0,1125.0,,t,3,28,55,124,2021-09-07,280,2,0,2013-08-25,2019-11-21,4.89,4.93,5.0,4.97,4.97,4.68,4.81,0363 5F3A 5684 6750 D14D,t,1,0,1,0,2.86
1,20168,https://www.airbnb.com/rooms/20168,20210907032724,2021-09-07,Studio with private bathroom in the centre 1,17th century Dutch townhouse in the heart of t...,Located just in between famous central canals....,https://a0.muscache.com/pictures/69979628/fd6a...,59484,https://www.airbnb.com/users/show/59484,Alexander,2009-12-02,"Amsterdam, Noord-Holland, The Netherlands",+ (Phone number hidden by Airbnb),within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/65092...,https://a0.muscache.com/im/pictures/user/65092...,Grachtengordel,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Amsterdam, North Holland, Netherlands",Centrum-Oost,,52.36407,4.89393,Private room in townhouse,Private room,2,,1 private bath,1.0,1.0,"[""Hot water"", ""TV"", ""Hangers"", ""Essentials"", ""...",$106.00,1,365,1.0,1.0,1125.0,1125.0,1.0,1125.0,,t,0,0,0,0,2021-09-07,339,0,0,2014-01-17,2020-03-27,4.44,4.69,4.79,4.63,4.62,4.87,4.49,0363 CBB3 2C10 0C2A 1E29,t,2,0,2,0,3.64
2,25428,https://www.airbnb.com/rooms/25428,20210907032724,2021-09-07,"Lovely, sunny 1 bed apt in Ctr (w.lift) & firepl.",Lovely apt in Centre ( lift & fireplace) near ...,,https://a0.muscache.com/pictures/138431/7079a9...,56142,https://www.airbnb.com/users/show/56142,Joan,2009-11-20,"New York, New York, United States","We are a retired couple who live in NYC, and h...",,,0%,t,https://a0.muscache.com/im/users/56142/profile...,https://a0.muscache.com/im/users/56142/profile...,Grachtengordel,2.0,2.0,"['email', 'phone', 'reviews']",t,f,,Centrum-West,,52.3749,4.88487,Entire rental unit,Entire home/apt,3,,1 bath,1.0,1.0,"[""Cable TV"", ""Coffee maker"", ""Long term stays ...",$125.00,14,120,7.0,14.0,120.0,120.0,13.8,120.0,,t,1,1,3,57,2021-09-07,5,0,0,2018-01-21,2020-01-02,5.0,5.0,5.0,5.0,5.0,5.0,4.8,,f,1,1,0,0,0.11
3,27886,https://www.airbnb.com/rooms/27886,20210907032724,2021-09-07,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,"Central, quiet, safe, clean and beautiful.",https://a0.muscache.com/pictures/02c2da9d-660e...,97647,https://www.airbnb.com/users/show/97647,Flip,2010-03-23,"Amsterdam, Noord-Holland, The Netherlands","Marjan works in ""eye"" the dutch filmmuseum, an...",within an hour,86%,100%,t,https://a0.muscache.com/im/users/97647/profile...,https://a0.muscache.com/im/users/97647/profile...,Westelijke Eilanden,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Amsterdam, North Holland, Netherlands",Centrum-West,,52.38761,4.89188,Private room in houseboat,Private room,2,,1.5 baths,1.0,1.0,"[""Coffee maker"", ""Long term stays allowed"", ""P...",$141.00,2,730,2.0,2.0,1125.0,1125.0,2.0,1125.0,,t,9,20,47,66,2021-09-07,223,4,2,2013-02-17,2021-08-21,4.95,4.93,4.96,4.95,4.92,4.9,4.8,0363 974D 4986 7411 88D8,t,1,0,1,0,2.14
4,28871,https://www.airbnb.com/rooms/28871,20210907032724,2021-09-08,Comfortable double room,<b>The space</b><br />In a monumental house ri...,"Flower market , Leidseplein , Rembrantsplein",https://a0.muscache.com/pictures/160889/362340...,124245,https://www.airbnb.com/users/show/124245,Edwin,2010-05-13,"Amsterdam, Noord-Holland, The Netherlands",Hi,within an hour,100%,98%,t,https://a0.muscache.com/im/pictures/user/9986b...,https://a0.muscache.com/im/pictures/user/9986b...,Amsterdam Centrum,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Amsterdam, North Holland, Netherlands",Centrum-West,,52.36775,4.89092,Private room in rental unit,Private room,2,,1 shared bath,1.0,1.0,"[""Hot water"", ""Shampoo"", ""Dryer"", ""Hangers"", ""...",$75.00,2,1825,2.0,2.0,1825.0,1825.0,2.0,1825.0,,t,11,27,50,298,2021-09-08,353,19,8,2015-05-18,2021-08-27,4.87,4.94,4.89,4.97,4.94,4.97,4.82,0363 607B EA74 0BD8 2F6F,f,2,0,2,0,4.59


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16116 entries, 0 to 16115
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            16116 non-null  int64  
 1   listing_url                                   16116 non-null  object 
 2   scrape_id                                     16116 non-null  int64  
 3   last_scraped                                  16116 non-null  object 
 4   name                                          16086 non-null  object 
 5   description                                   15893 non-null  object 
 6   neighborhood_overview                         10405 non-null  object 
 7   picture_url                                   16116 non-null  object 
 8   host_id                                       16116 non-null  int64  
 9   host_url                                      16116 non-null 

In [5]:
df.describe()

Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,16116.0,16116.0,16116.0,16111.0,16111.0,0.0,16116.0,16116.0,16116.0,0.0,15218.0,16019.0,16116.0,16116.0,16113.0,16113.0,16113.0,16113.0,16113.0,16113.0,0.0,16116.0,16116.0,16116.0,16116.0,16116.0,16116.0,16116.0,14029.0,13815.0,13816.0,13807.0,13812.0,13807.0,13807.0,16116.0,16116.0,16116.0,16116.0,14029.0
mean,21181840.0,20210910000000.0,69760520.0,2.023338,2.023338,,52.36551,4.889434,2.836684,,1.530096,1.7581,3.991189,608.146811,3.953578,4.149134,685.520449,267246.5,4.025489,266640.0,,4.026309,8.839787,14.32173,55.31689,24.645383,1.386262,0.325453,4.691878,4.8112,4.701267,4.8478,4.868875,4.728831,4.600638,1.636883,1.096674,0.489514,0.008067,0.677467
std,13520630.0,1.683646,90271640.0,23.344729,23.344729,,0.016563,0.036151,1.312016,,0.951085,1.467627,20.987452,540.665822,20.988709,21.034236,532.480917,23924510.0,20.99616,23870380.0,,8.702405,18.1402,28.508009,107.907731,56.707709,7.477153,2.025282,0.668502,0.345724,0.440648,0.307706,0.299283,0.328007,0.39052,2.402076,1.870279,1.531128,0.134829,1.720379
min,2818.0,20210910000000.0,3159.0,0.0,0.0,,52.29034,4.75571,0.0,,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01
25%,10180200.0,20210910000000.0,9735558.0,1.0,1.0,,52.35513,4.86369,2.0,,1.0,1.0,2.0,21.0,2.0,2.0,28.0,28.0,2.0,28.0,,0.0,0.0,0.0,0.0,2.0,0.0,0.0,4.67,4.75,4.6,4.82,4.85,4.6,4.5,1.0,1.0,0.0,0.0,0.11
50%,19265930.0,20210910000000.0,29741340.0,1.0,1.0,,52.36488,4.8869,2.0,,1.0,1.0,2.0,1125.0,2.0,3.0,1125.0,1125.0,2.0,1125.0,,0.0,0.0,0.0,0.0,8.0,0.0,0.0,4.86,4.91,4.83,4.95,4.98,4.81,4.67,1.0,1.0,0.0,0.0,0.27
75%,31075480.0,20210910000000.0,89883210.0,1.0,1.0,,52.37544,4.90916,4.0,,2.0,2.0,3.0,1125.0,3.0,3.0,1125.0,1125.0,3.0,1125.0,,0.0,2.0,5.0,47.0,22.0,0.0,0.0,5.0,5.0,5.0,5.0,5.0,5.0,4.83,1.0,1.0,0.0,0.0,0.62
max,52082800.0,20210910000000.0,421003700.0,1992.0,1992.0,,52.42534,5.066508,16.0,,50.0,33.0,1100.0,1825.0,1100.0,1100.0,1825.0,2147484000.0,1100.0,2142625000.0,,30.0,60.0,90.0,365.0,877.0,422.0,137.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,30.0,30.0,21.0,4.0,107.84


Some quick observations:

- host_listings count is highly sqewed. Probably an outlier in there. The max is 1992 when 75% of the data is 1. 
- Same with host_total_listings_count 
- neihbourhood_group_cleansed has no values 
- accomodates also seems to have a few really large values.
- bathrooms has no values
- bedroom seems to have an outlier. the max bedroom is 50. Same with beds (33), min_nights (1100), max_nights (1825), min_min_nights (1100), max_min_nights (1100), min_max_nights (1825), min_nights_avg_ntm (1100) 
- Not sure what the column max_max_nights represents. Shows really large values. 
- calendar_updated has no values 
- availability also seems to be highly skewed
- no of reviews alos seems to be highly skewed

In [19]:
# separating the catgorical and numerical variables 
# Not including url columns in categorical and id cols in numerical
cat_cols = []
num_cols = []
extra = []

for i in df.columns:
    if (df[i].dtype == 'object') & ('url' not in i):
        cat_cols.append(i)
    elif (df[i].dtype != 'object') & ('_id' not in i):
        num_cols.append(i)
    else:
        extra.append(i)

In [22]:
%pprint = False
cat_cols

Pretty printing has been turned OFF


['last_scraped', 'name', 'description', 'neighborhood_overview', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'property_type', 'room_type', 'bathrooms_text', 'amenities', 'price', 'has_availability', 'calendar_last_scraped', 'first_review', 'last_review', 'license', 'instant_bookable']

In [23]:
num_cols

['id', 'host_listings_count', 'host_total_listings_count', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']

In [24]:
extra

['listing_url', 'scrape_id', 'picture_url', 'host_id', 'host_url', 'host_thumbnail_url', 'host_picture_url']

In [27]:
# Looking at what unique values each categorical colum has 

for col in df.columns.values:
    if col in cat_cols:
        print(df[col].value_counts(normalize=True).round(decimals=4))
        print('\n')

2021-09-07    0.821
2021-09-08    0.179
Name: last_scraped, dtype: float64


Amsterdam                                             0.0018
Residences | 2-Bedrooms | Serviced Apartment          0.0008
Lovely apartment near Vondelpark                      0.0004
Spacious apartment near Vondelpark                    0.0004
Amsterdam Appartement                                 0.0003
                                                       ...  
A room with a canalview - Jordaan, Amsterdam          0.0001
Guesthouse Canal Pride, fijn logeren aan de gracht    0.0001
Unique studio in houseboat in the urban greenery      0.0001
Sun-Room-Shine                                        0.0001
Private guestfloor with roofterrace                   0.0001
Name: name, Length: 15766, dtype: float64


Hotel Jansen is a new Short Stay hotel in Amsterdam. We offer great and affordable accommodation for students, graduates, interns & young professionals from all over the world. Hotel Jansen is a place you can

Looking at the categorical variables:

- host_response_time: {within an hour, within a day, within a few hours, a few days or more}
- host_response_rate, host_acceptance_rate: should be a numerical data point 
- host_is_superhost: can be turned to 1,0 {13% are superhost}
- host_neighbourhoods: there are 68 different neighborhoods. Maybe there is a way to group some?
- host_verifications: a list of different ways to verify. We can maybe split it into columns of 1 and 0?
- host_has_profile_pic: 1,0 columns {99.8% have a profile pic}
- host_identity_verified: 1,0 column {67% have been verified} 
- neighbourhood: seems redundant 
- property_type: 67 different values. Maybe group them somehow?
- room_type: {Entire home/apt, Private room, Hotel room, Shared room}
- bathroom_text: should be a numerical value 
- amenities: split them and group them into smaller categories?
- price: should be a numerical column 
- has_availability: 1,0 column {96% have availability}
- instant_bookable: 1,0 column {77% are not instantly bookable}