### A Look at the Data

In order to get a better understanding of the data, let's take a look at some of the characteristics of the dataset. First, let's read in the data and necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display
import os
import json

pd.options.display.max_colwidth=200
%matplotlib inline

In [2]:
# data directory
data_dir = "../data"
raw_data = os.path.join(data_dir, "raw")

In [3]:
# data description
data_desc = pd.DataFrame([json.loads(open('../data/raw/data_description.json').read())]).T
data_desc

Unnamed: 0,0
detailed_listings.csv,Detailed Listings data for New York City
detailed_calendar.csv,Detailed Calendar Data for listings in New York City
detailed_reviews.csv,Detailed Review Data for listings in New York City
summary_listings.csv,Summary information and metrics for listings in New York City (good for visualisations).
summary_reviews.csv,Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).
neighbourhoods.csv,Neighbourhood list for geo filter. Sourced from city or open source GIS files.
neighbourhoods.geojson,GeoJSON file of neighbourhoods of the city.


In [4]:
df = pd.read_csv(os.path.join(raw_data, "detailed_listings.csv"), low_memory=False)

In [5]:
df.shape

(48864, 106)

In [6]:
len(list(df.neighbourhood.unique()))

198

In [7]:
len(list(df.neighbourhood_cleansed.unique()))

222

In [8]:
df.columns[df.columns.str.contains("neigh")]

Index(['neighborhood_overview', 'host_neighbourhood', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed'],
      dtype='object')

In [9]:
df.neighbourhood_group_cleansed.value_counts()

Manhattan        21456
Brooklyn         20114
Queens            5811
Bronx             1105
Staten Island      378
Name: neighbourhood_group_cleansed, dtype: int64

In [67]:
df.transit.isnull().sum()

16975

In [68]:
df.room_type.value_counts()

Entire home/apt    25296
Private room       22397
Shared room         1171
Name: room_type, dtype: int64

In [73]:
df.beds.value_counts()

1.0     31207
2.0     10408
3.0      3587
4.0      1524
0.0      1027
5.0       537
6.0       270
7.0        87
8.0        67
9.0        37
12.0       19
11.0       17
10.0       13
13.0        8
16.0        3
15.0        3
14.0        2
21.0        2
26.0        1
40.0        1
17.0        1
22.0        1
Name: beds, dtype: int64

In [74]:
df.property_type.value_counts()

Apartment                 38605
House                      3846
Townhouse                  1659
Condominium                1495
Loft                       1412
Serviced apartment          505
Guest suite                 363
Hotel                       227
Boutique hotel              190
Other                       118
Bed and breakfast            88
Resort                       72
Hostel                       62
Guesthouse                   56
Bungalow                     38
Villa                        28
Tiny house                   19
Aparthotel                   17
Boat                         13
Camper/RV                    10
Cottage                       7
Tent                          6
Earth house                   4
Houseboat                     3
Cabin                         3
Casa particular (Cuba)        2
Yurt                          2
Farm stay                     2
Bus                           2
Cave                          2
Barn                          2
Nature l

In [79]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48864 entries, 0 to 48863
Data columns (total 106 columns):
id                                              int64
listing_url                                     object
scrape_id                                       int64
last_scraped                                    object
name                                            object
summary                                         object
space                                           object
description                                     object
experiences_offered                             object
neighborhood_overview                           object
notes                                           object
transit                                         object
access                                          object
interaction                                     object
house_rules                                     object
thumbnail_url                                   float64
medium_url 

In [75]:
df_summary_listings = pd.read_csv(os.path.join(raw_data, "summary_listings.csv"))

In [77]:
df_summary_listings.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48864 entries, 0 to 48863
Data columns (total 16 columns):
id                                48864 non-null int64
name                              48848 non-null object
host_id                           48864 non-null int64
host_name                         48846 non-null object
neighbourhood_group               48864 non-null object
neighbourhood                     48864 non-null object
latitude                          48864 non-null float64
longitude                         48864 non-null float64
room_type                         48864 non-null object
price                             48864 non-null int64
minimum_nights                    48864 non-null int64
number_of_reviews                 48864 non-null int64
last_review                       38733 non-null object
reviews_per_month                 38733 non-null float64
calculated_host_listings_count    48864 non-null int64
availability_365                  48864 non-null int64

In [40]:
df_summary_listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,46,2019-07-14,0.39,2,288
1,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
2,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,274,2019-07-26,4.64,1,212
3,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
4,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,75,2019-07-21,0.6,1,127


In [41]:
df_detailed_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20190806030549,2019-08-07,Skylit Midtown Castle,"Find your romantic getaway to this beautiful, spacious skylit studio in the heart of Midtown, Manhattan. STUNNING SKYLIT STUDIO / 1 BED + SINGLE / FULL BATH / FULL KITCHEN / FIREPLACE / CENTRALLY...","- Spacious (500+ft²), immaculate and nicely furnished & designed studio. - Tuck yourself into the ultra comfortable bed under the skylight. Fall in love with a myriad of bright lights in the city ...","Find your romantic getaway to this beautiful, spacious skylit studio in the heart of Midtown, Manhattan. STUNNING SKYLIT STUDIO / 1 BED + SINGLE / FULL BATH / FULL KITCHEN / FIREPLACE / CENTRALLY...",none,"Centrally located in the heart of Manhattan just a few blocks from all subway connections in the very desirable Midtown location a few minutes walk to Times Square, the Theater District, Bryant Pa...",...,f,f,strict_14_with_grace_period,t,t,2,1,0,1,0.39
1,3647,https://www.airbnb.com/rooms/3647,20190806030549,2019-08-06,THE VILLAGE OF HARLEM....NEW YORK !,,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY This Spacious 1 bedroom is with Plenty of Windows with a View....... Sleeps.....Four Adults.....two in the Livingrm. with (2) Sofa-beds. (Website hid...,WELCOME TO OUR INTERNATIONAL URBAN COMMUNITY This Spacious 1 bedroom is with Plenty of Windows with a View....... Sleeps.....Four Adults.....two in the Livingrm. with (2) Sofa-beds. (Website hid...,none,,...,f,f,strict_14_with_grace_period,t,t,1,0,1,0,
2,3831,https://www.airbnb.com/rooms/3831,20190806030549,2019-08-06,Cozy Entire Floor of Brownstone,"Urban retreat: enjoy 500 s.f. floor in 1899 brownstone, with wood and ceramic flooring throughout (completed Aug. 2015 through Sept. 2015), roomy bdrm, & upgraded kitchen & bathroom (completed Oct...","Greetings! We own a double-duplex brownstone in Clinton Hill on Gates near Classon Avenue - (7 blocks to C train, 5 blocks to G train, minutes to all), in which we host on the entire top flo...","Urban retreat: enjoy 500 s.f. floor in 1899 brownstone, with wood and ceramic flooring throughout (completed Aug. 2015 through Sept. 2015), roomy bdrm, & upgraded kitchen & bathroom (completed Oct...",none,Just the right mix of urban center and local neighborhood; close to all but enough quiet for a calming walk.,...,f,f,moderate,f,f,1,1,0,0,4.64
3,5022,https://www.airbnb.com/rooms/5022,20190806030549,2019-08-06,Entire Apt: Spacious Studio/Loft by central park,,Loft apartment with high ceiling and wood flooring located 10 minutes away from Central Park in Harlem - 1 block away from 6 train and 3 blocks from 2 & 3 line. This is in a recently renovated bui...,Loft apartment with high ceiling and wood flooring located 10 minutes away from Central Park in Harlem - 1 block away from 6 train and 3 blocks from 2 & 3 line. This is in a recently renovated bui...,none,,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.1
4,5099,https://www.airbnb.com/rooms/5099,20190806030549,2019-08-06,Large Cozy 1 BR Apartment In Midtown East,"My large 1 bedroom apartment is true New York City living. The apt is in midtown on the east side and centrally located, just a 10-minute walk from Grand Central Station, Empire State Building, T...","I have a large 1 bedroom apartment centrally located in Midtown East. A 10 minute walk from Grand Central Station, Times Square, Empire State Building and all major subway and bus lines. The apar...","My large 1 bedroom apartment is true New York City living. The apt is in midtown on the east side and centrally located, just a 10-minute walk from Grand Central Station, Empire State Building, T...",none,My neighborhood in Midtown East is called Murray Hill. The area is very centrally located with easy access to explore . The apartment is about 5 blocks (7 minute walk) to the United Nations and Gr...,...,f,f,strict_14_with_grace_period,t,t,1,1,0,0,0.6


## Explore df_summary_listings

In [81]:
df_summary_listings.shape

(48864, 16)

In [82]:
df_summary_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48864 entries, 0 to 48863
Data columns (total 16 columns):
id                                48864 non-null int64
name                              48848 non-null object
host_id                           48864 non-null int64
host_name                         48846 non-null object
neighbourhood_group               48864 non-null object
neighbourhood                     48864 non-null object
latitude                          48864 non-null float64
longitude                         48864 non-null float64
room_type                         48864 non-null object
price                             48864 non-null int64
minimum_nights                    48864 non-null int64
number_of_reviews                 48864 non-null int64
last_review                       38733 non-null object
reviews_per_month                 38733 non-null float64
calculated_host_listings_count    48864 non-null int64
availability_365                  48864 non-null int64

In [86]:
cols = ["id", "latitude", "longitude", "neighbourhood_group", "availability_365", "price"]

In [87]:
df_summary_listings[cols].head()

Unnamed: 0,id,latitude,longitude,neighbourhood_group,availability_365,price
0,2595,40.75362,-73.98377,Manhattan,288,225
1,3647,40.80902,-73.9419,Manhattan,365,150
2,3831,40.68514,-73.95976,Brooklyn,212,89
3,5022,40.79851,-73.94399,Manhattan,0,80
4,5099,40.74767,-73.975,Manhattan,127,200


In [88]:
import matplotlib.pyplot as plt

In [91]:
import geopandas

ImportError: dlopen(/Users/Shravan/anaconda3/envs/xgboost/lib/python3.6/site-packages/fiona/ogrext.cpython-36m-darwin.so, 2): Library not loaded: @rpath/libkea.1.4.7.dylib
  Referenced from: /Users/Shravan/anaconda3/envs/xgboost/lib/libgdal.20.dylib
  Reason: image not found

In [42]:
df_calendar = pd.read_csv(os.path.join(raw_data, "detailed_calendar.csv"), nrows=100)

In [43]:
df_calendar.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
listing_id        100 non-null int64
date              100 non-null object
available         100 non-null object
price             100 non-null object
adjusted_price    100 non-null object
minimum_nights    100 non-null int64
maximum_nights    100 non-null int64
dtypes: int64(3), object(4)
memory usage: 5.5+ KB


In [46]:
df_calendar.head(10)

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,38638,2019-08-06,f,$219.00,$219.00,4,200
1,16595,2019-08-06,t,$275.00,$275.00,1,1124
2,16595,2019-08-07,f,$225.00,$225.00,1,1124
3,16595,2019-08-08,t,$225.00,$225.00,1,1124
4,16595,2019-08-09,f,$225.00,$225.00,1,1124
5,16595,2019-08-10,f,$225.00,$225.00,1,1124
6,16595,2019-08-11,f,$225.00,$225.00,1,1124
7,16595,2019-08-12,f,$225.00,$225.00,1,1124
8,16595,2019-08-13,f,$225.00,$225.00,1,1124
9,16595,2019-08-14,f,$225.00,$225.00,1,1124


In [54]:
df_neighbourhood = pd.read_csv(os.path.join(raw_data, "neighbourhoods.csv"))

In [55]:
df_neighbourhood.shape

(230, 2)

In [56]:
df_neighbourhood.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,Bronx,Allerton
1,Bronx,Baychester
2,Bronx,Belmont
3,Bronx,Bronxdale
4,Bronx,Castle Hill
