# Classification analysis of airbnb listings data

The API for the airbnb listings open dataset for this classification analysis is available here: https://public.opendatasoft.com/explore/dataset/airbnb-reviews/api/. The goal of the notebook is the supervised machine learning task of classifying listing features according to their review scores ratings. We will use Amazon SageMaker hosting and software to this end and so we begin with the necessary imports...

In [2]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import pandas as pd

sess = sagemaker.Session()

role = get_execution_role()
print(role)

bucket = sess.default_bucket() 
print(bucket)

arn:aws:iam::232666250507:role/service-role/AmazonSageMaker-ExecutionRole-20200611T080886
sagemaker-eu-west-2-232666250507


Let's now download and unzip the listings open dataset from http://insideairbnb.com and inspect it...

In [3]:
!wget http://data.insideairbnb.com/united-kingdom/england/london/2020-04-14/data/listings.csv.gz
!gunzip listings.csv.gz

--2020-06-11 15:23:53--  http://data.insideairbnb.com/united-kingdom/england/london/2020-04-14/data/listings.csv.gz
Resolving data.insideairbnb.com (data.insideairbnb.com)... 52.216.164.178
Connecting to data.insideairbnb.com (data.insideairbnb.com)|52.216.164.178|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78560292 (75M) [application/x-gzip]
Saving to: ‘listings.csv.gz’


2020-06-11 15:24:01 (11.0 MB/s) - ‘listings.csv.gz’ saved [78560292/78560292]



## Reading, cleaning and encoding the data

In [64]:
listings_dataf = pd.read_csv('listings.csv') 
listings_dataf.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20200414180850,2020-04-16,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,"Hello Everyone, I'm offering my lovely double ...",My bright double bedroom with a large window h...,business,Finsbury Park is a friendly melting pot commun...,...,f,f,moderate,f,f,2,1,1,0,0.18
1,15400,https://www.airbnb.com/rooms/15400,20200414180850,2020-04-16,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,Bright Chelsea Apartment This is a bright one...,Lots of windows and light. St Luke's Gardens ...,romantic,It is Chelsea.,...,t,f,strict_14_with_grace_period,t,t,1,1,0,0,0.71
2,17402,https://www.airbnb.com/rooms/17402,20200414180850,2020-04-15,Superb 3-Bed/2 Bath & Wifi: Trendy W1,You'll have a wonderful stay in this superb mo...,"This is a wonderful very popular beautiful, sp...",You'll have a wonderful stay in this superb mo...,none,"Location, location, location! You won't find b...",...,t,f,strict_14_with_grace_period,f,f,15,15,0,0,0.38
3,17506,https://www.airbnb.com/rooms/17506,20200414180850,2020-04-16,Boutique Chelsea/Fulham Double bed 5-star ensuite,Enjoy a chic stay in this elegant but fully mo...,Enjoy a boutique London townhouse bed and brea...,Enjoy a chic stay in this elegant but fully mo...,business,Fulham is 'villagey' and residential – a real ...,...,f,f,strict_14_with_grace_period,f,f,2,0,2,0,
4,25023,https://www.airbnb.com/rooms/25023,20200414180850,2020-04-15,All-comforts 2-bed flat near Wimbledon tennis,"Large, all comforts, 2-bed flat; first floor; ...",10 mins walk to Southfields tube and Wimbledon...,"Large, all comforts, 2-bed flat; first floor; ...",none,This is a leafy residential area with excellen...,...,t,f,moderate,f,f,1,1,0,0,0.7


In [65]:
for n, c in zip(range(len(listings_dataf.columns)),listings_dataf.columns): print(n,c)

(0, 'id')
(1, 'listing_url')
(2, 'scrape_id')
(3, 'last_scraped')
(4, 'name')
(5, 'summary')
(6, 'space')
(7, 'description')
(8, 'experiences_offered')
(9, 'neighborhood_overview')
(10, 'notes')
(11, 'transit')
(12, 'access')
(13, 'interaction')
(14, 'house_rules')
(15, 'thumbnail_url')
(16, 'medium_url')
(17, 'picture_url')
(18, 'xl_picture_url')
(19, 'host_id')
(20, 'host_url')
(21, 'host_name')
(22, 'host_since')
(23, 'host_location')
(24, 'host_about')
(25, 'host_response_time')
(26, 'host_response_rate')
(27, 'host_acceptance_rate')
(28, 'host_is_superhost')
(29, 'host_thumbnail_url')
(30, 'host_picture_url')
(31, 'host_neighbourhood')
(32, 'host_listings_count')
(33, 'host_total_listings_count')
(34, 'host_verifications')
(35, 'host_has_profile_pic')
(36, 'host_identity_verified')
(37, 'street')
(38, 'neighbourhood')
(39, 'neighbourhood_cleansed')
(40, 'neighbourhood_group_cleansed')
(41, 'city')
(42, 'state')
(43, 'zipcode')
(44, 'market')
(45, 'smart_location')
(46, 'country_co

Let's look at a subset of these columns for the primary analysis...

In [66]:
listings_dataf = listings_dataf[[
                    'host_is_superhost',
                    'host_total_listings_count',
                    'host_identity_verified',
                    'neighbourhood_cleansed',
                    'is_location_exact',
                    'property_type',
                    'room_type',
                    'accommodates',
                    'bathrooms',
                    'bedrooms',
                    'bed_type',
                    'amenities',
                    'price',
                    'guests_included',
                    'minimum_nights',
                    'maximum_nights',
                    'number_of_reviews',
                    'requires_license',
                    'instant_bookable',
                    'cancellation_policy',
                    'review_scores_rating']]

In [67]:
listings_dataf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86358 entries, 0 to 86357
Data columns (total 21 columns):
host_is_superhost            86348 non-null object
host_total_listings_count    86348 non-null float64
host_identity_verified       86348 non-null object
neighbourhood_cleansed       86358 non-null object
is_location_exact            86358 non-null object
property_type                86358 non-null object
room_type                    86358 non-null object
accommodates                 86358 non-null int64
bathrooms                    86226 non-null float64
bedrooms                     86216 non-null float64
bed_type                     86355 non-null object
amenities                    86358 non-null object
price                        86358 non-null object
guests_included              86358 non-null int64
minimum_nights               86358 non-null int64
maximum_nights               86358 non-null int64
number_of_reviews            86358 non-null int64
requires_license          

Some cleaning to do converting from strings to floats in some cases plus we can only train on review scores that actually exist!

In [82]:
# Reduce to all listings with review scores
listings_dataf = listings_dataf[listings_dataf['review_scores_rating'].notnull()]

# Reduce further (only a few more) to all listings with a 'bathrooms' record
listings_dataf = listings_dataf[listings_dataf['bathrooms'].notnull()]

# Reduce further (only a few more) to all listings with a 'bedrooms' record
listings_dataf = listings_dataf[listings_dataf['bedrooms'].notnull()]

# Reduce further (only 2 more) to all listings with a 'host_is_superhost' record
listings_dataf = listings_dataf[listings_dataf['host_is_superhost'].notnull()]

# Clean price data by removing dollar signs and commas
listings_dataf['price'] = listings_dataf['price'].str.replace('$','').str.replace(',','').astype(float)

In [83]:
listings_dataf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63117 entries, 0 to 86106
Data columns (total 21 columns):
host_is_superhost            63117 non-null object
host_total_listings_count    63117 non-null float64
host_identity_verified       63117 non-null object
neighbourhood_cleansed       63117 non-null object
is_location_exact            63117 non-null object
property_type                63117 non-null object
room_type                    63117 non-null object
accommodates                 63117 non-null int64
bathrooms                    63117 non-null float64
bedrooms                     63117 non-null float64
bed_type                     63117 non-null object
amenities                    63117 non-null object
price                        63117 non-null float64
guests_included              63117 non-null int64
minimum_nights               63117 non-null int64
maximum_nights               63117 non-null int64
number_of_reviews            63117 non-null int64
requires_license         