# Price recommender with airbnb London listings data

The API for the airbnb listings open dataset used in this regression analysis is available here: https://public.opendatasoft.com/explore/dataset/airbnb-reviews/api/. The goal of the notebook is the supervised machine learning task of using listing features and price to predict the review scores ratings in order to construct a price recommendation engine for airbnb hosts. We will use Amazon SageMaker hosting and software to this end and so we begin with the necessary imports...

In [1]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

role = get_execution_role()
region = boto3.Session().region_name

bucket = sagemaker.Session().default_bucket()
prefix = 'airbnb-recommender-data/'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)

# Instantiate the sagemaker sklearn processor
sklearn_proc = SKLearnProcessor(framework_version='0.20.0',
                                role=role,
                                instance_type='ml.m5.xlarge',
                                instance_count=1)

Let's now download and unzip the listings open dataset from http://insideairbnb.com and inspect it...

In [22]:
!wget http://data.insideairbnb.com/united-kingdom/england/london/2020-04-14/data/listings.csv.gz
!gunzip listings.csv.gz

--2020-06-12 05:30:31--  http://data.insideairbnb.com/united-kingdom/england/london/2020-04-14/data/listings.csv.gz
Resolving data.insideairbnb.com (data.insideairbnb.com)... 52.216.144.210
Connecting to data.insideairbnb.com (data.insideairbnb.com)|52.216.144.210|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78560292 (75M) [application/x-gzip]
Saving to: ‘listings.csv.gz’


2020-06-12 05:30:39 (9.90 MB/s) - ‘listings.csv.gz’ saved [78560292/78560292]



## Preprocessing the data

Let's inspect the data before preprocessing it...

In [8]:
listings_dataf = pd.read_csv('listings.csv') 
listings_dataf.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20200414180850,2020-04-16,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,"Hello Everyone, I'm offering my lovely double ...",My bright double bedroom with a large window h...,business,Finsbury Park is a friendly melting pot commun...,...,f,f,moderate,f,f,2,1,1,0,0.18
1,15400,https://www.airbnb.com/rooms/15400,20200414180850,2020-04-16,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,Bright Chelsea Apartment This is a bright one...,Lots of windows and light. St Luke's Gardens ...,romantic,It is Chelsea.,...,t,f,strict_14_with_grace_period,t,t,1,1,0,0,0.71
2,17402,https://www.airbnb.com/rooms/17402,20200414180850,2020-04-15,Superb 3-Bed/2 Bath & Wifi: Trendy W1,You'll have a wonderful stay in this superb mo...,"This is a wonderful very popular beautiful, sp...",You'll have a wonderful stay in this superb mo...,none,"Location, location, location! You won't find b...",...,t,f,strict_14_with_grace_period,f,f,15,15,0,0,0.38
3,17506,https://www.airbnb.com/rooms/17506,20200414180850,2020-04-16,Boutique Chelsea/Fulham Double bed 5-star ensuite,Enjoy a chic stay in this elegant but fully mo...,Enjoy a boutique London townhouse bed and brea...,Enjoy a chic stay in this elegant but fully mo...,business,Fulham is 'villagey' and residential – a real ...,...,f,f,strict_14_with_grace_period,f,f,2,0,2,0,
4,25023,https://www.airbnb.com/rooms/25023,20200414180850,2020-04-15,All-comforts 2-bed flat near Wimbledon tennis,"Large, all comforts, 2-bed flat; first floor; ...",10 mins walk to Southfields tube and Wimbledon...,"Large, all comforts, 2-bed flat; first floor; ...",none,This is a leafy residential area with excellen...,...,t,f,moderate,f,f,1,1,0,0,0.7


In [9]:
for n, c, d in zip(range(len(listings_dataf.columns)),listings_dataf.columns,listings_dataf.dtypes): print(n,c,d)

0 id int64
1 listing_url object
2 scrape_id int64
3 last_scraped object
4 name object
5 summary object
6 space object
7 description object
8 experiences_offered object
9 neighborhood_overview object
10 notes object
11 transit object
12 access object
13 interaction object
14 house_rules object
15 thumbnail_url float64
16 medium_url float64
17 picture_url object
18 xl_picture_url float64
19 host_id int64
20 host_url object
21 host_name object
22 host_since object
23 host_location object
24 host_about object
25 host_response_time object
26 host_response_rate object
27 host_acceptance_rate object
28 host_is_superhost object
29 host_thumbnail_url object
30 host_picture_url object
31 host_neighbourhood object
32 host_listings_count float64
33 host_total_listings_count float64
34 host_verifications object
35 host_has_profile_pic object
36 host_identity_verified object
37 street object
38 neighbourhood object
39 neighbourhood_cleansed object
40 neighbourhood_group_cleansed float64
41 city obje

Now to write the data preprocessing script...

In [13]:
%%writefile airbnb-recommender-preprocessing.py
import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

columns_of_interest = ['review_scores_rating','price', 'minimum_nights',
                       'maximum_nights', 'number_of_reviews', 'accommodates', 'guests_included',
                       'bathrooms', 'bedrooms', 'host_total_listings_count', 'host_is_superhost',
                       'host_identity_verified', 'neighbourhood_cleansed',
                       'is_location_exact', 'property_type', 'room_type', 'bed_type',
                       'requires_license', 'instant_bookable', 'cancellation_policy']

# Defined without the label: 'review_scores_rating'
numeric_column_names = ['price', 'minimum_nights',
                        'maximum_nights', 'number_of_reviews', 'accommodates', 'guests_included',
                        'bathrooms', 'bedrooms', 'host_total_listings_count']

categorical_column_names = ['host_is_superhost','host_identity_verified', 'neighbourhood_cleansed',
                            'is_location_exact', 'property_type', 'room_type', 'bed_type',
                            'requires_license', 'instant_bookable', 'cancellation_policy']

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    args, _ = parser.parse_known_args()
    
    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'listings.csv')
    
    print('Reading input data from {}'.format(input_data_path))
    df = pd.read_csv(input_data_path, error_bad_lines=False)
    df = pd.DataFrame(data=df, columns=columns_of_interest)

    print('Data shape: {}'.format(df.shape))
    
    # Reduce to all listings with review scores
    df = df[df['review_scores_rating'].notnull()]

    # Reduce further (only a few more) to all listings with a 'bathrooms' record
    df['bathrooms'] = pd.to_numeric(df['bathrooms'],errors='coerce')
    df['bathrooms'] = df['bathrooms'].notnull().astype(int)

    # Reduce further (only a few more) to all listings with a 'bedrooms' record
    df['bedrooms'] = pd.to_numeric(df['bedrooms'],errors='coerce')
    df['bedrooms'] = df['bedrooms'].notnull().astype(int)

    # Reduce further (only 2 more) to all listings with a 'host_is_superhost' record
    df = df[df['host_is_superhost'].notnull()]

    # Clean price data by removing dollar signs and commas
    df['price'] = df['price'].str.replace('$','').str.replace(',','').astype(float)

    # Convert 'total_listings_count' to more sensible format
    df['host_total_listings_count'] = df['host_total_listings_count'].astype(int)
    
    # Remove crazy entries
    df = df[df['property_type']!='Minsu (Taiwan)']
    df = df[df['property_type']!='Island']
    
    print('Data shape post-cleaning: {}'.format(df.shape))
    
    split_ratio = args.train_test_split_ratio
    print('Splitting data into train and test sets with ratio {}'.format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(df.drop('review_scores_rating', axis=1), \
                                                        df['review_scores_rating'], test_size=split_ratio, random_state=0)

    # Preprocessing all of the columns at once
    preprocess = make_column_transformer(
        (numeric_column_names, StandardScaler()),
        (categorical_column_names, OneHotEncoder(handle_unknown='error', sparse=False))
    )
    
    print('Running preprocessing and feature engineering transformations')
    
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)
    
    print('Train data shape after preprocessing: {}'.format(train_features.shape))
    print('Test data shape after preprocessing: {}'.format(test_features.shape))
    
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_features.csv')
    train_labels_output_path = os.path.join('/opt/ml/processing/train', 'train_labels.csv')
    
    test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_output_path = os.path.join('/opt/ml/processing/test', 'test_labels.csv')
    
    print('Saving training features to {}'.format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)
    
    print('Saving test features to {}'.format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)
    
    print('Saving training labels to {}'.format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)
    
    print('Saving test labels to {}'.format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)

Overwriting airbnb-recommender-preprocessing.py


Let's now run this processing script on the data with a train-test split of 80-20...

In [14]:
sklearn_proc.run(code='airbnb-recommender-preprocessing.py',
                inputs=[ProcessingInput(source='listings.csv',
                                        destination='/opt/ml/processing/input')],
                outputs=[ProcessingOutput(output_name='train_data',
                                          source='/opt/ml/processing/train'),
                         ProcessingOutput(output_name='test_data',
                                          source='/opt/ml/processing/test')],
                arguments=['--train-test-split-ratio', '0.2'])

preprocessing_job_description = sklearn_proc.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']


Job Name:  sagemaker-scikit-learn-2020-06-12-12-09-57-332
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-232666250507/sagemaker-scikit-learn-2020-06-12-12-09-57-332/input/input-1/listings.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-232666250507/sagemaker-scikit-learn-2020-06-12-12-09-57-332/input/code/airbnb-recommender-preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-2-232666250507/sagemaker-scikit-learn-2020-06-12-12-09-57-332/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': '

In [16]:
training_features = pd.read_csv(preprocessed_training_data + '/train_features.csv', nrows=10)
print('Training features shape: {}'.format(training_features.shape))

Training features shape: (10, 102)


## Training the model