# Modelling Boston Airbnb prices

Maybe you have a place that you're fortunate enough to be able to potentially place on Airbnb but you're not sure if it's worth the hassle. With Airbnb making host listing data publicly available, we can break down the prices to understand what factors influence how much you can charge. We'll go through that here, providing an easy guide to get a rough idea for how much to charge if you have a place in Boston you can rent out.

Note: Airbnb also gives pricing tips based on much more sophisticated models than I can accomplish here ([more details](https://www.vrmintel.com/inside-airbnbs-algorithm/)). Use this kernel though for a very rough guide to get you started.

Alternatively, the findings from this kernel apply to those looking to understand which filters to play around with when trying to find a cheap (or expensive?) place, other than using the price filter on the search page ¯\_(ツ)_/¯.

# What data are we working with?

We're given three datasets:

- calendar.csv - listings with dates detailing whether they're available or not and how much they cost to rent on each date.
- listings.csv - lots of information about each listing including text descriptions, host details, number of bedrooms, bathrooms, location and more.
- reviews.csv - full text reviews for those listings that have been reviewed

We will mostly focus on `listings.csv` and pull date-sensitive prices from `calendar.csv`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Show maximum 500 columns when displaying dataframes
pd.set_option('display.max_columns', 500)

# Use Seaborn standard design palette for plots
sns.set()

calendar = pd.read_csv('../input/boston/calendar.csv')
listings = pd.read_csv('../input/boston/listings.csv')

for dataset in [calendar, listings]:
    display(dataset.sample(3))

## Listings data clean up

In [None]:
print(f'The dataset contains {len(listings.id.unique())} from {len(listings.host_id.unique())} hosts.')

In [None]:
import missingno as msno

msno.matrix(listings);

From the missingno matrix we see that we have a few features with only null values and some with mostly null values. We'll drop these along with any columns containing only one unique value and columns containing IDs or URL links.

In [None]:
# Note number of columns
before = len(listings.columns)

# Drop columns with all null values
all_null = len(listings.columns[listings.isnull().mean() == 1])
listings = listings.dropna(how='all', axis=1)

# Drop columns with more than 50% null values
more_than_50_null = listings.columns[listings.isnull().mean() > 0.5]
listings = listings.drop(more_than_50_null, axis=1)

# Drop columns with only one value
one_value_columns = [
    column for column in listings.columns if len(listings[column].unique()) == 1
]
listings.drop(one_value_columns, axis=1, inplace=True)

# Drop url, ID (except for 'id') and name columns
url_id_columns = listings.columns[listings.columns.str.contains('url|_id|name')]
listings = listings.drop(url_id_columns, axis=1)

print(
    '{} columns dropped:\
    \n\t{} columns with only null values\
    \n\t{} columns with more than 50% null values\
    \n\t{} columns with only one unique value\
    \n\t{} URL/ID/name columns'.format(
        before - len(listings.columns), all_null, len(more_than_50_null),
        len(one_value_columns), len(url_id_columns)
    )
)

Now we will drop a bunch of features for various reasons:
- they're covered by other features
- they are likely to add variance to the model by being colinear to our target variable
- they are almost completely one value

In [None]:
to_drop = [
    'host_neighbourhood', 'host_listings_count', 'host_total_listings_count',
    'host_verifications', 'host_has_profile_pic', 'street', 'neighbourhood',
    'city', 'zipcode', 'market', 'smart_location', 'latitude', 'longitude',
    'is_location_exact', 'cleaning_fee', 'guests_included', 'extra_people',
    'minimum_nights', 'maximum_nights', 'calendar_updated', 'availability_30',
    'bed_type', 'availability_60', 'availability_90', 'availability_365',
    'first_review', 'last_review', 'review_scores_rating', 
    'review_scores_accuracy', 'review_scores_cleanliness', 
    'review_scores_checkin', 'review_scores_communication',
    'review_scores_location', 'require_guest_profile_picture', 
    'require_guest_phone_verification'
]

listings = listings.drop(to_drop, axis=1)

We are now going to convert text features to numerical ones.

- Summary and description features will be converted to a character count to see if a longer description helps.
- We will create a boolean local feature based on whether the host is from Boston or not
- `host_since` will be converted to a timedelta between an individual host's sign up date and the most recent host sign up
- `host_response_time` will be converted to a dummy variable
- % and $ signs will be stripped
- `amenities` column will be converted to a count of amenities listed
- Convert boolean t/f features to 1 or 0

In [None]:
# Define description features
description_features = ['summary', 'space', 'description', 'neighborhood_overview', 'transit', 'access', 'interaction', 'house_rules', 'host_about']

# Convert null values to empty strings
listings[description_features] = listings[description_features].apply(lambda col: col.fillna(''), axis=1)

# Convert description features to character counts
for column in listings[description_features]:
    listings[column] = listings[column].apply(lambda x: len(x))

# Convert host_since to datetime and create host_since_days timedelta feature
listings.host_since = pd.to_datetime(listings.host_since, yearfirst=True)
listings['host_since_days'] = (listings.host_since - listings.host_since.min()).dt.days

# Create is_local feature based on host_location
local_destination = 'Boston, Massachusetts, United States'
listings['is_local'] = listings.host_location.apply(
    lambda location: 1 if location==local_destination else 0
)

# Drop converted features
listings = listings.drop(['host_since', 'host_location'], axis=1)

# Map host_response_time values to numerical values
response_map = {
    np.nan: 0,
    'a few days or more': 1,
    'within a day': 2,
    'within a few hours': 3,
    'within an hour': 4
}
listings.host_response_time = listings.host_response_time.replace(response_map)

# Remove ['$', ',', '%'] and convert to float
str_to_float_columns = ['host_response_rate', 'host_acceptance_rate', 'price']
for column in str_to_float_columns:
#     listings[column] = listings[column].str.replace('$', '').str.replace('%', '').str.replace(',', '').astype(float)
    listings[column] = listings[column].apply(lambda value: re.sub(r'\$|,|%', '', str(value))).astype(float)
    
# Convert boolean t/f columns to 1/0 columns
boolean_columns = ['host_is_superhost', 'instant_bookable']
for column in boolean_columns:
    listings[column] = listings[column].apply(lambda val: True if val=='t' else False)
    
# Convert amenities to amenities_count
listings['amenities_count'] = listings.amenities.str.count(',')+1
listings = listings.drop('amenities', axis=1)

### Dropping listings without reviews

We want to understand what price to put for our new listing, we want to model this off "successful" listings. Therefore we're going to remove listings without a review. This also has the handy side effect of removing some listings that might define as outliers. Outliers can greatly impact a model's ability to predict pricing and account for variance but should not be removed for this reason alone. Data points may look like outliers but in our case, unless if they're an obvious mistake (e.g. a typo or other), then they are valid.

In [None]:
num_without_reviews = len(listings[listings.number_of_reviews==0])
percent_gone = num_without_reviews/len(listings)

print(f'{num_without_reviews} rows or {percent_gone:.2%} of rows dropped as having no reviews')

listings = listings[listings.number_of_reviews!=0]

## Calendar data clean up

First off, with `price` as our target variable, we will drop any rows where the price is null. A quick glance at the data shows that this value is null when the listing is not availble for a given date.

Therefore we drop rows where price is null and drop the `available` column as this then contains only one value.

In [None]:
calendar.sample(2)

In [None]:
# Drop any rows without the predictor value
calendar.dropna(subset=['price'], inplace=True)

# Convert price into a float
calendar.price = calendar.price.apply(
    lambda value: re.sub(r'\$|,', '', value)
).astype(float)

# Extract month from date string and rewrite values
calendar['month'] = calendar.date.apply(lambda value: value.split('-')[1])
calendar['month'] = calendar['month'].replace({
    '01': 'Jan',
    '02': 'Feb',
    '03': 'Mar',
    '04': 'Apr',
    '05': 'May',
    '06': 'Jun',
    '07': 'Jul',
    '08': 'Aug',
    '09': 'Sep',
    '10': 'Oct',
    '11': 'Nov',
    '12': 'Dec'
})

# Drop available and date columns
calendar = calendar.drop(['available', 'date'], axis=1)

calendar.sample(2)

## Merge datasets

With both datasets cleaned up and transformed where needed, we can now merge the two. This will give us two `price` columns so we'll keep only the time-sensitive one coming from the calendar dataset.

In [None]:
df = pd.merge(
    listings, calendar, how='left',
    left_on='id', right_on='listing_id',
)

df = df.drop(['price_x', 'listing_id' ], axis=1)
df = df.rename({'price_y': 'price'}, axis=1)

# Drop any remaining rows without a price value
df = df.dropna(subset=['price'])

df.sample(3)

# Exploring the data for trends

With some basic cleaning and feature engineering done, let's explore our dataset.

It's clear from the basic plots below that we have some informative features in our dataset.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(15, 10), dpi=80)

# Neighbourhoods
neighbourhoods = listings.groupby('neighbourhood_cleansed').agg(
    {
        'price': np.mean
    }
).reset_index()
sns.barplot(
    x='price', y='neighbourhood_cleansed',
    data=neighbourhoods.sort_values('price'),
    orient='h',
    palette='Blues',
    ax=axs[0, 0]
)
axs[0, 0].set_title('Neighbourhood')
axs[0, 0].set_xlabel('Mean Price ($)')
axs[0, 0].set_ylabel('')

# Room type
# The easiest way to plot this would be using histplot available in seaborn >= 0.11.0
# sns.histplot(x='price', hue='room_type', data=listings, ax=axs[0, 1])
# Instead we use distplot
sns.distplot(listings[listings.room_type == 'Private room']['price'],
             kde=False, ax=axs[0, 1], label='Private room')
sns.distplot(listings[listings.room_type == 'Shared room']['price'],
             kde=False, ax=axs[0, 1], label='Shared room')
sns.distplot(listings[listings.room_type == 'Entire home/apt']['price'],
             kde=False, ax=axs[0, 1], label='Entire home/apt')
axs[0, 1].set_xlim(0, 600)
axs[0, 1].set_title('Room Type')
axs[0, 1].set_xlabel('Price ($)')
axs[0, 1].legend()

# Cancellation Policy
sns.boxplot(x='price', y='cancellation_policy', fliersize=1, linewidth=0.75,
            data=listings, palette='Blues', ax=axs[1, 0],
            order=['flexible', 'moderate', 'strict', 'super_strict_30'])
axs[1, 0].set_xlim(0, 600)
axs[1, 0].set_title('Cancellation Policy')
axs[1, 0].set_xlabel('Price ($)')
axs[1, 0].set_ylabel('')

# Property type
sns.boxplot(x='price', y='property_type', fliersize=1, linewidth=0.75,
            data=listings, palette='Blues', ax=axs[1, 1])
axs[1, 1].set_xlim(0, 600)
axs[1, 1].set_title('Property Type')
axs[1, 1].set_xlabel('Price ($)')
axs[1, 1].set_ylabel('')

plt.tight_layout()
plt.show();

There's a particularly strong relationship here between price and the number of bedrooms. The month plot is on quite a narrow scale suggesting that it's impact is not so great. We can also see a potentially non-linear relationship between bathrooms and price here.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20, 5))

# Month plot
months = df.groupby('month').agg({'price': 'mean'}).reset_index()

# Converting to category to be able to set the order
months.month = months.month.astype('category')
sorter = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
months.month.cat.set_categories(sorter, inplace=True)

sns.lineplot(
    x='month', y='price',
    data=months,
#     orient='v',
    color='#2F7FBC',
    ax=axs[0]
)
axs[0].set_title('Month of the Year')
axs[0].set_xlabel('')

# Bedrooms
bedrooms = df.groupby('bedrooms').agg({'price': 'mean'}).reset_index()
sns.barplot(
    x='bedrooms', y='price',
    data=bedrooms,
    orient='v',
    color='#2F7FBC',
    ax=axs[1]
)
axs[1].set_title('Number of Bedrooms')
axs[1].set_xlabel('')
axs[1].set_ylabel('')

# Bathrooms
bathrooms = df.groupby('bathrooms').agg({'price': 'mean'}).reset_index()
sns.barplot(
    x='bathrooms', y='price',
    data=bathrooms,
    orient='v',
    color='#2F7FBC',
    ax=axs[2]
)
axs[2].set_title('Number of Bathrooms')
axs[2].set_xlabel('')
axs[2].set_ylabel('')

plt.tight_layout()
plt.show()

Let's take a look at the numerical description features we created from the text summaries about the listing, neighbourhood, etc. We transformed text descriptions into character counts.

In [None]:
description_features = ['summary', 'space', 'description',
                        'neighborhood_overview', 'transit', 'access',
                        'interaction', 'house_rules', 'host_about']

fig, axs = plt.subplots(3, 3, figsize=(15, 15))

for feature, ax in zip(description_features, axs.reshape(-1)):
    sns.regplot(x=feature, y='price', data=df, ax=ax, ci=None, line_kws={'color': 'orange'})

With no evidence of any strong relationships here, we'll drop these features to reduce complexity. It's also simple enough to run the regression models with or without these features to see that they have no impact.

In [None]:
df = df.drop(description_features, axis=1)

# Preprocessing step

As a final bit of preprocessing, we'll deal with missing values, encode categorical features and drop the `id` column.

5% of the rows contain missing data in one of the columns, mostly the `host_response_rate` and `host_acceptance_rate` columns. Even though it seems lazy, I see no reason why we shouldn't impute these missing values with the median rather than drop this 5% of the data.

In [None]:
# Dropping the extremely small subset of rows with no property_type
df = df.dropna(subset=['property_type']).copy()

# Imputing the mean for the remaining columns with null values
columns_with_null = df.columns[df.isnull().any()]
for column in columns_with_null:
    df[column] = df[column].fillna(df[column].median())

Finally we create dummies and drop the id column.

In [None]:
df = pd.get_dummies(df)
df = df.drop(['id'], axis=1)
df.sample(3)

# Modelling

We'll fit the data to a few different regression methods but the trusty Linear Regression may suit our needs best. The data may contain non-linear relationships that Linear Regression will not be able to capture but it does have the significant benefit of being easily interpretable and the coefficients provided by the model will allow us to easily get a rough idea for how much we should be charging for listings.

We'll then use a few tree-based ensemble methods. Tree-based methods give us great flexibility in their ability to describe non-linear relationships but they also tend to be very sensitive to small variations in the training data and, unconstrained, can lead to overfitting. This is why we'll use a few ensemble methods to reduce this tendency to overfit. Further, tree-based methods allow for measuring the importance of features in prediction.

## Linear Regression

With a moderate amount of features, we'll need to keep an eye on overfitting, scoring the model on both the training and test sets. If we were to detect any bias/variance then we can look at one of the regularisation methods.

In [None]:
def print_scores(model):
    """Print the R-squared and RMSE scores for the train and test set
    
    Parameters
        model: fitted regression model
    """
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)
    rmse_train = (mean_squared_error(y_train, y_pred_train))**0.5
    rmse_test = (mean_squared_error(y_test, y_pred_test))**0.5

    print(
        'Train R-squared: {:.3f}\tTrain RMSE: ${:.2f}\
        \nTest R-squared: {:.3f}\tTest RMSE: ${:.2f}'
        .format(r2_train, rmse_train, r2_test, rmse_test)
    )

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = df.drop('price', axis=1)
y = df.price

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# Use previously created function to output metric scores
print_scores(model)

Our model spits out an R-squared score of 60% meaning that we can account for around 60% of the variance using our features using Linear Regression. We also need to bear in mind the significant standard deviation of the residuals represented by the $97 root-mean-squared-error. Predictions made with the model have a large error margin.

Now let's have a look at the coefficients.

In [None]:
feature_importance = pd.DataFrame(
    {'features': X.columns, 'coefficients': model.coef_}
).sort_values(by='coefficients')
feature_importance['features'] = feature_importance['features']

import plotly.express as px
fig = px.bar(x='features', y='coefficients',
             data_frame=feature_importance, height=600)
fig.show();

With this birds eye view of the coefficients we can see quite a few features having little to no impact on the price according to our linear regression model. Notable features that don't influence the price here:

- details about the host such as their response time/rate and number of listings
- number of reviews or the average rating given to a listing

Below we pull out the features deemed relevant by the model in predicting the price. The neighbourhood, property type and room type, alongside the number of rooms and properties stand out as key features in deciding the price.

The way to read this would be to go through the different features and add/subtract the coefficients together to get a rough idea for what price to put your Airbnb up for. To this you then need to add the **'intercept' of the model**, in our case $-\$86$. That means you need to subtract $\$86$ from the final price. The intercept is negative because we have certain features such as bedrooms which start at 1 instead of 0.

For example, say I have a two bed apartment in South End that I want to rent out in January. I would calculate

- $+\$50$ for the South End neighbourhood
- $+\$20$ because it's an apartment
- $-\$20$ since January is a cheap month
- $+\$0$ because I want to use a moderate cancellation policy
- $+\$42$ since I'll rent out the whole apartment
- $+2\times\$63$ for the two bedrooms
- $+\$35$ for the one bathroom
- $+3\times\$6$ for the three beds (there's a sofa bed)
- $+6\times\$6$ since the place accommodates 6 people
- $-\$86$ for the intercept

Therefore the model says I should put the place up for a price around $\$221$. Again, bearing in mind the potential error margin here, I might think that my apartment is in a particularly nice part of South End and want to push the price up a bit. Maybe it's a bit of a cheek calling it a two-bed and knock a few dollars off.

Feel free to go through the features here to get a rough idea of how much you might price your property.

In [None]:
def filter_coefficients(keyword, coefficients_df):
    """
    Filters a dataframe of coefficients for specific features
    
    Parameters:
        keyword (str): The keyword to filter the features by
        coefficients_df (DataFrame): the coefficients df to filter by
    
    Returns:
        df (DataFrame): a keyword filtered dataframe of coefficients
    """
    
    df = coefficients_df[coefficients_df.\
                         features.str.contains(keyword)].copy()
    df.features = df.features.str.lstrip(keyword+'_')
    
    return df

def plot_coefficients(coefficients_df, ax=None, palette='icefire', xlabel='Coefficients', title=None):
    """Plots a horizontal barplot of the coefficients"""
    
    sns.barplot(
        y='features', x='coefficients',
        orient='h', data=coefficients_df,
        palette=palette,
        ax = ax
    )
    if ax:
        ax.set_xlabel(xlabel)
        ax.set_ylabel('')
        ax.set_title(title)
    else:
        plt.xlabel(xlabel)
        plt.ylabel('')
        plt.title(title)
        
# Create the subplot grid
fig, axs = plt.subplots(2, 3, figsize=(14, 8), dpi=100,
                        gridspec_kw = {'height_ratios': [4, 2]})

# Create lists to loop over for simple plots
keywords = ['neighbourhood_cleansed', 'property_type', 'cancellation_policy', 'room_type']
titles = ['Neighbourhood', 'Property Type', 'Cancellation Policy', 'Room Type']
axes = [axs[0, 0], axs[0, 1], axs[1, 0], axs[1, 1]]

for kw, title, axis in zip(keywords, titles, axes):
    plot_coefficients(
        filter_coefficients(kw, feature_importance),
        ax=axis,
        title=title
    )

# Filter and sort months
month = filter_coefficients('month', feature_importance)
month.features = month.features.astype('category')
sorter = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month['features'].cat.set_categories(sorter, inplace=True)
plot_coefficients(month, ax=axs[0, 2], title='Month of the Year')

# Pick final few features
other = feature_importance[feature_importance.features.isin(
    ['bedrooms', 'bathrooms', 'beds', 'accommodates',
     'host_is_superhost', 'is_local']
)].copy()
plot_coefficients(other, ax=axs[1, 2], title='Other Features')

fig.tight_layout()
plt.show();

There are some clear drawbacks to this model. For one thing, it's possible to plug in data to return a negative listing price implying that you should pay someone to come and stay in your dorm in West Roxbury in December. However, whilst Linear Regression in this case may not be the best method to accurately predict listing prices, it does provide an easily interpeted model that can be used (at least in some form) without even needing to use anything other than a pen and paper.

We'll now look at some ensemble methods to see if we can better account for the variation in the data.

## Ensemble methods

As mentioned, ensembled methods allow us leverage the flexibility of (in this case) tree-based models whilst reducing their tendency to memorise noise.

### Random Forests

Random Forests is one of the most powerful Machine Learning algorithms, despite its simplicity. It is an ensemble of Decision Trees, taking the average prediction from multiple individual trees all trained on a different random subset of the training data. Let's see how it does on our data.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=20, n_jobs=-1)
rf.fit(X_train, y_train)
print_scores(rf)

Over 90% of the variance accounted for using Random Forests and with no sign of overfitting! Note though the sizable residual error standard deviation. There is stil a room for error even with so much of the variability in our dataset accounted for.

We'll look at the key features driving the prediction in a moment. First we'll look at a second ensemble method.

### XGBoost

XGBoost stands for Extreme Gradient Boosting, a "boosting" ensemble method that works by sequentially adding predictors to an ensemble, each one correcting its predecessor. Another popular boosting method is AdaBoost, which stands for Adaptive Boosting.

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(X_train, y_train)
print_scores(xgb)

We've chosen tree-based models here not only for their flexibility but also their ability to assess feature importance. Feature importance is measured by how much the tree nodes use a particular feature to predict prices. Scikit-Learn scales the results so that the sum of all importances is equal to 1.

Let's see the top most important features according to the two ensemble methods.

In [None]:
def get_feature_importances(model):
    """Return sorted model feature importances"""
    
    df = pd.DataFrame(
        {'features': X.columns, 'coefficients': model.feature_importances_}
    ).sort_values(by='coefficients', ascending=False)[:15]
    return df
        
fig, axs = plt.subplots(1, 2, figsize=(15, 6), dpi=80)
plot_coefficients(
    get_feature_importances(rf), ax=axs[0], xlabel='Importance',
    title='Random Forest Feature Importance', palette='Blues_r')
plot_coefficients(
    get_feature_importances(xgb), ax=axs[1], xlabel='Importance',
    title='XGBoost Feature Importance', palette='Blues_r')

plt.tight_layout()
plt.show()

Both ensemble methods are aligned that the number of bedrooms, bathrooms and whether the listing is an entire home/apt or not are the most important features. For both models, these features account for 57% of the importance. The rest of the features attribute small importances to the models.

# Conclusion

## Avoiding complexity

Through the models chosen and with the task at hand, we've managed to avoid some extra steps here in making our predictions. We did not need to scale our data since we used Linear Regression rather than one of the regularized methods and scaling did not impact the ensemble methods. Further, since we are only providing guidelines, we did not need to optimize our models with hyperparameter tuning. Our ensemble methods scored well enough by most standards without the added complexity.


## That _Je Ne Sais Quoi_

The models have given us some sensible guidelines for choosing how much to put a place up for on Airbnb but in each case, there was a sizeable error margin.

While our models do a great job at generalizing prices based on features, each listing is unique and what makes a traveller decide that your place is worth the money comes down to many features we won't be able to capture, although Airbnb do their best with their model.

