# Airbnb Boston Analysis

## Business Understanding

1. What are the most expensive neighbourhoods in Boston?
2. Is there a price-sesonality?
3. Based on the reviews, are there months where people prefere to visit boston?

## Data Understanding & Preparation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
sns.set_theme(style="darkgrid")

from datetime import datetime


df_calendar = pd.read_csv('./calendar.csv')
df_listings = pd.read_csv('./listings.csv', index_col='id')
df_reviews = pd.read_csv('./reviews.csv')

In [None]:
# remove '$' and ',' from price and convert price into float
df_calendar['price'] = df_calendar['price'].str.replace('$','')
df_calendar['price'] = df_calendar['price'].str.replace(',','')
df_calendar['price'] = df_calendar['price'].astype(float)

# convert date from dtype 'object' to 'date
df_calendar['date'] = pd.to_datetime(df_calendar['date'])

In [None]:
# share of days for which no price is available
print("The share of days for which the listings are unavailable is {}".format(df_calendar['price'].isnull().sum()/df_calendar.shape[0]))

To work with the calendar price data, transform the data to a pivot table. **Listings for which there is no price at all available, are dropped.** Henceforth, only listings which have at least one price available are considered.

In [None]:
# transfor the calendar data into a more convenient layout
df_price = df_calendar.pivot_table(index='date',columns='listing_id',values='price')

# relevant listings
listings = df_price.columns

In [None]:
print("The share of listings for which there is no variability in the price is {}".format((df_price.std()==0).sum()/df_price.shape[1]))

In [None]:
#interpolate linearly for missing values, for values at the beginning of the period use a backwardfill 
df_price = df_price.interpolate(method='linear').fillna(method='bfill')

In [None]:
def listing_mean_neighbourhood(listing_id):
    neighbourhood = df_listings.loc[listing_id]['neighbourhood_cleansed']
    mean_price = df_price[listing_id].mean()

    return neighbourhood, mean_price

In [None]:
d = []

for listing in listings:
    neighbourhood, mean_price = listing_mean_neighbourhood(listing)
    d.append(
        {
            'listing_id':listing,
            'neighbourhood': neighbourhood,
            'mean_price': mean_price
        }
    )

df_listing_mean_neighbourhood = pd.DataFrame(d)

In [None]:
mean_price_by_neighbourhood = df_listing_mean_neighbourhood[['neighbourhood','mean_price']].groupby('neighbourhood').mean().sort_values(by='mean_price')
mean_price_by_neighbourhood

In [None]:
ax = mean_price_by_neighbourhood.plot.bar(layout='constraint')

plt.title('Mean Aribnb Price per Neigbourhood in Boston (MA)')
plt.savefig('mean_price_neighbor.jpeg',bbox_inches="tight",dpi=600);

In [None]:
ax = df_price.mean(axis=1).plot(layout='constraint')

plt.title('Mean Airbnb Price per Night in Boston (MA)')
plt.savefig('mean_price_per_night.jpeg',bbox_inches="tight",dpi=600);

## Data Modeling

Adressing question #3 is more complex than questions #1 & #2. The goal is to use features of the listing (type, rooms, neigborhood, ...) and reviews (no. of reviews per listing, frequency of reviews, ...) to predict the price of a given listing on the day the `listings.csv` data has beeen scraped. To start modeling we have to decide on features we want to include in the model.

In [None]:
df_listings_red = df_listings[[
                    'last_scraped',
                    'host_since',
                    'host_response_time',
                    'host_response_rate',
                    'host_acceptance_rate',
                    'host_is_superhost',
                    'host_listings_count',
                    'host_verifications',
                    'host_has_profile_pic',
                    'host_identity_verified',
                    'neighbourhood_cleansed',
                    'property_type',
                    'room_type',
                    'accommodates',
                    'bathrooms',
                    'bedrooms',
                    'beds',
                    'bed_type',
                    'amenities',
                    'price',
                    'cleaning_fee',
                    'guests_included',
                    'minimum_nights',
                    'review_scores_rating',
                    'review_scores_accuracy',
                    'review_scores_cleanliness',
                    'review_scores_checkin',
                    'review_scores_communication',
                    'review_scores_location',
                    'review_scores_value',
                    'cancellation_policy',
                    'reviews_per_month']].copy()

First we have to prepare the columns that are in the wrong format, e.g., turn host_since into the date difference relative to 2016-09-07 in days, and convert numbers with '$' or '%' sign.

In [None]:
#convert last_scraped and host_since to datetime
df_listings_red[['last_scraped','host_since']] = df_listings_red[['last_scraped','host_since']].apply(pd.to_datetime)

#calculate host_since in days
df_listings_red['host_since_days'] = (df_listings_red['last_scraped'] - df_listings_red['host_since']) / np.timedelta64(1, 'D')

In [None]:
#remove % sign from rate columns and convert them to dtype float
def convert_rate(df: pd.DataFrame, cols: list):
    
    """
    Remove '%' sign and convert column to float between 0 and 1
    # Parameters
    df:     dataframe of interest
    cols:   list of columns which feature rates
    """

    for col in cols:
        df[col] = df[col].str.replace('%','')
        df[col] = df[col].astype(float)/100

#remove $ sign and , as seperator and turn dollar amount columns into float
def convert_dollar(df: pd.DataFrame, cols: list):

    """
    Remove '$' and ',' sign and convert column to float
    # Parameters
    df:     dataframe of interest
    cols:   list of columns which feature rates
    """

    for col in cols:
        df[col] = df[col].str.replace('$','')
        df[col] = df[col].str.replace(',','')
        df[col] = df[col].astype(float)

In [None]:
convert_rate(df_listings_red,['host_response_rate', 'host_acceptance_rate'])
convert_dollar(df_listings_red,['price', 'cleaning_fee'])