# Airbnb Boston Analysis

## Business Understanding

1. What are the most expensive neighbourhoods in Boston?
2. Is there a price-sesonality?
3. Based on the reviews, are there months where people prefere to visit boston?

## Data Understanding

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
sns.set_theme(style="darkgrid")


df_calendar = pd.read_csv('./calendar.csv')
df_listings = pd.read_csv('./listings.csv', index_col='id')
df_reviews = pd.read_csv('./reviews.csv')

In [None]:
# remove '$' and ',' from price and convert price into float
df_calendar['price'] = df_calendar['price'].str.replace('$','')
df_calendar['price'] = df_calendar['price'].str.replace(',','')
df_calendar['price'] = df_calendar['price'].astype(float)

# convert date from dtype 'object' to 'date
df_calendar['date'] = pd.to_datetime(df_calendar['date'])

In [None]:
# share of days for which no price is available
print("The share of days for which the listings are unavailable is {}".format(df_calendar['price'].isnull().sum()/df_calendar.shape[0]))

To work with the calendar price data, transform the data to a pivot table. **Listings for which there is no price at all available, are dropped.** Henceforth, only listings which have at least one price available are considered.

In [None]:
# transfor the calendar data into a more convenient layout
df_price = df_calendar.pivot_table(index='date',columns='listing_id',values='price')

# relevant listings
listings = df_price.columns

In [None]:
print("The share of listings for which there is no variability in the price is {}".format((df_price.std()==0).sum()/df_price.shape[1]))

In [None]:
#interpolate linearly for missing values, for values at the beginning of the period use a backwardfill 
df_price = df_price.interpolate(method='linear').fillna(method='bfill')

In [None]:
def listing_mean_neighbourhood(listing_id):
    neighbourhood = df_listings.loc[listing_id]['neighbourhood_cleansed']
    mean_price = df_price[listing_id].mean()

    return neighbourhood, mean_price

In [None]:
d = []

for listing in listings:
    neighbourhood, mean_price = listing_mean_neighbourhood(listing)
    d.append(
        {
            'listing_id':listing,
            'neighbourhood': neighbourhood,
            'mean_price': mean_price
        }
    )

df_listing_mean_neighbourhood = pd.DataFrame(d)

In [None]:
mean_price_by_neighbourhood = df_listing_mean_neighbourhood[['neighbourhood','mean_price']].groupby('neighbourhood').mean().sort_values(by='mean_price')
mean_price_by_neighbourhood

In [None]:
ax = mean_price_by_neighbourhood.plot.bar(layout='constraint')

plt.title('Mean Price per Neigbourhood in Boston (MA)')
plt.savefig('mean_price_neighbor.jpeg',bbox_inches="tight",dpi=600);