# Boston Airbnb listing Dataset

AirBnb is an online website/application offering rooms, lodgings and homestays all over the world. Airbnb provides the travellers with a unique and personalized way of experiencing new places and socializing with local people. This is what sets them apart from the basic, run of the mill hotels/inns that travellers usually stay in. 

This dataset describes the listings on Airbnb for the city of Boston. It has all the details about the listings, hosts and also some metrics to analyse the user behaviour and draw conclusions from it.

Here we take into account some questions that would help us analyse user behaviour like - 
1. First we just generally take a look at the data and run some basic housekeeping commands to get a feel of the data. Basic stuff like data exploration, null coverage and the correlation matrix to visualize and filter out the important features of the dataset.
2. Then we dive deeper into the data to see what are the common amenities provided in the homestays across Boston.
3. Further we analyse how these ameneties affect the popularity of the property. Also we look at what other features attract the guests to rent a property.
4. Finally, using the property's geolocation coordinates, we see how the location of the property affects it's price. We do this by training a linear model on the data and then analysing the coefficients of the model to analyse which features affect the listing price and by what factor.

In [1]:
#Boilerplate imports for data exploration and analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
from collections import defaultdict
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import math
import seaborn as sns
%matplotlib inline

df = pd.read_csv('../input/boston/listings.csv')
sns.set_style('darkgrid')

Let's start with some basic data exploration - getting to know the data


In [2]:
df.columns

In [3]:
df.shape

In [4]:
df.dtypes.value_counts()

In [5]:
#Columns with no null values
df.isnull().sum()[df.isnull().sum()==len(df)]

In [6]:
#Columns with no nulls and all distinct values
df.nunique()[df.nunique()==len(df)]

In [7]:
#Null coverage in other columns (distinct)
df.nunique()[df.nunique() != len(df)].sort_values(ascending=False)

Now that we have seen the data type coverage, before getting the answers to our questions we can go ahead and clean the dataset so that it can be visualised by seaborn.

Starting off we can drop the columns which are not essential for our analysis - Web scraping housekeeping columns/image urls

In [8]:
cols = ['thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'listing_url', 'host_url',
       'host_thumbnail_url', 'host_picture_url', 'country', 'country_code', 'neighbourhood',
       'smart_location', 'street', 'market', 'first_review', 'last_review', 'state', 'calendar_last_scraped',
       'calendar_updated', 'city', 'scrape_id', 'last_scraped', 'space', 'host_neighbourhood', 
        'neighborhood_overview', 'host_listings_count', 'zipcode', 'is_location_exact', 'host_location',
       'host_total_listings_count']
df.drop(cols, axis=1, inplace=True)
# drop the colunms where more than half the values are null
cols = df.columns[df.isnull().sum()/df.shape[0] > 0.5]
df.drop(cols, axis=1, inplace=True)
print(df.shape)

In [9]:
# Fixing data types, extracting num columns and converting to int. (Seaborn can't visualize Object columns for regression)
cols = ['host_response_rate', 'host_acceptance_rate', 'price', 'cleaning_fee', 'extra_people']
for col in cols:
    df[col] = df[col].str.extract(r'(\d+)')
    df[col] = df[col].astype('float128').astype('Int64')
df[cols].dtypes
temp = pd.to_datetime('06/30/2021')
df['host_since'] = pd.to_datetime(df.host_since)
df['host_len'] = df.host_since.apply(lambda x: pd.Timedelta(temp-x).days)
df = df.drop('host_since', axis=1)
# extract the number of amenities 
df['n_amenities'] = df['amenities'].apply(lambda x: len(x.replace('{', '').\
                        replace('{', '').replace('"', '').split(',')))
df_num = df.select_dtypes(include=['int', 'float'])
# fill na for the columns
int_fillmean = lambda x: x.fillna(round(x.mean()))
df_num = df_num.apply(int_fillmean, axis=0)
df_num = df_num.drop(['id', 'host_id', 'latitude', 'longitude'], axis=1).astype(float)

In [10]:
# visualise the price
plt.figure(figsize=(10, 10))
sns.distplot(df_num['price'], bins=50, kde=True)
plt.ylabel('Percentage', fontsize=12)
plt.xlabel('Price (USD)', fontsize=12)
plt.title('Listed Price Distribution', fontsize=16);

In [11]:
# visualize the correlation matrix
corr = df_num.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(20, 16))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, annot=True, fmt='.2f', cmap='coolwarm')

As we can clearly see amenities come out to be the most influential feature for the price column. With this information lets dive in deeper to get answers to some of our questions.

**Q2. Then we dive deeper into the data to see what are the common amenities provided in the homestays across Boston.**

In [12]:
#refreshing the dataframe
df = pd.read_csv('../input/boston/listings.csv')
#First lets get all the amenities values from the data

amenities=[]
for i in range(len(df)):
    amenities.append(df['amenities'][i])
#Getting unique amenities from the list
amenities=list((s.strip('\'\{\}') for s in amenities))
amenities_str=",".join(amenities)
amenities=list(set(amenities_str.split(",")))
amenities[:]=(value for value in amenities if value != '') # Removing empty elements

In [13]:
#Lets prep our graph for analysis, renaming the x and y axis for the graph - amenities/amenities count
df_am = df['amenities'].value_counts().reset_index()
df_am.rename(columns={'index': 'amenities', 'amenities': 'count'}, inplace=True)

#Calculate count for each amenities value
new_df = defaultdict(int)
for val in amenities:
    for idx in range(df_am.shape[0]):
        if val in df_am['amenities'][idx]:
            new_df[val] += int(df_am['count'][idx])
new_df = pd.DataFrame(pd.Series(new_df)).reset_index()
new_df.columns = ['amenities','count']
new_df.sort_values('count', ascending=False, inplace=True)
new_df.set_index('amenities', inplace=True)

#Visualize plot
(new_df/len(df))[:20].plot(kind='bar', legend=None);
plt.title('Most Common Amenities');
plt.ylabel('% listings')
plt.show()


From the above chart, it's evident that the most common amenities are : 
* Internet
* Wireless Internet
* Heating
* Kitchen
* Essentials
* Dryer

**3. Further we analyse how these amenities affect the popularity of the property. Also we look at what other features attract the guests to rent a property.**

To gauge popularity we have the monthly availability column which gives us the bookings for repective listings. For our use case, we'll create a column using this to get booking percentage for each listing.

In [14]:
df['booking']=1-(df['availability_30']/30)
df['booking'].mean() #Average booking

Let's play around with this column to gauge property popularity against a couple of features.

In [15]:
#Popularity based on room_type
(df.groupby(['room_type'])['booking'].mean().sort_values(ascending=False)).plot(kind='bar')
plt.title('Booking % vs room_type');
plt.show()

In [16]:
#Booking popularity based in property_type
(df.groupby(['property_type'])['booking'].mean().sort_values(ascending=False)).plot(kind='bar')
plt.title('Booking % vs property_type');
plt.show()

In [17]:
#Lets make binary value columns to indicate if the specific amenity is present in the dataframe or not
for i in range(len(amenities)):
    df[amenities[i]] = 0
for i in range(len(amenities)):
    for j in range(len(df)):
        if amenities[i] in df['amenities'][j]:
            df.loc[j , amenities[i]] = 1
        else:
            df.loc[j , amenities[i]] = 0
amenities_df = pd.DataFrame(index=range(0,len(amenities)), columns = ["Amenity_Name", "Amenity_positive_booking", "Amenity_negative_booking", "Booking Difference"])
for i in range(len(amenities)):
    amenities_df['Amenity_Name'][i] = amenities[i]
    amenities_df['Amenity_positive_booking'][i] = df.groupby([amenities[i]])['booking'].mean()[1]
    amenities_df['Amenity_negative_booking'][i] = df.groupby([amenities[i]])['booking'].mean()[0]
    amenities_df['Booking Difference'][i] = amenities_df['Amenity_positive_booking'][i] - amenities_df['Amenity_negative_booking'][i]

amenities_df.set_index('Amenity_Name', inplace=True)

In [18]:
#Visualization of popular amenities vs booking
amenities_df['Booking Difference'].sort_values(ascending = False)[:10].plot(kind='bar', legend=None)
plt.title('Amenity Popularity');
plt.ylabel('Booking %')
plt.show()

So clearly from the plot, we can see that having a smoke detector increases the booking rate substantially.Having a buzzer, fire extinguisher is also a good idea to prop up your booking numbers.

Also, apart from the amenities, for room type private room is the most preferred and villas are preferred over other property types.

**Q4. Finally, using the property's geolocation coordinates, we see how the location of the property affects it's price. We do this by training a linear model on the data and then analysing the coefficients of the model to analyse which features affect the listing price and by what factor.**

Below I write a function that calculate the distance from downtown. According to Google Maps the coordinates of Boston downtown are [47.605151, -122.334379] I'll be using this point to calculate the distance from for each listing.

In [19]:
def dist(lat, long, downtown=[42.3557, 71.0572]):
    lat1 = math.radians(downtown[0])
    long1 = math.radians(downtown[1])
    lat2 = math.radians(lat)
    long2 = math.radians(long)
    dlong = long2-long1
    dlat = lat2-lat1
    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlong / 2)**2
    c = 2 *math.atan2(math.sqrt(a), math.sqrt(1 - a))
    dist =c*6373.0
    return dist

In [20]:
#Dropping off the extra columns from the dataframe

filtered_features=['bathrooms', 'bedrooms', 'beds', 'latitude', 'longitude', 'reviews_per_month',
        'booking', 'accommodates', 'guests_included', '"24-Hour Check-in"', '"Suitable for Events"','"Other pet(s)"', 'Essentials', '"Wireless Internet"',
        '"Laptop Friendly Workspace"', '"Pets Allowed"', 'Pool',
        '"Free Parking on Premises"','"Pets live on this property"', 'Dog(s)', '"Smoking Allowed"',
       '"Buzzer/Wireless Intercom"', 'TV', 'Gym', 'Washer',
        'Doorman', 'Dryer','"Hot Tub"', '"Air Conditioning"', '"Carbon Monoxide Detector"', '"Safety Card"', 'Kitchen',
       '"Hair Dryer"', '"Fire Extinguisher"', 'Breakfast', '"Washer / Dryer"',
       '"Lock on Bedroom Door"', 'Cat(s)', 'Hangers',
    '"Family/Kid Friendly"','"Wheelchair Accessible"', 'Iron', 'Shampoo', '"Smoke Detector"', '"First Aid Kit"',
       '"Indoor Fireplace"', '"Elevator in Building"', 'Internet','"Cable TV"', 'Heating', 'host_is_superhost',
        'property_type','room_type','bed_type','price','cleaning_fee', 'extra_people', 'instant_bookable', 'cancellation_policy']


model_df= df[filtered_features]

In [21]:
model_df['cleaning_fee']=model_df['cleaning_fee'].fillna("$0.00")

In [22]:
cols = ['price', 'cleaning_fee','extra_people']
for col in cols:
    model_df[col] = model_df[col].str.extract(r'(\d+)')
    model_df[col] = model_df[col].astype('float128').astype('int64')

In [23]:
model_df.dtypes

In [24]:
#cleaning the dataframe and fixing missing values
model_df['bedrooms'].fillna(model_df['bedrooms'].mean(), inplace=True)
model_df['bathrooms'].fillna(model_df['bathrooms'].mean(), inplace=True)
model_df['reviews_per_month'].fillna(model_df['reviews_per_month'].mean(), inplace=True)
model_df['beds'].fillna(model_df['beds'].mean(), inplace=True)
#Taking care of the categorical columns
cat_cols = model_df.select_dtypes(include=['object'])
for col in cat_cols:
    try:
        model_df = pd.concat([model_df.drop(col, axis=1), pd.get_dummies(model_df[col], prefix=col, prefix_sep='_', drop_first=True)], axis=1)
    except:
        continue

Now since we have the dataframe ready, lets use the dist method to calculate distance from downtown

In [25]:
for i in range(len(model_df)):
    model_df.loc[ i , 'distance_from_downtown'] = dist(model_df.loc[ i ,'latitude'] , model_df.loc[ i ,'longitude'])
model_df.drop(columns=['latitude', 'longitude'], inplace=True)
model_df.fillna(0, inplace=True)
y = model_df['price']
X = model_df.drop(columns='price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42) # Train test split

#Training a linear regression model on the final dataset
model = LinearRegression(normalize=True)
model.fit(X_train, y_train)
y_test_preds = model.predict(X_test)
test_score = r2_score(y_test, y_test_preds)
print(test_score)

In [26]:
#This is a function that calculates the coefficient for each feature, I found this on Udacity. We will use this to calculate see how the location and other features affects the price prediction in our model.
def coef_weights(coefficients, X_train):
    coefs_df = pd.DataFrame()
    coefs_df['feat'] = X_train.columns
    coefs_df['coefs'] = model.coef_
    coefs_df['abs_coefs'] = np.abs(model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
    return coefs_df

#Use the function
coef_df = coef_weights(model.coef_, X_train)

In [27]:
coef_df.sort_values(by='coefs').head(10)

In [28]:
coef_df[coef_df['feat'] == 'distance_from_downtown']