AI@Penn Venture Fellows Airbnb Exploratory Data Analysis
By - Michael O'Farrell

We'll begin this analysis of the Airbnb by first importing libraries

In [None]:
# Library imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from matplotlib import pyplot as plt
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import tqdm
from tqdm._tqdm_notebook import tqdm_notebook
from numpy import median
import seaborn as sns 
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from keras.preprocessing.text import Tokenizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
import keras
from keras import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
import tensorflow as tf
import os
import collections
from sklearn.metrics import mean_squared_error
from numpy import random
from numpy import median
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

We'll then read in and then examine the head of the airbnb data

In [None]:
data = pd.read_csv("../input/us-airbnb-open-data/AB_US_2020.csv")
data.head()

Examining the missing info, a large part of the missing data seems to stem from neighborhood groups, and features related to reviews

In [None]:
# Quick Look at the missing values indicate a vast majority of the missing data stems
# from neighborhood groups, last_reviews, and reviews_per_month
print(data.isna().sum())
print(data.info())

Looking at our more quantitative features, we'll drop the id feature, and get a good sense of the standard statistical measurements for these features

The incredibly large maximum values compared to the relatively small upper quartile values and large standard devitations suggest that this data contains many outliers.

In [None]:
numeric_features = data.select_dtypes(include = ['int64', 'float64'])
nonnumeric_features = data.select_dtypes(include = ['object'])
numeric_features = numeric_features.drop(['id'], axis = 1);
data = data.drop(['id'], axis = 1)
numeric_features.describe()

We'll first deal with some of the missing data. Looking at how the number of reviews, latest reviews, and reviews per month relate to each other when one of them is missing. 

Here we see that if the number of reviews being zero leads to the other two being NaN, so we'll fix this by fixing the reviews per month feature to be zero when this occurs

In [None]:
# If the number of reviews were 0, the last review would be NaN
print(data[data['number_of_reviews'] == 0 & data['last_review'].isna()].iloc[:,11:13].shape)
print(data[data['last_review'].isna()].iloc[:,12:13].shape)
print(data[data['number_of_reviews'] == 0].iloc[:,11:12].shape)

# Investigating if number_of_reviews = 0 leads to reviews_per_month also being NaN
print(data[data['reviews_per_month'].isna()].iloc[:,11:13].shape)
data[data['reviews_per_month'].isna()].iloc[:,13] = 0

In [None]:
data.loc[data['number_of_reviews']==0 & data['last_review'].isna(), 'reviews_per_month'] = 0
data[data['number_of_reviews']==0 & data['last_review'].isna()].iloc[:,11:14]


Before filtering out the many outliers, I first wanted to get a good hold on where our airbnb listings are in this dataset. Found the neighborhood feature to be of mixed importance when trying to figure that out. 

In [None]:
# Let us know what locations we are dealing with
data['neighbourhood'].unique()

Then, using the latitude and longitude coordinates, I outputted on a map of the US where the Airbnbs were located, where the size of the dot was proportional to its availability throughout the year. I did this using python's BaseMap library.

In [None]:
fig = plt.figure(figsize = (12,9))
m = Basemap(width=12000000, height = 9000000, projection = 'lcc', lat_1 = 45., lat_2 = 55, lat_0 = 50, lon_0 = -107.)

m.shadedrelief()
m.drawstates(color = 'black')
m.drawcountries(color = 'black')
m.scatter(data['longitude'].tolist(), data['latitude'].tolist(), latlon=True, c = ['red'], s = data['availability_365']/4,
         marker = 'o', alpha = .4, edgecolor = 'k', zorder = 2)

plt.title('Airbnb Listing Density in US', fontsize = 20)
plt.show()

Wanted to get a good idea of what states most of the airbnbs are in, because oftentimes the cities would just also be county names or in the case of a rhode island, a state name. 
To get a uniform sense of where the airbnb listings are, I decided to reverse geocode the latitude and longitude coordinates into state information using Nominatim. 
After getting the state information, I then feature engineered the state feature into the dataset. 

In [None]:
# Reverse Geolocator for the N cities, and then autofill for the rest
# Allows for us to feature engineer a states column
locator = Nominatim(user_agent="myGeocoder")
states = {}
for x in data['city'].unique():
    house = data[data['city'] == x].iloc[0]
    geom = str(house['latitude']) + "," + str(house['longitude'])
    location = locator.reverse(geom)
    state = location.raw
    states[x] = state['address']['state']
data['state'] = data['city'].map(states)

From here I was able to visualize how many Airbnb listings were in each 'city' and each state. 

In [None]:
g = sns.catplot(x = 'city', data = data, kind = 'count', palette = 'crest', label = 'big', aspect = 3, order = data['city'].value_counts().index)
plt.title("Number of Listings in Each City")
plt.xticks(rotation = 45)
plt.show()

In [None]:
g = sns.catplot(x = 'state', data = data, palette = "crest", kind = 'count', aspect= 2, label = 'big', orient = 'h', order = data['state'].value_counts().index)
plt.title("Number of Listings in Each State")
plt.xticks(rotation = 45)
plt.show()

I then turned to focus on the outlier data.
I did this by first visualizing the density curves for the numeric features. 

In [None]:
f = plt.figure(figsize=(9, 9))
gs = f.add_gridspec(3, 3)
index = 0
for i in range(3):
    for j in range(3):
        ax = f.add_subplot(gs[i, j])
        sns.kdeplot(data = data, x  = numeric_features.columns[index], ax= ax).set_title(numeric_features.columns[index] + " density")
        index += 1
plt.subplots_adjust(wspace=1, hspace=1)
        

For filtering out price, I only considered prices that were 1.5 the IQR from the lower and upper quartile prices. 
When attempting to do this for the other features, I was only left with 67 listings in the end. Thus, for calculated host listings, reviews per month, and number of reviews, I looked at the density curves above and filtered below the lower quartile on the x axis. 
I also filtered minimum nights to be less than 2 months. 

In [None]:
iqr = data['price'].quantile(.75)-data['price'].quantile(.25)
lower = data['price'].quantile(.25)-1.5*(iqr)
upper = data['price'].quantile(.75)+1.5*(iqr)
without_outliers = data[data['price'].between(lower,upper, inclusive = True)]
without_outliers = without_outliers[without_outliers['calculated_host_listings_count'] < 125]
without_outliers = without_outliers[without_outliers['reviews_per_month'] < 10]
without_outliers = without_outliers[without_outliers['number_of_reviews'] < 250]
without_outliers = without_outliers[without_outliers['minimum_nights'] < 60]

Revisualizing this dataset without outliers, we see multimodal distributions for minimum nights and availability 365. This makes sense, because the peaks of the minimum nights peaks close to a single digit value and peaks again around 30, suggesting daily/weekly rental airbnbs and monthly ones. 

number of reviews, reviews per month, and calculated host listings all follow power-law distibutions by the looks of it. 

In [None]:
f = plt.figure(figsize=(9, 9))
gs = f.add_gridspec(3, 3)
index = 0
for i in range(3):
    for j in range(3):
        ax = f.add_subplot(gs[i, j])
        sns.kdeplot(data = without_outliers, x  = numeric_features.columns[index], ax= ax).set_title(numeric_features.columns[index] + " density")
        index += 1
plt.subplots_adjust(wspace=1, hspace=1)
        

With this filtered dataset, I now begin to look at median numerical feature data in each city and state, to get a better sense of how the airbnb listings vary by state. 

In [None]:

order = without_outliers.groupby(['city'])['price'].aggregate(np.median).reset_index().sort_values('price', ascending = False)
sns.catplot(x = 'price', y = 'city', kind = 'bar', order = order['city'], palette = "crest", orient = 'h', data =without_outliers, estimator = median)
plt.title("Median Listing Price in each city")
plt.show()

In [None]:

order = without_outliers.groupby(['state'])['price'].aggregate(np.median).reset_index().sort_values('price', ascending = False)
sns.catplot(x = 'state', y = 'price', kind = 'bar', aspect = 2, order = order['state'], palette = "crest", orient = 'v', data =without_outliers, estimator = median)
plt.title("Median Listing Price in each State")
plt.xticks(rotation = 45)
plt.show()

Here, I prepare a correlation matrix for the numeric features of our filtered dataset. Unfortunately, there are not strong predictors in the numeric features for price. 

In [None]:
order = without_outliers.groupby(['state'])['availability_365'].aggregate(np.mean).reset_index().sort_values('availability_365', ascending = False)
sns.catplot(x = 'availability_365', y = 'state', kind = 'bar', order = order['state'], palette = "crest", orient = 'h', data =without_outliers)
plt.title("Mean Availability in each state")
plt.show()

In [None]:
order = without_outliers.groupby(['state'])['number_of_reviews'].aggregate(np.median).reset_index().sort_values('number_of_reviews', ascending = False)
sns.catplot(x = 'number_of_reviews', y = 'state', kind = 'bar', order = order['state'], palette = "crest", orient = 'h', data =without_outliers, estimator = median)
plt.title("Median Review Count in each state")
plt.show()

In [None]:
order = without_outliers.groupby(['state'])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365', ascending = False)
sns.catplot(x = 'availability_365', y = 'state', kind = 'bar', order = order['state'], palette = "crest", orient = 'h', data =without_outliers, estimator = median)
plt.title("Median Availability in each state")
plt.show()

In [None]:
g = sns.FacetGrid(without_outliers, col = 'room_type')
g.map(sns.scatterplot, 'number_of_reviews', 'price')

In [None]:
sns.heatmap(without_outliers.corr(), annot=True, fmt = ".2f", cmap = "crest")

I then wondered if the name of the airbnb listing (which also usually entails a descriuption of the place) could help us predict price. Using the WordCloud library and an image mask of a house, I was able to generate a house-shaped word cloud 

In [None]:
# Configure the house image mask
house_mask = np.array(Image.open("../input/image-of-house/45180.png"))
print(house_mask.shape)
def transform_format(val):
    if val == 0:
        return 255
    elif val == 247:
        return 255
    else:
        return val
transformed_house_mask = np.ndarray((house_mask.shape[0],house_mask.shape[1]), np.int32)
for i in range(len(house_mask)):
    transformed_house_mask[i] = list(map(transform_format, house_mask[i]))

In [None]:
# Build the word cloud
contents = " ".join(name for name in without_outliers[without_outliers['name'].notna()]['name'].values.astype(str))
stopwords = set(STOPWORDS)
wc = WordCloud(background_color = "white", stopwords=stopwords, min_font_size = 8, max_words=100, mask = transformed_house_mask,contour_width=3, contour_color='black').generate(contents)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("Words Associated with Airbnb Names")
plt.show()

Descriptions like 'beautiful', 'downtown', and 'beach' could be associated with higher-priced Airbnbs. 
We will now begin to build the model for predicting airbnb housing prices. First, I do want to quantify the images n the word cloud though. 
We'll begin by dropping all listings with no description

In [None]:
without_outliers = without_outliers[without_outliers['name'].notna()]

We'll then split out data into training and test sets, converting room types and state features into one-hot, numeric representations. 

In [None]:
# Training Data
features = ['name', 'calculated_host_listings_count', 'room_type', 'state', 'minimum_nights','availability_365']
X = without_outliers[features]
X = pd.get_dummies(X, columns = ['room_type', 'state'])
X_train, X_test, y_train, y_test = train_test_split(X, without_outliers['price'], test_size = .2)

We will now convert the names of the airbnbs into NN embeddinds, and then pad them so that they are all the same size.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X['name'].values)
name_train_embed = tokenizer.texts_to_sequences(texts = X_train['name'])
name_test_embed = tokenizer.texts_to_sequences(texts = X_test['name'])
train_embed = keras.preprocessing.sequence.pad_sequences(name_train_embed, maxlen=120)
test_embed = keras.preprocessing.sequence.pad_sequences(name_test_embed, maxlen=120)

We will now build a sequence model to produce a numeric output based on the description of the airbnb. The embeddings will be put through an embedding layer, flattened, then run through a dense layer. 

In [None]:
layers = keras.layers
deep_inputs = layers.Input(shape=(120,))
embedding = layers.Embedding(40000, 10, input_length = 120)(deep_inputs)
embedding = layers.Flatten()(embedding)
embedding = layers.Dense(units = 15, activation = 'relu')(embedding)
embed_out = layers.Dense(units = 1,activation = 'relu')(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
deep_model.summary()

In [None]:
deep_model.fit(train_embed, y_train, epochs = 10, verbose = 1)
_, accuracy = deep_model.evaluate(test_embed, y_test)

We'll now evaluate the MSE of this sequence model.

In [None]:
predictions_test = deep_model.predict(test_embed)
rms = mean_squared_error(y_test, predictions_test, squared = False)
print(rms)

We'll now feature engineer an 'output' feature, which is a numeric representation to how the name of an airbnb could relate to its price. 

In [None]:
X_train['output'] = deep_model.predict(train_embed)
X_test['output'] = predictions_test

We'll now feature scale minimum_nights, availability_365, the host_listings count, and the output feature, and then we'll drop the name, keeping only numeric features

In [None]:

X_train.drop(columns = ['name'])
col_names = ['availability_365','minimum_nights',
       'calculated_host_listings_count', 'output']
scaled_X_train = X_train.copy()
scaled_X_test = X_test.copy()

features = scaled_X_train[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

features2 = scaled_X_test[col_names]
scaler2 = StandardScaler().fit(features2.values)
features2 = scaler2.transform(features2.values)

scaled_X_train[col_names] = features
scaled_X_test[col_names] = features2

scaled_X_train = scaled_X_train.drop(columns = 'name')
scaled_X_test = scaled_X_test.drop(columns = 'name')

We'll now evaluate which model performs the best
Here we'll compared an SGD regressor with a random forest regressor. Was going to do an SVR, but it did not scale well with the data. 
Comparing the RMSE, the  RF regressor is far betterthan the SGD regressor. 

In [None]:
rfr = RandomForestRegressor(max_depth = 5)
rfr.fit(scaled_X_train,y_train)
print("Random Forest regressor RMSE: " + str(mean_squared_error(y_test, rfr.predict(scaled_X_test), squared = False)))

sgd = SGDRegressor(max_iter=2000, tol=1e-4)
sgd.fit(scaled_X_train, y_train)
print("SGD Regressor RMSE: " +  str(mean_squared_error(y_test, sgd.predict(scaled_X_test), squared = False)))

lasso = Lasso()
lasso.fit(scaled_X_train, y_train)
print("Lasso RMSE: " + str(mean_squared_error(y_test, lasso.predict(scaled_X_test), squared = False)))