# New York City: An Analysis of AirBNB Prices

In [4]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import folium
import pandas as pd
import random as rnd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor


We have a dataset of New York City AirBNB listings for 2019.

In [5]:
data = pd.read_csv(".\data\AB_NYC_2019.csv")
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [11]:
n = data.shape[0]
n

48895

We will be using the `folium` package to plot our location data:

In [12]:
m = folium.Map(location = [40.64749, -73.97237],
               tiles = 'CartoDB positron',
               )
m

In [13]:
def ColourFunc(price):
    minprice, maxprice = min(data.loc[:, 'price']), max(data.loc[:, 'price'])
    if price < data.quantile(1/3).loc['price']:
        return 'green'
    elif price < data.quantile(2/3).loc['price']:
        return 'yellow'
    else:
        return 'red'

We can add a random sample of listings to our map.

In [14]:
sample_size = 500
sample = [rnd.randint(0, n) for i in range(sample_size)]
for i in range(sample_size):
    row = data.loc[i, :]
    name = str(row.loc['name'])
    price = float(row.loc['price'])
    neighbourhood = str(row.loc['neighbourhood'])
    borough = str(row.loc['neighbourhood_group'])
    location = list(row.loc[['latitude', 'longitude']])
    roomtype = str(row.loc['room_type'])
    
    popuptext = f"""<h4>{name}</h4>
    <p>Price: ${price: .2f}
    Area: {neighbourhood}, {borough}
    Room Type: {roomtype}"""
    
    if roomtype == 'Private room':
        size = 15
    else:
        size = 30
    
    folium.Circle(
    radius= size,
    location = location,
    popup = popuptext,
    color = ColourFunc(price),
    fill=False,
).add_to(m)
m

## Machine Learning Model
Our target feature is the price of renting the property. We want to build a regression model to predict this value based on the remaining features of each listing available to us. In particular, we want to build a model based on the location data (i.e neighbourhood_group, neighbourhood, latitude, longitude) and the type of listing (room_type, minimum_nights). We won't be using features about reviews or other features specific to a listing on AirBNB. The motivation behind this is to allow this model to be used to give people an idea about the price they could charge for their property, if they were to list it on AirBNB.

### Model specifics
* Model: Random Forest Regression
* Measure: Root Mean Squared Error

### Data
We have 48895 instances to train and test with. We will use a 90-5-5 split for our training, validation, and test sets.

In [None]:
# Seperate the data
seed = 1606 # we will use this random state whereever there is a random process
training_data = train_test_split(data, train_size = 0.9, random_state = seed)