# Data Exploration
The Data Cleansing notebook gave a cleaner version of our NYC Apartments data that we can now use to perform some data exploration. This notebook will explore relationships between predictors and the response (price), as well as between other predictors. This will aid us later when building our model to predict the price of an apartment.

## Libraries

In [116]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Range1d
output_notebook()

## Read in the data

In [106]:
df = pd.read_csv('housing_cleaned.csv', index_col='id').drop('Unnamed: 0', axis=1)
df.describe()

Unnamed: 0,area,bedrooms,bikeScore,distanceToNearestIntersection,has_image,has_map,price,repost_of,transitScore,walkScore,...,includes_area,year,month,dow,day,hour,bedrooms_filled,advertises_no_fee,is_repost,sideOfStreetEncoded
count,802.0,2799.0,3127.0,3104.0,3128.0,3128.0,3128.0,1022.0,1894.0,3127.0,...,3128.0,3128.0,3128.0,3128.0,3128.0,3128.0,3023.0,3128.0,3128.0,3128.0
mean,964.013716,2.191854,75.474576,42.499535,0.882353,1.0,2810.990729,6467005000.0,95.008448,88.777103,...,0.256394,2019.0,6.0,4.762148,21.762148,11.478581,2.081376,0.150575,0.326726,0.503517
std,419.766603,0.972821,15.472363,61.244591,0.322241,0.0,1550.892282,781580600.0,11.253405,17.043335,...,0.436712,0.0,0.0,0.684012,0.684012,7.150593,1.05161,0.357692,0.469091,0.500068
min,230.0,1.0,6.0,0.0,0.0,1.0,0.0,2482441000.0,31.0,1.0,...,0.0,2019.0,6.0,4.0,21.0,0.0,0.0,0.0,0.0,0.0
25%,733.0,1.0,65.0,0.063578,1.0,1.0,2050.0,6520257000.0,97.0,89.0,...,0.0,2019.0,6.0,4.0,21.0,5.0,1.0,0.0,0.0,0.0
50%,850.0,2.0,80.0,22.314235,1.0,1.0,2585.0,6835913000.0,100.0,95.0,...,0.0,2019.0,6.0,5.0,22.0,12.0,2.0,0.0,0.0,1.0
75%,1100.0,3.0,87.0,66.62823,1.0,1.0,3200.0,6892385000.0,100.0,98.0,...,1.0,2019.0,6.0,5.0,22.0,18.0,3.0,0.0,1.0,1.0
max,3400.0,6.0,97.0,769.854208,1.0,1.0,28500.0,6916887000.0,100.0,100.0,...,1.0,2019.0,6.0,6.0,23.0,23.0,6.0,1.0,1.0,1.0


## Assess the amount of NULL data points
This data is from from perfectly clean, we are scraping from Craigslist after all. The data pipeline is Craigslist Apartment Data -> Enrich with Mapquest Data -> Enrich with Walk Score data, so we tend to have a lot of missing data points. Let's quantify this.

In [107]:
# Create a DF with the count of nulls and the second column of percentage of df
nulls = pd.concat([df.isnull().sum(axis = 0), df.isnull().sum(axis = 0)/len(df)], axis=1)
nulls

Unnamed: 0,0,1
address,0,0.0
area,2326,0.743606
bedrooms,329,0.105179
bikeScore,1,0.00032
datetime,0,0.0
distanceToNearestIntersection,24,0.007673
has_image,0,0.0
has_map,0,0.0
name,0,0.0
postalCode,0,0.0


Because we have many null features in some features such as area we are unlikely to use these. However, we were able to make features from these features such as "includes_area" which states whether the post includes area. For now we will drop scores if they have over 10% missing values.

## Drop features with missing values

In [108]:
threshold = .1
df.dropna(thresh=(1-threshold)*len(df), axis=1, inplace=True)
nulls = pd.concat([df.isnull().sum(axis = 0), df.isnull().sum(axis = 0)/len(df)], axis=1)
nulls

Unnamed: 0,0,1
address,0,0.0
bikeScore,1,0.00032
datetime,0,0.0
distanceToNearestIntersection,24,0.007673
has_image,0,0.0
has_map,0,0.0
name,0,0.0
postalCode,0,0.0
price,0,0.0
sideOfStreet,0,0.0


In [129]:
# Shrunk DF with only features we will plot
df_shrunk = df[['bikeScore', 'distanceToNearestIntersection', 'has_image', 'has_map', 'sideOfStreet', 
                'walkScore', 'where', 'includes_area', 'year', 'month', 'dow', 'day', 'hour', 'bedrooms_filled',
               'advertises_no_fee', 'is_repost', 'price']]

In [131]:
df_shrunk

Unnamed: 0_level_0,bikeScore,distanceToNearestIntersection,has_image,has_map,sideOfStreet,walkScore,where,includes_area,year,month,dow,day,hour,bedrooms_filled,advertises_no_fee,is_repost,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
6911917730,64.0,0.000000,1,1,R,92.0,bed-stuy,0,2019,6,4,21,14,3.0,1,0,2700
6917210186,88.0,203.483553,1,1,L,98.0,harlem / morningside,1,2019,6,4,21,14,1.0,0,1,2600
6914527887,79.0,0.013114,1,1,R,94.0,bed-stuy,0,2019,6,4,21,14,3.0,0,0,2875
6914529944,79.0,0.013114,1,1,R,94.0,bed-stuy,0,2019,6,4,21,14,3.0,1,0,2800
6917173545,81.0,61.301497,1,1,L,93.0,,1,2019,6,4,21,14,1.0,1,1,3500
6915622461,95.0,74.557864,1,1,L,100.0,east village,0,2019,6,4,21,14,1.0,1,1,3000
6915400788,87.0,1.049047,1,1,R,96.0,bushwick,0,2019,6,4,21,14,2.0,0,0,2300
6915405145,77.0,0.000000,1,1,L,94.0,bushwick,0,2019,6,4,21,14,3.0,1,0,2875
6917234115,86.0,0.000000,1,1,L,99.0,"jersey city, nj",1,2019,6,4,21,14,2.0,0,0,4545
6908895228,65.0,81.224653,1,1,L,88.0,,1,2019,6,4,21,14,1.0,0,0,2065


## Price Over Time
Let's look at how prices vary over time for NYC Apartments

In [114]:
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
df_price_per_day = df.set_index('datetime')[['price']]
df_price_per_day = df_price_per_day.resample('d').median().reset_index()

In [124]:
# Create the plot
source = ColumnDataSource(df_price_per_day)
p = figure(title="NYC Apartment Median Price Over Time", sizing_mode='stretch_width', x_axis_type='datetime')
p.line(x='datetime', y='price', line_width=2, color='#2e485c', source=source)
p.y_range = Range1d(0, df_price_per_day['price'].max()*1.05)
show(p)