# New York City Property Sales Analysis

## Introduction

On this Notebook, an Analysis of 12-Month period of Property Sales in New York City is done. For this, the "NYC Property Sales" published on Kaggle is used, enriched with location data obtained geocoding the Zip Codes using Open Street Map. This is my first published Notebook, hope you like it.

## Context

Some years ago, I had the opportunity to visit New York in one of my vacations, and I got really amazed by how places were so different from anything I had seen before. All the streets, parks, skyscrapers, stores and restaurants seems to compose the uniqueness of the City that never Sleeps. So, wandering a little bit on the streets was enough to ask myself how much does it cost to live in a place like that. 

Specially considering the well-known high density of buildings on the city, we may expect find interesting things by first analysing the sales prices of properties on New York. That's the purpose of this Notebook. Here, a 12-month dataset of Property Sales is analysed, involving the following features:

* Location Features: Borough, Neighborhood, Block, Lot, Address, Zip Code and Geographical Coordinates;
* Class and Category of the Building;
* Physical Properties of the Building: Year Built, Land and Gross Square Feet, Residential and Comercial Units;
* Sales Information: Sales Price and Sales Date.

By analyzing these features, we can define some goals for the Analysis.

## Objectives

A good way to define some goals to be reached by an analysis, it's to define some questions, based on the features available, to be answered during the experiments done to the data. For this work, we can try to answer:

* How the Sales Price vary according to the type of Building?
* How the Sales Price vary with Time?
* What is the Average Sales Price on different locations of the City?

## Libraries

First, some specific packages used in future sections are installed.

### Geocoder

Geocoder is a simple framework that promotes a easy way of doing geocoding with Python by the interaction with multiple providers online, like Google and OpenStreetMap, which is used on this notebook.

In [None]:
!pip install geocoder

### Kepler.gl

Kepler.gl is a open-soruce geospatial analysis tool, which features excelent map visualizations with a fast integration of data. It runs  standalone via the [website](https://kepler.gl/), or inside a notebook through a specific integration. The process of instalation here were inspired by a notebook made by [Prageeth Anjula](https://www.kaggle.com/praanj), which can also be found on Kaggle [here](https://www.kaggle.com/praanj/kelper-gl-geospatial-visualization-on-kaggle?scriptVersionId=37216419).

In [None]:
!pip install keplergl
!conda install -y -c conda-forge/label/cf202003 nodejs
!jupyter labextension install @jupyter-widgets/jupyterlab-manager keplergl-jupyter

### Imports

Here all the libraries used throghout this notebook are imported.

In [None]:
import numpy as np 
from datetime import datetime
import pandas as pd 
import matplotlib.pyplot as plt 
import matplotlib.ticker as ticker
import seaborn as sns 
sns.set_style('darkgrid')
from statsmodels.graphics.tsaplots import plot_acf
import geocoder
from keplergl import KeplerGl

Here, warnings are disabled.

In [None]:
import warnings
warnings.filterwarnings("ignore")

## Read Data

Here the dataset used is read and some formatting infos are obtained.

In [None]:
data = pd.read_csv('../input/nyc-property-sales/nyc-rolling-sales.csv')

In [None]:
data.head()

In [None]:
data.info()

## Geocoding

Before starting the pre-process of the data, on this section, the Geocoder framework is appleid to the data to obtain the geographical coordiantes of the properties, which is going to be useful in subsequent analysis. For this, the zip codes of the properties are send as queries to Open Street Map, an open-source localization data provider, which then return the proper coordinates. Using zip codes may not be the most precise way, but is a more fast way of obtaining this data.

In [None]:
zip_codes = data['ZIP CODE'].unique()

In [None]:
x_coordinates = []
y_coordinates = []

for codes in zip_codes:
    g = geocoder.osm(str(codes) + ', New York')
    if g.ok == True:
        x_coordinates.append(g.osm['x'])
        y_coordinates.append(g.osm['y'])
    else:
        x_coordinates.append('Not Found')
        y_coordinates.append('Not Found')

In [None]:
zip_codes_df = pd.DataFrame(list(zip(zip_codes,x_coordinates,y_coordinates)),columns=['Zip Codes','X Coordinate','Y Coordinate'])
zip_codes_df

In [None]:
data = data.merge(zip_codes_df,how='left',left_on='ZIP CODE',right_on='Zip Codes')
data = data.drop('Zip Codes',axis=1)

## Pre-Process

First, unused columns are removed from the dataset.

In [None]:
data = data.drop(['Unnamed: 0','BLOCK','LOT','APARTMENT NUMBER'],axis=1)

After that, some variables to be used have their format changed for a better performance.

In [None]:
data[['SALE PRICE','LAND SQUARE FEET','GROSS SQUARE FEET']] = data[['SALE PRICE','LAND SQUARE FEET','GROSS SQUARE FEET']].replace({' -  ':'0'})
data[['SALE PRICE','LAND SQUARE FEET','GROSS SQUARE FEET']] = data[['SALE PRICE','LAND SQUARE FEET','GROSS SQUARE FEET']].astype('float64')

In [None]:
data['SALE DATE'] = pd.to_datetime(data['SALE DATE'])

In the Sales Price column, some columns appear to be Outliers. To remove any possible invalid data, properties with price greater than $5000000 are removed from the dataset. Also, a great number of Properties have a null Price, which corresponds to transfers of ownerships, as stated by the documentations of the dataset. This data is also removed from the dataset.

In [None]:
data = data[(data['SALE PRICE'] > 0) & (data['SALE PRICE'] < 5000000)]

The Borough column is codified. According to the description of the data provided at Kaggle, the possible values are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). Let's configure that on data.

In [None]:
data['BOROUGH'] = data['BOROUGH'].replace({1:'Manhattan', 2:'Bronx', 3:'Brooklyn', 4:'Queens', 5:'Staten Island'})

In [None]:
data.info()

## Inital Analysis

Let's start the Analysis with some simple analysis of key variables of the dataset. First let's check the distribution of the target variable, Sales Price.

In [None]:
plot_data = data['SALE PRICE']

In [None]:
plt.figure(figsize=(10,8))
plotd = sns.histplot(plot_data,bins=100)

tick_spacing=250000 
plotd.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plotd.ticklabel_format(axis='both',style='plain')
plt.xticks(rotation=30) 
plt.xlabel('Sales Price')
plt.title('Properties Sales Price Distribution')

ylim = plotd.get_ylim()
plotd.plot([plot_data.mean(),plot_data.mean()], plotd.get_ylim(),color='red',ls='--',)
plotd.set_ylim(ylim)
plotd.text(plot_data.mean() + 100000, 3000,'Mean:' + str(round(plot_data.mean(),2)), fontsize=13,color='red')

The distribution presents a skewed behaviour toward less priced properties, with a long tail towards more valuable samples. Highlighted by the red line, it's possible to check the mean of the distribution, which is nearly $827.000.

Let's check the distribution of Properties by Borough to check any imbalances in data.

In [None]:
plot_data = data[['BOROUGH','SALE PRICE']].groupby(by='BOROUGH').count()
plot_data = plot_data.reset_index()
plot_data.columns =  ['Borough','Counts']
plot_data = plot_data.sort_values('Counts',ascending=False)

In [None]:
plt.figure(figsize=(10,8))
plt.bar(x = plot_data['Borough'],height=plot_data['Counts'])
plt.title('Properties by Borough')

Queen, Brooklyn and Manhattan concentrate a high level of properties compared to other Boroughs.

It's also a interesting to analyse the distribution of properties by Building Class Category.

In [None]:
plot_data = data[['BUILDING CLASS CATEGORY','SALE PRICE']].groupby(by='BUILDING CLASS CATEGORY').count()
plot_data = plot_data.reset_index()
plot_data.columns =  ['Building Class','Counts']
plot_data = plot_data.sort_values('Counts')

In [None]:
plt.figure(figsize=(10,8))
plt.barh(y = plot_data['Building Class'],width=plot_data['Counts'])
plt.title('Properties by Building Class')

It seems that the properties presented on the dataset are highly concentraded on Apartments and Family Dwellings. What about the area of the properties?

In [None]:
plot_data = data[data['GROSS SQUARE FEET'] > 0]['GROSS SQUARE FEET']

In [None]:
plt.figure(figsize=(10,8))
plotd = sns.histplot(plot_data,bins=100,log_scale=True)

plt.xticks(rotation=30) 
plt.xlabel('Gross Square Feet')
plt.title('Properties Gross Square Feet Distribution (Log Scale)')

ylim = plotd.get_ylim()
plotd.plot([plot_data.mean(),plot_data.mean()], plotd.get_ylim(),color='red',ls='--',)
plotd.set_ylim(ylim)
plotd.text(plot_data.mean() + 500, 2500,'Mean:' + str(round(plot_data.mean(),2)), fontsize=13,color='red')

As there's some big values involved, a logarithmic scale is used, from which is possible to see a asymmetric distribution with a long tail. The Mean of the distribution is shown on the red vertical line.

## More Deep Analysis

Ok, so after the Initial Analysis section, which focused more on the distribution of important single variables, on this section more deep explorations are done, involving the use of multiple variables to try to answer the questions defined on the Objectives section.

### Prices Variation by Building Class

The first question considered is how Sales Prices vary according to building type. Let's first separate the appropriate data.

In [None]:
plot_data = data[['BUILDING CLASS CATEGORY','RESIDENTIAL UNITS','COMMERCIAL UNITS','TAX CLASS AT TIME OF SALE','SALE PRICE']]
plot_data

As the initial step here, let's first plot the Sales Price by Building Class. As there's some types with small numbers of samples, this analysis will be limited to the 15 most common building types on this dataset.

In [None]:
selected_class = plot_data['BUILDING CLASS CATEGORY'].value_counts().index[:15]

In [None]:
plot_data = plot_data[plot_data['BUILDING CLASS CATEGORY'].isin(selected_class)]

In [None]:
plt.figure(figsize=(15,12))
sns.boxplot(y='BUILDING CLASS CATEGORY',x='SALE PRICE',data=plot_data,orient='h',fliersize=0)
sns.despine(trim=True, left=True)
plt.title('Sales Price Distribution by Building Class')

There's not a significant difference between the classes on this plot, but three of the classes presented stands out for higher means and variances, which are "Rentals - Walkup Apartments", "Rentals - 4-10 Unit" and "Store Buildings". It is interesting to see the first two building types presenting high prices, since they consist, in general, in smaller properties. This information may suggest that the most demanded buiildings on New York are generally small, and probably, as a consequence, the size of families or groups of people living together also tends to be small. 

### Prices Variation with Time

As mentioned earlier, the dataset contains infomration about the date of the sale registered. So, let's try to catch some information of that.

In [None]:
time_data = data[['SALE DATE','SALE PRICE']]
time_data

Since it's possible to have more than one sale per day, let's group the sales by day and obtain the mean of each day.

In [None]:
time_data = time_data.groupby('SALE DATE').mean()
time_data = time_data.reset_index().rename({'index':'SALE DATE'})
time_data

In [None]:
plt.figure(figsize=(15,8))
plt.plot(time_data['SALE DATE'],time_data['SALE PRICE'])
plt.title('Sales Price by Date')

Despite not having a clear trend, there's a strong seasonality on the data. Let's try to catch this on a Autocorrelation plot.

In [None]:
lags = 60
plt.rc("figure", figsize=(15,6))
fig = plot_acf(time_data['SALE PRICE'],lags=lags,title='Autocorrelation Plot of Sales Price')
x_ticks = plt.xticks(range(lags+1))
plt.tight_layout()

It seems there's a more proeminent correlation in 6-7 days periods, which may indicate that prices vary regularly along the week. Let's try to visualize that.

In [None]:
time_data['SALE WEEK DAY NAME'] = time_data['SALE DATE'].map(lambda x: x.strftime("%A"))
time_data['SALE WEEK DAY NUMBER'] = time_data['SALE DATE'].map(lambda x: x.weekday())

In [None]:
week_data = time_data.groupby('SALE WEEK DAY NAME').mean()
week_data = week_data.reset_index().rename({'index':'SALE WEEK DAY NAME'})
week_data = week_data.sort_values('SALE WEEK DAY NUMBER')
week_data

In [None]:
plt.figure(figsize=(15,8))
plt.bar(x = week_data['SALE WEEK DAY NAME'],height=week_data['SALE PRICE'])
plt.title('Mean Sales Price by Day of Week')

So, it just seems that on the weekeend the average price of sales have a big drop, while during the week it remains practically constant, which appears to be some natural behaviour of this kind of sales on Saturday and Sunday.

Another important time related analysis to be done involves the price variation with the age of the building. Obviously, it's expected that older buildings are less valued, but for how much?

In [None]:
age_data = data[['YEAR BUILT','SALE PRICE','SALE DATE']]
age_data['SALE YEAR'] = age_data['SALE DATE'].map(lambda x: x.year)
age_data

As seen earlier, this dataset involves sales between September of 2016 and September of 2017. So, the age of the buildings shoud be calculated taking the year of sale as a reference.

In [None]:
age_data['Age'] = age_data['SALE YEAR'] - age_data['YEAR BUILT']
age_data = age_data.groupby('Age').mean()

In [None]:
age_data = age_data[age_data.index <= 300]

In [None]:
plt.figure(figsize=(15,8))
plt.plot(age_data.index,age_data['SALE PRICE'])
plt.title('Mean Sales Price by Age of Property')

Surprisingly, there's a high variation of prices around buildings with 150 years old. By doing some comparisons, it feels that there's a significant difference on these buildings related to the Gross Square Feet, as can be seen on the next plot.

In [None]:
old_priced_buildings = data[(2017 - data['YEAR BUILT'] >= 125) & (2017 - data['YEAR BUILT'] <= 175)]

In [None]:
old_gross_area_mean = old_priced_buildings['GROSS SQUARE FEET'].mean()
general_gross_area_mean = data['GROSS SQUARE FEET'].mean()

In [None]:
plt.figure(figsize=(6,8))
plt.bar(x=['150 Year Old Builds','All Builds'],height=[old_gross_area_mean,general_gross_area_mean])
plt.title('Average Gross Square Feet of Properties Comparison')

### Prices Variation with Location

Last, but not least, probably the most practical question to answer is about the sales price variation with the location of the properties. For this, the geocode data extracted on a previous section are used to create a visualization on the forementioned framework Kepler.gl.


In [None]:
map_data = data[['X Coordinate','Y Coordinate','SALE PRICE']]
map_data = map_data[map_data['X Coordinate'] != 'Not Found']

In [None]:
map_data['X Coordinate'] = map_data['X Coordinate'].astype(float)
map_data['Y Coordinate'] = map_data['Y Coordinate'].astype(float)

Here, a pre-setted configuration file of the map is executed, to make the initial state of the visualization focus New York and the data presented.

In [None]:
%run ../input/nyc-map-config/map_config.py

In [None]:
ny_map = KeplerGl(config=config)
ny_map.add_data(data=map_data,name='NY Properties')
ny_map

So, as expected, the more expensive properties are located in Manhattan. It's also possible to see that the sales on this borough are concentrated on North Zone, but that can be a specific behaviour of the data collected. Outside this zone, the buildings are less priced the more distant they are from the center of the city. 

# Conclusion

On this notebook, an exploratory analysis of a dataset contating one year of Properties Sales on New Yor City is made. From this, some interesting insights could be obtained:

* The Prices of Sales have a high variance, but most properties are concentrated on $500.000 price zone with a Mean Area of approximately 3000 Square Feet;
* Most of the properties in Sale are located on Queens and Brooklin and generally aree apartments and family dwellings;
* Despite not having many variations in price according to building type, small residential properties, like 4-10 Units and Walkup Apartments stands out on data with higher means and variations;
* During the year analised, there's not any trend on sales price detected, but it was possible to see a strong weekly seasonal effect, with lower mean sales prices on weekends;
* The age of the buildings doesn't appear to be a negative factor fo trhe price, since it was found on the data buildings with approximately 150 years old with high prices;
* As expected, the more expensive properties are located nearby Manhattan.

Again, this was my first notebook published on Kaggle. If you enjoyed, or it helped you in some way, please consider giving an upvote. It will mean a lot to me, and will help me to find out if I am on a good learning path on Data Science.