# Predicting NYC apartment's value with open data

According to real estate brokers in NYC, there are a variety of factors they consider to determine the value of your apartment, not limited to

1. Recent sales in your building / Neighbourhood
2. Square footage
3. Renovation status
4. View, close to subway, # of bedrooms etc. 

Unfortunately this data, especially apt sq footage, is not easily available for non REBNY members. 

In this notebook, I will explore an alternative approach to price Manhattan apartments. 

In [2]:
import numpy as np
import pandas as pd

from datascience import *

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from constants import *

import locale
import os
import xlrd as xlrd

## Import data

Source
1. [Annualized Sales Data](https://www1.nyc.gov/site/finance/taxes/property-annualized-sales-update.page)
2. [Rolling Sale data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [3]:

# Retrieve current working directory (`cwd`)
cwd = os.getcwd()

# List data files and directories in current directory
excel_files = os.listdir(raw_directory)

# Select only tje xls files
excel_files = [k for k in excel_files if '.xls' in k]

In [4]:
excel_files[:len(excel_files)]

['2015_manhattan.xls',
 '2011_manhattan.xls',
 '2016_manhattan.xls',
 '2012_manhattan.xls',
 'rollingsales_manhattan.xls',
 'sales_manhattan_03.xls',
 'sales_manhattan_06.xls',
 'sales_manhattan_04.xls',
 'sales_2007_manhattan.xls',
 'sales_manhattan_05.xls',
 '2009_manhattan.xls',
 '2013_manhattan.xls',
 '2017_manhattan.xls',
 'sales_2008_manhattan.xls',
 '2010_manhattan.xls',
 '2014_manhattan.xls']

Unfortunately not all files are formatted the same was. Some have the header in row 4, others in row 5. We can check by making sure 'BOROUGH' is the first column in the imported dataset 

In [5]:

# Create an data store
all_sales_data = pd.DataFrame()

# Load individual excel files. 
for excel_file in excel_files:
    print(excel_file)
    
    # Read excel, Note the headers could in row 4 or row 5 (index=3 or 4). 
    yearly_sales_data = pd.read_excel(raw_directory+excel_file, header=3, encoding='sys.getfilesystemencoding()')
   
    # Check if the first column is "BOROUGH"
    if not yearly_sales_data.columns[0].startswith('BOROUGH'):
        # Otherwise the data starts from row 5.
         yearly_sales_data = pd.read_excel(raw_directory+excel_file, header=4, encoding='sys.getfilesystemencoding()')
    
    print(yearly_sales_data.shape)
    
    yearly_sales_data.rename(columns=lambda x: x.strip(), inplace=True)
    
    all_sales_data = all_sales_data.append(yearly_sales_data)
    
    print(all_sales_data.shape)





2015_manhattan.xls


(24989, 21)
(24989, 21)
2011_manhattan.xls


(21500, 21)
(46489, 21)
2016_manhattan.xls


(21241, 21)
(67730, 21)
2012_manhattan.xls


(26258, 21)
(93988, 21)
rollingsales_manhattan.xls


(16828, 21)
(110816, 21)
sales_manhattan_03.xls


(22210, 21)
(133026, 21)
sales_manhattan_06.xls


(26352, 21)
(159378, 21)
sales_manhattan_04.xls


(25894, 21)
(185272, 21)
sales_2007_manhattan.xls


(28439, 21)
(213711, 21)
sales_manhattan_05.xls


(26388, 21)
(240099, 21)
2009_manhattan.xls


(19166, 21)
(259265, 21)
2013_manhattan.xls


(26715, 21)
(285980, 21)
2017_manhattan.xls


(18642, 21)
(304622, 23)
sales_2008_manhattan.xls


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


(25994, 21)
(330616, 23)
2010_manhattan.xls


(17296, 21)


(347912, 23)
2014_manhattan.xls


(24524, 21)


(372436, 23)


Spot check to verify data

In [6]:

all_sales_data.sample(10)


Unnamed: 0,ADDRESS,APARTMENT NUMBER,BLOCK,BOROUGH,BUILDING CLASS AS OF FINAL ROLL 17/18,BUILDING CLASS AT PRESENT,BUILDING CLASS AT TIME OF SALE,BUILDING CLASS CATEGORY,COMMERCIAL UNITS,EASE-MENT,...,NEIGHBORHOOD,RESIDENTIAL UNITS,SALE DATE,SALE PRICE,TAX CLASS AS OF FINAL ROLL 17/18,TAX CLASS AT PRESENT,TAX CLASS AT TIME OF SALE,TOTAL UNITS,YEAR BUILT,ZIP CODE
1278,"405 WEST 57TH STREET, 4H",,1067,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,CLINTON,0,2009-08-19,494800,,2.0,2,0,1940,10019
7412,768 5 AVENUE,1302,1274,1,,R4,R4,13 CONDOS - ELEVATOR APARTMENTS,0,,...,MIDTOWN WEST,1,2009-11-17,1090000,,2.0,2,1,0,10019
964,539 AVENUE OF THE AMER,,790,1,K4,,K4,22 STORE BUILDINGS,2,,...,CHELSEA,2,2017-04-18,10600000,4.0,,4,4,1920,10011
20744,510 EAST 80 STREET,4A,1576,1,,R4,R4,13 CONDOS - ELEVATOR APARTMENTS,0,,...,UPPER EAST SIDE (79-96),1,2004-07-07,505500,,2.0,2,1,1986,10021
13603,"225 EAST 36TH STREET, 17B",,917,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,MURRAY HILL,0,2011-11-08,0,,2.0,2,0,1963,10016
17780,"230 EAST 79TH STREET, 18-F",,1433,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,UPPER EAST SIDE (59-79),0,2005-12-22,550000,,2.0,2,0,1964,10021
1321,516 WEST 47TH,S2J,1075,1,,R4,R4,13 CONDOS - ELEVATOR APARTMENTS,0,,...,CLINTON,1,2009-11-12,420000,,2.0,2,1,0,10036
26170,643-645 171 STREET,,2142,1,,C1,C1,07 RENTALS - WALKUP APARTMENTS,0,,...,WASHINGTON HEIGHTS LOWER,31,2013-12-16,5500000,,2.0,2,31,1914,10032
398,"223 WEST 21 STREET, 5L",,771,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,CHELSEA,0,2006-04-20,750000,,2.0,2,0,1889,10011
7942,102 WEST 57TH STREET,,1009,1,,H2,H2,25 LUXURY HOTELS,2,,...,MIDTOWN WEST,0,2010-08-02,32505,,4.0,4,2,2007,10019


In [7]:
# Check for duplicate entries
sum(all_sales_data.duplicated(all_sales_data.columns))



16137

In [8]:
#Delete the duplicates and check that it worked
all_sales_data = all_sales_data.drop_duplicates(all_sales_data.columns, keep='last')
sum(all_sales_data.duplicated(all_sales_data.columns))

0

Save the data as a CSV for further clean up and analyis. See Step 2. 

In [9]:
# save to csv
all_sales_data.to_csv(csv_directory+"manhattan.csv")