# Predict NYC apartment's value with open data

According to real estate brokers in NYC, there are a variety of factors they consider to determine the value of your apartment, not limited to

1. Recent sales in your building / Neighbourhood
2. Square footage
3. Renovation status
4. View, close to subway, # of berooms etc. 

Unfortunately this data, especially apt sq footage, is not easily available for non REBNY members. 

In this notebook, I will an alternative approach to price Manhattan apartments. 

In [3]:
import numpy as np
import pandas as pd

from datascience import *

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import locale
import os
import xlrd as xlrd

## Import data

Source: XXXX

In [4]:
# Constants
raw_directory = "data/raw/"
csv_directory = "data/csv/"

# Retrieve current working directory (`cwd`)
cwd = os.getcwd()

# List data files and directories in current directory
excel_files = os.listdir(raw_directory)

# Select only tje xls files
excel_files = [k for k in excel_files if '.xls' in k]

In [5]:
excel_files[:len(excel_files)]

['2015_manhattan.xls',
 '2011_manhattan.xls',
 '2016_manhattan.xls',
 '2012_manhattan.xls',
 'rollingsales_manhattan.xls',
 'sales_manhattan_03.xls',
 'sales_manhattan_06.xls',
 'sales_manhattan_04.xls',
 'sales_2007_manhattan.xls',
 'sales_manhattan_05.xls',
 '2009_manhattan.xls',
 '2013_manhattan.xls',
 '2017_manhattan.xls',
 'sales_2008_manhattan.xls',
 '2010_manhattan.xls',
 '2014_manhattan.xls']

Unfortunately not all files are formatted the same was. Some have the header in row 4, others in row 5. We can check by making sure 'BOROUGH' is the first column in the imported dataset 

In [7]:

# Create an data store
all_sales_data = pd.DataFrame()

# Load individual excel files. 
for excel_file in excel_files:
    print(excel_file)
    
    # Read excel, Note the headers could in row 4 or row 5 (index=3 or 4). 
    yearly_sales_data = pd.read_excel(raw_directory+excel_file, header=3, encoding='sys.getfilesystemencoding()')
   
    # Check if the first column is "BOROUGH"
    if not yearly_sales_data.columns[0].startswith('BOROUGH'):
        # Otherwise the data starts from row 5.
         yearly_sales_data = pd.read_excel(raw_directory+excel_file, header=4, encoding='sys.getfilesystemencoding()')
    
    print(yearly_sales_data.shape)
    
    yearly_sales_data.rename(columns=lambda x: x.strip(), inplace=True)
    
    all_sales_data = all_sales_data.append(yearly_sales_data)
    
    print(all_sales_data.shape)





2015_manhattan.xls
(24989, 21)
(24989, 21)
2011_manhattan.xls
(21500, 21)
(46489, 21)
2016_manhattan.xls
(21241, 21)
(67730, 21)
2012_manhattan.xls
(26258, 21)
(93988, 21)
rollingsales_manhattan.xls
(16828, 21)
(110816, 21)
sales_manhattan_03.xls
(22210, 21)
(133026, 21)
sales_manhattan_06.xls
(26352, 21)
(159378, 21)
sales_manhattan_04.xls
(25894, 21)
(185272, 21)
sales_2007_manhattan.xls
(28439, 21)
(213711, 21)
sales_manhattan_05.xls
(26388, 21)
(240099, 21)
2009_manhattan.xls
(19166, 21)
(259265, 21)
2013_manhattan.xls
(26715, 21)
(285980, 21)
2017_manhattan.xls
(18642, 21)
(304622, 23)
sales_2008_manhattan.xls
(25994, 21)
(330616, 23)
2010_manhattan.xls
(17296, 21)
(347912, 23)
2014_manhattan.xls
(24524, 21)
(372436, 23)


Spot check to verify data

In [8]:

all_sales_data.sample(10)


Unnamed: 0,ADDRESS,APARTMENT NUMBER,BLOCK,BOROUGH,BUILDING CLASS AS OF FINAL ROLL 17/18,BUILDING CLASS AT PRESENT,BUILDING CLASS AT TIME OF SALE,BUILDING CLASS CATEGORY,COMMERCIAL UNITS,EASE-MENT,...,NEIGHBORHOOD,RESIDENTIAL UNITS,SALE DATE,SALE PRICE,TAX CLASS AS OF FINAL ROLL 17/18,TAX CLASS AT PRESENT,TAX CLASS AT TIME OF SALE,TOTAL UNITS,YEAR BUILT,ZIP CODE
11615,870 7 AVENUE,2301,1027,1,,R5,R5,28 COMMERCIAL CONDOS,1,,...,MIDTOWN WEST,0,2011-06-22,29000,,4,4,1,0,10019
18434,"870 WEST 181ST STREET, 28",,2177,1,D4,,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,WASHINGTON HEIGHTS UPPER,0,2017-11-01,395000,2.0,,2,0,1923,10033
5973,72-78 SEAMAN AVENUE,,2248,1,,D1,D1,08 RENTALS - ELEVATOR APARTMENTS,0,,...,INWOOD,47,2003-10-28,5600000,,2,2,47,1926,10034
20562,"178 EAST 80TH STREET, 23A",,1508,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,UPPER EAST SIDE (79-96),0,2006-06-13,850000,,2,2,0,1973,10021
11889,"343 EAST 74TH STREET, 14L",,1449,1,,R9,R9,17 CONDO COOPS,0,,...,UPPER EAST SIDE (59-79),0,2017-09-18,755000,,2,2,0,1986,10021
16423,"90 LASALLE STREET, 14B",,1978,1,,D4,D4,10 COOPS - ELEVATOR APARTMENTS,0,,...,MORNINGSIDE HEIGHTS,0,2007-10-05,468700,,2,2,0,1956,10027
11437,127 MADISON AVENUE,8,860,1,,R1,R1,15 CONDOS - 2-10 UNIT RESIDENTIAL,0,,...,MURRAY HILL,1,2010-08-27,1443369,,2C,2,1,1920,10016
7055,47 WEST 127 STREET,,1725,1,,C0,C0,03 THREE FAMILY HOMES,0,,...,HARLEM-CENTRAL,3,2007-05-23,0,,1,1,3,2002,10027
12184,1335 AVENUE OF THE AMERIC,TIMES,1006,1,,RH,RH,45 CONDO HOTELS,0,,...,MIDTOWN WEST,0,2015-11-17,19997,,4,4,1,1963,10019
4947,152 WEST 131 ST. APT. 5,,1915,1,,C6,C6,09 COOPS - WALKUP APARTMENTS,0,,...,HARLEM-CENTRAL,0,2003-05-09,0,,2C,2,0,1920,10027


Save the data as a CSV for further clean up and analyis. See Step 2. 

In [19]:
# save to csv
all_sales_data.to_csv(csv_directory+"manhattan.csv")