# Data Cleaning 
Data was sourced from [Zillow](https://www.zillow.com/research/data/). Our data contains location information along with the Zillow Home Value Index (ZHVI). (*From Zillow website: "A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range."*)

In [1]:
import pandas as pd
import numpy as np
import pickle

In [2]:
ls DATA

Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv
timeseries_queens_p.csv


# Data Import

In [3]:
df = pd.read_csv('DATA/Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')
df.head(3)

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1996-01-31,...,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30,2020-07-31,2020-08-31
0,61639,0,10025,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,245762.0,...,1292776.0,1288753.0,1269532.0,1243884.0,1211977.0,1197322.0,1185428.0,1179938.0,1175379.0,1173231.0
1,84654,1,60657,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,209547.0,...,487111.0,486300.0,486154.0,487283.0,488823.0,489789.0,489865.0,490118.0,491195.0,493022.0
2,61637,2,10023,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,230594.0,...,1080810.0,1099111.0,1117633.0,1130101.0,1129983.0,1138594.0,1143043.0,1147409.0,1149477.0,1155724.0


The dataframe is in the wide format. I want it to have a column per region and row per timestamp. I will also filter the location down to include only New York. First let's check if our region's are unique.

## Export Metadata
Let's separate out the region meta info so we can reference it later.

In [4]:
meta = df.iloc[:, 0:9]

In [5]:
#import os
#os.mkdir('PKL')

In [6]:
# saving meta data
with open('PKL/meta.pkl', 'wb') as fp:
    pickle.dump(meta, fp)

## Subset
Subsetting to only Queens for now.

In [7]:
queens = df[df.CountyName == 'Queens County']

In [8]:
meta_cols = list(df.columns[0:9])

In [9]:
meta_cols.remove('RegionName')

In [10]:
queens = queens.drop(meta_cols, axis = 1)

In [11]:
queens.shape

(54, 297)

In [12]:
queens.columns[queens.isnull().sum() != 0][-1]

'2003-08-31'

It seems like we have full data of all queens zipcode starting from 2003 September. Let's cap it at that.

In [13]:
queens = queens.dropna(axis = 1)

## Percentage Increase
We are trying to find the best neighborhood to invest in. We can approach this in different ways. I can predict the housing price for coming year then calculate the difference OR I can predict the percent increase for each time point. I'll try the percent increase first.

In [14]:
def calculate_percent_increase(x1, x2):
    return ((x2-x1)/x1)*100

In [15]:
queens_p = queens.copy()
for i in range(queens_p.shape[1]-1,1, -1):
    prior = queens_p.iloc[:, i-1]
    current = queens_p.iloc[:, i] 
    queens_p.iloc[:, i] = calculate_percent_increase(prior, current)

In [16]:
queens_p = queens_p.drop(['2003-09-30'], axis = 1)

In [17]:
queens_p.head(2)

Unnamed: 0,RegionName,2003-10-31,2003-11-30,2003-12-31,2004-01-31,2004-02-29,2004-03-31,2004-04-30,2004-05-31,2004-06-30,...,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30,2020-07-31,2020-08-31
20,11375,0.641106,0.913096,0.393119,0.631139,0.435425,0.625393,0.455833,0.666596,1.167856,...,-0.235229,0.303563,-0.195293,0.024489,-0.426299,-0.05689,-0.313316,0.178322,0.533047,0.64746
108,11377,0.025764,0.023655,0.921259,1.147181,1.159911,0.756265,0.799079,0.995691,1.538103,...,0.310192,0.464618,1.262602,1.144025,1.365576,1.264332,0.898478,0.54954,0.489825,0.827001


## Transpose
Now I'll change the format to the wide format.

In [18]:
queens_p = queens_p.set_index('RegionName').transpose()

In [19]:
queens = queens.set_index('RegionName').transpose()

## Fix Datetime
Let's convert the index to datetime.

In [20]:
queens_p.index = pd.DatetimeIndex(queens_p.index)
queens_p.index = queens_p.index.strftime('%m/%Y')
queens_p.columns.name = None

In [21]:
queens.index = pd.DatetimeIndex(queens.index)
queens.index = queens.index.strftime('%m/%Y')
queens.columns.name = None

## Exporting 
Now let's export the dataframe.

In [22]:
with open('PKL/timeseries_queens_p.pkl', 'wb') as fp:
    pickle.dump(queens_p, fp)

In [23]:
with open('PKL/timeseries_queens.pkl', 'wb') as fp:
    pickle.dump(queens, fp)

In [24]:
queens_p.to_csv('DATA/timeseries_queens_p.csv', header=True)