#### Capstone Project
Arthur: Wilson Lau
Date: 2015 November

### Capstone Milestone Report

## Introduction
The problem I want to tackle in my capstone project is to detect investment opportunities for cash flow real estate property.  

## The Problem
Real estate investment is a passive income opportunity for investors to accumulate return without active management.  Huge long term appreciation is very much possible for real estate.  However, appreciation often happens in short period of time and the appreciation is heavily affected local market condition.  It is difficult for out-of-the-market investors to detect the trend and acquire properties before price goes up.  My capstone project is trying to help investors to detect real estate investment opportunities and be able to predict which zip code will experience long term appreciation in the near future. 

My project is divided into two stages:

# Stage One
- the first stage is to use anomaly detection technique to discover which zipcodes have success sprike of upward price appreciation. 
- The Anomaly detection is compared new monthly updated home price with historical price as well as comparing with the home price of the surrending areas based on neighborhoods, city, menpolian area, and state.

# Stage Two
- Based on the result of stage one, more data will be collected for the zipcodes that have sudden increase of home price.  
- Those new dataset will be used to figure out what feature contributes to the price appreciation and a predictive model will be created using those features. 

## Dateset
Dataset used is downloaded from Quandl's Zillow data.  This dataset is a monthly time series recording the change of home price. 

# What important fields and information does the data set have?
The important columns for this dataset is time(date end of the month), home price and zipcode. 



# What are its limitations i.e. what are some questions that you cannot answer with this data set?
This dataset actually have 22 columns.  For example, it has columns for different home size e.g. 1 bedroom, 2 bedroom, 3 bedroom, price per square foot, average listing price, etc...  Since it seems that most of these columns are highly corelate to home price, they may be very help for predicting home price. 

# What kind of cleaning and wrangling did you need to do?
Quandl's API is very powerful and very flexible.  At the same time, it is a bit complicated to pull the home price.  One way to pull home price is to pass zipcode as input parameter, and Quandl will return the historical price data with all the columns related to that zipcode.  So, to get the data, I will need to use a zipcode dataset that contains all the zipcode of the US and then use the Quandl API to pull data one zipcode at a time. 

# Are there other datasets you can find, use and combine with, to answer the questions that matter?
To order to group home price data by city, state, etc.., I will need to find other dataset that relates individual zipcodes with their city name and state name.  I will need to merge the dataset together in order to find out which zipcode belongs to which city or state. 

## Preliminary Findings and Hypothesis test.
.....






In [94]:
''' Problem: Try to find which zipcode and in which months has its price increase sigificantly higher than the mean price with
the same zipcode and also than the rest of the zipcode. '''

#load home price data into dataframe
import numpy as np
import pandas as pd
import pickle
from scipy import stats

#function for calculating moving average 
def movingaverage(interval, window_size):
    window = np.ones(int(window_size))/float(window_size)
    return np.convolve(interval, window, 'same')



#load home price data from Datafiles
df = pickle.load( open( ".\DataPfiles\93063_to_93524.p", "rb" ) ) 

###pivot All Homes prirce to row and grouped them by zipcode
df = df.reset_index()
dfAllHomes = df.pivot(index='index', columns='ZipCode', values='All Homes')

### calcaluate the home price's moving average for every 3 months, with minimal of 3 months
dfAllHomesMovingAve = pd.rolling_mean(dfAllHomes,window=3,min_periods=3)

### shift entire set of Moving Average number downward for one month
### This is needed because I need to line up the previous month moving average with the current month moving average in the same row
dfAllHomesMovingAve = dfAllHomesMovingAve.shift(periods=1,freq=None,axis=0)

### since I shift the all rows for one month earlier, so I can just subtract the two dataframe and calculate the price change.
dfAllHomesDiffFromMovAve = dfAllHomes - dfAllHomesMovingAve
dfAllHomesStdDevInZipCode = dfAllHomesDiffFromMovAve.copy() #dataframe to hold z score within zipcode


##loop through each zipcode column
for x in dfAllHomesStdDevInZipCode.columns:
    if sum(dfAllHomesStdDevInZipCode[x].isnull()) > 0:  # if any month has null value, drop the entire zipcode
        dfAllHomesStdDevInZipCode.drop(x,axis=1)
    dfAllHomesStdDevInZipCode[x] = (dfAllHomesStdDevInZipCode[x] - dfAllHomesStdDevInZipCode[x].mean()) / dfAllHomesStdDevInZipCode[x].std()  #calculate z score for All Homes price each month

##Calculate z score comparing price across zipcode
dfAllHomesStdDevAcrossZipCode = dfAllHomesStdDevInZipCode.copy() #dataframe to hold z score across different zipcode
dfAllHomesStdDevAcrossZipCode['MeanHomePrice'] = dfAllHomesStdDevAcrossZipCode.mean(axis=1)
dfAllHomesStdDevAcrossZipCode['StdDevHomePrice'] = dfAllHomesStdDevAcrossZipCode.std(axis=1)
dfAllHomesStdDevAcrossZipCode = (dfAllHomesStdDevAcrossZipCode - dfAllHomesStdDevAcrossZipCode['MeanHomePrice']) / dfAllHomesStdDevAcrossZipCode['StdDevHomePrice']

dfAllHomesStdDevAcrossZipCode.drop('MeanHomePrice',axis=1)
dfAllHomesStdDevAcrossZipCode.drop('StdDevHomePrice',axis=1)

print dfAllHomesStdDevInZipCode.shape
print dfAllHomesStdDevAcrossZipCode.shape
print dfAllHomesStdDevAcrossZipCode.columns

dfAllHomesStdDevAcrossZipCode.tail()
# ((dfAllHomesStdDevInZipCode > 2) & (dfAllHomesStdDevAcrossZipCode > 2))
# dfAllHomesStdDevInZipCode[(dfAllHomesStdDevInZipCode > 2) & (dfAllHomesStdDevAcrossZipCode > 2)]

(231, 125)
(231, 127)
Index([u'93063', u'93065', u'93066', u'93067', u'93101', u'93103', u'93105',
       u'93108', u'93109', u'93110', 
       ...
       u'93512', u'93513', u'93514', u'93516', u'93517', u'93518', u'93519',
       u'93523', u'MeanHomePrice', u'StdDevHomePrice'],
      dtype='object', name=u'ZipCode', length=127)


ZipCode,93063,93065,93066,93067,93101,93103,93105,93108,93109,93110,...,93512,93513,93514,93516,93517,93518,93519,93523,MeanHomePrice,StdDevHomePrice
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-02-28,-0.844769,-0.542349,-1.702577,,-0.944722,1.078083,0.302272,-0.537323,0.937375,-1.396686,...,,,,,,,,,0,0.298068
2015-03-31,-0.789081,-0.255822,-1.716821,,-1.344337,-0.068684,0.535508,-0.932992,1.897178,-0.644438,...,,,,,,,,,0,0.191029
2015-04-30,-0.643006,-0.036051,-1.795146,,-1.668408,-0.858257,0.548202,-1.304778,2.174798,0.149499,...,,,,,,,,,0,0.420017
2015-05-31,-0.712149,-0.32266,-1.927069,,-1.572385,-0.976485,0.592327,-1.384341,2.317391,1.147357,...,,,,,,,,,0,0.907535
2015-06-30,-0.601716,-0.781055,-2.106352,,-1.491822,-0.43255,0.529372,-1.888489,1.896495,0.732294,...,,,,,,,,,0,1.060436


In [1]:
### This cell as well as the cell after this one are just my playground to play with different functions.  This is not part of the project. 


from pylab import plot, ylim, xlim, show, xlabel, ylabel
from numpy import linspace, loadtxt
import numpy as np

r=3.0

x = p.head()
y = pz

def movingaverage(interval, window_size):
    window = np.ones(int(window_size))/float(window_size)
    return np.convolve(interval, window, 'same')

# plot(x,y)
# xlim(0,1000)

x_av = movingaverage(x, r)
# plot(x_av, y)

# xlabel("Months since Jan 1749.")
# ylabel("No. of Sun spots")
# show()
print x_av

p = df11.iloc[:,0]  #ALl Homes price
pz = df11.iloc[:,-1] 
print pz


t1 = df.iloc[:,1] 
t2 = df3.iloc[:,0]
print "mean is % 4.3F and sd is % 4.3F " % (t1.mean(),t1.std
                                            
                                            # add a new column for moving average of All Homes price


p = df11.iloc[:,0] #ALl Homes price
pz = df11.iloc[:,-1] 


window_size = 3.0 #set the number of sample to gathering centered in the middle
movingave = lambda x: np.convolve(x, np.ones(int(window_size))/float(window_size), 'same')
transformed = df11.groupby('ZipCode')
transformed['All Homes'].transform(movingave)
# df11.info()

# df11 = df11[df11['All Homes'].isnull()]
grouped = df11[['All Homes','ZipCode']].groupby('ZipCode')
b = pd.DataFrame()
newdf = pd.DataFrame()
for name,group in grouped:
    g = group.copy()
#     print group.shape
#     print "size: % 3.2F" % movingaverage(group['All Homes'],3).size
#     if sum(g.isnull()) < 0:
#         g['Moving Ave'] = movingaverage(group['All Homes'],3)

#     print g.shape
#     print g.head()

    #     if(newdf.isnull):
#         newdf = g
#     newdf = newdf.append(g)
#     b = group['All Homes']
#     a = movingaverage(group['All Homes'],3)
    
#     group["MovingAve"] = np.convolve(group['All Homes'], np.ones(int(window_size))/float(window_size), 'same')
#     newdf = newdf.append(group)
# print b
# newdf
# newdf.info()
    
    


# df11['Moving Ave'] = movingaverage(df11.iloc[:,0], 3)  #para#1 is All Homes price; para#2 is numer of sample to gather centered in the middle

# df11.head()
x                                             

SyntaxError: invalid syntax (<ipython-input-1-2f2d3327462f>, line 40)

In [57]:
np.sum(dfAllHomesStdDevAcrossZipCode > 2)


ZipCode
93063               0
93065               1
93066              16
93067               0
93101               3
93103               6
93105               3
93108              20
93109              10
93110               7
93111               2
93117               7
93201               0
93202               7
93203               0
93204              11
93205               0
93206               0
93207               0
93208               0
93210               0
93212               0
93215               3
93218               0
93219              16
93221               4
93222              10
93223               2
93224               0
93225               4
                   ..
93442               2
93444               1
93445               5
93446               5
93449              17
93450               0
93451              14
93452               0
93453              11
93454               0
93455               1
93458               2
93460              12
93461               0
93

In [21]:
np.average(dfAllHomes['93063'][(-1-y):-1])

483400.0

In [48]:
import pandas as pd
dfAllHomesDiffFromMovAve

ZipCode,93063,93065,93066,93067,93101,93103,93105,93108,93109,93110,...,93505,93510,93512,93513,93514,93516,93517,93518,93519,93523
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1996-04-30,,,,,,,,,,,...,,,,,,,,,,
1996-05-31,,,,,,,,,,,...,,,,,,,,,,
1996-06-30,,,,,,,,,,,...,,,,,,,,,,
1996-07-31,100.000000,-200.000000,-11100.000000,,-1766.666667,-8666.666667,3466.666667,,-66.666667,-6100.000000,...,-533.333333,-8866.666667,,,,,,,,
1996-08-31,1000.000000,33.333333,-2433.333333,,-133.333333,-4200.000000,6100.000000,,800.000000,-3266.666667,...,-200.000000,-5833.333333,,,,,,,,
1996-09-30,733.333333,566.666667,3866.666667,,1800.000000,2400.000000,6666.666667,,1466.666667,400.000000,...,-766.666667,-2666.666667,,,,,,,,
1996-10-31,-200.000000,600.000000,2000.000000,,3466.666667,7200.000000,6233.333333,,1800.000000,1533.333333,...,-1000.000000,-533.333333,,,,,,,,
1996-11-30,-600.000000,300.000000,-1166.666667,,3466.666667,7666.666667,4966.666667,,-166.666667,-1066.666667,...,-866.666667,833.333333,,,,,,,,
1996-12-31,-566.666667,466.666667,-66.666667,,1900.000000,4200.000000,2700.000000,,-2133.333333,-5300.000000,...,-566.666667,1466.666667,,,,,,,,
1997-01-31,-433.333333,866.666667,1200.000000,,700.000000,766.666667,1533.333333,,-1500.000000,-6566.666667,...,-733.333333,533.333333,,,,,,,,


In [47]:
print dfAllHomesStdDevAcrossZipCode.shape
dfAllHomesStdDevAcrossZipCode
# np.sum(dfAllHomesStdDevAcrossZipCode > 2)

(231, 127)


ZipCode,93063,93065,93066,93067,93101,93103,93105,93108,93109,93110,...,93512,93513,93514,93516,93517,93518,93519,93523,MeanHomePrice,StdDevHomePrice
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1996-04-30,,,,,,,,,,,...,,,,,,,,,,
1996-05-31,,,,,,,,,,,...,,,,,,,,,,
1996-06-30,,,,,,,,,,,...,,,,,,,,,,
1996-07-31,0.002422,-0.084785,-3.253319,,-0.540201,-2.545970,0.981083,,-0.046026,-1.799863,...,,,,,,,,,0,0.973353
1996-08-31,0.141966,-0.189977,-1.037005,,-0.247209,-1.643660,1.893254,,0.073288,-1.323163,...,,,,,,,,,0,0.798576
1996-09-30,-0.072360,-0.144320,1.280493,,0.388186,0.647242,2.489425,,0.244265,-0.216281,...,,,,,,,,,0,0.611015
1996-10-31,-0.550377,-0.107203,0.668352,,1.480838,3.548984,3.013482,,0.557558,0.409834,...,,,,,,,,,0,0.560416
1996-11-30,-0.657150,-0.108867,-1.002365,,1.820276,4.378930,2.734081,,-0.393162,-0.941444,...,,,,,,,,,0,0.708372
1996-12-31,-0.485677,0.235568,-0.136688,,1.236004,2.841356,1.794387,,-1.579177,-3.789444,...,,,,,,,,,0,0.909844
1997-01-31,-0.366208,0.551109,0.786318,,0.433504,0.480546,1.021528,,-1.118878,-4.694062,...,,,,,,,,,0,0.939564


In [51]:
dfAllHomesDiffFromMovAve.mean(axis =1 )

index
1996-04-30             NaN
1996-05-31             NaN
1996-06-30             NaN
1996-07-31       91.666667
1996-08-31      586.574074
1996-09-30      900.925926
1996-10-31      793.518519
1996-11-30      478.703704
1996-12-31      129.166667
1997-01-31       85.648148
1997-02-28      405.555556
1997-03-31      780.092593
1997-04-30      760.648148
1997-05-31      875.877193
1997-06-30      813.157895
1997-07-31      971.491228
1997-08-31      860.526316
1997-09-30      707.456140
1997-10-31      701.315789
1997-11-30      964.935065
1997-12-31     1424.675325
1998-01-31     2119.480519
1998-02-28     2412.554113
1998-03-31     1835.064935
1998-04-30     2538.961039
1998-05-31     1176.666667
1998-06-30     1123.333333
1998-07-31     1037.777778
1998-08-31      933.333333
1998-09-30     1007.777778
                  ...     
2013-01-31     6131.034483
2013-02-28     6220.689655
2013-03-31     6819.540230
2013-04-30     7923.371648
2013-05-31     9233.333333
2013-06-30    10330.26

In [67]:
dfAllHomesDiffFromMovAve.std(axis = 1)

index
1996-04-30             NaN
1996-05-31             NaN
1996-06-30             NaN
1996-07-31     3464.217942
1996-08-31     2932.579180
1996-09-30     2332.347023
1996-10-31     1817.827203
1996-11-30     1653.007934
1996-12-31     1442.762277
1997-01-31     1427.121888
1997-02-28     1391.973414
1997-03-31     1408.032620
1997-04-30     1622.073157
1997-05-31     2654.676370
1997-06-30     2697.252808
1997-07-31     3186.660641
1997-08-31     3585.095528
1997-09-30     3629.339696
1997-10-31     3964.921897
1997-11-30     3951.352186
1997-12-31     3483.127261
1998-01-31     3688.560436
1998-02-28     3728.497961
1998-03-31     3325.169371
1998-04-30     4241.060094
1998-05-31     1650.801500
1998-06-30     2190.962804
1998-07-31     2559.449973
1998-08-31     2060.614051
1998-09-30     1683.858080
                  ...     
2013-01-31     9600.902726
2013-02-28     9834.221798
2013-03-31    10534.479566
2013-04-30    12485.185138
2013-05-31    15415.985529
2013-06-30    16315.11

In [70]:
True & False

False