#### Capstone Project
Arthur: Wilson Lau
Date: 2015 November

### Capstone Milestone Report

## Introduction
The problem I want to tackle in my capstone project is to detect investment opportunities for cash flow real estate property.  

## The Problem
Real estate investment is a passive income opportunity for investors to accumulate return without active management.  Huge long term appreciation is very much possible for real estate.  However, appreciation often happens in short period of time and the appreciation is heavily affected local market condition.  It is difficult for out-of-the-market investors to detect the trend and acquire properties before price goes up.  My capstone project is trying to help investors to detect real estate investment opportunities and be able to predict which zip code will experience long term appreciation in the near future. 

My project is divided into two stages:

# Stage One
- the first stage is to use anomaly detection technique to discover which zipcodes have success sprike of upward price appreciation. 
- The Anomaly detection is compared new monthly updated home price with historical price as well as comparing with the home price of the surrending areas based on neighborhoods, city, menpolian area, and state.

# Stage Two
- Based on the result of stage one, more data will be collected for the zipcodes that have sudden increase of home price.  
- Those new dataset will be used to figure out what feature contributes to the price appreciation and a predictive model will be created using those features. 

## Dateset
Dataset used is downloaded from Quandl's Zillow data.  This dataset is a monthly time series recording the change of home price. 

# What important fields and information does the data set have?
The important columns for this dataset is time(date end of the month), home price and zipcode. 



# What are its limitations i.e. what are some questions that you cannot answer with this data set?
This dataset actually have 22 columns.  For example, it has columns for different home size e.g. 1 bedroom, 2 bedroom, 3 bedroom, price per square foot, average listing price, etc...  Since it seems that most of these columns are highly corelate to home price, they may be very help for predicting home price. 

# What kind of cleaning and wrangling did you need to do?
Quandl's API is very powerful and very flexible.  At the same time, it is a bit complicated to pull the home price.  One way to pull home price is to pass zipcode as input parameter, and Quandl will return the historical price data with all the columns related to that zipcode.  So, to get the data, I will need to use a zipcode dataset that contains all the zipcode of the US and then use the Quandl API to pull data one zipcode at a time. 

# Are there other datasets you can find, use and combine with, to answer the questions that matter?
To order to group home price data by city, state, etc.., I will need to find other dataset that relates individual zipcodes with their city name and state name.  I will need to merge the dataset together in order to find out which zipcode belongs to which city or state. 

## Preliminary Findings and Hypothesis test.
.....






In [570]:
''' Problem: Try to find which zipcode and in which months has its price increase sigificantly higher than the mean price with
the same zipcode and also than the rest of the zipcode. '''

#load home price data into dataframe
import numpy as np
import pandas as pd
import pickle
from scipy import stats
import os
import glob
from os import path
import dateUtility
import datetime as dt
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_row', 1000)

def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

#load home price data from Datafiles
# df = pickle.load( open( ".\DataPfiles\93063_to_93524.p", "rb" ) ) 
df = pd.DataFrame()
for filename in glob.glob(os.path.join('.\\DataPfiles\\', '9*.p')):
    df = pd.concat([df, pd.read_pickle(filename)], axis=0)
# there may be a problem here.  There are only about 1700 zipcode in the .p file although CA has over 2700 zipcode

### Pivot indivudal columns into a new dataframe for analysis: 
### 1) Put 1) All Homes prirce 2) Price-To-Rent Ratio into rows and grouped them by zipcode
df = df.reset_index()
dfAllHomes = df.pivot(index='index', columns='ZipCode', values='All Homes')
dfPriceToRent = df.pivot(index='index', columns='ZipCode', values='Price-to-Rent Ratio')  #I don't make use of this feature now,but will do later

### calcaluate the home price's moving average for every 3 months, with minimal of 3 months
dfAllHomesMovingAve = pd.rolling_mean(dfAllHomes,window=3,min_periods=3)

### shift entire set of Moving Average number downward for one month
### This is needed to line up the previous month moving average with the current month moving average in the same row
dfAllHomesMovingAve = dfAllHomesMovingAve.shift(periods=1,freq=None,axis=0)

### since I shift the all rows for one month earlier, so I can just subtract the two dataframe and calculate the price change.
dfAllHomesDiffFromMovAve = dfAllHomes - dfAllHomesMovingAve #price change
dfAllHomesDiffFromMovAvePercent = (dfAllHomes - dfAllHomesMovingAve)/dfAllHomes.shift(periods=1,freq=None,axis=0) #price percentage change
NumOfMonthForward = 6 #set the number of month to include for calculating the predicting average percentage increase
dfAllHomesDiffFromMovAvePercentMovAve = pd.rolling_mean(dfAllHomesDiffFromMovAvePercent,window=NumOfMonthForward,min_periods=NumOfMonthForward) # calculate the moving average percentage increase for 6 months
dfAllHomesDiffFromMovAvePercentMovAve2 = dfAllHomesDiffFromMovAvePercentMovAve.shift(periods=-(NumOfMonthForward-1),freq=None,axis=0) # shift dataframe upward for 5 record so the current month shows the moving average percentage increase for the next 6 months(including current month)
dfAllHomesStdDevInZipCode = dfAllHomesDiffFromMovAvePercentMovAve2.copy() #dataframe to hold z score within zipcode

##loop through each zipcode column
for x in dfAllHomesStdDevInZipCode.columns:
    if sum(dfAllHomesStdDevInZipCode[x].isnull()) > 0:  # if any month has null value, drop the entire zipcode
        dfAllHomesStdDevInZipCode.drop(x,axis=1)
    dfAllHomesStdDevInZipCode[x] = (dfAllHomesStdDevInZipCode[x] - dfAllHomesStdDevInZipCode[x].mean()) / dfAllHomesStdDevInZipCode[x].std()  #calculate z score for All Homes price each month

##Calculate z score comparing price across zipcode
dfAllHomesStdDevAcrossZipCode = dfAllHomesDiffFromMovAvePercent.copy() #dataframe to hold z score across different zipcode
dfAllHomesStdDevAcrossZipCode['MeanHomePrice'] = dfAllHomesStdDevAcrossZipCode.mean(axis=1)
dfAllHomesStdDevAcrossZipCode['StdDevHomePrice'] = dfAllHomesStdDevAcrossZipCode.std(axis=1)
dfAllHomesStdDevAcrossZipCode = (dfAllHomesStdDevAcrossZipCode - dfAllHomesStdDevAcrossZipCode['MeanHomePrice']) / dfAllHomesStdDevAcrossZipCode['StdDevHomePrice']

##drop the two added columns so that match the dataframe size of df..InZipCode and df..AcrossZipCode
dfAllHomesStdDevAcrossZipCode = dfAllHomesStdDevAcrossZipCode.drop('MeanHomePrice',axis=1)
dfAllHomesStdDevAcrossZipCode = dfAllHomesStdDevAcrossZipCode.drop('StdDevHomePrice',axis=1)

##Abnormally Detection method ONE
##Find out which zipcode and its time frame in the dataframe has z score large than 2 in both df..InZipCode and df..AcrossZipCode dataframe
targetZscore = 1.5
targetZipcodeBoolean = ((dfAllHomesStdDevInZipCode > targetZscore) & (dfAllHomesStdDevAcrossZipCode > targetZscore))
# targetZipCodes = dfAllHomesMovingAve[targetZipcodeBoolean]#.dropna
selectedZipCodes = pd.DataFrame(dfAllHomesMovingAve[targetZipcodeBoolean].sum(axis = 0) > 0)  #find out which zipcodes are over target zscore
selectedZipCodes[selectedZipCodes[0] == True] #extract only the zipcode that are over zscore

### pivot a new dataframe moving targeted zipcodes into index joining the month-end time (multi-level) index
TargetZipCode = pd.DataFrame(targetZipcodeBoolean.stack())  # stack zipcode from column names into column
TargetZipCode.columns = ['PredictZipCode'] # specify columne name
TargetZipCode = pd.DataFrame(TargetZipCode.reset_index()) # move index into column
dfPredictZipCode = pd.merge(df,TargetZipCode,on=['index','ZipCode']) #merge targeted zipcode column for prediction into main dataframe(which is loaded from 9*.p files)
dfPredictZipCode.rename(columns={'index':'Month'}, inplace=True)
dfPredictZipCode['Month'] = dfPredictZipCode['Month'].apply(lambda x: dt.datetime.strftime(x, '%Y-%m-%d'))
dfPredictZipCode.set_index(['Month','ZipCode'],inplace=True)

### Convert dfPredictZipcode.PredictZipCode from boolean to float, so false = 0, true = 1
dfPredictZipCode.ix[dfPredictZipCode.PredictZipCode == False,'PredictZipCode'] = 0;
dfPredictZipCode.ix[dfPredictZipCode.PredictZipCode == True,'PredictZipCode'] = 1;
dfPredictZipCode.PredictZipCode = dfPredictZipCode.PredictZipCode.astype(float)
print sum(dfPredictZipCode.PredictZipCode)


### Clean irsdatafull dataframe by removing np.nan, nan(string), ' '(space) and 0 in the cells and fill those cells with back filling value
# finding dataframe cell with 'nan',' ' and 0 and then turn them into np.nan
for x in dfPredictZipCode.columns:
    if x == 'PredictZipCode':
        print "found predictZipCode"
        break
    nalist = dfPredictZipCode[x].isin(['nan',' ',0])
    dfPredictZipCode.ix[nalist,x] = np.nan

# use fillna funcion to back filling the value from following rows back to rows above
dfPredictZipCode = dfPredictZipCode.reset_index()  
dfPredictZipCode.set_index("Month")
dfPredictZipCode = dfPredictZipCode.fillna(method='bfill')
dfPredictZipCode = dfPredictZipCode.fillna(method='ffill')
dfPredictZipCode = dfPredictZipCode.reset_index()
dfPredictZipCode.set_index(['Month','ZipCode'],inplace=True)
dfPredictZipCode = dfPredictZipCode.drop('index',axis=1)
dfPredictZipCode.head()

#check to see if there is still nan like value in the dataframe
print "Predict Zip:"
print sum(dfPredictZipCode.PredictZipCode)
print "nan:"
print sum(dfPredictZipCode.isin(['nan',' ',0]).values)
print "null value:"
print sum(dfPredictZipCode.isnull().values)

### tasks to do
### process IRS dataaset - DONE
### process Census dataset - DONE
### process proximityone dataset
### apply training model to learn how IRS and Census dataset can predict target zipcodes - DONE
### evaluate and test model by breaking down into training data and test data - DONE
### apply feature scaling to dataset
### use PCA to combine some features e.g. I think I can combine all the home price for different number of bedrooms into one feature
### modify withinZipcode detection not to use z score, but just average monthly price increase over a certain percentage
### use F1 score and some other ways to evaluate the performance of the model
### apply visualization on data; add Google maps to show the results
### rewrite Storytelling





3095.0
found predictZipCode
Predict Zip:
3095.0
nan:
[     0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0 305422]
null value:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [197]:
# print "nan:"
# print sum(dfAllHomesMovingAve.isin(['nan',' ',0]).values)
# print "null value:"
# print sum(dfAllHomesMovingAve.isnull().values)
# sum(TargetZipCode.PredictZipCode)
dfPredictZipCode1 = pd.DataFrame(dfPredictZipCode.PredictZipCode)
dfPredictZipCode1.unstack(level=1)

Unnamed: 0_level_0,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode,PredictZipCode
ZipCode,90001,90002,90003,90004,90005,90006,90007,90008,90010,90011,90012,90013,90014,90015,90016,90017,90018,90019,90020,90021,90022,90023,90024,90025,90026,90027,90028,90029,90031,90032,90033,90034,90035,90036,90037,90038,90039,90040,90041,90042,90043,90044,90045,90046,90047,90048,90049,90056,90057,90058,90059,90061,90062,90063,90064,90065,90066,90067,90068,90069,90077,90089,90094,90201,90210,90211,90212,90220,90221,90222,90230,90232,90240,90241,90242,90245,90247,90248,90249,90250,90254,90255,90260,90262,90265,90266,90270,90272,90274,90275,90277,90278,90280,90290,90291,90292,90293,90301,90302,90303,90304,90305,90401,90402,90403,90404,90405,90501,90502,90503,90504,90505,90601,90602,90603,90604,90605,90606,90620,90621,90623,90630,90631,90638,90640,90650,90660,90670,90680,90701,90703,90704,90706,90710,90712,90713,90715,90716,90717,90720,90723,90731,90732,90740,90742,90743,90744,90745,90746,90755,90802,90803,90804,90805,90806,90807,90808,90810,90813,90814,90815,91001,91006,91007,91008,91010,91011,91016,91020,91024,91030,91040,91042,91101,91103,91104,91105,91106,91107,91108,91201,91202,91203,91204,91205,91206,91207,91208,91210,91214,91301,91302,91303,91304,91306,91307,91311,91316,91320,91321,91324,91325,91326,91331,91335,91340,91342,91343,91344,91345,91350,91351,91352,91354,91355,91356,91360,91361,91362,91364,91367,91377,91384,91387,91390,91401,91402,91403,91405,91406,91411,91423,91436,91501,91502,91504,91505,91506,91601,91602,91604,91605,91606,91607,91701,91702,91706,91708,91709,91710,...,95689,95690,95691,95692,95693,95694,95695,95697,95699,95701,95703,95709,95713,95714,95715,95717,95720,95721,95722,95724,95726,95728,95735,95736,95742,95746,95747,95757,95758,95762,95765,95776,95798,95799,95811,95813,95814,95815,95816,95817,95818,95819,95820,95821,95822,95823,95824,95825,95826,95827,95828,95829,95830,95831,95832,95833,95834,95835,95837,95838,95840,95841,95842,95843,95852,95864,95865,95866,95901,95912,95913,95914,95916,95917,95918,95919,95920,95922,95923,95924,95925,95926,95928,95930,95931,95932,95934,95935,95936,95937,95938,95939,95940,95941,95942,95944,95945,95946,95947,95948,95949,95950,95951,95953,95954,95955,95956,95957,95959,95960,95961,95962,95963,95965,95966,95968,95969,95971,95973,95974,95975,95977,95978,95979,95981,95982,95983,95984,95987,95988,95991,95992,95993,96001,96002,96003,96006,96007,96008,96009,96010,96013,96015,96016,96017,96019,96020,96021,96022,96023,96024,96025,96027,96028,96029,96032,96033,96034,96035,96037,96038,96039,96040,96041,96044,96045,96046,96047,96048,96049,96050,96051,96052,96055,96056,96057,96058,96059,96061,96062,96063,96064,96065,96067,96069,96071,96073,96074,96075,96076,96080,96084,96086,96087,96088,96089,96090,96091,96092,96093,96094,96095,96096,96097,96101,96103,96104,96105,96106,96107,96108,96109,96110,96111,96112,96113,96114,96116,96117,96118,96119,96120,96121,96122,96123,96124,96125,96126,96127,96128,96129,96130,96133,96134,96135,96136,96137,96140,96141,96142,96143,96145,96146,96148,96150,96151,96152,96160,96161,96162
Month,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2,Unnamed: 61_level_2,Unnamed: 62_level_2,Unnamed: 63_level_2,Unnamed: 64_level_2,Unnamed: 65_level_2,Unnamed: 66_level_2,Unnamed: 67_level_2,Unnamed: 68_level_2,Unnamed: 69_level_2,Unnamed: 70_level_2,Unnamed: 71_level_2,Unnamed: 72_level_2,Unnamed: 73_level_2,Unnamed: 74_level_2,Unnamed: 75_level_2,Unnamed: 76_level_2,Unnamed: 77_level_2,Unnamed: 78_level_2,Unnamed: 79_level_2,Unnamed: 80_level_2,Unnamed: 81_level_2,Unnamed: 82_level_2,Unnamed: 83_level_2,Unnamed: 84_level_2,Unnamed: 85_level_2,Unnamed: 86_level_2,Unnamed: 87_level_2,Unnamed: 88_level_2,Unnamed: 89_level_2,Unnamed: 90_level_2,Unnamed: 91_level_2,Unnamed: 92_level_2,Unnamed: 93_level_2,Unnamed: 94_level_2,Unnamed: 95_level_2,Unnamed: 96_level_2,Unnamed: 97_level_2,Unnamed: 98_level_2,Unnamed: 99_level_2,Unnamed: 100_level_2,Unnamed: 101_level_2,Unnamed: 102_level_2,Unnamed: 103_level_2,Unnamed: 104_level_2,Unnamed: 105_level_2,Unnamed: 106_level_2,Unnamed: 107_level_2,Unnamed: 108_level_2,Unnamed: 109_level_2,Unnamed: 110_level_2,Unnamed: 111_level_2,Unnamed: 112_level_2,Unnamed: 113_level_2,Unnamed: 114_level_2,Unnamed: 115_level_2,Unnamed: 116_level_2,Unnamed: 117_level_2,Unnamed: 118_level_2,Unnamed: 119_level_2,Unnamed: 120_level_2,Unnamed: 121_level_2,Unnamed: 122_level_2,Unnamed: 123_level_2,Unnamed: 124_level_2,Unnamed: 125_level_2,Unnamed: 126_level_2,Unnamed: 127_level_2,Unnamed: 128_level_2,Unnamed: 129_level_2,Unnamed: 130_level_2,Unnamed: 131_level_2,Unnamed: 132_level_2,Unnamed: 133_level_2,Unnamed: 134_level_2,Unnamed: 135_level_2,Unnamed: 136_level_2,Unnamed: 137_level_2,Unnamed: 138_level_2,Unnamed: 139_level_2,Unnamed: 140_level_2,Unnamed: 141_level_2,Unnamed: 142_level_2,Unnamed: 143_level_2,Unnamed: 144_level_2,Unnamed: 145_level_2,Unnamed: 146_level_2,Unnamed: 147_level_2,Unnamed: 148_level_2,Unnamed: 149_level_2,Unnamed: 150_level_2,Unnamed: 151_level_2,Unnamed: 152_level_2,Unnamed: 153_level_2,Unnamed: 154_level_2,Unnamed: 155_level_2,Unnamed: 156_level_2,Unnamed: 157_level_2,Unnamed: 158_level_2,Unnamed: 159_level_2,Unnamed: 160_level_2,Unnamed: 161_level_2,Unnamed: 162_level_2,Unnamed: 163_level_2,Unnamed: 164_level_2,Unnamed: 165_level_2,Unnamed: 166_level_2,Unnamed: 167_level_2,Unnamed: 168_level_2,Unnamed: 169_level_2,Unnamed: 170_level_2,Unnamed: 171_level_2,Unnamed: 172_level_2,Unnamed: 173_level_2,Unnamed: 174_level_2,Unnamed: 175_level_2,Unnamed: 176_level_2,Unnamed: 177_level_2,Unnamed: 178_level_2,Unnamed: 179_level_2,Unnamed: 180_level_2,Unnamed: 181_level_2,Unnamed: 182_level_2,Unnamed: 183_level_2,Unnamed: 184_level_2,Unnamed: 185_level_2,Unnamed: 186_level_2,Unnamed: 187_level_2,Unnamed: 188_level_2,Unnamed: 189_level_2,Unnamed: 190_level_2,Unnamed: 191_level_2,Unnamed: 192_level_2,Unnamed: 193_level_2,Unnamed: 194_level_2,Unnamed: 195_level_2,Unnamed: 196_level_2,Unnamed: 197_level_2,Unnamed: 198_level_2,Unnamed: 199_level_2,Unnamed: 200_level_2,Unnamed: 201_level_2,Unnamed: 202_level_2,Unnamed: 203_level_2,Unnamed: 204_level_2,Unnamed: 205_level_2,Unnamed: 206_level_2,Unnamed: 207_level_2,Unnamed: 208_level_2,Unnamed: 209_level_2,Unnamed: 210_level_2,Unnamed: 211_level_2,Unnamed: 212_level_2,Unnamed: 213_level_2,Unnamed: 214_level_2,Unnamed: 215_level_2,Unnamed: 216_level_2,Unnamed: 217_level_2,Unnamed: 218_level_2,Unnamed: 219_level_2,Unnamed: 220_level_2,Unnamed: 221_level_2,Unnamed: 222_level_2,Unnamed: 223_level_2,Unnamed: 224_level_2,Unnamed: 225_level_2,Unnamed: 226_level_2,Unnamed: 227_level_2,Unnamed: 228_level_2,Unnamed: 229_level_2,Unnamed: 230_level_2,Unnamed: 231_level_2,Unnamed: 232_level_2,Unnamed: 233_level_2,Unnamed: 234_level_2,Unnamed: 235_level_2,Unnamed: 236_level_2,Unnamed: 237_level_2,Unnamed: 238_level_2,Unnamed: 239_level_2,Unnamed: 240_level_2,Unnamed: 241_level_2,Unnamed: 242_level_2,Unnamed: 243_level_2,Unnamed: 244_level_2,Unnamed: 245_level_2,Unnamed: 246_level_2,Unnamed: 247_level_2,Unnamed: 248_level_2,Unnamed: 249_level_2,Unnamed: 250_level_2,Unnamed: 251_level_2,Unnamed: 252_level_2,Unnamed: 253_level_2,Unnamed: 254_level_2,Unnamed: 255_level_2,Unnamed: 256_level_2,Unnamed: 257_level_2,Unnamed: 258_level_2,Unnamed: 259_level_2,Unnamed: 260_level_2,Unnamed: 261_level_2,Unnamed: 262_level_2,Unnamed: 263_level_2,Unnamed: 264_level_2,Unnamed: 265_level_2,Unnamed: 266_level_2,Unnamed: 267_level_2,Unnamed: 268_level_2,Unnamed: 269_level_2,Unnamed: 270_level_2,Unnamed: 271_level_2,Unnamed: 272_level_2,Unnamed: 273_level_2,Unnamed: 274_level_2,Unnamed: 275_level_2,Unnamed: 276_level_2,Unnamed: 277_level_2,Unnamed: 278_level_2,Unnamed: 279_level_2,Unnamed: 280_level_2,Unnamed: 281_level_2,Unnamed: 282_level_2,Unnamed: 283_level_2,Unnamed: 284_level_2,Unnamed: 285_level_2,Unnamed: 286_level_2,Unnamed: 287_level_2,Unnamed: 288_level_2,Unnamed: 289_level_2,Unnamed: 290_level_2,Unnamed: 291_level_2,Unnamed: 292_level_2,Unnamed: 293_level_2,Unnamed: 294_level_2,Unnamed: 295_level_2,Unnamed: 296_level_2,Unnamed: 297_level_2,Unnamed: 298_level_2,Unnamed: 299_level_2,Unnamed: 300_level_2,Unnamed: 301_level_2,Unnamed: 302_level_2,Unnamed: 303_level_2,Unnamed: 304_level_2,Unnamed: 305_level_2,Unnamed: 306_level_2,Unnamed: 307_level_2,Unnamed: 308_level_2,Unnamed: 309_level_2,Unnamed: 310_level_2,Unnamed: 311_level_2,Unnamed: 312_level_2,Unnamed: 313_level_2,Unnamed: 314_level_2,Unnamed: 315_level_2,Unnamed: 316_level_2,Unnamed: 317_level_2,Unnamed: 318_level_2,Unnamed: 319_level_2,Unnamed: 320_level_2,Unnamed: 321_level_2,Unnamed: 322_level_2,Unnamed: 323_level_2,Unnamed: 324_level_2,Unnamed: 325_level_2,Unnamed: 326_level_2,Unnamed: 327_level_2,Unnamed: 328_level_2,Unnamed: 329_level_2,Unnamed: 330_level_2,Unnamed: 331_level_2,Unnamed: 332_level_2,Unnamed: 333_level_2,Unnamed: 334_level_2,Unnamed: 335_level_2,Unnamed: 336_level_2,Unnamed: 337_level_2,Unnamed: 338_level_2,Unnamed: 339_level_2,Unnamed: 340_level_2,Unnamed: 341_level_2,Unnamed: 342_level_2,Unnamed: 343_level_2,Unnamed: 344_level_2,Unnamed: 345_level_2,Unnamed: 346_level_2,Unnamed: 347_level_2,Unnamed: 348_level_2,Unnamed: 349_level_2,Unnamed: 350_level_2,Unnamed: 351_level_2,Unnamed: 352_level_2,Unnamed: 353_level_2,Unnamed: 354_level_2,Unnamed: 355_level_2,Unnamed: 356_level_2,Unnamed: 357_level_2,Unnamed: 358_level_2,Unnamed: 359_level_2,Unnamed: 360_level_2,Unnamed: 361_level_2,Unnamed: 362_level_2,Unnamed: 363_level_2,Unnamed: 364_level_2,Unnamed: 365_level_2,Unnamed: 366_level_2,Unnamed: 367_level_2,Unnamed: 368_level_2,Unnamed: 369_level_2,Unnamed: 370_level_2,Unnamed: 371_level_2,Unnamed: 372_level_2,Unnamed: 373_level_2,Unnamed: 374_level_2,Unnamed: 375_level_2,Unnamed: 376_level_2,Unnamed: 377_level_2,Unnamed: 378_level_2,Unnamed: 379_level_2,Unnamed: 380_level_2,Unnamed: 381_level_2,Unnamed: 382_level_2,Unnamed: 383_level_2,Unnamed: 384_level_2,Unnamed: 385_level_2,Unnamed: 386_level_2,Unnamed: 387_level_2,Unnamed: 388_level_2,Unnamed: 389_level_2,Unnamed: 390_level_2,Unnamed: 391_level_2,Unnamed: 392_level_2,Unnamed: 393_level_2,Unnamed: 394_level_2,Unnamed: 395_level_2,Unnamed: 396_level_2,Unnamed: 397_level_2,Unnamed: 398_level_2,Unnamed: 399_level_2,Unnamed: 400_level_2,Unnamed: 401_level_2,Unnamed: 402_level_2,Unnamed: 403_level_2,Unnamed: 404_level_2,Unnamed: 405_level_2,Unnamed: 406_level_2,Unnamed: 407_level_2,Unnamed: 408_level_2,Unnamed: 409_level_2,Unnamed: 410_level_2,Unnamed: 411_level_2,Unnamed: 412_level_2,Unnamed: 413_level_2,Unnamed: 414_level_2,Unnamed: 415_level_2,Unnamed: 416_level_2,Unnamed: 417_level_2,Unnamed: 418_level_2,Unnamed: 419_level_2,Unnamed: 420_level_2,Unnamed: 421_level_2,Unnamed: 422_level_2,Unnamed: 423_level_2,Unnamed: 424_level_2,Unnamed: 425_level_2,Unnamed: 426_level_2,Unnamed: 427_level_2,Unnamed: 428_level_2,Unnamed: 429_level_2,Unnamed: 430_level_2,Unnamed: 431_level_2,Unnamed: 432_level_2,Unnamed: 433_level_2,Unnamed: 434_level_2,Unnamed: 435_level_2,Unnamed: 436_level_2,Unnamed: 437_level_2,Unnamed: 438_level_2,Unnamed: 439_level_2,Unnamed: 440_level_2,Unnamed: 441_level_2,Unnamed: 442_level_2,Unnamed: 443_level_2,Unnamed: 444_level_2,Unnamed: 445_level_2,Unnamed: 446_level_2,Unnamed: 447_level_2,Unnamed: 448_level_2,Unnamed: 449_level_2,Unnamed: 450_level_2,Unnamed: 451_level_2,Unnamed: 452_level_2,Unnamed: 453_level_2,Unnamed: 454_level_2,Unnamed: 455_level_2,Unnamed: 456_level_2,Unnamed: 457_level_2,Unnamed: 458_level_2,Unnamed: 459_level_2,Unnamed: 460_level_2,Unnamed: 461_level_2,Unnamed: 462_level_2,Unnamed: 463_level_2,Unnamed: 464_level_2,Unnamed: 465_level_2,Unnamed: 466_level_2,Unnamed: 467_level_2,Unnamed: 468_level_2,Unnamed: 469_level_2,Unnamed: 470_level_2,Unnamed: 471_level_2,Unnamed: 472_level_2,Unnamed: 473_level_2,Unnamed: 474_level_2,Unnamed: 475_level_2,Unnamed: 476_level_2,Unnamed: 477_level_2,Unnamed: 478_level_2,Unnamed: 479_level_2,Unnamed: 480_level_2,Unnamed: 481_level_2,Unnamed: 482_level_2,Unnamed: 483_level_2,Unnamed: 484_level_2,Unnamed: 485_level_2,Unnamed: 486_level_2,Unnamed: 487_level_2,Unnamed: 488_level_2,Unnamed: 489_level_2,Unnamed: 490_level_2,Unnamed: 491_level_2,Unnamed: 492_level_2,Unnamed: 493_level_2,Unnamed: 494_level_2,Unnamed: 495_level_2,Unnamed: 496_level_2,Unnamed: 497_level_2,Unnamed: 498_level_2,Unnamed: 499_level_2,Unnamed: 500_level_2,Unnamed: 501_level_2
1996-04-30,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,False,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-05-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,False,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-06-30,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-07-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-08-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-09-30,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,False,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-10-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-11-30,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,False,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1996-12-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,
1997-01-31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,,,,False,False,False,,,,,,False,,False,False,,,False,False,False,False,False,False,False,False,,,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,,False,,,False,False,,False,,,False,,,,False,False,False,,,,,,,False,False,,,,,,,,,,,,False,,False,False,,False,False,,,False,False,,,,False,,False,,,False,False,,False,,False,,False,False,,,,False,,,False,,False,,False,False,False,False,,False,,,,False,,,,False,,False,False,,,False,False,,,False,,,,,,,,,,,,,,,,,False,,,,,,,,,False,,False,,,False,,,,False,,,,False,,,,,False,,,,False,False,,,,,,,,,,,,False,,,,,,,,,,,,,,,False,,,,,,False,False,False,False,False,,False,False,,,,False,


In [198]:
dfPredictZipCodetest1 = dfPredictZipCode.reset_index()
dfPredictZipCodetest1[dfPredictZipCodetest1.ZipCode == '96097']

Unnamed: 0,Month,ZipCode,All Homes,1 Bedroom,Median List Price per Square Foot,2 Bedroom,Number of Homes for Rent,4 Bedroom,5 or More Bedroom,Ratio of Sale Price to List Price,"Estimated Rent, All Homes in Region",Median List Price,Median Sale Price per Square Foot,Price per Square Foot,Median Sale Price,Single Family Residences,Listings with Price Cut in Last 30 Days,Condominiums,"Median Rent, Homes Listed for Rent",3 Bedroom,Estimated Rent per Square Foot,Percentage of Sales that were Foreclosures,Price-to-Rent Ratio,PredictZipCode
303941,1996-04-30,96097,76900,139800,166.111574,66000,1.8581,120000,816400,0.9842,1281,259000,66.894531,60,96000.0000,77300,5.6338,760000,325,80400,0.974,9.0909,9.41,False
303942,1996-05-31,96097,77300,139800,166.111574,65400,1.8581,125000,816400,0.9842,1281,259000,66.894531,61,96000.0000,77900,5.6338,760000,325,81700,0.974,9.0909,9.41,False
303943,1996-06-30,96097,77700,139800,166.111574,64500,1.8581,127600,816400,0.9842,1281,259000,66.894531,60,96000.0000,78300,5.6338,760000,325,82700,0.974,9.0909,9.41,False
303944,1996-07-31,96097,77900,139800,166.111574,64200,1.8581,126200,816400,0.9842,1281,259000,66.894531,60,96000.0000,78500,5.6338,760000,325,83000,0.974,9.0909,9.41,False
303945,1996-08-31,96097,78800,139800,166.111574,64500,1.8581,125600,816400,0.9842,1281,259000,66.894531,60,96000.0000,79300,5.6338,760000,325,83100,0.974,9.0909,9.41,False
303946,1996-09-30,96097,79700,139800,166.111574,65000,1.8581,125700,816400,0.9842,1281,259000,64.537000,61,83307.6923,80100,5.6338,760000,325,83100,0.974,9.0909,9.41,False
303947,1996-10-31,96097,80200,139800,166.111574,65400,1.8581,125500,816400,0.9842,1281,259000,64.537000,61,83307.6923,80800,5.6338,760000,325,83800,0.974,9.0909,9.41,False
303948,1996-11-30,96097,81200,139800,166.111574,65800,1.8581,124400,816400,0.9842,1281,259000,61.633282,61,84500.0000,82100,5.6338,760000,325,85200,0.974,9.0909,9.41,False
303949,1996-12-31,96097,82300,139800,166.111574,66000,1.8581,123000,816400,0.9842,1281,259000,61.633282,61,84500.0000,83100,5.6338,760000,325,86600,0.974,9.0909,9.41,False
303950,1997-01-31,96097,82500,139800,166.111574,65600,1.8581,122600,816400,0.9842,1281,259000,61.633282,62,84500.0000,83300,5.6338,760000,325,87400,0.974,9.0909,9.41,False


In [199]:
# sum(dfPredictZipCodeFinalnotstack.PredictZipCode.isnull().values)


In [200]:
###########process IRS dataset
### prepare 1998 IRS data
irsdata = pd.read_csv('./DataPfiles/98zp05ca.csv',header=4, na_values=['**','--'])
irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
# irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
#           'TaxableInterestRtn','TaxableInterestAmt','EarnedIncomeCreditRtn','EarnedIncomeCreditAmt', \
#           'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalScheduleC','ScheduleFTotalRtn','ScheduleFTotalScheduleF',\
#           'ScheduleARtn','ScheduleAAmt'
#           ]
irsdata.columns= irsCol #assign column names to dataframe
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns: #clean up data by removing non numeric characters
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']
# print irsdata.columns
# print calzipcode.columns
# print irsdata.ZipCode
irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata1998 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]
irsdata1998['AverageAGI'] = irsdata1998.AGI.astype(int) #/ irsdata1998['NumberOfReturns'])
irsdata1998['NumberOfReturns'] = irsdata1998['NumberOfReturns'].astype(int)
irsdata1998['AverageAGI'] = irsdata1998['AverageAGI'] * 1000 / irsdata1998.NumberOfReturns

irsdata1998full = pd.DataFrame(columns=irsdata1998.columns.to_series().append(pd.Series("Month")))
# irsdata1998full = irsdata1998.copy()

for x in range(0,irsdata1998.shape[0]):
    listtemp = []
    irsdata1998t = pd.DataFrame(columns=irsdata1998.columns)
    irsdata1998t = irsdata1998t.append([irsdata1998.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("1998-%s-02"%thisMonth)
#         print dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d")
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata1998t["Month"] = pd.Series(listtemp)
    irsdata1998full = irsdata1998full.append(irsdata1998t, ignore_index=True)

irsdata1998full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe
irsdata1998small = irsdata1998full[irsdata1998full.index.get_level_values('Month') == '1998-03-31']
irsdata1998small2 = irsdata1998full[irsdata1998full.index.get_level_values('ZipCode') == '90011']

# irsdata1998small = irsdata1998full[irsdata1998full.index.get_level_values('Month') == '1998-03-31']
### merge IRS data with zipcode price data
irsdatafull = irsdata1998full
print "done"
# dfPredictZipCodesmall = dfPredictZipCode[dfPredictZipCode.index.get_level_values('Month') == '1998-03-31']
# dfPredictZipCodesmall = pd.concat([dfPredictZipCodesmall.unstack(level=1),irsdata1998small.unstack(level=1)], axis=1)
# dfPredictZipCodesmall.stack(level=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2396
done


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [201]:
### prepare 2001 IRS data
irsdata = pd.read_csv('./DataPfiles/01zp05ca.csv',header=4, na_values=['**','--'])
irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']
# print irsdata.columns
# print calzipcode.columns
# print irsdata.ZipCode
irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2001 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]
irsdata2001['AverageAGI'] = irsdata2001.AGI.astype(int) #/ irsdata1998['NumberOfReturns'])
irsdata2001['NumberOfReturns'] = irsdata2001['NumberOfReturns'].astype(int)
irsdata2001['AverageAGI'] = irsdata2001['AverageAGI'] * 1000 / irsdata2001.NumberOfReturns
print irsdata2001['AverageAGI'].mean()

irsdata2001full = pd.DataFrame(columns=irsdata2001.columns.to_series().append(pd.Series("Month")))
# irsdata1998full = irsdata1998.copy()

for x in range(0,irsdata2001.shape[0]):
    listtemp = []
    irsdata2001t = pd.DataFrame(columns=irsdata2001.columns)
    irsdata2001t = irsdata2001t.append([irsdata2001.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2001-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2001t["Month"] = pd.Series(listtemp)
    irsdata2001full = irsdata2001full.append(irsdata2001t, ignore_index=True)
    
irsdata2001full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2001full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)

# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdata1998full.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCodeFinal.unstack(level=1),irsdata2001full.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)

print "done"
# b = dfPredictZipCodeFinal.index.get_level_values(0) == '1998-09-30'  #for testing the result
# b1 = dfPredictZipCodeFinal.index.get_level_values(0) == '2001-09-30'  #for testing the result
# dfPredictZipCodeFinal[b1]  #for testing the result

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2412
47996.9858028
done


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [202]:
### prepare 2002 IRS data
irsdata = pd.read_csv('./DataPfiles/zptab02ca.csv',header=4, na_values=['**','--'])
irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']
# print irsdata.columns
# print calzipcode.columns
# print irsdata.ZipCode
irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2002 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]
irsdata2002['AverageAGI'] = irsdata2002.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
irsdata2002['NumberOfReturns'] = irsdata2002['NumberOfReturns'].astype(int)
irsdata2002['AverageAGI'] = irsdata2002['AverageAGI'] * 1000 / irsdata2002.NumberOfReturns
print irsdata2002['AverageAGI'].mean()

irsdata2002full = pd.DataFrame(columns=irsdata2002.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2002.shape[0]):
    listtemp = []
    irsdata2002t = pd.DataFrame(columns=irsdata2002.columns)
    irsdata2002t = irsdata2002t.append([irsdata2002.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2002-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2002t["Month"] = pd.Series(listtemp)
    irsdata2002full = irsdata2002full.append(irsdata2002t, ignore_index=True)
    
irsdata2002full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2002full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1) #not concat until all irsdatafull is created
print "done"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2414
49689.0262676
done


In [203]:
### prepare 2004 IRS data
irsdata = pd.read_csv('./DataPfiles/ZIPCode2004CA.csv',header=4, na_values=['**','--'])
irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']
irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2004 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]
irsdata2004['AverageAGI'] = irsdata2004.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
irsdata2004['NumberOfReturns'] = irsdata2004['NumberOfReturns'].astype(int)
irsdata2004['AverageAGI'] = irsdata2004['AverageAGI'] * 1000 / irsdata2004.NumberOfReturns
print irsdata2004['AverageAGI'].mean()

irsdata2004full = pd.DataFrame(columns=irsdata2004.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2004.shape[0]):
    listtemp = []
    irsdata2004t = pd.DataFrame(columns=irsdata2004.columns)
    irsdata2004t = irsdata2004t.append([irsdata2004.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2004-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2004t["Month"] = pd.Series(listtemp)
    irsdata2004full = irsdata2004full.append(irsdata2004t, ignore_index=True)
    
irsdata2004full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2004full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)  #not concat until all irsdatafull is created
print "done"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2390
52308.0969728
done


In [204]:
### prepare 2005 IRS data
irsdata = pd.read_csv('./DataPfiles/ZIPCode2005CA.csv',header=4, na_values=['**','--'])
irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2005 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]
irsdata2005['AverageAGI'] = irsdata2005.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
irsdata2005['NumberOfReturns'] = irsdata2005['NumberOfReturns'].astype(int)
irsdata2005['AverageAGI'] = irsdata2005['AverageAGI'] * 1000 / irsdata2005.NumberOfReturns
print irsdata2005['AverageAGI'].mean()

irsdata2005full = pd.DataFrame(columns=irsdata2005.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2005.shape[0]):
    listtemp = []
    irsdata2005t = pd.DataFrame(columns=irsdata2005.columns)
    irsdata2005t = irsdata2005t.append([irsdata2005.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2005-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2005t["Month"] = pd.Series(listtemp)
    irsdata2005full = irsdata2005full.append(irsdata2005t, ignore_index=True)
    
irsdata2005full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2005full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)  #not concat until all irsdatafull is created
print "done"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2396
55154.0374922
done


In [205]:
### prepare 2006 IRS data
irsdata = pd.read_csv('./DataPfiles/ZIPCode2006CA.csv',header=4, na_values=['**','--'])
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2006 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2006 = irsdata2006[(irsdata2006['Range'] == irsdata2006['ZipCode'])]
irsdata2006 = irsdata2006.drop('Range', axis=1)
irsdata2006['AverageAGI'] = irsdata2006.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
irsdata2006['NumberOfReturns'] = irsdata2006['NumberOfReturns'].astype(int)
irsdata2006['AverageAGI'] = irsdata2006['AverageAGI'] * 1000 / irsdata2006.NumberOfReturns
print irsdata2006['AverageAGI'].mean()

irsdata2006full = pd.DataFrame(columns=irsdata2006.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2006.shape[0]):
    listtemp = []
    irsdata2006t = pd.DataFrame(columns=irsdata2006.columns)
    irsdata2006t = irsdata2006t.append([irsdata2006.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2006-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2006t["Month"] = pd.Series(listtemp)
    irsdata2006full = irsdata2006full.append(irsdata2006t, ignore_index=True)
    
irsdata2006full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2006full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)  #not concat until all irsdatafull is created
print "done"

19128
55230.5241484
done


In [206]:
### prepare 2007 IRS data
def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

irsdata = pd.read_csv('./DataPfiles/ZIPCode2007CA.csv',header=4, na_values=['**','--'])
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2007 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2007 = irsdata2007[pd.isnull(irsdata2007['Range'])]
irsdata2007 = irsdata2007.drop('Range', axis=1)  #it seems that the inplace operation has bug that failed this line, so I removed it. 
irsdata2007['AverageAGI'] = irsdata2007.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
irsdata2007['NumberOfReturns'] = irsdata2007['NumberOfReturns'].astype(int)
irsdata2007['AverageAGI'] = irsdata2007['AverageAGI'] * 1000 / irsdata2007.NumberOfReturns
print irsdata2007['AverageAGI'].mean()

irsdata2007full = pd.DataFrame(columns=irsdata2007.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2007.shape[0]):
    listtemp = []
    irsdata2007t = pd.DataFrame(columns=irsdata2007.columns)
    irsdata2007t = irsdata2007t.append([irsdata2007.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2007-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2007t["Month"] = pd.Series(listtemp)
    irsdata2007full = irsdata2007full.append(irsdata2007t, ignore_index=True)
    
irsdata2007full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2007full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)   #not concat until all irsdatafull is created

19176
52460.2197889


In [207]:
### prepare 2008 IRS data
def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

irsdata = pd.read_csv('./DataPfiles/08zp05ca.csv',header=4, na_values=['**','--'])
# irsdata = pd.read_csv('./DataPfiles/08zp05ca.csv',header=4, na_values=['**','--'], \
#                       dtype={'Range': np.str, 'ZipCode': np.str, 'NumberOfReturns': np.str, \
#                             'ExemptionsRtn':np.str, 'DepedentExemptionsAmt':np.str,'AGI':np.str,'SalariesWagesRtn':np.str, \
#                             'SalariesWagesAmt':np.str,'TaxableInterestRtn':np.str, 'TaxableInterestAmt':np.str, \
#                             'TotalTaxRtn':np.str, 'TotalTaxAmt':np.str, 'ScheduleCTotalRtn': np.str,'ScheduleCTotalAmt':np.str, \
#                             'ScheduleFTotalRtn':np.str, 'ScheduleFTotalAmt':np.str, 'ScheduleARtn':np.str,'ScheduleAAmt':np.str}
#                       ) 
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol

for x in irsdata.columns:
    irsdata[x] = irsdata[x].astype(str)  #convert dtype from float64 to string object
    irsdata[x] = irsdata[x].apply(trim_fraction) #remove the decimals from the string object 
    
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2008 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2008 = irsdata2008[irsdata2008['Range'] == 'nan']
irsdata2008 = irsdata2008.drop('Range', axis=1)  #it seems that the inplace operation has bug that failed this line, so I removed it. 
irsdata2008['AverageAGI'] = irsdata2008.AGI.astype(int) #/ irsdata2008['NumberOfReturns'])
irsdata2008['NumberOfReturns'] = irsdata2008['NumberOfReturns'].astype(int)
irsdata2008['AverageAGI'] = irsdata2008['AverageAGI'] * 1000 / irsdata2008.NumberOfReturns
print irsdata2008['AverageAGI'].mean()

irsdata2008full = pd.DataFrame(columns=irsdata2008.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2008.shape[0]):
    listtemp = []
    irsdata2008t = pd.DataFrame(columns=irsdata2008.columns)
    irsdata2008t = irsdata2008t.append([irsdata2008.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2008-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2008t["Month"] = pd.Series(listtemp)
    irsdata2008full = irsdata2008full.append(irsdata2008t, ignore_index=True)
    
irsdata2008full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2008full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1) #not concat until all irsdatafull is created
print "done"

15600
56658.0840399
done


In [208]:
### prepare 2009 IRS data
def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

irsdata = pd.read_csv('./DataPfiles/09zp05ca.csv',header=4, na_values=['**','--'])
# irsdata = pd.read_csv('./DataPfiles/08zp05ca.csv',header=4, na_values=['**','--'], \
#                       dtype={'Range': np.str, 'ZipCode': np.str, 'NumberOfReturns': np.str, \
#                             'ExemptionsRtn':np.str, 'DepedentExemptionsAmt':np.str,'AGI':np.str,'SalariesWagesRtn':np.str, \
#                             'SalariesWagesAmt':np.str,'TaxableInterestRtn':np.str, 'TaxableInterestAmt':np.str, \
#                             'TotalTaxRtn':np.str, 'TotalTaxAmt':np.str, 'ScheduleCTotalRtn': np.str,'ScheduleCTotalAmt':np.str, \
#                             'ScheduleFTotalRtn':np.str, 'ScheduleFTotalAmt':np.str, 'ScheduleARtn':np.str,'ScheduleAAmt':np.str}
#                       ) 
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol

for x in irsdata.columns:
    irsdata[x] = irsdata[x].astype(str)  #convert dtype from float64 to string object
    irsdata[x] = irsdata[x].apply(trim_fraction) #remove the decimals from the string object 
    
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2009 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2009 = irsdata2009[irsdata2009['Range'] == 'nan']
irsdata2009 = irsdata2009.drop('Range', axis=1)  #it seems that the inplace operation has bug that failed this line, so I removed it. 
irsdata2009['AverageAGI'] = irsdata2009.AGI.astype(int) #/ irsdata2009['NumberOfReturns'])
irsdata2009['NumberOfReturns'] = irsdata2009['NumberOfReturns'].astype(int)
irsdata2009['AverageAGI'] = irsdata2009['AverageAGI'] * 1000 / irsdata2009.NumberOfReturns
print irsdata2009['AverageAGI'].mean()

irsdata2009full = pd.DataFrame(columns=irsdata2009.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2009.shape[0]):
    listtemp = []
    irsdata2009t = pd.DataFrame(columns=irsdata2009.columns)
    irsdata2009t = irsdata2009t.append([irsdata2009.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2009-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2009t["Month"] = pd.Series(listtemp)
    irsdata2009full = irsdata2009full.append(irsdata2009t, ignore_index=True)
    
irsdata2009full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2009full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1) #not concat until all irsdatafull is created
print "done"

10409
55546.2507987
done


In [209]:
### prepare 2010 IRS data
def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

irsdata = pd.read_csv('./DataPfiles/10zp05ca.csv',header=4, na_values=['**','--'])
# irsdata = pd.read_csv('./DataPfiles/08zp05ca.csv',header=4, na_values=['**','--'], \
#                       dtype={'Range': np.str, 'ZipCode': np.str, 'NumberOfReturns': np.str, \
#                             'ExemptionsRtn':np.str, 'DepedentExemptionsAmt':np.str,'AGI':np.str,'SalariesWagesRtn':np.str, \
#                             'SalariesWagesAmt':np.str,'TaxableInterestRtn':np.str, 'TaxableInterestAmt':np.str, \
#                             'TotalTaxRtn':np.str, 'TotalTaxAmt':np.str, 'ScheduleCTotalRtn': np.str,'ScheduleCTotalAmt':np.str, \
#                             'ScheduleFTotalRtn':np.str, 'ScheduleFTotalAmt':np.str, 'ScheduleARtn':np.str,'ScheduleAAmt':np.str}
#                       ) 
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol

for x in irsdata.columns:
    irsdata[x] = irsdata[x].astype(str)  #convert dtype from float64 to string object
    irsdata[x] = irsdata[x].apply(trim_fraction) #remove the decimals from the string object 
    
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2010 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2010 = irsdata2010[irsdata2010['Range'] == 'nan']
irsdata2010 = irsdata2010.drop('Range', axis=1)  #it seems that the inplace operation has bug that failed this line, so I removed it. 
irsdata2010['AverageAGI'] = irsdata2010.AGI.astype(int) #/ irsdata2010['NumberOfReturns'])
irsdata2010['NumberOfReturns'] = irsdata2010['NumberOfReturns'].astype(int)
irsdata2010['AverageAGI'] = irsdata2010['AverageAGI'] * 1000 / irsdata2010.NumberOfReturns
print irsdata2010['AverageAGI'].mean()

irsdata2010full = pd.DataFrame(columns=irsdata2010.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2010.shape[0]):
    listtemp = []
    irsdata2010t = pd.DataFrame(columns=irsdata2010.columns)
    irsdata2010t = irsdata2010t.append([irsdata2010.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2010-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2010t["Month"] = pd.Series(listtemp)
    irsdata2010full = irsdata2010full.append(irsdata2010t, ignore_index=True)
    
irsdata2010full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge IRS data with zipcode price data
irsdatafull = irsdatafull.append(irsdata2010full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1) #not concat until all irsdatafull is created
print "done"

10402
56226.670196
done


In [210]:
### prepare 2011 IRS data
def trim_fraction(text):
    if '.0' in text:
        return text[:text.rfind('.0')]
    return text

irsdata = pd.read_csv('./DataPfiles/11zp05ca.csv',header=4, na_values=['**','--'])
# irsdata = pd.read_csv('./DataPfiles/08zp05ca.csv',header=4, na_values=['**','--'], \
#                       dtype={'Range': np.str, 'ZipCode': np.str, 'NumberOfReturns': np.str, \
#                             'ExemptionsRtn':np.str, 'DepedentExemptionsAmt':np.str,'AGI':np.str,'SalariesWagesRtn':np.str, \
#                             'SalariesWagesAmt':np.str,'TaxableInterestRtn':np.str, 'TaxableInterestAmt':np.str, \
#                             'TotalTaxRtn':np.str, 'TotalTaxAmt':np.str, 'ScheduleCTotalRtn': np.str,'ScheduleCTotalAmt':np.str, \
#                             'ScheduleFTotalRtn':np.str, 'ScheduleFTotalAmt':np.str, 'ScheduleARtn':np.str,'ScheduleAAmt':np.str}
#                       ) 
irsCol = ['Range', 'ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
          'TaxableInterestRtn','TaxableInterestAmt', \
          'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleCTotalAmt','ScheduleFTotalRtn','ScheduleFTotalAmt',\
          'ScheduleARtn','ScheduleAAmt'
          ]
irsdata.columns= irsCol

for x in irsdata.columns:
    irsdata[x] = irsdata[x].astype(str)  #convert dtype from float64 to string object
    irsdata[x] = irsdata[x].apply(trim_fraction) #remove the decimals from the string object 
    
irsdata['ZipCode'] = irsdata['ZipCode'].astype(str)  #convert dtype from float64 to string object
irsdata['ZipCode'] = irsdata['ZipCode'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleCTotalAmt'] = irsdata['ScheduleCTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleFTotalAmt'] = irsdata['ScheduleFTotalAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['ScheduleAAmt'] = irsdata['ScheduleAAmt'].apply(trim_fraction) #remove the decimals from the string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].astype(str)  #convert dtype from float64 to string object
irsdata['DepedentExemptionsAmt'] = irsdata['DepedentExemptionsAmt'].apply(trim_fraction) #remove the decimals from the string object
chars_to_remove = ['*', '**', '--',',','.']
for col in irsdata.columns:
    irsdata[col] = irsdata[col].str.translate(None, ''.join(chars_to_remove))

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
irsdata2011 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

irsdata2011 = irsdata2011[irsdata2011['Range'] == 'nan']
irsdata2011 = irsdata2011.drop('Range', axis=1)  #it seems that the inplace operation has bug that failed this line, so I removed it. 
irsdata2011['AverageAGI'] = irsdata2011.AGI.astype(int) #/ irsdata2011['NumberOfReturns'])
irsdata2011['NumberOfReturns'] = irsdata2011['NumberOfReturns'].astype(int)
irsdata2011['AverageAGI'] = irsdata2011['AverageAGI'] * 1000 / irsdata2011.NumberOfReturns
print irsdata2011['AverageAGI'].mean()

irsdata2011full = pd.DataFrame(columns=irsdata2011.columns.to_series().append(pd.Series("Month")))

for x in range(0,irsdata2011.shape[0]):
    listtemp = []
    irsdata2011t = pd.DataFrame(columns=irsdata2011.columns)
    irsdata2011t = irsdata2011t.append([irsdata2011.iloc[x]]*12,ignore_index=True)
    for i in range(12):
        thisMonth = ("0%i"%(i+1,))[-2:]
        d = dateUtility.mkDateTime("2011-%s-02"%thisMonth)
        listtemp.append(dateUtility.mkLastOfMonth(d).strftime("%Y-%m-%d"))
    irsdata2011t["Month"] = pd.Series(listtemp)
    irsdata2011full = irsdata2011full.append(irsdata2011t, ignore_index=True)
    
irsdata2011full.set_index(['Month','ZipCode'],inplace=True)  #set dataframe index to match with zipcode price dataframe

### merge zipcode price data
irsdatafull = irsdatafull.append(irsdata2011full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)
# print "done"

10409
55833.1618228


In [211]:
### Clean irsdatafull dataframe by removing nan(string), ' '(space) and 0 in the cells and fill those cells with back filling value

irsdatafull['ScheduleFTotalRtn'] = irsdatafull['ScheduleFTotalRtn'].str.strip()  # to remove extra spaces in the cell
irsdatafull['ScheduleFTotalAmt'] = irsdatafull['ScheduleFTotalAmt'].str.strip()  # to remove extra spaces in the cell

# finding dataframe cell with 'nan',' ' and 0 and then turn them into np.nan
for x in irsdatafull.columns:
    nalist = irsdatafull[x].isin(['nan',' ',0,'0',''])
    irsdatafull.ix[nalist,x] = np.nan

#use fillna funcion to back filling the value from following rows back to rows above
irsdatafull = irsdatafull.reset_index() # need to reset index because the current first level index is grouped by zipcode 
irsdatafull = irsdatafull.set_index('Month') #reset index to month; rows for the same zipcode from different month are listed consecutively
irsdatafull = irsdatafull.fillna(method='bfill') #apply back filling with fillna
irsdatafull = irsdatafull.fillna(method='ffill') #apply forward filling to correct cells the last cells that didn't get fill with back filling
irsdatafull = irsdatafull.reset_index()
irsdatafull.set_index(['Month','ZipCode'],inplace=True) #putting back index of Monthly and ZipCode for merging with dfPredictZipCodeFinal later

#check to see if there is still nan like value in the dataframe
print sum(irsdatafull.isin(['nan',' ',0,'0','']).values)
print sum(irsdatafull.isnull().values)


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [571]:
### merge IRS data with zipcode price data
# irsdatafull = irsdatafull.append(irsdata2011full)
dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinalnotstack = dfPredictZipCodeFinal.copy()
dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)
# dfPredictZipCodeFinalstack = dfPredictZipCodeFinal.copy()

In [572]:
### Clean irsdatafull dataframe by removing nan(string), ' '(space) and 0 in the cells and fill those cells with back filling value

# finding dataframe cell with 'nan',' ' and 0 and then turn them into np.nan
for x in dfPredictZipCodeFinal.columns:
    if (dfPredictZipCodeFinal[x].dtypes == 'object'): # check if the column data type is "O" object, 
        dfPredictZipCodeFinal[x] = dfPredictZipCodeFinal[x].astype(str).str.strip() # then strip any space in the cell
    nalist = dfPredictZipCodeFinal[x].isin(['nan',' ',0,'','.'])
    dfPredictZipCodeFinal.ix[nalist,x] = np.nan
    print dfPredictZipCodeFinal[x].dtypes


#use fillna funcion to back filling the value from following rows back to rows above
dfPredictZipCodeFinal['PredictZipCode'].fillna(value=False,inplace=True) # Change the NaN values created during the dataframe merge because of stack/unstack
dfPredictZipCodeFinal = dfPredictZipCodeFinal.reset_index() # need to reset index because the current first level index is grouped by zipcode 
dfPredictZipCodeFinal = dfPredictZipCodeFinal.set_index('level_0') #reset index to month; rows for the same zipcode from different month are listed consecutively
dfPredictZipCodeFinal = dfPredictZipCodeFinal.fillna(method='bfill') #apply back filling with fillna
dfPredictZipCodeFinal = dfPredictZipCodeFinal.fillna(method='ffill') #apply forward filling to correct cells the last cells that didn't get fill with back filling
dfPredictZipCodeFinal = dfPredictZipCodeFinal.reset_index()
dfPredictZipCodeFinal.rename(columns={'level_0':'Month'}, inplace=True)
#### I am disabling this line for now as making Month and ZipCode doesn't seem to add much value 
# dfPredictZipCodeFinal.set_index(['Month','ZipCode'],inplace=True) #putting back index of Monthly and ZipCode for merging with dfPredictZipCodeFinal later


#check to see if there is still nan like value in the dataframe
print "nan:"
print sum(dfPredictZipCodeFinal.isin(['nan',' ','',0,'.']).values)
print "null value:"
print sum(dfPredictZipCodeFinal.isnull().values)

float64
float64
float64
float64
float64
object
float64
float64
float64
object
float64
float64
object
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
object
object
object
object
object
object
object
object
float64
object
object
object
object
nan:
[     0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0 412370      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0]
null value:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]


In [573]:
# pd.set_option('display.max_rows', 500)
# print sum(dfPredictZipCodeFinal.PredictZipCode)
# dfPredictZipCodeFinal.PredictZipCode.isnull

# print sum(dfPredictZipCodeFinal['PredictZipCode'].isnull().values)
# dfPredictZipCodeFinalnotstack['PredictZipCode'].fillna(value='test', inplace=True)
# # dfp1["preTestScore"].fillna(value=False , inplace=True)
# print sum(dfPredictZipCodeFinalnotstack['PredictZipCode'].isnull().values)
dfPredictZipCodeFinal.describe()

### simpe groupig analysis ###

## Find out the average value for columns such as all home price, average AGI, etc.. grouped by targeted zipcode and non targeted
print dfPredictZipCodeFinal.groupby('PredictZipCode').mean()


selectedZipCodes[selectedZipCodes[0] == True]

                    1 Bedroom      2 Bedroom      3 Bedroom      4 Bedroom  \
PredictZipCode                                                               
0               184437.134127  279913.232776  381749.027572  509540.048258   
1               179464.620355  259671.663974  362444.781906  499482.132472   

                5 or More Bedroom      All Homes    AverageAGI   Condominiums  \
PredictZipCode                                                                  
0                   764913.459757  384845.254747  50716.233060  275579.344036   
1                   689189.919225  359840.226171  38189.692297  252101.195477   

                Estimated Rent per Square Foot  \
PredictZipCode                                   
0                                     1.267083   
1                                     1.312438   

                Estimated Rent, All Homes in Region  \
PredictZipCode                                        
0                                       1930.486602

Unnamed: 0_level_0,0
ZipCode,Unnamed: 1_level_1
90001,True
90003,True
90004,True
90006,True
90007,True
90008,True
90010,True
90011,True
90012,True
90013,True


In [574]:
### simpe groupig analysis ###

## the number of targeted zipcode in different year.  
dfPredictZipCodeFinal['Year'] = dfPredictZipCodeFinal.Month.str[:4]  # add a new column to store which year(first 4 digits of date) based on Month 
TargetZipCodeCount = pd.DataFrame(dfPredictZipCodeFinal.groupby(['Year','PredictZipCode']).count()['ZipCode'])
TargetZipCodeCount = TargetZipCodeCount.unstack(1)
TargetZipCodeCount = TargetZipCodeCount.rename(columns = {'ZipCode':'ZipCodeCount'})
TargetZipCodeCount.fillna(0,inplace=True)
TargetZipCodeCount

Unnamed: 0_level_0,ZipCodeCount,ZipCodeCount
PredictZipCode,0,1
Year,Unnamed: 1_level_2,Unnamed: 2_level_2
1996,11051,9
1997,14878,31
1998,29022,63
1999,14911,458
2000,14805,544
2001,29043,105
2002,28960,188
2003,15073,295
2004,28208,556
2005,28413,387


In [575]:
### simpe groupig analysis ###

#for the zipcode ever consided to be targertted, find out for how many years they are selected as target
TargetedZipCodeCount = pd.DataFrame(dfPredictZipCodeFinal[dfPredictZipCodeFinal['PredictZipCode'] == True].groupby(['ZipCode'])['PredictZipCode'].count())
TargetedZipCodeCount.rename(columns = {'PredictZipCode':'PredictZipCodeCount'}, inplace = True)
TargetedZipCodeCount.sort('PredictZipCodeCount', ascending=False)
# sum(TargetedZipCodeCount['PredictZipCode'] > 1)

# dfPredictZipCodeFinal.groupby(['Month','ZipCode']).count()

Unnamed: 0_level_0,PredictZipCodeCount
ZipCode,Unnamed: 1_level_1
94609,16
93280,16
95113,15
96093,14
93501,13
95811,13
94301,13
93426,13
94022,13
95070,13


In [577]:
###  Prepare training and testing dataset
from sklearn.cross_validation import train_test_split

# separate aataset into training and testing set
X = dfPredictZipCodeFinal.copy()
X.set_index(['Month'],inplace=True) #putting back index of Monthly
# y = X['PredictZipCode'].values * 1  #set dependent variable PredictZipCode from dataframe
y = X['PredictZipCode']  #set dependent variable PredictZipCode from dataframe
X.drop('PredictZipCode',1,inplace=True) #remove PredictZipCode from dataframe

######################################## NEED TO DELETE HERE TO MAKE ENTIRE DATASET FOR MODELING LATER AFTER FULL TESTING
X = X.ix[20000:80000,:]
y = y[20000:80000]
######################################## NEED TO DELETE HERE TO MAKE ENTIRE DATASET FOR MODELING LATER AFTER FULL TESTING

# X_train, X_test, y_train, y_test = train_test_split(X.ix[20000:40000,:], y[20000:40000], test_size=0.33, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print X_train.shape, X_test.shape, y_train.shape, y_test.shape

print sum(y_train)
#test to see if y has any true in the predicted y variable.  if no, I need better sample size
# if True in y_train:  
#     print "yes"
# else:
#     print "no"

(48000, 40) (12000, 40) (48000L,) (12000L,)
812.0


In [588]:
### Training prediction model

#LogisticRegression
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# clf = linear_model.LogisticRegression(C=1e5)
# clf.fit(X_train, y_train) 
y_pred = list(clf.predict(X_test))
print sum(y_pred)

#Evaluate model
print "accuracy store normalized:", accuracy_score(y_test, y_pred)
print "accuracy store normalized:", accuracy_score(y_test, y_pred, normalize=False)
print "F1 store binary:",f1_score(y_test, y_pred, average='binary')
print "F1 store micro:",f1_score(y_test, y_pred, average='micro')
print "F1 store weighted:",f1_score(y_test, y_pred, average='weighted') 


27.0
accuracy store normalized: 0.98275
accuracy store normalized: 11793
F1 store binary: 0.0717488789238
F1 store micro: 0.0717488789238
F1 store weighted: 0.0717488789238




In [None]:
### Training prediction model

#SVM
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# clf = linear_model.LogisticRegression(C=1e5)
# clf.fit(X_train, y_train) 
y_pred = list(clf.predict(X_test))
print sum(y_pred)

#Evaluate model
print "accuracy store normalized:", accuracy_score(y_test, y_pred)
print "accuracy store normalized:", accuracy_score(y_test, y_pred, normalize=False)
print "F1 store binary:",f1_score(y_test, y_pred, average='binary')
print "F1 store micro:",f1_score(y_test, y_pred, average='micro')
print "F1 store weighted:",f1_score(y_test, y_pred, average='weighted') 


In [None]:
### This cell is trying to combine the individual california zipcode data .p pickle file into a singe file
### so the main analytic code can load all zipcodes and their distance difference into a single data frame.

import os
import glob
import pandas as pd
from os import path
path = '.\\DataPfiles\\'

calzipcodeAllLong = pd.DataFrame()    
for filename in glob.glob(os.path.join(path, 'calzip*.p')):
    calzipcodeAllLong = pd.concat([calzipcodeAllLong, pd.read_pickle(filename)], axis=1)

dfname = './DataPfiles/' + 'calzipcodeAllLong' + ".p"
pickle.dump( calzipcodeAllLong, open( dfname, "wb" ) )
print 'Done: ' + dfname

In [None]:
##Abnormally Detection method TWO
##Find out zipcode in df..InZipCode which has z score > 2 and in df..AcrossZipCode among the 10 closest zipcodes which has z score > 2



### This cell is trying to calculate the distance between two zipcodes (within California) in miles
import pandas as pd
import pickle
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
from math import radians, cos, sin, asin, sqrt
def calculateDistance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km


geolocator = Nominatim()

zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
calzipcode = zipcode[zipcode.State == 'California']

startzipcodeIndex = int(100)    #last batch processed by script at test.ipynb
endzipcodeIndex = int(200)

for z in calzipcode.iloc[startzipcodeIndex:endzipcodeIndex,0]:
# for z in calzipcode.iloc[2:3,0]:
# for z in calzipcode['Postal Code']:    
    for i,r in calzipcode.iterrows():
        ziplon1 = calzipcode[calzipcode['Postal Code'] == int(z)]['Longitude']
        ziplat1 = calzipcode[calzipcode['Postal Code'] == int(z)]['Latitude']
        ziplon2 = r['Longitude']
        ziplat2 = r['Latitude']
        calzipcode.loc[i,z] = calculateDistance(ziplon1,ziplat1,ziplon2,ziplat2)

dfname = './DataPfiles/' + 'calzip' + str(int(calzipcode.iloc[startzipcodeIndex]['Postal Code'])) + "_to_" + str(int(calzipcode.iloc[endzipcodeIndex-1]['Postal Code'])) + ".p"
pickle.dump( calzipcode, open( dfname, "wb" ) )
print 'Done: ' + dfname
 
    
# geolocator = Nominatim()
# location = geolocator.geocode("94582")
# print(location.address)
# print((location.latitude, location.longitude))
# print(location.raw)

# # >> newport_ri = (41.49008, -71.312796)
# # >>> cleveland_oh = (41.499498, -81.695391)
# print(vincenty((geolocator.geocode("94582").latitude,geolocator.geocode("94582").longitude), (geolocator.geocode("94122").latitude,geolocator.geocode("94122").longitude)).miles)

In [None]:
### This cell as well as the cell after this one are just my playground to play with different functions.  This is not part of the project. 


from pylab import plot, ylim, xlim, show, xlabel, ylabel
from numpy import linspace, loadtxt
import numpy as np

r=3.0

x = p.head()
y = pz

def movingaverage(interval, window_size):
    window = np.ones(int(window_size))/float(window_size)
    return np.convolve(interval, window, 'same')

# plot(x,y)
# xlim(0,1000)

x_av = movingaverage(x, r)
# plot(x_av, y)

# xlabel("Months since Jan 1749.")
# ylabel("No. of Sun spots")
# show()
print x_av

p = df11.iloc[:,0]  #ALl Homes price
pz = df11.iloc[:,-1] 
print pz


t1 = df.iloc[:,1] 
t2 = df3.iloc[:,0]
print "mean is % 4.3F and sd is % 4.3F " % (t1.mean(),t1.std
                                            
                                            # add a new column for moving average of All Homes price


p = df11.iloc[:,0] #ALl Homes price
pz = df11.iloc[:,-1] 


window_size = 3.0 #set the number of sample to gathering centered in the middle
movingave = lambda x: np.convolve(x, np.ones(int(window_size))/float(window_size), 'same')
transformed = df11.groupby('ZipCode')
transformed['All Homes'].transform(movingave)
# df11.info()

# df11 = df11[df11['All Homes'].isnull()]
grouped = df11[['All Homes','ZipCode']].groupby('ZipCode')
b = pd.DataFrame()
newdf = pd.DataFrame()
for name,group in grouped:
    g = group.copy()
#     print group.shape
#     print "size: % 3.2F" % movingaverage(group['All Homes'],3).size
#     if sum(g.isnull()) < 0:
#         g['Moving Ave'] = movingaverage(group['All Homes'],3)

#     print g.shape
#     print g.head()

    #     if(newdf.isnull):
#         newdf = g
#     newdf = newdf.append(g)
#     b = group['All Homes']
#     a = movingaverage(group['All Homes'],3)
    
#     group["MovingAve"] = np.convolve(group['All Homes'], np.ones(int(window_size))/float(window_size), 'same')
#     newdf = newdf.append(group)
# print b
# newdf
# newdf.info()
    
    


# df11['Moving Ave'] = movingaverage(df11.iloc[:,0], 3)  #para#1 is All Homes price; para#2 is numer of sample to gather centered in the middle

# df11.head()
x                                             

In [None]:
np.average(dfAllHomes['93063'][(-1-y):-1])

In [None]:
import pandas as pd
#dfAllHomesDiffFromMovAve

In [None]:
print dfAllHomesStdDevAcrossZipCode.shape
# dfAllHomesStdDevAcrossZipCode
# np.sum(dfAllHomesStdDevAcrossZipCode > 2)

In [None]:
#dfAllHomesDiffFromMovAve.mean(axis =1 )

In [None]:
dfAllHomesDiffFromMovAve.std(axis = 1)

In [None]:
from geopy.geocoders import Nominatim
from geopy.distance import vincenty

geolocator = Nominatim()
location = geolocator.geocode("94582")
print(location.address)
print((location.latitude, location.longitude))
print(location.raw)

# >> newport_ri = (41.49008, -71.312796)
# >>> cleveland_oh = (41.499498, -81.695391)
print(vincenty((geolocator.geocode("94582").latitude,geolocator.geocode("94582").longitude), (geolocator.geocode("94122").latitude,geolocator.geocode("94122").longitude)).miles)

In [None]:
import pandas as pd
import pickle

df = pickle.load( open( ".\DataPfiles\calzip90001_to_90010.p", "rb" ) ) 

# print "abcde"
# zipcode = pd.read_csv('./DataPfiles/us_postal_codes.csv')
# calzipcode = zipcode[zipcode.State == 'California']

In [None]:
import os
from os import path
files = [f for f in os.listdir(".\DataPfiles\\")]# if path.isfile(f)]
files
# print os.listdir(".\DataPfiles\\")

In [None]:
PtoR = dfPriceToRent.mean()
print dfPriceToRent.mean(),dfPriceToRent.std()
print PtoR.mean(),PtoR.std()

highPtoRzip = pd.DataFrame(dfPriceToRent.mean() > 18)
highPtoRzip[highPtoRzip[0] == True]
# targetZipCodes[targetZipCodes[0] == True]

In [None]:
# calzipcode = zipcode[zipcode.State == 'California']

# irsdata['ZipCode'] = irsdata['ZipCode'].str.strip()  #strip white space in the cell
# print sum(irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str)))  #match the rows in irsdata with calzipcode
# irsdata2007 = irsdata[irsdata['ZipCode'].isin(calzipcode['Postal Code'].apply(int).apply(str))]

# irsdata2007 = irsdata2007[irsdata2007['Range'] == ""]
# irsdata2007 = irsdata2007.drop('Range', axis=1, inplace=True)
# irsdata2007['AverageAGI'] = irsdata2007.AGI.astype(int) #/ irsdata2002['NumberOfReturns'])
# irsdata2007['NumberOfReturns'] = irsdata2007['NumberOfReturns'].astype(int)
# irsdata2007['AverageAGI'] = irsdata2007['AverageAGI'] * 1000 / irsdata2007.NumberOfReturns
# print irsdata2007['AverageAGI'].mean()


# irsdata = pd.read_csv('./DataPfiles/ZIPCode2006CA.csv',header=4, na_values=['**','--'])
# irsCol = ['ZipCode','NumberOfReturns','ExemptionsRtn','DepedentExemptionsAmt','AGI','SalariesWagesRtn','SalariesWagesAmt', \
#           'TaxableInterestRtn','TaxableInterestAmt', \
#           'TotalTaxRtn','TotalTaxAmt','ScheduleCTotalRtn','ScheduleFTotalRtn',\
#           'ScheduleARtn'
#           ]
# irsdata.columns= irsCol
# irsdata['ZipCode']

# print type(irsdata2007['Range'])
# import numpy as np
# L = [4,np.nan ,6]
# dftest = Series(L)
# dftest.apply(np.isnan)
# np.isNan(dftest)
# np.isnan()
# sum(irsdata2007.Range == "NaN")
# irsdata2007.Range 

dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)
print "done"
dfPredictZipCodeFinal
# dfPredictZipCode.unstack(level=1)
# irsdatafull.unstack(level=1)

In [None]:
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCodeFinal.unstack(level=1),irsdata2001full.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)

# b = dfPredictZipCodeFinal.index.get_level_values(0) == '1998-09-30'  #for testing the result
# # b1 = dfPredictZipCodeFinal.index.get_level_values(0) == '2001-09-30'  #for testing the result
# dfPredictZipCodeFinal[b]  #for testing the result

# irsdatafull = irsdata1998full
# irsdatafull = irsdatafull.append(irsdata2001full)
# irsdatafull = irsdatafull.append(irsdata2002full)
# dfPredictZipCodeFinal = pd.concat([dfPredictZipCode.unstack(level=1),irsdatafull.unstack(level=1)], axis=1)
# dfPredictZipCodeFinal = dfPredictZipCodeFinal.stack(level=1)
# dfPredictZipCodeFinal

b = dfPredictZipCodeFinal.index.get_level_values(0) == '1998-09-30'  #for testing the result
b1 = dfPredictZipCodeFinal.index.get_level_values(0) == '2001-09-30'  #for testing the result
b2 = dfPredictZipCodeFinal.index.get_level_values(0) == '2002-09-30'  #for testing the result
b4 = dfPredictZipCodeFinal.index.get_level_values(0) == '2004-09-30'  #for testing the result
b5 = dfPredictZipCodeFinal.index.get_level_values(0) == '2005-09-30'  #for testing the result
b6 = dfPredictZipCodeFinal.index.get_level_values(0) == '2006-09-30'  #for testing the result
b7 = dfPredictZipCodeFinal.index.get_level_values(0) == '2007-09-30'  #for testing the result
b8 = dfPredictZipCodeFinal.index.get_level_values(0) == '2008-09-30'  #for testing the result
b9 = dfPredictZipCodeFinal.index.get_level_values(0) == '2009-09-30'  #for testing the result
b10 = dfPredictZipCodeFinal.index.get_level_values(0) == '2010-09-30'  #for testing the result
b11 = dfPredictZipCodeFinal.index.get_level_values(0) == '2011-09-30'  #for testing the result
dfPredictZipCodeFinal[b11].head(5)  #for testing the result

In [None]:
dfPredictZipCodeFinal.info()
# dfPredictZipCodeFinal['TotalTaxAmt']

In [None]:
dffinaltest = dfPredictZipCodeFinal.copy()
# dffinaltest.mean
dffinaltest1 = dffinaltest.fillna(method='backfill')
dffinaltest2 = dffinaltest1.dropna(axis=0,how='any')
dffinaltest2.describe().ix[:,10:36]
# dffinaltest2['SalariesWagesAmt']

In [None]:
dffinaltest = dfPredictZipCodeFinal.copy()
# dffinaltest.mean
dffinaltest1 = dffinaltest.dropna()
dffinaltest1.info()
dffinaltest2 = dffinaltest1.mean(axis=0)
dffinaltest2
# dffinaltest1 = dffinaltest1.ix[:,0:2]
# dffinaltest1.mean(axis=0)

In [None]:
dffinaltest = dfPredictZipCodeFinal.copy()
a = "nan"
# nalist = dffinaltest['DepedentExemptionsAmt'].str.contains("NaN", na=True)
# dffinaltest.ix[nalist,'DepedentExemptionsAmt'] == np.nan

# dffinaltest['DepedentExemptionsAmt'] = dffinaltest['DepedentExemptionsAmt'].convert_objects(convert_numeric=True)
dffinaltest = dffinaltest.convert_objects(convert_numeric=True)
# dffinaltest['DependentExemptionsAmt'] = dffinaltest['DependentExemptionsAmt'].astype(float)
dffinaltest2 = dffinaltest.fillna(dffinaltest.mean(numeric_only=True))

dffinaltest2['DepedentExemptionsAmt']

# dffinaltest['DepedentExemptionsAmt']
# pd.isnull(dffinaltest['DepedentExemptionsAmt'])

# dffinaltest['DepedentExemptionsAmt'].fillna("abc")
# dffinaltest['DepedentExemptionsAmt']
# a = np.nan
# np.isnan(a)
# type(a)