# Kaggle Competition - Improving Zillow Zestimate

Submission by Robert Latimer

## Introduction

Zillow’s Zestimate home valuation has shaken up the U.S. real estate industry since first released 11 years ago.

A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important. The Zestimate was created to give consumers as much information as possible about homes and the housing market, marking the first time consumers had access to this type of home value information at no cost.

“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today), Zillow has since become established as one of the largest, most trusted marketplaces for real estate information in the U.S. and a leading example of impactful machine learning.

Zillow Prize, a competition with a one million dollar grand prize, is challenging the data science community to help push the accuracy of the Zestimate even further. Winning algorithms stand to impact the home values of 110M homes across the U.S.

In this million-dollar competition, participants will develop an algorithm that makes predictions about the future sale prices of homes. The contest is structured into two rounds, the qualifying round which opens May 24, 2017 and the private round for the 100 top qualifying teams that opens on Feb 1st, 2018. In the qualifying round, you’ll be building a model to improve the Zestimate residual error. In the final round, you’ll build a home valuation algorithm from the ground up, using external data sources to help engineer new features that give your model an edge over the competition.

Because real estate transaction data is public information, there will be a three-month sales tracking period after each competition round closes where your predictions will be evaluated against the actual sale prices of the homes. The final leaderboard won’t be revealed until the close of the sales tracking period.

The goal of this model is to improve the Zestimate residual error. More specifically, we are trying to minimize the mean absolute error between the predicted log error and the actual log error. This information is recorded in the transactions training data.

**logerror = log(Zestimate) − log(SalePrice)
 **
 
For each property (unique parcelid), we will predict a log error for each time point. We should be predicting 6 timepoints: **October 2016 (201610), November 2016 (201611), December 2016 (201612), October 2017 (201710), November 2017 (201711), and December 2017 (201712)**. The file should contain a header and have the following format:

`ParcelId,201610,201611,201612,201710,201711,201712
10754147,0.1234,1.2234,-1.3012,1.4012,0.8642,3.1412
10759547,0,0,0,0,0,0
etc.`

## Environment Setup

Load modules for data analysis and visualization.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import scipy.stats as stats
import lightgbm as lgb
% matplotlib inline
pd.options.display.max_columns = 999

## Import Data

We have four spreadsheets:
    * properties_2016.csv - all the properties with their home features for 2016.
    * train_2016_v2.csv - the training set with transactions from 1/1/2016 to 12/31/2016
    * sample_submission.csv - a sample submission file in the correct format
    * zillow_data_dictionary.xlsx - explains the data fields

### Training Data

The `Training Data` dataset contains the log error and transaction dates for 90,275 homes sold during 2016.

In [3]:
train_data = pd.read_csv('train_2016_v2.csv', parse_dates =["transactiondate"])
print train_data.shape
train_data.head(10)

(90275, 3)


Unnamed: 0,parcelid,logerror,transactiondate
0,11016594,0.0276,2016-01-01
1,14366692,-0.1684,2016-01-01
2,12098116,-0.004,2016-01-01
3,12643413,0.0218,2016-01-02
4,14432541,-0.005,2016-01-02
5,11509835,-0.2705,2016-01-02
6,12286022,0.044,2016-01-02
7,17177301,0.1638,2016-01-02
8,14739064,-0.003,2016-01-02
9,14677559,0.0843,2016-01-03


### Property Data

The `Property Data` dataset contains 58 different features (or details) on almost 3 Million homes! Unlike the `Training Data`, this dataset features information on all homes - not just ones that have been sold. We will see that many of the homes in this dataset are missing information, signaling that this dataset will likely need some pruning!

In [4]:
property_data = pd.read_csv('properties_2016.csv')
print property_data.shape
property_data.head(10)

  interactivity=interactivity, compiler=compiler, result=result)


(2985217, 58)


Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,calculatedfinishedsquarefeet,finishedsquarefeet12,finishedsquarefeet13,finishedsquarefeet15,finishedsquarefeet50,finishedsquarefeet6,fips,fireplacecnt,fullbathcnt,garagecarcnt,garagetotalsqft,hashottuborspa,heatingorsystemtypeid,latitude,longitude,lotsizesquarefeet,poolcnt,poolsizesum,pooltypeid10,pooltypeid2,pooltypeid7,propertycountylandusecode,propertylandusetypeid,propertyzoningdesc,rawcensustractandblock,regionidcity,regionidcounty,regionidneighborhood,regionidzip,roomcnt,storytypeid,threequarterbathnbr,typeconstructiontypeid,unitcnt,yardbuildingsqft17,yardbuildingsqft26,yearbuilt,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,,,,,,,,6037.0,,,,,,,34144442.0,-118654084.0,85768.0,,,,,,010D,269.0,,60378000.0,37688.0,3101.0,,96337.0,0.0,,,,,,,,,,,9.0,2015.0,9.0,,,,
1,10759547,,,,0.0,0.0,,,,,,,,,,,,6037.0,,,,,,,34140430.0,-118625364.0,4083.0,,,,,,0109,261.0,LCA11*,60378000.0,37688.0,3101.0,,96337.0,0.0,,,,,,,,,,,27516.0,2015.0,27516.0,,,,
2,10843547,,,,0.0,0.0,,,,,,73026.0,,,73026.0,,,6037.0,,,,,,,33989359.0,-118394633.0,63085.0,,,,,,1200,47.0,LAC2,60377030.0,51617.0,3101.0,,96095.0,0.0,,,,2.0,,,,,,650756.0,1413387.0,2015.0,762631.0,20800.37,,,
3,10859147,,,,0.0,0.0,3.0,7.0,,,,5068.0,,,5068.0,,,6037.0,,,,,,,34148863.0,-118437206.0,7521.0,,,,,,1200,47.0,LAC2,60371410.0,12447.0,3101.0,27080.0,96424.0,0.0,,,,,,,1948.0,1.0,,571346.0,1156834.0,2015.0,585488.0,14557.57,,,
4,10879947,,,,0.0,0.0,4.0,,,,,1776.0,,,1776.0,,,6037.0,,,,,,,34194168.0,-118385816.0,8512.0,,,,,,1210,31.0,LAM1,60371230.0,12447.0,3101.0,46795.0,96450.0,0.0,,,,1.0,,,1947.0,,,193796.0,433491.0,2015.0,239695.0,5725.17,,,
5,10898347,,,,0.0,0.0,4.0,7.0,,,,2400.0,,,2400.0,,,6037.0,,,,,,,34171873.0,-118380906.0,2500.0,,,,,,1210,31.0,LAC4,60371250.0,12447.0,3101.0,46795.0,96446.0,0.0,,,,,,,1943.0,1.0,,176383.0,283315.0,2015.0,106932.0,3661.28,,,
6,10933547,,,,0.0,0.0,,,,,,,,,,,,6037.0,,,,,,,34131929.0,-118351474.0,,,,,,,010V,260.0,LAC2,60371440.0,12447.0,3101.0,274049.0,96049.0,0.0,,,,,,,,,,397945.0,554573.0,2015.0,156628.0,6773.34,,,
7,10940747,,,,0.0,0.0,,,,,,3611.0,,,3611.0,,,6037.0,,,,,,,34171345.0,-118314900.0,5333.0,,,,,,1210,31.0,BUC4YY,60373110.0,396054.0,3101.0,,96434.0,0.0,,,,,,,1946.0,1.0,,101998.0,688486.0,2015.0,586488.0,7857.84,,,
8,10954547,,,,0.0,0.0,,,,,,,,,,,,6037.0,,,,,,,34218210.0,-118331311.0,145865.0,,,,,,010D,269.0,BUR1*,60373100.0,396054.0,3101.0,,96436.0,0.0,,,,,,,,,,,9.0,2015.0,9.0,,,,
9,10976347,,,,0.0,0.0,3.0,7.0,,,,3754.0,,,3754.0,,,6037.0,,,,,,,34289776.0,-118432085.0,7494.0,,,,,,1210,31.0,SFC2*,60373200.0,47547.0,3101.0,,96366.0,0.0,,,,,,,1978.0,1.0,,218440.0,261201.0,2015.0,42761.0,4054.76,,,


## Visualize and Clean Data
`Property Data` contains several cells with the value "NaN", which means "Not a Number". This means that the information for that property is missing. If we look at the first home entry above, we discover that it is missing information for "`airconditioningtypeid`", "`architecturalstyletypeid`", "`basementsqft`", "`buildingclasstypeid`", in addition to too many other categories to list. Because "NaN" is difficult to quantify, we will likely replace it with "0", but we will take it case-by-case as we continue to explore the information.  

Let's take a deeper look at each feature (column) to see the percentage of houses that actually have information (not "NaN") in its respective feature. If 99% of the homes are missing information on, say, "`architecturalstyletypeid`", there will not be a great amount of information gained by keeping that feature in our dataset.

In [None]:
plt.figure(figsize=(12,20))
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = True).plot(kind = 'barh')
plt.title('Percentage of Present Information by Feature')


In [None]:
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = False)

Aha, as expected "`architecturalstyletypeid`" is present in less than 1% of the homes listed. In fact, there are several features that are only present in less than 1% of homes. Intuition says that because these features are mostly absent, they are unlikely to help shed a great deal of insight. Another interesting thing of note is that there is not a single feature that is present in 100% of the homes.

Before we decide to drop some of the (mostly-missing) features, let's make sure that a cell that reads "NaN" is really symbolic of a "No". For example, if a home has a pool, "`hashottuborspa`" may have information entered, but if a home doesn't have a pool, "NaN" is entered. Knowing whether or not a home has a pool is valuable information, so let's correct that category and a few others in similar scenarios.

### Pools & Hot tubs

There are actually multiple features related to pools: 
* **"`poolcnt`"** - Number of pools on a lot. "NaN" means "0 pools", so we can update that to reflect "0" instead of "NaN".

* **"`hashottuborspa`"** - Does the home have a hottub or a spa? "NaN" means "0 hottubs or spas", so we can update that to reflect "0" instead of "NaN".

* **"`poolsizesum`"** - Total square footage of pools on property. Similarly, "NaN" means "0 sqare feet of pools", so we can also adjust that to read "0". For homes that do have pools, but are missing this information, we will just fill the "NaN" with the median value of other homes with pools.

* **"`pooltypeid2`" & "`pooltypeid7`" & "`pooltypeid10`"** - Type of pool or hottub present on property. These categories will only contain non-null information if "`poolcnt`" or "`hashottuborspa`" contain non-null information. For the pool-related categories, we can fill the "NaN" value with a "0". And because "`pooltypeid10`" tells us the exact same information as "`hashottuborspa`", we can probably drop that category from our model.

In [None]:
# "0 pools"
property_data.poolcnt.fillna(0,inplace = True)

# "0 hot tubs or spas"
property_data.hashottuborspa.fillna(0,inplace = True)
# Convert "True" to 1
property_data.hashottuborspa.replace(to_replace = True, value = 1,inplace = True)

# Set properties that have a pool but no info on poolsize equal to the median poolsize value.
property_data.loc[property_data.poolcnt==1, 'poolsizesum'] = property_data.loc[property_data.poolcnt==1, 'poolsizesum'].fillna(property_data[property_data.poolcnt==1].poolsizesum.median())

# "0 pools" = "0 sq ft of pools"
property_data.loc[property_data.poolcnt==0, 'poolsizesum']=0

# "0 pools with a spa/hot tub"
property_data.pooltypeid2.fillna(0,inplace = True)

# "0 pools without a hot tub"
property_data.pooltypeid7.fillna(0,inplace = True)

# Drop redundant feature
property_data.drop('pooltypeid10', axis=1, inplace=True)

In [None]:
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = False)

Just like that, now all of the "pool" related categories have information.

### Fireplace Data

There are two features related to fireplaces:
* **"`fireplaceflag`"** - Does the home have a fireplace? The answers are either "True" or "NaN". We will change the "True" values to "1" and the "NaN" values to "0".
* **"`fireplacecnt`"** - How many fireplaces in the home? We can replace "NaN" values with "0".

Looking deeper, it seems odd that over 10% of the homes have 1 or more fireplaces according to the "`fireplacecnt`" feature, but less than 1% of homes actually have "`fireplaceflag`" set to "True". There are obviously some errors with this data collection. To fix this, we will do the following:
* If "`fireplaceflag`" is "True" and "`fireplacecnt`" is "NaN", we will set "`fireplacecnt`" equal to the median value of "1".
* If "`fireplacecnt`" is 1 or larger "`fireplaceflag`" is "NaN", we will set "`fireplaceflag`" to "True".
* We will change "True" in "`fireplaceflag`" to "1", so we can more easily analyze the information.

In [None]:
# If "fireplaceflag" is "True" and "fireplacecnt" is "NaN", we will set "fireplacecnt" equal to the median value of "1".
property_data.ix[property_data.fireplaceflag==True, 'fireplacecnt']= property_data.ix[property_data.fireplaceflag==True, 'fireplacecnt'].fillna(1)
property_data.fireplacecnt.fillna(0,inplace = True)

# If "fireplacecnt" is 1 or larger "fireplaceflag" is "NaN", we will set "fireplaceflag" to "True".
property_data.ix[property_data.fireplacecnt.notnull(),'fireplaceflag'] = True
property_data.fireplaceflag.fillna(0,inplace = True)

# Change "True" in "fireplaceflag" to "1", so we can more easily analyze the information.
property_data.fireplaceflag.replace(to_replace = True, value = 1,inplace = True)


In [None]:
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = False)

Now "`fireplacecnt`" and "`fireplaceflag`" all contain values.

### Garage Data

There are two features related to garages:
* **"`garagecarcnt`"** - How many garages does the house have? Easy fix here - we can replace "NaN" with "0" if a house doesn't have a garage.
* **"`garagetotalsqft`"** - What is the square footage of the garage? Again, if a home doesn't have a garage, we can replace "NaN" with "0".

Unlike the **Fireplace** category where we have several Type II errors (false negative), we do not have any scenarios where a home has a "`garagecarcnt`" of "NaN", but a "`garagetotalsqft`" of some value.

In [None]:
property_data.garagecarcnt.fillna(0,inplace = True)
property_data.garagetotalsqft.fillna(0,inplace = True)

In [None]:
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = False)

### Tax Data Delinquency

There are two features related to tax delinquency:
* **"`taxdelinquencyflag`"** - Property taxes for this parcel are past due as of 2015.
* **"`taxdelinquencyyear`"** - Year for which the unpaid property taxes were due.

In [None]:
# Replace "NaN" with "0"
property_data.taxdelinquencyflag.fillna(0,inplace = True)

# Change "Y" to "1"
property_data.taxdelinquencyflag.replace(to_replace = 'Y', value = 1,inplace = True)

In [None]:
property_data.drop('parcelid',axis=1).notnull().mean().sort_values(ascending = False)

### The Rest

* **"`storytypeid`"** - Numerical ID that describes all types of homes. Mostly missing, so we should drop this category. Crazy idea would be to try and integrate street view of each home, and use image recognition to classify each type of story ID.          

In [None]:
# Drop "storytypeid"
property_data.drop('storytypeid', axis=1, inplace=True)

* **"`basementsqft`"** - Square footage of basement. Mostly missing, suggesting no basement, so we will replace "NaN" with "0".

In [None]:
# Replace "NaN" with 0, signifying no basement.
property_data.basementsqft.fillna(0,inplace = True)

* **"`yardbuildingsqft26 `"** - Storage shed square footage. We can set "NaN" values to "0". Might be useful to change this to a categorical category of just "1"s and "0"s (has a shed vs doesn't have a storage shed), but some of the sheds are enormous and others are tiny, so we will keep the actual square footage.

In [None]:
print(property_data['yardbuildingsqft26'].value_counts())

In [None]:
# Replace 'yardbuildingsqft26' "NaN"s with "0".
property_data.yardbuildingsqft26.fillna(0,inplace = True)

* **"`architecturalstyletypeid`"** - What is the architectural style of the house? Examples: ranch, bungalow, Cape Cod, etc. Because this is only present in a small fraction of the homes, I'm going to drop this category. (Idea: One can also assume that most homes in the same neighborhood have the same style. Could also try image recognition.)

In [None]:
# Drop "architecturalstyletypeid"
property_data.drop('architecturalstyletypeid', axis=1, inplace=True)

* **"`typeconstructiontypeid`"** - What material is the house made out of? Missing in a bunch, so probably drop category. Would be very difficult image recognition problem.

* **"`finishedsquarefeet13`"** - Perimeter of living area. This seems more like describing the shape of the house and is closely related to the square footage. I recommend dropping the category.

In [None]:
# Drop "typeconstructiontypeid" and "finishedsquarefeet13"
property_data.drop('typeconstructiontypeid', axis=1, inplace=True)
property_data.drop('finishedsquarefeet13', axis=1, inplace=True)

* **"`buildingclasstypeid`"** - Describes the internal structure of the home. Not a lot of information gained and present in less than 1% of properties. I will drop.

In [None]:
# Drop "buildingclasstypeid"
property_data.drop('buildingclasstypeid', axis=1, inplace=True)

Now let's do a quick checkup on our `property_data` to see the current shape and null-value percentages.

In [None]:
print property_data.shape
property_data.notnull().mean().sort_values(ascending = False)

We are now down from 58 features to 52 features, and fewer of our features have a large percentage of null-values. Still a bit more to do!

* **"`decktypeid`"** - Type of deck (if any) on property. Looks like a value is either "66.0" or "NaN". I will keep this feature and change the "66.0" to "1" for "Yes" and "NaN" to "0" for "No".

In [None]:
# Let's check the unique values for "decktypeid"
print(property_data['decktypeid'].value_counts())

In [None]:
# Change "decktypeid" "Nan"s to "0"
property_data.decktypeid.fillna(0,inplace = True)
# Convert "decktypeid" "66.0" to "1"
property_data.decktypeid.replace(to_replace = 66.0, value = 1,inplace = True)

* **"`finishedsquarefeet6`"** - Base unfinished and finished area. Not sure what this means. Seems like it gives valuable information, but replacing "NaN"s with "0"s would be incorrect. Perhaps it is a subset of other categories. Probably drop, but TBD.

* **"`finishedsquarefeet15`"** - Total area. Should be equal to sum of all other finishedsquarefeet categories.

* **"`finishedfloor1squarefeet`"** - Sq footage of first floor. Could cross check this with number of stories.

* **"`finishedsquarefeet50`"** - Identical to above category? Drop one of them. Duplicate.

* **"`finishedsquarefeet12`"** - Finished living area.

* **"`calculatedfinishedsquarefeet`"** - Total finished living area of home.

In [None]:
print(property_data['finishedsquarefeet6'].value_counts())

In [None]:
pd.options.display.max_columns = 999
#squarefeet = property_data[property_data['finishedsquarefeet6'].notnull() & property_data['finishedsquarefeet12'].isnull() & property_data['finishedsquarefeet15'].isnull() & property_data['finishedsquarefeet50'].isnull() & property_data['lotsizesquarefeet'].isnull()]
#squarefeet = property_data[property_data['finishedsquarefeet12'].notnull() & property_data['finishedsquarefeet15'].notnull() & property_data['finishedsquarefeet50'].notnull() & property_data['lotsizesquarefeet'].notnull()]
squarefeet = property_data[property_data['finishedsquarefeet15'].notnull() & property_data['finishedsquarefeet50'].notnull() & property_data['lotsizesquarefeet'].notnull()]
squarefeet[['calculatedfinishedsquarefeet','finishedsquarefeet15','finishedsquarefeet50','numberofstories','lotsizesquarefeet','landtaxvaluedollarcnt','structuretaxvaluedollarcnt','taxvaluedollarcnt','taxamount']]
#squarefeet
# squarefeet = property_data[property_data[['finishedsquarefeet6','finishedsquarefeet12','finishedsquarefeet15','finishedsquarefeet50','lotsizesquarefeet']].notnull()]
#property_data[['finishedsquarefeet6','finishedsquarefeet12','finishedsquarefeet15','finishedsquarefeet50','lotsizesquarefeet']][:100]


**"`finishedsquarefeet6`"** is rarely present, and even when it is present, it is equal to **"`calculatedfinishedsquarefeet`"**. Because of this, we will drop it. Same scenario with **"`finishedsquarefeet12`"**, so we will drop that as well. **"`finishedsquarefeet50`"** is identical to **"`finishedfloor1squarefeet`"**, so we will also drop **"`finishedfloor1squarefeet`"**.

In [None]:
# Drop "finishedsquarefeet6"
property_data.drop('finishedsquarefeet6', axis=1, inplace=True)

# Drop "finishedfloor1squarefeet"
property_data.drop('finishedfloor1squarefeet', axis=1, inplace=True)

* ~~**"`finishedsquarefeet6`"** - Base unfinished and finished area.~~ DROPPED

* **"`finishedsquarefeet15`"** - Total area. Should be equal to sum of all other finishedsquarefeet categories.

* ~~**"`finishedfloor1squarefeet`"** - Sq footage of first floor.~~ DROPPED

* **"`finishedsquarefeet50`"** - Sq footage of first floor.

* ~~**"`finishedsquarefeet12`"** - Finished living area.~~ DROPPED

* **"`calculatedfinishedsquarefeet`"** - Total finished living area of home.

In [None]:
# Drop "finishedsquarefeet6"
property_data.drop('finishedsquarefeet12', axis=1, inplace=True)

Looks like our **"`fireplaceflag`"** didn't update properly. Let's try that again.

In [None]:
# Change "True" in "fireplaceflag" to "1", so we can more easily analyze the information.
property_data.fireplaceflag.replace(to_replace = True, value = 1,inplace = True)

In [None]:
property_data['fireplaceflag'] = property_data['fireplaceflag'] * 1
property_data

In [None]:
print(property_data['fireplaceflag'].value_counts())

* **"`yardbuildingsqft17`"** - Patio in yard. Do same as storage shed category.

* **"`threequarterbathnbr`"** - Number of 3/4 baths = shower, sink, toilet.

* **"`numberofstories`"** - Self explanatory. If "NaN", replace with "1".

* **"`airconditioningtypeid`"** - If "NaN", change to "5" for "None".

* **"`regionidneighborhood`"** - Neighborhood. Could fill in blanks. Would need a key that maps lat & longitude regions with specific neighborhoods. 

* **"`heatingorsystemtypeid`"** - Change "NaN" to "13" for "None"

* **"`buildingqualitytypeid`"** - Change "NaN" to median

* **"`unitcnt`"** - Change "NaN" to "1"

* **"`propertyzoningdesc`"** - Meh. Leave as is.      

* **"`lotsizesquarefeet`"** - Area of lot in square feet.

* **"`fullbathcnt`"** - Number of full bathrooms - tub, sink, toilet

* **"`calculatedbathnbr`"** - Total number of bathrooms including partials.

* **"`censustractandblock`"** - Census tract and block ID combined

* **"`landtaxvaluedollarcnt`"** - Assessed value of land area of parcel.

* **"`regionidcity`"** - City property is located in. Might be redundant?

* **"`yearbuilt`"** - Year home was built.

* **"`structuretaxvaluedollarcnt`"** - Assessed value of built structure on land.

* **"`taxvaluedollarcnt`"** - Total tax assessed value of property. "structuretax..." + "landtax...".

* **"`taxamount`"** - Total property tax assessed for assessment year.

In [None]:
print property_data.columns
property_data.head(10).values

In [None]:
notnull13 = property_data[property_data['finishedsquarefeet13'].notnull()]
notnull13['finishedsquarefeet13'][:20]

In [None]:
decktype = property_data[property_data['decktypeid'].notnull()]
decktype['decktypeid'][:10]

In [None]:
notnull6 = property_data[property_data['finishedsquarefeet6'].notnull()]
notnull6['finishedsquarefeet6'][:20]

In [None]:
propertyzoningdesc = property_data[property_data['propertyzoningdesc'].notnull()]
propertyzoningdesc['propertyzoningdesc'][:50]

Tomorrow - finish cleaning data.<br>
Graph simple statistics.<br>
Create a few new categories.<br>
Apply Month, Day, Year.<br>
Create Tax Features.<br>
Merge with transaction data.<br>
Generage geographical maps. <br>
Year built vs. log error. <br>
Correlation Coefficients.<br>
Build Train & Testing Data. <br>



In [None]:
censustractandblock = property_data[property_data['censustractandblock'].notnull()]
censustractandblock['censustractandblock'][-10:]

In [None]:
print(properties_data['propertyzoningdesc'].value_counts())