In [3]:
import urllib #urllib & json are used for accessing and formatting our API Calls
import json
import random
import numpy as np
import pandas as pd
import gmaps #Google Maps
from ipywidgets import * #You'll need this for dispalying Google Maps inside the notebook

# Working with Real Estate Data:
## Plotting Addresses on a Map & Building your own Zillow-Style Real Estate Estimator

The intent of this tutorial is to expedite future work for individuals wishing to work with property data. We'll accomplish this using property data from Western Pennsylvania Regional Data Center (WPRDC). We will cover converting features to useful formats, visualization, and prediction. 

In this tutorial, we'll be leveraging data from two different sources -- Western Pennsylvania Regional Data Center and Google Maps' API. 


## Table of Contents
<ol>
<li>System Setup</li>
<li>Pulling Data from WDRPC's "Allegheny County Property Assessment" Database</li>
<li>Plotting Addresses in Google Maps</li>
<li>Your Own Zillow: Perceptron for Property Price Prediction</li>
<li>Recommendations for Follow-On Projects</li>
</ol>

## System Setup
### Google API Key

Our first step is to setup our system so we can access and display the required data. 

The Google Maps API comes with some great documentation. Recent changes to Google's API policies requires users to have an API Key. You can get one by follwing their instructions here:  

https://developers.google.com/maps/documentation/geocoding/get-api-key

One you do that, copy and paste your API key here:

In [34]:
google_api_key = '' # Your key here!

### Installing the Javascript Widget
To visualize Google Maps in the notebook, we'll want the Javascript Widget from nbextenstion next. In a command line run the following:
```
pip install ipywidgets
```
then
```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```
<strong>Note:</strong> If you have more than one kernel, make sure you swith to that environment before running the above commands.

If you have trouble with those commands you can try this Conda alternative:
```
conda install -c conda-forge ipywidgets
```
### Installing the gmaps package in your environment
With that in place, checkout Google's <a href="https://developers.google.com/maps/documentation/geocoding/intro">documentation on geocoding</a>. 

We'll be accessing it through Python's gmaps 0.3.5. You can find more documentation at these two sites:
<ul>
<li>https://pypi.python.org/pypi/gmaps/0.3.5</li>
<li>https://github.com/pbugnion/gmaps</li>
</ul>

You can easily install the required packages using the following command in a terminal:
```
pip install gmaps
```


## Pulling Data from WDRPC

Using the <a href="https://data.wprdc.org/">WPRDC's data</a> comes with several stipulations. Most are covered in the Terms of Use which you must agree to before you are allowed to access the numerous data sets available. However, be aware that there may be additional restrictions on individual data sets.

WPRDC provides 146 datasets that cover everything from crash and police incidents to air quality, abestos permits, and even dog permits in Allegheny county. For this tutorial, we'll be focusing on the <a href="https://data.wprdc.org/dataset/property-assessments">Allegheny County Property Assessments</a>.

For this tutorial, we'll be using this dataset: https://data.wprdc.org/dataset/property-assessments/resource/518b583f-7cc8-4f60-94d0-174cc98310dc

<strong>Note:</strong> Click on the green API icon in the top right corner of the page to get a feel for the different API calls you can make. Everything from SQL commads to built-in queries can be passed. 

For a detailed breakdown on specific features, this is the reference you want: https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/43a275de-4745-446b-8ba0-b9389568e568/download/property-assessment-data-dictionaryrev.pdf

We'll use the function below to query the data in the format we want. Note that there are <strong>500,000+</strong> records available, but we'll only use 50,000 for our tutorial. 

In [5]:
def get_propasses(limit=None):
    # This function downloads the desired data set from WPRDC via a query. The query will return will be in JSON which
    # we will convert to a Pandas DataFrame
    # Inputs: URL of the desired dataset
    # Output: Pandas DataFrame containing property assessement data
    
    url = 'https://data.wprdc.org/api/action/datastore_search?resource_id=518b583f-7cc8-4f60-94d0-174cc98310dc&q'
    
    # If the user specifies a limit, append it to the URL. There's 500k records, so we can get quite a lot!
    if limit != None:
        url += '&limit='+str(limit)
    
    # Query the WPRDC data base
    fileobj = urllib.urlopen(url)
    
    # Convert the data to JSON format
    data = json.loads(fileobj.read())
    records = data['result']['records']
    df = pd.read_json(json.dumps(records))
    
    # For the purpose of this tutorial, we will only consider residential property sold after 2014. Next,
    # we will eliminate anything that doesn't meet these requirements. 
       
    return df
results = get_propasses(50000)

Right now, we have 50,000 records at our disposal. However, we need to do some filtering. WPRDC is in the process of cleaning up their dataset, but that also means not all of the records can be used at this time. In the next step, we're going to be doing some filtering. 

We're being overly strict for this implementation, but you're free to relax some of the constraints.

In [6]:
def preprocess(df):

    # We'll set PARID (Parcel Identifier) to the index
    df = df.set_index(['PARID'])   
    
    # Remove so called "skeleton" listings. These have a USECODE of 001 & 002.
    df = df[df['USECODE'] > 2]

    # We're going to eliminate known outliers or special cases so we can focus on predicting fairmarket value 
    # for residential properties
    df = df[df['SALEPRICE'] > 1000] # Play around with this number. You'll notice the perceptron struggles under a $100k
    df = df[df['CARDNUMBER'] == 1] # Some parcels have more than one building, we'll restrict this to single-building properties
    df = df[df['SALEDESC'] == 'VALID SALE'] # We only want records that county think were at fair market value.
    df = df[df['TAXCODE'] == 'T']
    df = df[df['OWNERCODE'] == 10] #Regular, residental owners only
    
    # We'll also remove any special use properties
    df = df[pd.isnull(df['TAXSUBCODE'])]
    df = df[pd.isnull(df['ABATEMENTFLAG'])]
    df = df[pd.isnull(df['FARMSTEADFLAG'])]
    df = df[pd.isnull(df['HOMESTEADFLAG'])]
    df = df[pd.isnull(df['CLEANGREEN'])]

    # Remove the now homogeneous columns
    extracol = ['OWNERCODE', 'OWNERDESC','CARDNUMBER', 'SALEDESC','TAXCODE','TAXSUBCODE','USECODE', 'ABATEMENTFLAG', 'FARMSTEADFLAG', 'HOMESTEADFLAG', 'CLEANGREEN']
    df = df.drop(extracol, axis=1)

    # Correct the format of specific rows:
    df['SALEDATE'] = pd.DatetimeIndex(df['SALEDATE'], errors='coerce') #convert to datetime format
    df['LOTAREA'] = df['LOTAREA'].astype(float)
    df['PROPERTYZIP'] = df['PROPERTYZIP'].astype(str)
    df['PROPERTYHOUSENUM'] = df['PROPERTYHOUSENUM'].astype(int)
    
    #Fill in missing values
    df['NEIGHDESC'] = df['NEIGHDESC'].fillna("")
    df['PROPERTYFRACTION'] = df['PROPERTYFRACTION'].fillna("")
    df['ROOFDESC'] = df['ROOFDESC'].fillna("")
    
    # Remove "skeleton" listings that aren't properly labelled
    df = df.dropna(subset=['BASEMENT','BEDROOMS','BSMTGARAGE','CONDITION','STORIES','TOTALROOMS','YEARBLT'],axis=0)

    # Drop the columns that are redundant, we have no interest in, that would lead our algorithm astray, 
    # or are just filler
    extracol = ['STYLE','NEIGHCODE','SALECODE','_id','CLASS','TAXDESC','MUNICODE','SCHOOLCODE','LEGAL1','LEGAL2','LEGAL3','ALT_ID','ASOFDATE','RECORDDATE','DEEDBOOK','DEEDPAGE','TAXSUBCODE_DESC','LOCALTOTAL','FAIRMARKETBUILDING','FAIRMARKETLAND','FAIRMARKETTOTAL','EXTERIORFINISH','ROOF','BASEMENTDESC','GRADEDESC','CONDITIONDESC','CDUDESC','PREVSALEDATE','PREVSALEPRICE','PREVSALEDATE2','PREVSALEPRICE2','CHANGENOTICEADDRESS1','CHANGENOTICEADDRESS2','CHANGENOTICEADDRESS3','CHANGENOTICEADDRESS4','COUNTYBUILDING','COUNTYLAND','COUNTYTOTAL','COUNTYEXEMPTBLDG','LOCALBUILDING','LOCALLAND','CLASSDESC','HEATINGCOOLING','TAXYEAR']
    df = df.drop(extracol, axis=1)
    
    #Finally, we're going to cycle through and ensure that we only have no "NaN" in our numeric features
    for col in df.columns:
        if (df[col].dtypes == float) | (df[col].dtypes == int):
            df = df[pd.notnull(df[col])]
    
    return df

prop_assess = preprocess(results)

At this point, our data should be clean and ready to be processed. Let's take a look...

In [7]:
print prop_assess.head()
print prop_assess.dtypes

                  BASEMENT  BEDROOMS  BSMTGARAGE CDU  CONDITION  \
PARID                                                             
0104S00104000000       5.0       2.0         1.0  AV        3.0   
0190N00200000000       5.0       2.0         0.0  AV        3.0   
0190J00056000000       5.0       3.0         1.0  AV        3.0   
0190G00174000000       5.0       3.0         1.0  AV        3.0   
0190G00266000000       5.0       3.0         0.0  FR        3.0   

                 EXTFINISH_DESC  FINISHEDLIVINGAREA  FIREPLACES  FULLBATHS  \
PARID                                                                        
0104S00104000000          Frame              1170.0         1.0        1.0   
0190N00200000000  Masonry FRAME              1012.0         1.0        1.0   
0190J00056000000          Brick              1274.0         1.0        1.0   
0190G00174000000  Masonry FRAME              1470.0         0.0        1.0   
0190G00266000000          Brick              1519.0         0.

Columns that are listed as "object" will be treated as a nominal feature. Everything can safely be considered as a continuous feature.

### Separating the data for further processing

Now that our dataset has been cleaned, we'll divide it into convenient segments for traing. One caveet is that we'll create a "Key." Property addresses are not readily useful for training our model. 

However, they are useful in bringing context to what the data. For that reason, we'll save it in a separate dataset for future reference.

We'll do that here and build the other data sets later <strong>(after we have some fun with Google Maps).</strong>

In [8]:
key = prop_assess.loc[:,['PROPERTYCITY', 'PROPERTYZIP', 'PROPERTYSTATE','PROPERTYHOUSENUM', 'PROPERTYFRACTION','PROPERTYADDRESS','PROPERTYUNIT']].dropna(axis=1, how='all').fillna("")
print key.head()
print key.shape

                 PROPERTYCITY PROPERTYZIP PROPERTYSTATE  PROPERTYHOUSENUM  \
PARID                                                                       
0104S00104000000     CARNEGIE       15106            PA               306   
0190N00200000000   PITTSBURGH       15234            PA              3551   
0190J00056000000   PITTSBURGH       15234            PA              3417   
0190G00174000000   PITTSBURGH       15234            PA              3125   
0190G00266000000   PITTSBURGH       15234            PA              3111   

                 PROPERTYFRACTION PROPERTYADDRESS PROPERTYUNIT  
PARID                                                           
0104S00104000000                         BELL AVE               
0190N00200000000                       LIBRARY RD               
0190J00056000000                       POPLAR AVE               
0190G00174000000                    BELLEVILLE ST               
0190G00266000000                           MAY ST               
(1475

## Pulling Lattitude and Longitude from Google's Geocoding API

In the desire to make sure you get your money's worth from your Google API Key, we'll convert our street addresses to  Now we'll leverage another Google tookit -- Geocoding.

Geocoding works by converting the street addresses to grid coordinates (and yes, you can go the other way too). 

We'll write a function to handle the API call for use, <i>getlatlon</i>, shown below. The function returns the results of our API request and parses the lattitude and longitude for us.

<strong>Note:</strong> You only have a 25,000 API calls in a 24 hour period. Use them wisely.

In [21]:
def getlatlon(address, api_key):
    
    url = "https://maps.googleapis.com/maps/api/geocode/json?address="+'"'+address+'"'+"&key="+api_key

    # Query the WPRDC data base
    fileobj = urllib.urlopen(url)

    # Convert the data to JSON format
    data = json.loads(fileobj.read())

    if data['status'] == 'OK':
        lat = data['results'][0]['geometry']['location']['lat']
        lon = data['results'][0]['geometry']['location']['lng']
        return lat, lon
    else:
        print "Address not found: ", data['status']
        return None, None

Next, we'll add two empty columns to our "key" dataframe for lattitude and longitude. Next, we'll cicle through our addresses (combining them into something readable for the API) and calling our function. Results are added to their respective columns in "key" as we recieve them. If any adresses aren't found, we'll drop that from our key. 

In [22]:
key['lat'] = np.zeros(key.shape[0])
key['lon'] = np.zeros(key.shape[0])

for idx, row in key.iterrows():
    address = str(row['PROPERTYHOUSENUM'])+' '+str(row['PROPERTYADDRESS'])+', '+str(row['PROPERTYCITY'])+', '+str(row['PROPERTYSTATE'])+' '+str(row['PROPERTYZIP'])    
    key.loc[idx,'lat'], key.loc[idx,'lon'] = getlatlon(address, google_api_key)


## Plotting Coordinates in Google Maps

Now that we have our grid coordinates, let's put them to use. We'll create a list of tuples (lat, lon) that Google will use to generate and display a heatmap from. 

You can see that we create our map object, add the heatmap layer, and display the result.

In [23]:
key = key.dropna(subset=['lat','lon']) #Drop rows that weren't found in the database

coords = [[k['lat'], k['lon']] for i, k in key.iterrows()] # Creating that tuple I mentioned
gmaps.configure(api_key = google_api_key)
m = gmaps.Map()
m.add_layer(gmaps.Heatmap(data=coords))
m

In [24]:
print key.shape 

(1474, 9)


<strong>Note:</strong> Key changed in size here. We'll have to remember that for later

### Creating datasets for training our perceptron

We will do this in two steps. First, we will create our output vector from the "SALEPRICE" column in our property assessement data. 

You can see that here:

In [25]:
prices = prop_assess.loc[:, 'SALEPRICE']
print prices.head()
print prices.shape

PARID
0104S00104000000    63000.0
0190N00200000000    29100.0
0190J00056000000    93000.0
0190G00174000000    74900.0
0190G00266000000    82000.0
Name: SALEPRICE, dtype: float64
(1475,)


Creating our "Prices" dataset was the easy part. Creating our feature set will take a little more work.

A quirk of perceptrons is that they are very suseptible to extreme feature values. For that reason we'll be scaling everything between [0,1] (but you could just as easily use [-1,1]).

For nominal features, we'll need to create new features to compensate. So called "One Hot" encoding will allow us to us leverage these feature labels. You can read a <a href="http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning#17470183">great writeup here if you're curious.</a> 

We'll nest all of this into a fuction "formatfeatures" which is three in one. Since we prepocessed our features, we can easily process them based on their datatype. You can see the function below:

In [26]:
def formatfeatures(df):
    
    def numericfeature(dfcol):
    
        mincol = min(dfcol)
        maxcol = max(dfcol)

        return (dfcol-mincol)/(maxcol-mincol)

    def nominalfeature(col, df):
    
        labels = pd.get_dummies(df[col].unique())

        for label in labels:
            df[str(col+"="+label)] = 1.0 * (df[col] == label)

        df = df.drop([col],1)

        return df

    for col in df.columns:
        
        if df[col].dtype == object:
            df = nominalfeature(col, df)
            
        else:
            df[col] = numericfeature(df[col])
    
    return df

Next we'll generate our feature set from the desired columns and pass them to our "formatfeatures" function. 

Don't forget to check the output (and makesure everything has been scaled correctly).

In [27]:
features = prop_assess.loc[:, ['BASEMENT','BEDROOMS','BSMTGARAGE','CDU','CONDITION','EXTFINISH_DESC','FINISHEDLIVINGAREA','FIREPLACES','FULLBATHS','GRADE','HALFBATHS','HEATINGCOOLINGDESC','LOTAREA','MUNIDESC','NEIGHDESC','ROOFDESC','SALEDATE','SCHOOLDESC','STORIES','STYLEDESC','TOTALROOMS','USEDESC','YEARBLT']]
features = formatfeatures(features)
print features.head()
print features.shape

                  BASEMENT  BEDROOMS  BSMTGARAGE  CONDITION  \
PARID                                                         
0104S00104000000       1.0     0.125        0.25   0.166667   
0190N00200000000       1.0     0.125        0.00   0.166667   
0190J00056000000       1.0     0.250        0.25   0.166667   
0190G00174000000       1.0     0.250        0.25   0.166667   
0190G00266000000       1.0     0.250        0.00   0.166667   

                  FINISHEDLIVINGAREA  FIREPLACES  FULLBATHS  HALFBATHS  \
PARID                                                                    
0104S00104000000            0.106430    0.333333       0.25   0.000000   
0190N00200000000            0.083394    0.333333       0.25   0.000000   
0190J00056000000            0.121592    0.333333       0.25   0.000000   
0190G00174000000            0.150168    0.000000       0.25   0.333333   
0190G00266000000            0.157312    0.000000       0.25   0.333333   

                   LOTAREA  SALEDATE   

### Creating test and training sets

Finally, we're going to divide our data into test and training sets.

We'll do this probablistically using a threashold value and random values. For each index value (from "PARID" in this case), we'll generate a random value between [0,1]. If the random value is below the threshold (default = 0.7) the index added to the training set, otherwise it's added to the test set. 

<strong>Note:</strong> We want to pass the "Key" index instead of "Price" or "Features" since it may have eliminated several properties that Google couldn't find in the geocoding step above.

In [28]:
def generatesplit(index, thresh=0.7):
    
    train = []
    test = []
    
    random.seed() #Uses the current time to 
    
    for idx in index:
        prob = random.random()
        
        if prob < thresh:
            train.append(idx)
        else:
            test.append(idx)

    return train, test

# Get the random indexes of our training and test sets
itrain, itest = generatesplit(key.index)

The outputs of this function are two lists of indexes that we'll use for the training and testing sets. We'll use these to split our "Prices" and "Features" sets.

In [29]:
# Split our featrues and prices
ptrain = prices[itrain]
ptest = prices[itest]

ftrain = features.loc[itrain]
ftest = features.loc[itest]

# Split our key (not really necessary)
ktrain = key.loc[itrain]
ktest = key.loc[itest]

## Building the Perceptron
Finally, we're ready to start training our perceptron. 
Here's the obligatory Wikipedia link:

https://en.wikipedia.org/wiki/Perceptron

If that's too vague, you can also check out these links (which were simplified a lot of the fluff):
<ul>
<li>https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/</li>
<li>http://glowingpython.blogspot.com/2011/10/perceptron.html</li>
</ul>

For our purposes, we'll create a weight vector with the same number of columns as our "Features" Set. We'll then initialize these weights to random initial values. 

When we <i>train</i> a perceptron, we train the weights. We're computing two equations. 

The first is the prediction equation, shown here:

$y = max(\sum{w_i * x_i}, 0)$ for $0 <= i < n$, where n is the number of columns in our feature set. 

We use our prediction to update weights using this equation for each sample in our training set:

$w := w_i + error * learningrate * (d - y)* x_i$ for $0 <= i < n$

The "d" is simply the "Prices" value that corresponds to the row we're reviewing. Learning rate is typicaly less than 1, we'll use 0.2. Since our training set is so small, we'll iterate through the entire set a few times. 

In [30]:
def trainperceptron(w, X, D, numiter=5, lr = 0.2):
    count = 0
    while count < numiter:
        for i, x in X.iterrows():
            #print x
            #print i
            d = D[i]
            y = max(sum(w*x),0)
            error = (d - y)
            #print i, d, y, error
            w = w + lr*error*x
            
        count += 1
    return w
# Initialize weights 
w = [random.random() for _ in xrange(len(features.columns))]
# Train perceptron
w = trainperceptron(w, ftrain, ptrain, numiter=5, lr=0.1)

## Perceptron Regression

Predicting values is easy with perceptrons. We'll use this equation from above:

$y = max(\sum{w_i * x_i}, 0)$ for $0 <= i < n$

We'll iterate through our testing set and output the results. We'll then check our results for accuracy. 

In [31]:
def predict(w, X):
    y = pd.Series(index=X.index)
    
    for i, x in X.iterrows():
        y[i] = sum(w*x)
    return y
predictedprices = predict(w,ftest)
print "average error: $",sum(abs(predictedprices - ptest)/float(len(ptest)))

# Uncomment this code to see the error for each 
#for i in range(len(predictedprices)):
    #print predictedprices[i], ptest[i], abs(ptest[i] - predictedprices[i])/ptest[i]*100

average error: $ 32535.3370307


Since our training set is so small, we're going to expect limited accuracy. You can experiment with both preprocessing and feature engineering to further refine results. 

## Recommendations for Follow-On Projects
<ol>

<li>Include more data</li>

Preprocessing for this tutorial was quite heavy handed. Try your hand at experimenting with more relaxed techniques. 

<li>Calculate the distance from Downtown Pittsburgh.</li>

Since we already have the grid coordinates for every property, we could quickly calculate their distance from Pittsburgh Better yet, include their local city center too! Hint: you can use <a href="https://pypi.python.org/pypi/geopy">geopy.</a>

<li>Compare results to Zillow (or Redfin).</li>

Since we're builing our own estimator, you could check how we did against Zillow's "Zestimates." You can read up on it here: http://www.zillow.com/howto/api/APIOverview.htm

<li>Implement a Multi-Layer Perceptron</li>

A single layer perceptron can be quite limited (for instance, it can't learn XOR relationships). Takin the time to implement a multi-layer perceptron network may improve your results. 

</ol>