In [20]:
import os
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

import matplotlib.pyplot as plt

# Overview

## Business Problem

<blockquote source="https://docs.google.com/document/d/1WQ5MPTmzHR7PWzl6byvtts_TQ-4c1goZAuW4rE3EtDY/edit">
    Zillow gives consumers as much information as possible about homes and the housing market, marking the first time consumers had access to this type of home value information at no cost.
Predict the actual sale price, given all the features of a home and some key factors that affect real estate prices.
    <br/>
    <b>Business goal:</b> Create a Dynamic Pricing tool that will predict the actual home sale price, given all the features of a home and some key factors that affect real estate prices.
</blockquote>

Zillow describes zestimate as follows:

<blockquote source="https://www.zillow.com/zestimate/">
The Zestimate® home valuation model is Zillow’s estimate of a home's market value. The Zestimate incorporates public and user-submitted data, taking into account home facts, location and market conditions.
<br />
It is not an appraisal and it should be used as a starting point. We encourage buyers, sellers and homeowners to supplement the Zestimate with other research such as visiting the home, getting a professional appraisal of the home, or requesting a comparative market analysis (CMA) from a real estate agent.
</blockquote>
    
Plan for improving the existing zestimate:
    
- Based on [Airbnb's Dynamic Pricing Model](https://github.com/tule2236/Airbnb-Dynamic-Pricing-Optimization), we plan to augment the zestimate with a comparative market analysis akin to the expertise given by a local relator.

The added comparative market analysis adds value as follows:
- help home sellers evaluate local relators
- increase home sellers' confidence in setting the price of their home without a relator

## Data Sources

In the first version of this project, we will focus on the static data provided by the Zillow's Kaggle competition. When implementing the competitive price analysis, we will look for local government datasets, and, if there is time, we will look into web scraping new data sources.

# References

Thottuvaikkatumana, R. (2016). Apache Spark 2 for Beginners. Packt Publishing.

tule2236 (2018). Dynamic Pricing Optimization for Airbnb. Retrieved from https://github.com/tule2236/Airbnb-Dynamic-Pricing-Optimization.

Zillow Kaggle Comepetition https://www.kaggle.com/c/zillow-prize-1/data.

# Data Exploration
* Full list of real estate properties in three counties (Los Angeles, Orange, and Ventura, California). 
* All training transactions occurred before October 15, 2016 (and a few after Oct 15, 2016).
* Testing data are tranactions between October 15 and December 31, 2016.

### Objective
We are supposed to predict 6 time points for all properties (October 2016 (201610), November 2016 (201611), December 2016 (201612), October 2017 (201710), November 2017 (201711), and December 2017 (201712).

Our target variable is the log-error between their Zestimate and the actual sale price. 

`logerror = log(Zestimate) - log(SalePrice)`

This is given to us in the train.csv file.

# Download the data
Download the data from https://www.kaggle.com/c/zillow-prize-1/data and extract it into a folder called `zillow-prize-1`. This folder should be in the base level directory (the same directory as the .gitignore file).

### What is the data?

All features can be found in `properties_2016.csv` and `properties_2017.csv`. These two csv represent all homes in 2016 and 2017, and the properties of their house.

Training data can are found in `train_2016.csv` and `train_2017.csv`

In [2]:
features_location = "../zillow-prize-1/properties_2017.csv"
train1_location = "../zillow-prize-1/train_2016_v2.csv"
train2_location = "../zillow-prize-1/train_2017.csv"
test_location = "../zillow-prize-1/sample_submission.csv"


features = spark.read.csv(features_location, header=True)
training1 = spark.read.csv(train1_location, header=True)
training2 = spark.read.csv(train2_location, header=True)
testing = spark.read.csv(test_location, header=True)

### What does our features look like?

In [7]:
print("Feature Schema: ")
features.printSchema()

Feature Schema: 
root
 |-- parcelid: string (nullable = true)
 |-- airconditioningtypeid: string (nullable = true)
 |-- architecturalstyletypeid: string (nullable = true)
 |-- basementsqft: string (nullable = true)
 |-- bathroomcnt: string (nullable = true)
 |-- bedroomcnt: string (nullable = true)
 |-- buildingclasstypeid: string (nullable = true)
 |-- buildingqualitytypeid: string (nullable = true)
 |-- calculatedbathnbr: string (nullable = true)
 |-- decktypeid: string (nullable = true)
 |-- finishedfloor1squarefeet: string (nullable = true)
 |-- calculatedfinishedsquarefeet: string (nullable = true)
 |-- finishedsquarefeet12: string (nullable = true)
 |-- finishedsquarefeet13: string (nullable = true)
 |-- finishedsquarefeet15: string (nullable = true)
 |-- finishedsquarefeet50: string (nullable = true)
 |-- finishedsquarefeet6: string (nullable = true)
 |-- fips: string (nullable = true)
 |-- fireplacecnt: string (nullable = true)
 |-- fullbathcnt: string (nullable = true)
 |-- ga

In [8]:
print("The first entry:")
features.show(1, vertical=True)

The first entry:
-RECORD 0--------------------------------------
 parcelid                     | 10754147       
 airconditioningtypeid        | null           
 architecturalstyletypeid     | null           
 basementsqft                 | null           
 bathroomcnt                  | 0.0            
 bedroomcnt                   | 0.0            
 buildingclasstypeid          | null           
 buildingqualitytypeid        | null           
 calculatedbathnbr            | null           
 decktypeid                   | null           
 finishedfloor1squarefeet     | null           
 calculatedfinishedsquarefeet | null           
 finishedsquarefeet12         | null           
 finishedsquarefeet13         | null           
 finishedsquarefeet15         | null           
 finishedsquarefeet50         | null           
 finishedsquarefeet6          | null           
 fips                         | 06037          
 fireplacecnt                 | null           
 fullbathcnt           

In [17]:
features.describe('airconditioningtypeid').show()

+-------+---------------------+
|summary|airconditioningtypeid|
+-------+---------------------+
|  count|               815362|
|   mean|   1.9457234945950388|
| stddev|   3.1605068777112315|
|    min|                    1|
|    max|                    9|
+-------+---------------------+



In [13]:
features.describe('landtaxvaluedollarcnt').show()

+-------+---------------------+
|summary|landtaxvaluedollarcnt|
+-------+---------------------+
|  count|              2925291|
|   mean|   268455.76912108914|
| stddev|    486509.7132898551|
|    min|                  1.0|
|    max|             999996.0|
+-------+---------------------+



In [14]:
features.describe('fireplacecnt').show()

+-------+-------------------+
|summary|       fireplacecnt|
+-------+-------------------+
|  count|             313124|
|   mean| 1.1689586234207534|
| stddev|0.46185487571362344|
|    min|                  1|
|    max|                  9|
+-------+-------------------+



In [9]:
# What is the size of our DataFrame?
"The size of our data is {} rows and {} columns.".format(features.count(), len(features.columns))

'The size of our data is 2985217 rows and 58 columns.'

In [50]:
# Lets see all of the rows.
features.columns

['parcelid',
 'airconditioningtypeid',
 'architecturalstyletypeid',
 'basementsqft',
 'bathroomcnt',
 'bedroomcnt',
 'buildingclasstypeid',
 'buildingqualitytypeid',
 'calculatedbathnbr',
 'decktypeid',
 'finishedfloor1squarefeet',
 'calculatedfinishedsquarefeet',
 'finishedsquarefeet12',
 'finishedsquarefeet13',
 'finishedsquarefeet15',
 'finishedsquarefeet50',
 'finishedsquarefeet6',
 'fips',
 'fireplacecnt',
 'fullbathcnt',
 'garagecarcnt',
 'garagetotalsqft',
 'hashottuborspa',
 'heatingorsystemtypeid',
 'latitude',
 'longitude',
 'lotsizesquarefeet',
 'poolcnt',
 'poolsizesum',
 'pooltypeid10',
 'pooltypeid2',
 'pooltypeid7',
 'propertycountylandusecode',
 'propertylandusetypeid',
 'propertyzoningdesc',
 'rawcensustractandblock',
 'regionidcity',
 'regionidcounty',
 'regionidneighborhood',
 'regionidzip',
 'roomcnt',
 'storytypeid',
 'threequarterbathnbr',
 'typeconstructiontypeid',
 'unitcnt',
 'yardbuildingsqft17',
 'yardbuildingsqft26',
 'yearbuilt',
 'numberofstories',
 'firep

### What does our training set look like?

In [59]:
# Concatentate 2016 and 2017
training = training1.union(training2)

In [60]:
training.show(3)

+--------+--------+---------------+
|parcelid|logerror|transactiondate|
+--------+--------+---------------+
|11016594|  0.0276|     2016-01-01|
|14366692| -0.1684|     2016-01-01|
|12098116|  -0.004|     2016-01-01|
+--------+--------+---------------+
only showing top 3 rows



In [63]:
# What is the size of our DataFrame?
print((training.count(), len(training.columns)))
print("2016 Count: ", (training1.count(), len(training1.columns)))
print("2017 Count: ", (training2.count(), len(training2.columns)))

(167888, 3)
2016 Count:  (90275, 3)
2017 Count:  (77613, 3)


In [62]:
# Lets see all of the rows.
training.columns

['parcelid', 'logerror', 'transactiondate']

## Things to note about training data
We can join the features with the parcelid. 

We are predicting the value of 'logerror'.

We will want to potentially implement timeseries of this (with transactiondate).

# What does our testing set look like?

In [10]:
testing.show(3)

+--------+------+------+------+------+------+------+
|ParcelId|201610|201611|201612|201710|201711|201712|
+--------+------+------+------+------+------+------+
|10754147|     0|     0|     0|     0|     0|     0|
|10759547|     0|     0|     0|     0|     0|     0|
|10843547|     0|     0|     0|     0|     0|     0|
+--------+------+------+------+------+------+------+
only showing top 3 rows



In [56]:
# What is the size of our DataFrame?
print((testing.count(), len(testing.columns)))

(2985217, 7)


We would want to exclude the testing data. I think?

In [64]:
# Lets see all of the rows.
testing.columns

['ParcelId', '201610', '201611', '201612', '201710', '201711', '201712']

### Things to note about the testing set
1. We will want to exclude all things in the training set. 
2. The ideal `zscore` is 0. 
    * I'm not 100% sure what this means for us...
        * Do we just try to predict the `zscore` in the training set?
  
I'm not sure if this is actually our testing set... hmm