# Housing Case Study

## Introduction:

Rasterframes is a powerful tool for combining geospatial data and spark dataframes. In this case study, we will explore the rate of development of a housing complex just outside of Charlottesville, Virginia, using aerial imagery captured over the course of four years. This tutorial assumes a basic knowledge of dataframes, more info can be found [here](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes):

## Background 

[Cascadia](http://southern-development.com/communities/cascadia/) is a housing development built to the northeast of Charlottesville. The land was rezoned in 2007, but we are going to analyze how long it took to break land and clear away the forest previously there. This data is available through NAIP (the National Agriculture Imagery Program) and consists of three multispectral images taken over the course of 4 years. Pictures were taken during the summer of 2012, 2014, and 2016, and the high spatial resolution allows us a more precise measurement of the progression of the development. 

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyrasterframes import *
from pyrasterframes.rasterfunctions import *

# Add other configuration options as needed

spark = SparkSession.builder. \
    master("local[*]"). \
    appName("RasterFrames"). \
    config("spark.ui.enabled", "false"). \
    getOrCreate(). \
    withRasterFrames()

In [32]:
from pyspark.sql.functions import lit
RFs = [spark.read.geotiff("data/201{}.tif".format(i*2)). \
       withColumn("year", lit("201{}".format(i*2))) for i in range(1,4)]

In [33]:
from functools import reduce
unionRF = reduce(lambda rf1, rf2: rf1.union(rf2), RFs).asRF()

In [35]:
for num in range(1,5):
    unionRF = unionRF.withColumnRenamed("tile_{}".format(num), "band_{}".format(num))

## Aggregate Statistics

Aggregate statistics are an important tool for the analysis of tiles. Aggregate statistics operate on a columnwise basis, so for instance, `aggSum` finds the sum of all cells within a column, `aggMean` performs a similar function, but finds the mean across every cell. `aggStats` is similar to `tileStats`, except it is aggregated and finds the number for all stats. This is the result of running `aggStats` across the blue band:

In [36]:
unionRF.select(aggStats("band_1")).show(truncate=False)

+------------------------------------------------------------+
|aggStats(band_1)                                            |
+------------------------------------------------------------+
|[1014390,0,23.0,255.0,121.69001863188714,2521.3564791043373]|
+------------------------------------------------------------+



### Local Aggregate Statistics

<img style="float: right; width:300px;height:600px;" src="pics/localAdd.png";> 


Aggregate Statistics will often need to be conducted on a local, or cellwise basis. Local aggregate functions take into account several tiles over one band and output a tile, which has cells equal to that function applied to each cell. For instance, `localAggSum` will return a tile of the same size of the input tiles with the sum of all the cells as the output cells, much in the same way that `localAdd` functions (although `localAggSum` operates on more than two tiles at once). The diagram to the right is the result of performing `localAdd` on two tiles. There are other local aggregate functions, such as `localAggMin`, `localAggMax`, `localAggDataCells`, etc. A full list can be found in [the docs](Link goes here).

### Mask the Scenes to get only the site data (Use the later site to determine what to mask on the earlier site)
### Load in some label data or some other way to classify things idk
### Train model on the scenes
### Combine classified image and image before the mask