# Teranet dataset
# Feature extraction
# Alpha-shapes
This notebook describes the process of generating [alpha shapes](https://en.wikipedia.org/wiki/Alpha_shape) from Teranet records.  


## Summary of the previous steps
Teranet records previously were: 
* cleaned and filtered for duplicates
    * `consideration_amt` < $30 were reset to NaN (Not a Number, missing values)
    * records matching on all columns have been removed (83'798 records)
    * records matching on all columns excluding `pin` have been removed (729'182 records)
    * **813'138 duplicate entries** removed in total from original Teranet dataset 
    * 8'226'103 unique records remain after duplicates have been removed
    * see notebook `data_cleaning/Teranet_data_cleaning.ipynb` for details

* filtered to include only records from GTHA 
    * filtering performed via a spatial join
    * `xy` coordinates of Teranet records joined (how='inner', op='within') with DA geometry for GTHA 
    * DA geometry provided by York Municipal Government (accessed via Esri Open Data portal)
    * 6,062,853 records have `xy` coordinates within GTHA boundary
    * see notebook `data_cleaning/Teranet_GTHA_DA_spatial_join.ipynb` for details

## Alpha shapes
From [wikipedia](https://en.wikipedia.org/wiki/Alpha_shape):  
In computational geometry, an alpha shape, or α-shape, is a family of piecewise linear simple curves in the Euclidean plane associated with the shape of a finite set of points. They were first defined by [Edelsbrunner, Kirkpatrick & Seidel (1983)](https://ieeexplore.ieee.org/document/1056714). The alpha-shape associated with a set of points is a generalization of the concept of the convex hull, i.e. every convex hull is an alpha-shape but not every alpha shape is a convex hull.

<img src='img/alpha_shapes.png'>

In this notebook, alpha shapes (polygons) will be generated from Teranet point data using PySal library in Python.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pysal.lib.cg import alpha_shape_auto
import os
import sys
import time

In [6]:
os.listdir()

['.git',
 '.gitattributes',
 '.gitignore',
 '.idea',
 '.ipynb_checkpoints',
 'data',
 'downloads',
 'img',
 'notebooks',
 'presentations',
 'README.md',
 'src',
 '__pycache__']

In [None]:
sys.path.append('src')

In [None]:
dtypes = {
    'decade': 'int',
    'year': 'int',
    'lro_num': 'category',
    'pin': 'category',
    'postal_code': 'category',
    'street_designation': 'category',
    'street_direction': 'category',
    'municipality': 'category',
    'da_id': 'category',
    'da_city': 'category',
    'xy': 'category'
}
t = time.time()
teranet_path = 'data/HHSaleHistory_cleaned_v0.9_GTHA_DA_with_cols_v0.9.csv'
df = pd.read_csv(teranet_path,
                 dtype=dtypes,
                 parse_dates=['registration_date'])\
        .set_index('registration_date').sort_index()
df = df.sort_values('registration_date')
elapsed = time.time() - t
print("----- DataFrame with Teranet records loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

In [None]:
df = df.dropna()