# GTHA housing market database
# OSEMN methodology Step 3: Explore
# Exploratory Spatial Data Analysis (ESDA) of the Teranet dataset
# Alpha-shapes

This notebook describes the process of generating [alpha shapes](https://en.wikipedia.org/wiki/Alpha_shape) from Teranet records.  


## Previous steps
Previous steps included: 

**Step 2: Scrub:**

* **Step 2.1:** spatial join between the Teranet points and the polygons of GTHA Dissemination Areas (DAs)
    
    * During step 2.1, Teranet records whose coordinates fall outside of the GTHA boundary (as defined by the DA geometry) have been filtered out (6,803,691 of the original 9,039,241 Teranet records remain in the dataset)
     
    * In addition to that, three new columns (`OBJECTID`, `DAUID`, and `CSDNAME`) derived from DA attributes have been added to each Teranet transaction

    * for details, see `notebooks/2.scrub/2.1_teranet_gtha_spatial_join.ipynb`

* **Step 2.2:** correction for consistency of the Teranet records

    * column names were converted to lower case
    
    * inconsistent capitalizations were fixed for columns
    
        * `municipality`    
        * `street_name`
        * `street_designation`
        * `postal_code` (did not show problems, converted as a preventive measure)
        
    * columns `province` and `street_suffix` were removed from the dataset
    
    * new column `street_name_raw` was created: reserve copy of unmodified `street_name`
    
    * column `street_name` was parsed and cleaned for:
    
        * `postal_code`
        * `unitno`
        * `street_number`
        * `street_direction`
        * `street_designation`
        
    * plots of the count and percentage of missing values per column were produced
    
    * inconsistent entries were fixed in the following columns:
        
        * `street_direction`
        * `street_designation`
        * `municipality`
        * `street_name`
        * `unitno`
        
    * for details, see `notebooks/2.scrub/2.2_teranet_consistency.ipynb`

* **Step 2.3:** addition of new attributes to the Teranet dataset

* during Step 2.3, **two versions of the Teranet dataset were produced**:

    * one where `consideration_amt` was left unmodified
    
    * one where `consideration_amt` < 10'000 CAD was reset to NaN and these records were removed from the dataset (1,615,178 records (23.74% of the total) have been removed. 5,188,513 records remain in the Teranet dataset).

New attributes were added to both versions of the Teranet dataset:
 
* surrogate key:

    * `transaction_id`: unique identifier for each Teranet transaction 
    
Essentially, a simple range index, which represents the row number of a record in the full Teranet dataset (filtered to include only GTHA records), ordered by date (from earliest to latest) and `pin`
    
* attributes for display

    * `date_disp`: `registration_date` converted to `datetime.date` data type to exclude the timestamp (original `registration_date` is stored in NumPy's `datetime64` format to allow more efficient datetime operations)
    
    * `price_disp`: `consideration_amt` formatted to include thousands separator (_e.g.,_ '3,455,122') and stored as a string, for display purposes
    
* attributes for record grouping
    
    * `year`: year parsed from `registration_date`, to simplify record grouping
    
    * `3year`: `registration_date` parsed for 3-year intervals (_e.g.,_ '2014-2016'), to simplify record grouping
    
    * `5year`: `registration_date` parsed for 5-year intervals (_e.g.,_ '2012-2016'), to simplify record grouping
    
    * `10year`: `registration_date` parsed for 3-year intervals (_e.g.,_ '2007-2017'), to simplify record grouping
    
    * `xy`: `x` and `y` coordinates concatenated together (_e.g.,_ '43.098324_-79.234235'), can be used to identify and group records by their coordinate pairs
    
* correction of `consideration_amt` for inflation    
    
    * `price_infl`: `consideration_amt` corrected for inflation
    
* exploratory attributes

    * `pin/xy_total_sales`: total records for this `pin`/`xy`

    * `pin/xy_prev_sales`: previous records from this `pin`/`xy` (not counting current transaction)

    * `pin/xy_price_cum_sum`: cumulative price of all records to date from this `pin`/`xy`

    * `pin/xy_price_pct_change`: price percentage change compared to previous record from this `pin`/`xy`

    * `price_da_pct_change`: price percentage change compared to previous record from this DA (by `da_id`)

    * `pin/xy_years_since_last_sale`: years since last sale from this `pin`/`xy`

    * `da_days_since_last_sale`, `da_years_since_last_sale`: days or years since last sale from this DA (by `da_id`)

    * `sale_next_6m/1y/3y`: "looks into the future" to see whether there is another transaction from this `pin`/`xy` within the given time horizon (6 months, 1 year, 3 years)

    * for details, see `notebooks/2.scrub/2.3_teranet_new_cols.ipynb` and `notebooks/2.scrub/2.3_teranet_nonan_new_cols.ipynb`

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

For background information, description of the Teranet dataset, and its attributes, see `methodology/1.obtain/obtain.pdf`.

For description of _Step 3: Explore_ of OSEMN methodology, see `methodology/2.scrub/scrub.pdf`.

## Alpha shapes
From [wikipedia](https://en.wikipedia.org/wiki/Alpha_shape):  
In computational geometry, an alpha shape, or α-shape, is a family of piecewise linear simple curves in the Euclidean plane associated with the shape of a finite set of points. They were first defined by [Edelsbrunner, Kirkpatrick & Seidel (1983)](https://ieeexplore.ieee.org/document/1056714). The alpha-shape associated with a set of points is a generalization of the concept of the convex hull, i.e. every convex hull is an alpha-shape but not every alpha shape is a convex hull.

<img src='img/alpha_shapes.png'>

In this notebook, alpha shapes (polygons) will be generated from Teranet point data using PySal library in Python.

## Import dependencies

In [16]:
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import contextily as ctx
import seaborn as sns
import os
import sys
from pysal.lib.cg import alpha_shape_auto
from shapely.geometry import Point
from time import time

In [2]:
sys.path.append('../../src')
from plot_utils import map_alpha

## Load Teranet data

In [3]:
teranet_path = '../../data/teranet/'
os.listdir(teranet_path)

['1.1_Teranet_DA.csv',
 '1.3_Teranet_DA_TAZ_PG_FSA.csv',
 '2_Teranet_consistent.csv',
 'parcel16_epoi13.csv',
 '1.2_Teranet_DA_TAZ.csv',
 '1.4_Teranet_DA_TAZ_FSA_LU_LUDMTI.csv',
 '1.4_Teranet_DA_TAZ_FSA_LU.csv',
 '.ipynb_checkpoints',
 'ParcelLandUse.zip',
 'ParcelLandUse',
 'HHSaleHistory.csv',
 '3_Teranet_nonan_new_cols.csv',
 'GTAjoinedLanduseSales']

In [4]:
# load DataFrame with Teranet records
t = time()
teranet_df = pd.read_csv(teranet_path + '3_Teranet_nonan_new_cols.csv',
                         parse_dates=['registration_date'])
elapsed = time() - t
print("----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(teranet_df.shape[0], teranet_df.shape[1]) + 
      "\n-- Column names:\n", teranet_df.columns)

  interactivity=interactivity, compiler=compiler, result=result)


----- DataFrame loaded
in 44.55 seconds
with 5,188,513 rows
and 54 columns
-- Column names:
 Index(['transaction_id', 'lro_num', 'pin', 'consideration_amt',
       'registration_date', 'postal_code', 'unitno', 'street_name',
       'street_designation', 'street_direction', 'municipality',
       'street_number', 'x', 'y', 'dauid', 'csduid', 'csdname', 'taz_o', 'fsa',
       'pca_id', 'postal_code_dmti', 'pin_lu', 'landuse', 'prop_code',
       'dmti_lu', 'street_name_raw', 'year', 'year_month', 'year3',
       'census_year', 'census2001_year', 'tts_year', 'tts1991_year', 'xy',
       'pin_total_sales', 'xy_total_sales', 'pin_prev_sales', 'xy_prev_sales',
       'pin_price_cum_sum', 'xy_price_cum_sum', 'pin_price_pct_change',
       'xy_price_pct_change', 'price_da_pct_change',
       'pin_years_since_last_sale', 'xy_years_since_last_sale',
       'da_days_since_last_sale', 'da_years_since_last_sale',
       'pin_sale_next_6m', 'pin_sale_next_1y', 'pin_sale_next_3y',
       'xy_sale_nex

## Generate maps of alpha shapes by municipality from Teranet records

In [40]:
as_dir = 'results/maps/alpha_shapes/'
os.listdir(as_dir)

[]

In [41]:
start_year = 1985
end_year = 2017
min_count = 8
teranet_crs = {'proj': 'latlong', 'ellps': 'WGS84', 'datum': 'WGS84', 'no_defs': True}

#plot alpha shapes produced from Teranet records taken by year
for year in range(start_year, end_year + 1):
    
    s = teranet_df.query('year == {0}'.format(year))
    print("Generating alpha shapes from the {0:,} Teranet records from {1}...".format(len(s), year))
    
    # filter Teranet records for minimum count per municipality required to produce alpha shapes
    mun_counts = s.groupby(color_col)['price_2016'].count()
    mun_counts = mun_counts[mun_counts > min_count]
    mask1 = s[color_col].isin(mun_counts.index)
    s = s[mask1]

    # generate alpha shapes from Teranet records as a GeoDataFrame
    alpha = s.groupby(color_col)[['x', 'y']].apply(lambda x: alpha_shape_auto(x.values))
    alpha = gpd.GeoDataFrame({'geometry': alpha}, crs=teranet_crs)\
        .to_crs(epsg=3857).reset_index()
    
    # plot alpha shapes created from the Teranet subset
    f, ax = plt.subplots(1, figsize=(12, 12))
    alpha.plot(column=color_col, legend=True, legend_kwds={'loc': 'lower right'}, ax=ax, alpha=0.5)

    # add municipality counts
    for idx, mun in alpha.iterrows():
        mun_centroid = mun['geometry'].centroid
        ax.text(mun_centroid.x, mun_centroid.y, mun['csdname'] + 
                "\n{:,} records".format(mun_counts[idx]),
                va='center', ha='center')

    # add a basemap
    ctx.add_basemap(ax=ax, url=ctx.sources.ST_TONER_BACKGROUND)

    # configure axis parameters
    ax.set_xlim(-8940996.776086302, -8723064.623629777)
    ax.set_ylim(5313237.739935117, 5555494.494204169)
    ax.set_title("Alpha shapes produced from {0:,} Teranet records from {1}"\
                 .format(len(s), year), fontsize=20)
    ax.set_axis_off()
    plt.savefig(as_dir + 'teranet_alpha_' + str(year) + '.png', dpi=400, bbox_inches='tight')
    plt.close(f)

Generating alpha shapes from the 19,912 Teranet records from 1985...
Generating alpha shapes from the 35,291 Teranet records from 1986...
Generating alpha shapes from the 36,529 Teranet records from 1987...
Generating alpha shapes from the 51,180 Teranet records from 1988...
Generating alpha shapes from the 81,903 Teranet records from 1989...
Generating alpha shapes from the 80,297 Teranet records from 1990...
Generating alpha shapes from the 81,096 Teranet records from 1991...
Generating alpha shapes from the 87,769 Teranet records from 1992...
Generating alpha shapes from the 80,936 Teranet records from 1993...
Generating alpha shapes from the 100,207 Teranet records from 1994...
Generating alpha shapes from the 88,685 Teranet records from 1995...
Generating alpha shapes from the 141,955 Teranet records from 1996...
Generating alpha shapes from the 154,189 Teranet records from 1997...
Generating alpha shapes from the 145,558 Teranet records from 1998...
Generating alpha shapes from t