# GTHA housing market database
# OSEMN methodology Step 2: Scrub
# Step 2.3 Spatial join of Teranet points with parcel-level land use from DMTI

---

This notebook describes Step 2.3 (part of _Step 2: Scrub_ of OSEMN methodology) performed on the Teranet dataset.

Step 2.3 involves the spatial join of Teranet points with parcel-level land use from DMTI.

Previous steps included: 

* Step 2.1 

    * the spatial join between the Teranet points and the polygons of GTHA Dissemination Areas (DAs)
    
    * During step 2.1, Teranet records whose coordinates fall outside of the GTHA boundary (as defined by the DA geometry) have been filtered out (6,803,691 of the original 9,039,241 Teranet records remain in the dataset)
     
    * In addition to that, three new columns (`OBJECTID`, `DAUID`, and `CSDNAME`) derived from DA attributes have been added to each Teranet transaction

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

For background information, description of the Teranet dataset, and its attributes, see `methodology/1.obtain/obtain.pdf`.

For description of _Step 2: Scrub_ of OSEMN methodology, see `methodology/2.scrub/scrub.pdf`.

For description of the cleanup plan for Teranet dataset, see `methodology/2.scrub/teranet_cleanup_plan.pdf`.

For description of Step 2.1 of the cleanup process, see `notebooks/2.scrub/2.1_teranet_gtha_spatial_join.ipynb`.


## Import dependencies

In [8]:
%matplotlib inline
import pandas as pd 
import geopandas as gpd
import os
from shapely.geometry import Point
from glob import glob
from time import time

In [7]:
data_path = '../../data/'
teranet_path = data_path + 'teranet/'
os.listdir(teranet_path)

['3_Teranet_new_cols.csv',
 '1.1_Teranet_DA.csv',
 '1.3_Teranet_DA_TAZ_PG_FSA.csv',
 '2_Teranet_consistent.csv',
 'parcel16_epoi13.csv',
 '1.2_Teranet_DA_TAZ.csv',
 '1.4_Teranet_DA_TAZ_FSA_LU.csv',
 '.ipynb_checkpoints',
 'ParcelLandUse.zip',
 'ParcelLandUse',
 'HHSaleHistory.csv',
 '3_Teranet_nonan_new_cols.csv',
 'GTAjoinedLanduseSales']

In [10]:
dmti_lu_path = data_path + 'dmti/dmti_lu_gtha/'
os.listdir(dmti_lu_path)

['lu_gtha02.shx',
 'lu_gtha10.shp',
 'lu_gtha10.shx',
 'lu_gtha03.sbx',
 'lu_gtha12.shp.xml',
 'lu_gtha02.prj',
 'lu_gtha13.shp.xml',
 'lu_gtha14.shp',
 'lu_gtha12.shx',
 'lu_gtha12.shp',
 'lu_gtha05.shp',
 'lu_gtha14.dbf',
 'lu_gtha02.dbf',
 'lu_gtha11.shp.xml',
 'lu_gtha09.cpg',
 'lu_gtha14.cpg',
 'lu_gtha01.prj',
 'lu_gtha11.cpg',
 'lu_gtha07.dbf',
 'lu_gtha14.sbn',
 'lu_gtha14.shx',
 'lu_gtha12.dbf',
 'lu_gtha09.dbf',
 'lu_gtha12.cpg',
 'lu_gtha05.sbx',
 'lu_gtha06.prj',
 'lu_gtha11.sbn',
 'lu_gtha10.sbx',
 'lu_gtha01.sbn',
 'lu_gtha08.cpg',
 'lu_gtha09.prj',
 'lu_gtha09.shp.xml',
 'lu_gtha07.sbx',
 'lu_gtha13.shp',
 'lu_gtha10.shp.xml',
 'lu_gtha12.prj',
 'lu_gtha05.sbn',
 'lu_gtha09.sbn',
 'lu_gtha12.sbn',
 'lu_gtha04.sbn',
 'lu_gtha07.prj',
 'lu_gtha08.sbx',
 'lu_gtha11.shp',
 'lu_gtha03.prj',
 'lu_gtha13.dbf',
 'lu_gtha03.shx',
 'lu_gtha07.shp.xml',
 'lu_gtha06.shp.xml',
 'lu_gtha10.cpg',
 'lu_gtha05.cpg',
 'lu_gtha07.shx',
 'lu_gtha01.shp.xml',
 'lu_gtha04.shp.xml',
 'lu_gtha0

In [19]:
dmti_lu_shapefile_list = glob(dmti_lu_path + '*.shp')
dmti_lu_shapefile_list.sort()
dmti_lu_shapefile_list

['../../data/dmti/dmti_lu_gtha/lu_gtha01.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha02.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha03.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha04.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha05.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha06.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha07.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha08.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha09.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha10.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha11.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha12.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha13.shp',
 '../../data/dmti/dmti_lu_gtha/lu_gtha14.shp']

## Load Teranet data

In [31]:
t = time()
teranet_df = pd.read_csv(teranet_path + '1.4_Teranet_DA_TAZ_FSA_LU.csv',
                         parse_dates=['registration_date'])
elapsed = time() - t
print("----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(teranet_df.shape[0], teranet_df.shape[1]) + 
      "\n-- Column names:\n", teranet_df.columns)

t = time()
# combine values in columns 'x' and 'y' into a POINT geometry object
geometry = [Point(xy) for xy in zip(teranet_df['X'], teranet_df['Y'])]
# generate a new GeoDataFrame by adding point geometry to data frame 'teranet_sales_data'
teranet_gdf = gpd.GeoDataFrame(teranet_df, geometry=geometry)
elapsed = time() - t
print("\n----- Geometry generated from 'X' and 'Y' pairs, GeoDataFrame created!"
      "\nin {0:.2f} seconds ({1:.2f} minutes)".format(elapsed, elapsed / 60) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(teranet_gdf.shape[0], teranet_gdf.shape[1]) + 
      "\n-- Column names:\n", teranet_gdf.columns)

# add CRS for WGS84 (lat-long) to GeoDataFrame with Teranet records
teranet_gdf.crs = {'proj': 'latlong', 
                   'ellps': 'WGS84', 
                   'datum': 'WGS84', 
                   'no_defs': True}
print("\n----- CRS dictionary for WGS84 added to geo data frame 'teranet_gdf'!")

----- DataFrame loaded
in 1161.73 seconds
with 6,803,767 rows
and 27 columns
-- Column names:
 Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'POSTAL_CODE', 'PROVINCE', 'UNITNO', 'STREET_NAME',
       'STREET_DESIGNATION', 'STREET_DIRECTION', 'MUNICIPALITY',
       'STREET_SUFFIX', 'STREET_NUMBER', 'X', 'Y', 'DAUID', 'CSDUID',
       'CSDNAME', 'TAZ_O', 'FSA', 'PCA_ID', 'postal_code_dmti', 'MAF_ID',
       'DEL_M_ID', 'pin_lu', 'LANDUSE', 'PROP_CODE'],
      dtype='object')

----- Geometry generated from 'X' and 'Y' pairs, GeoDataFrame created!
in 110.04 seconds (1.83 minutes)
with 6,803,767 rows
and 28 columns
-- Column names:
 Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'POSTAL_CODE', 'PROVINCE', 'UNITNO', 'STREET_NAME',
       'STREET_DESIGNATION', 'STREET_DIRECTION', 'MUNICIPALITY',
       'STREET_SUFFIX', 'STREET_NUMBER', 'X', 'Y', 'DAUID', 'CSDUID',
       'CSDNAME', 'TAZ_O', 'FSA', 'PCA_ID', 'postal_code_dmti', 'MAF_ID',
 

## Perform the spatial join of Teranet points with DMTI land use parcels by year
### Validate projections
>Note: EPSG:4326 and WGS 84 represent the [same projection](https://spatialreference.org/ref/epsg/wgs-84/).

In [None]:
'OBJECTID' is in teranet_gdf.columns

In [29]:
for shapefile in dmti_lu_shapefile_list:
    # read one year of DMTI land use data
    lu_gdf = gpd.read_file(shapefile)
    cols = ['OBJECTID', 'CATEGORY', 'geometry']
    lu_gdf = lu_gdf[cols]
    
    # take a subset of Teranet records for that year
    year = int('20' + shapefile[-6:-4])
    s = teranet_gdf.query('year == {0}'.format(year))
    
    # perform the spatial join of Teranet points with DMTI land use polygons
    print("----- {0}. Joining {0:,} land use polygons from DMTI with {1:,} Teranet points..."
          .format(year, len(lu_gdf), len(s)) + 
         "\nCRS of DMTI land use shapefile:", lu_gdf.crs, "CRS of Teranet points:", s.crs)
#    t = time()
#    teranet_lu_gdf = gpd.sjoin(teranet_gdf, lu_gdf, 
#                               how='left', op='within')
#    elapsed = time() - t
#    print("\nSpatial join completed in {0:,.2f} seconds ({1:,.2f} minutes). {2:,} rows in the resultant GeoDataFrame"
#          .format(len(teranet_lu_gdf)))
#    if 'OBJECTID_x' is in teranet_lu_gdf.columns:
#        mask1 = teranet_lu_gdf['OBJECTID_x'].isnull()
#        teranet_lu_gdf.loc[mask1, 'OBJECTID_x'] = teranet_lu_gdf.loc[mask1, 'OBJECTID_y']
#        teranet_lu_gdf.loc[mask1, 'CATEGORY_x'] = teranet_lu_gdf.loc[mask1, 'CATEGORY_y']
#        teranet_lu_gdf = teranet_lu_gdf.drop(['OBJECTID_y', 'CETEGORY_y'])\
#                            .rename(columns={'OBJECTID_x': 'OBJECTID', 'CATEGORY_x': 'CATEGORY'})

UndefinedVariableError: name 'year' is not defined

### Perform the spatial join


In [8]:
t = time()
teranet_lu_gdf = gpd.sjoin(teranet_gdf, lu_gdf, 
                           how='left', op='within')
elapsed = time() - t
print("\n----- Spatial join completed, new GeoDataFrame created"
      "\nin {0:.2f} seconds ({1:.2f} minutes)".format(elapsed, elapsed / 60) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(teranet_lu_gdf.shape[0], teranet_lu_gdf.shape[1]) + 
      "\n-- Column names:\n", teranet_lu_gdf.columns)

  warn('CRS of frames being joined does not match!')



----- Spatial join completed, new GeoDataFrame created
in 1186.97 seconds (19.78 minutes)
with 6,803,767 rows
and 29 columns
-- Column names:
 Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'POSTAL_CODE', 'PROVINCE', 'UNITNO', 'STREET_NAME',
       'STREET_DESIGNATION', 'STREET_DIRECTION', 'MUNICIPALITY',
       'STREET_SUFFIX', 'STREET_NUMBER', 'X', 'Y', 'DAUID', 'CSDUID',
       'CSDNAME', 'TAZ_O', 'FSA', 'PCA_ID', 'postal_code_dmti', 'MAF_ID',
       'DEL_M_ID', 'geometry', 'index_right', 'PIN', 'LANDUSE', 'PROP_CODE'],
      dtype='object')


#### Rename column 'PIN' to 'pin_lu'

In [9]:
teranet_lu_gdf = teranet_lu_gdf.rename(columns={'PIN': 'pin_lu'})
print("Column was renamed.")

Column was renamed.


### Save results to a .csv file

In [10]:
teranet_lu_gdf.columns

Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'POSTAL_CODE', 'PROVINCE', 'UNITNO', 'STREET_NAME',
       'STREET_DESIGNATION', 'STREET_DIRECTION', 'MUNICIPALITY',
       'STREET_SUFFIX', 'STREET_NUMBER', 'X', 'Y', 'DAUID', 'CSDUID',
       'CSDNAME', 'TAZ_O', 'FSA', 'PCA_ID', 'postal_code_dmti', 'MAF_ID',
       'DEL_M_ID', 'geometry', 'index_right', 'pin_lu', 'LANDUSE',
       'PROP_CODE'],
      dtype='object')

In [11]:
save_path = teranet_path + '1.4_Teranet_DA_TAZ_FSA_LU.csv'
t = time()
teranet_lu_gdf.drop(['index_right', 'geometry'], axis=1).to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds ({1:.2f} minutes)".format(elapsed, elapsed / 60))

DataFrame saved to file:
 ../../data/teranet/1.4_Teranet_DA_TAZ_FSA_LU.csv 
took 179.54 seconds (2.99 minutes)
