# GTHA housing market database
# OSEMN methodology Step 2: Scrub
# Data quality issues discovered in the Teranet dataset

---

This notebook presents examples of data quality issues discovered in the Teranet dataset.

Previous steps included: 

* Step 2.1 

    * the spatial join between the Teranet points and the polygons of GTHA Dissemination Areas (DAs)
    
    * During step 2.1, Teranet records whose coordinates fall outside of the GTHA boundary (as defined by the DA geometry) have been filtered out (6,803,691 of the original 9,039,241 Teranet records remain in the dataset)
     
    * In addition to that, three new columns (`OBJECTID`, `DAUID`, and `CSDNAME`) derived from DA attributes have been added to each Teranet transaction

---

## Import dependencies

In [1]:
import pandas as pd
import os
from time import time

## Multiple transactions

In [2]:
data_path = '../../data/teranet/'
os.listdir(data_path)

['Teranet_with_DA_cols.csv', 'HHSaleHistory.csv']

## Load Teranet data

In [3]:
t = time()
#df = pd.read_csv(data_path + 'Teranet_with_DA_cols.csv',
#                 parse_dates=['registration_date'])
df = pd.read_csv(data_path + 'Teranet_with_DA_cols.csv')
elapsed = time() - t
print("----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

  interactivity=interactivity, compiler=compiler, result=result)


----- DataFrame loaded
in 20.37 seconds
with 6,803,691 rows
and 18 columns
-- Column names:
 Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'POSTAL_CODE', 'PROVINCE', 'UNITNO', 'STREET_NAME',
       'STREET_DESIGNATION', 'STREET_DIRECTION', 'MUNICIPALITY',
       'STREET_SUFFIX', 'STREET_NUMBER', 'X', 'Y', 'OBJECTID', 'DAUID',
       'CSDNAME'],
      dtype='object')


In [4]:
df.columns = df.columns.str.lower()
df.columns

Index(['lro_num', 'pin', 'consideration_amt', 'registration_date',
       'postal_code', 'province', 'unitno', 'street_name',
       'street_designation', 'street_direction', 'municipality',
       'street_suffix', 'street_number', 'x', 'y', 'objectid', 'dauid',
       'csdname'],
      dtype='object')

In [7]:
df['price'] = df['consideration_amt'].apply(lambda x: '{:,}'.format(x))

## Multiple transactions
There is a number of ways in which multiple transactions can be recorded in the Teranet dataset.

In [9]:
mask1 = df['pin'] == 248580237
df.loc[mask1, ['pin', 'registration_date', 'price', 'unitno', 
               'street_name', 'street_designation', 'municipality', 'y', 'x']]\
    .sort_values('registration_date')

Unnamed: 0,pin,registration_date,price,unitno,street_name,street_designation,municipality,y,x
163593,248580237,12/22/2011 0:00:00,5439000.0,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE,43.383163,-79.737999
163478,248580237,5/2/2016 0:00:00,10504865.0,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE,43.383163,-79.737999
163485,248580237,6/29/2017 0:00:00,2.0,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE,43.383163,-79.737999


In [6]:
mask1 = df['pin'] == 32063841
df.loc[mask1, ['pin', 'registration_date', 'price', 'unitno', 
               'street_name', 'street_designation', 'municipality', 'y', 'x']]\
    .sort_values('registration_date')

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,pin,registration_date,price,unitno,street_name,street_designation,municipality,y,x
811529,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811530,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811531,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811532,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811533,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811534,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033
811535,32063841,8/23/2016 0:00:00,,,- 298 KING ST. & 4 - 8 PARKER AVEN,,RICHMOND HILL,43.943214,-79.465033


In [10]:
mask1 = df['street_name'].str.contains('&', na=False)
df.loc[mask1, ['pin', 'registration_date', 'unitno', 'street_name', 'street_designation', 'municipality']]

Unnamed: 0,pin,registration_date,unitno,street_name,street_designation,municipality
121142,250590032,6/7/1989 0:00:00,,& 239 ARMSTRONG AVENUE,,HALTON HILLS
121154,250590032,3/22/1993 0:00:00,,& 239 ARMSTRONG AVENUE,,HALTON HILLS
122722,250590032,10/7/2016 0:00:00,,& 239 ARMSTRONG AVENUE,,HALTON HILLS
122843,250590032,6/22/2017 0:00:00,,& 239 ARMSTRONG AVENUE,,HALTON HILLS
163478,248580237,5/2/2016 0:00:00,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE
163485,248580237,6/29/2017 0:00:00,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE
163593,248580237,12/22/2011 0:00:00,,",3525,3535 & 3545 REBECCA STREET",,OAKVILLE
421658,134130101,9/29/2008 0:00:00,,"& 3450 RIDGEWAY DR., 3715 LAIRD RO",,MISSISSAUGA
421775,134130101,1/15/2013 0:00:00,,"& 3450 RIDGEWAY DR., 3715 LAIRD RO",,MISSISSAUGA
691581,210180481,7/31/2009 0:00:00,,& 332 REAR LEE AVENUE,,TORONTO
