# EDA of Teranet data with 12 added columns
This notebook presents Exploratory Data Analysis (EDA) for Teranet records that were:
* cleaned and filtered for duplicates
    * `consideration_amt` < $30 reset to NaN (Not a Number, missing values)
    * records matching on all columns have been removed (83'798 records)
    * records matching on all columns excluding `pin` have been removed (729'182 records)
    * **813'138 duplicate entries** removed in total from original Teranet dataset 
    * 8'226'103 unique records remain after duplicates have been removed
    * see notebook `Teranet_data_cleaning.ipynb` for details
* filtered to include only records from GTHA 
    * filtering performed via a spatial join
    * `xy` coordinates of Teranet records joined (how='inner', op='within') with DA geometry for GTHA 
    * DA geometry provided by York Municipal Government (accessed via Esri Open Data portal)
    * 6,062,853 records have `xy` coordinates within GTHA boundary
    * see notebook `Teranet_GTHA_DA_spatial_join.ipynb` for details
* filtered to exclude records with missing (or under $30) `consideration_amt`
    * `consideration_amt` < $30 was reset to NaN during data cleaning
    * records with `consideration_amt` == NaN were removed before adding new columns
    * 4,637,584 records have non-NaN `consideration_amt`
    * see notebook `Teranet_add_new_columns.ipynb` for details
* new columns added
    * `da_id`, `da_city`, `da_median_tot_inc`: were added during the spatial join with DA data
    * `xy`: `x` and `y` concatenated together (used for grouping by coordinate pairs)
    * `total_sales_pin`: total records for this `pin`
    * `total_sales_xy`: total records for this `pin`
    * `pin_prev_sales`: 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import os

In [4]:
os.chdir('Documents/repos/geo')
os.listdir()

['.git',
 '.gitattributes',
 '.gitignore',
 '.idea',
 '.ipynb_checkpoints',
 'data',
 'img',
 'notebooks',
 'presentations',
 'README.md',
 'src',
 '__pycache__']

In [5]:
# column `pin` will be converted to dtype=category
# after records with NaN `consideration_amt` will be dropped
dtypes = {
    'decade': 'int',
    'year': 'int',
    'lro_num': 'category',
    'postal_code': 'category',
    'street_designation': 'category',
    'street_direction': 'category',
    'municipality': 'category',
    'da_id': 'category',
    'da_city': 'category',
}
t = time.time()
teranet_path = 'data/HHSaleHistory_cleaned_v0.9_GTHA_DA_with_cols_v0.9.csv'
df = pd.read_csv(teranet_path,
                 dtype=dtypes,
                 parse_dates=['registration_date'])
df = df.sort_values('registration_date')
elapsed = time.time() - t
print("----- DataFrame with Teranet records loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

----- DataFrame with Teranet records loaded
in 30.38 seconds
with 4,637,584 rows
and 28 columns
-- Column names:
 Index(['registration_date', 'decade', 'year', 'lro_num', 'pin',
       'consideration_amt', 'postal_code', 'unitno', 'street_name',
       'street_designation', 'street_direction', 'municipality',
       'street_number', 'x', 'y', 'da_id', 'da_city', 'da_median_tot_inc',
       'xy', 'total_sales_pin', 'pin_prev_sales', 'xy_prev_sales',
       'pin_price_cum_sum', 'xy_price_cum_sum', 'pin_price_pct_change',
       'xy_price_pct_change', 'pin_years_since_last_sale',
       'xy_years_since_last_sale'],
      dtype='object')


In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,registration_date,decade,year,lro_num,pin,consideration_amt,postal_code,unitno,street_name,...,pin_total_sales,xy_total_sales,pin_prev_sales,xy_prev_sales,pin_price_cum_sum,xy_price_cum_sum,pin_price_pct_change,xy_price_pct_change,pin_years_since_last_sale,xy_years_since_last_sale
0,0,1986-07-09,198,1986,65,29000001,185000.0,L3P6K5,,Cairns,...,,,1,1,185000.0,185000.0,,,,
1,1,1986-04-14,198,1986,65,29000002,171000.0,L3P6K5,,Cairns,...,,,1,1,171000.0,171000.0,,,,
2,2,1988-09-09,198,1988,65,29000003,318000.0,L3P6K5,,Cairns,...,,,1,1,318000.0,318000.0,,,,
3,3,1999-01-29,199,1999,65,29000003,273000.0,L3P6K5,,Cairns,...,,,2,2,591000.0,591000.0,-0.141509,-0.141509,10.394521,10.394521
4,4,2011-02-18,201,2011,65,29000003,558000.0,L3P6K5,,Cairns,...,,,3,3,1149000.0,1149000.0,1.043956,1.043956,12.063014,12.063014


In [None]:
df.info(null_counts=True)

## Records per `pin` vs records per `xy` pair
There are significantly more records with unique `pin` compared to unique `xy` coordinate pairs, which appears to be reasonable (some `xy` coordinate pairs correspond to records with different`pin`s).

In [None]:
pin_counts = df['pin'].value_counts()
xy_counts = df['xy'].value_counts()

f, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
plt.title("Histogram of records")

pin_counts.hist(bins=100, ax=axes[0])
axes[0].set_title("\nfor all pins ({0:,} pins total)"
                  .format(len(pin_counts)))
axes[0].set_ylabel("# of pins")
axes[0].set_xlabel("# of records")
axes[0].grid(linestyle=':')

xy_counts.hist(bins=100, ax=axes[1])
axes[1].set_title("\nfor all xy pairs ({0:,} pairs total)"
                  .format(len(xy_counts)))
axes[1].set_ylabel("# of xy pairs")
axes[1].set_xlabel("# of records")
axes[1].grid(linestyle=':')

plt.show()

In [None]:
max_counts = 10

f, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
plt.title("Histogram of records")

pin_counts[pin_counts < max_counts].hist(bins=20, 
                                         ax=axes[0])
axes[0].set_title("for pins with <{0} records"
                  .format(max_counts) + 
                  "\n({0:,} pins ({1:.5f}% of the total)"
          .format(len(pin_counts[pin_counts < max_counts]),
                  len(pin_counts[pin_counts < max_counts])
                            / len(pin_counts) * 100))
axes[0].set_ylabel("# of pins")
axes[0].set_xlabel("# of records")
axes[0].grid(linestyle=':')

xy_counts[xy_counts < max_counts].hist(bins=20, 
                                       ax=axes[1])
axes[1].set_title("\nfor xy pairs with <{0} records"
                  .format(max_counts) + 
            "\n({0:,} pairs ({1:.5f}% of the total)"
                  .format(len(xy_counts[xy_counts < max_counts]),
                  len(xy_counts[xy_counts < max_counts])
                          / len(xy_counts) * 100))
plt.ylabel("# of xy pairs")
plt.xlabel("# of records")
plt.grid(linestyle=':')

### Plot distributions of days since last sale

In [None]:
df['days_since_last_sale_pin'].hist(bins=200)
plt.title("Days since last sale, by `pin`")
plt.xlabel('')

In [None]:
mask = df['days_since_last_sale_xy'] < 50
df.loc[mask, 'days_since_last_sale_xy'].hist(bins=50)