# GTHA housing market database
# OSEMN methodology Step 1: Obtain
# Obtain the "adapter" table with temporal spans for various data sources

---

This notebook describes _Step 1: Obtain_ of OSEMN methodology, the process of obtaining an "adapter" table that shows matching temporal spans for different data sources.

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

## Temporal matching of data sources in the GTHA housing market database

The proposed GTHA housing market database includes datasets with variables that are measured in different time intervals. Records in Teranet's dataset of real estate transactions are dated at the time scale of a day, which can be considered an approximately continuous measurement at the time scale of years and decades, given the annual frequency of Teranet records past 1985. On the other hand, Census and TTS surveys are conducted in 5 year intervals, with no information available in between the Census years, and thus these measurements are distinctly discrete in nature. Due to this fact, temporal matching of Teranet data with other sources, such as Census or TTS tables, presents an additional challenge. This challenge is addressed in the proposed GTHA housing market database via an "adapter" table specifying the temporal spans to be used when matching various data sources.

## Teranet records and Census / TTS variables

Teranet and Census / TTS variables can be matched in a number of ways:

### Direct match with appropriate Teranet subsets

* use subsets of Teranet records from the Census / TTS years and match only with data directly recorded for that year 
* for example, take a subset of Teranet records from 2016 and match it with 2016 Census / TTS variables
* technically, any date span can be specified when creating a Teranet subset, but in this case, appropriate Census / TTS variables must be selected manually
    * benefits:
        * precision of match: variables from Census would be composed of the actual values produced by the survey, rather than an interpolation based on assumptions
        * flexilibity of use: a new Census table can be added and its variables can be match with appropriate Teranet records via simple SQL queries
    * disadvantages:
        * limited match: only Teranet data from Census years can be used, records from between Census years cannot be matched with the Census vairables
        * hard to generalize SQL queries: need custom SQL queries to match data to different tables, several queries to match Teranet data from different Census years

### Interpolation of discrete Census / TTS variables

* discrete Census / TTS variables can be turned into continuous via interpolation
* Teranet records can be matched to real recorded and interpolated values by year, or finer time scale
    * benefits:
        * most Teranet records used: all Teranet records within the Census / TTS range can be used (within the interpolation region)
        * closest match: closest temporal match between Teranet records and Census / TTS variables
        * precise, if correctly assessed: in the case where correct assumptions are made while interpolating values, the most precise match
    * disadvantages
        * more assumptions: assumtions need to be made about the dynamics of each Census / TTS variables between Census years
        * inaccurate, if incorrectly assessed: in case of incorrect assumptions, there is a risk of lower accuracy compared with other matching methods
        * interpolated rather then recorded: Teranet values from non-Census years will be matched to variables that are interpolated rather then recorded
        * more data pre-processing needed: each Census / TTS variable needs to be processed in order to produce interpolated values
        
### Use "adapter" table with temporal spans for each Census / TTS survey

* each Census / TTS survey can be assigned a temporal span of 5 years representing a group of Teranet records to which its variables can be matched
* Teranet records are matched by year, each year would yield an appropriate Census or TTS variable from an appropriate temporal span
* for example, the Census of 2016 would have a temporal span of 2014-2018, and thus a Teranet record from 2015 would be matched to variables from 2016 Census. Census of 1991 would have a temporal span of 1989-1993, and thus a Teranet record from 1993 would be matched with variables from 1991 Census.
* TTS survey would be matched in a similar manner
    * benefits:
        * most Teranet records used: all Teranet / TTS records that fall within the specified temporal spans range can be matche to appropriate Census / TTS variables
        * recorded rather than interpolated: Teranet records are matched to actual recorded Census / TTS values
        * avoid interpolation assumptions: since no interpolation is performed, no additional assumptions are needed
        * no additional data pre-processing: all matching is done through an "adapter" table, original Census / TTS variables do not need to be changed
    * disadvantages:
        * step-change in Census / TTS variables: when matchin Teranet sources from non-Census years, instead of using interpolation, same Census / TTS varibales are used for a group of 5 years centered at each Census year
        * varying accuracy: accuracy of match further away from the Census years (+/- 2 years) probably will be lower. In addition, there would be a step change from every +2 to -2 Census year (i.e., 1998 to 1999)
        * need "adapter" table: an additional "adapter" table specifying the matching groups of Census / TTS variables to years of Teranet records needs to be added to the database to facilitate temporal integrity of the joining operations

## Temporal spans of data sources

Temporal spans of data sources used in the proposed GTHA housing market database are presented in the table below (not including datasets prior to 1984):

<img src="../../../methodology/2.scrub/temporal_spans.png">

## Import dependencies

In [1]:
import pandas as pd
import os

## Construct the "adapter" table with temporal spans

### Add primary key: range of years in Teranet records

In [2]:
df = pd.DataFrame(data={'year': range(1865, 2018)})
df['year']

0      1865
1      1866
2      1867
3      1868
4      1869
       ... 
148    2013
149    2014
150    2015
151    2016
152    2017
Name: year, Length: 153, dtype: int64

### Add foreign key: Census years

In [3]:
min_year = 1971
max_year = 2017
census_years = list(range(min_year, max_year + 1, 5))
census_years

[1971, 1976, 1981, 1986, 1991, 1996, 2001, 2006, 2011, 2016]

In [5]:
for census_year in census_years:
    mask1 = census_year - 2 <= df['year']
    mask2 = df['year'] <= census_year + 2
    df.loc[mask1 & mask2, 'census_year'] = census_year
    
df['census_year'].value_counts().sort_index()

1971.0    5
1976.0    5
1981.0    5
1986.0    5
1991.0    5
1996.0    5
2001.0    5
2006.0    5
2011.0    5
2016.0    4
Name: census_year, dtype: int64

### Add foreign key: TTS years

In [7]:
min_year = 1986
max_year = 2017
tts_years = list(range(min_year, max_year + 1, 5))
tts_years

[1986, 1991, 1996, 2001, 2006, 2011, 2016]

In [8]:
for tts_year in tts_years:
    mask1 = tts_year - 2 <= df['year']
    mask2 = df['year'] <= tts_year + 2
    df.loc[mask1 & mask2, 'tts_year'] = tts_year
    
df['tts_year'].value_counts().sort_index()

1986.0    5
1991.0    5
1996.0    5
2001.0    5
2006.0    5
2011.0    5
2016.0    4
Name: tts_year, dtype: int64

## Save results to a .csv file

In [10]:
df.tail()

Unnamed: 0,year,census_year,tts_year
148,2013,2011.0,2011.0
149,2014,2016.0,2016.0
150,2015,2016.0,2016.0
151,2016,2016.0,2016.0
152,2017,2016.0,2016.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 3 columns):
year           153 non-null int64
census_year    49 non-null float64
tts_year       34 non-null float64
dtypes: float64(2), int64(1)
memory usage: 3.7 KB


In [11]:
misc_data_path = '../../../data/misc/'
os.listdir(misc_data_path)

['Infl_adjustment.xlsx',
 'temporal_spans.csv',
 'fuel_prices_StatsCan.xlsx',
 'Data_sources_years.xlsx',
 'fuel_price.csv']

In [12]:
save_path = misc_data_path + 'temporal_spans.csv'
df.to_csv(save_path, index=False)
print("DataFrame saved to file:\n", save_path)

DataFrame saved to file:
 ../../../data/misc/temporal_spans.csv
