# GTHA housing market database
# OSEMN methodology Step 1: Obtain
# Obtain Select TAZ-level TTS Variables

---

This notebook describes _Step 1: Obtain_ of OSEMN methodology, the process of obtaininng select TAZ-level TTS variables.

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

## Import dependencies

In [1]:
import pandas as pd
import os
from time import time

In [2]:
data_path = '../../../data/tts/'
os.listdir(data_path)

['tts_num_jobs_tidy.csv',
 'TAZ_2001shp.zip',
 'TAZ_2001shp',
 'taz_info.csv',
 'TTS_variables.xlsx',
 'taz_tts.xlsx',
 'Num_of_Jobs.xlsx']

## Load geometry of GTHA Dissemination Areas (DAs)

In [3]:
t = time()
df = pd.read_excel(data_path + 'taz_tts.xlsx')
elapsed = time() - t
print("\n----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)


----- DataFrame loaded
in 1.54 seconds
with 1,716 rows
and 42 columns
-- Column names:
 Index(['TAZ_O', 'Pop16', 'FT_wrk16', 'Stu16', 'HH16', 'Jobs16', 'Cars16',
       'Pop11', 'FT_wrk11', 'Stu11', 'HH11', 'Jobs11', 'Cars11', 'Pop06',
       'FT_wrk06', 'Stu06', 'HH06', 'Jobs06', 'Cars06', 'Pop01', 'FT_wrk01',
       'Stu01', 'HH01', 'Jobs01', 'Cars01', 'Pop96', 'FT_wrk96', 'HH96',
       'Stu96', 'Jobs96', 'Cars96', 'Pop91', 'FT_wrk91', 'Stu91', 'HH91',
       'Jobs91', 'Cars91', 'Pop86', 'FT_wrk86', 'Stu86', 'HH86', 'Cars86'],
      dtype='object')


In [5]:
df.columns.sort_values()

Index(['Cars01', 'Cars06', 'Cars11', 'Cars16', 'Cars86', 'Cars91', 'Cars96',
       'FT_wrk01', 'FT_wrk06', 'FT_wrk11', 'FT_wrk16', 'FT_wrk86', 'FT_wrk91',
       'FT_wrk96', 'HH01', 'HH06', 'HH11', 'HH16', 'HH86', 'HH91', 'HH96',
       'Jobs01', 'Jobs06', 'Jobs11', 'Jobs16', 'Jobs91', 'Jobs96', 'Pop01',
       'Pop06', 'Pop11', 'Pop16', 'Pop86', 'Pop91', 'Pop96', 'Stu01', 'Stu06',
       'Stu11', 'Stu16', 'Stu86', 'Stu91', 'Stu96', 'TAZ_O'],
      dtype='object')

## Convert data to tidy format
Same variables are stored in several columns corresponding to different TTS years, which violates one of the conditions of Tidy Data. To correct this, a new column `year` will be added to be used as a part of a **_primary key_** along with `taz_o` for the `taz_tts` table in the database. DataFrame will be melted for each variable to be stored in a single column additionally referenced by the column `year`.

### Desired shape of the DataFrame
There are 6 unique variables for each Census year but 1986, variable `Jobs` was not present in 1986. There are 7 separate TTS years: 1986, 1991, 1996, 2001, 2006, 2011, 2016.

In [32]:
mask1 = df.columns.str.contains('86')
cols = df.columns[mask1]
cols

Index(['Pop86', 'FT_wrk86', 'Stu86', 'HH86', 'Cars86'], dtype='object')

In [22]:
mask1 = df.columns.str.contains('01')
cols = df.columns[mask1]
cols

Index(['Pop01', 'FT_wrk01', 'Stu01', 'HH01', 'Jobs01', 'Cars01'], dtype='object')

In [29]:
mask1 = df.columns.str.contains('Jobs')
cols = df.columns[mask1]
cols.sort_values()

Index(['Jobs01', 'Jobs06', 'Jobs11', 'Jobs16', 'Jobs91', 'Jobs96'], dtype='object')

In [31]:
mask1 = df.columns.str.contains('Cars')
cols = df.columns[mask1]
cols.sort_values()

Index(['Cars01', 'Cars06', 'Cars11', 'Cars16', 'Cars86', 'Cars91', 'Cars96'], dtype='object')

#### All Census variables in the original table

In [33]:
var_names = df.columns[df.columns.str.contains('\d')]
var_names = var_names.str.slice(stop=-2)
var_names = var_names.unique()
len(var_names)

6

In [34]:
var_names

Index(['Pop', 'FT_wrk', 'Stu', 'HH', 'Jobs', 'Cars'], dtype='object')

There are seven different TTS years:

In [37]:
mask1 = df.columns.str.contains('HH')
cols = df.columns[mask1]
cols.sort_values()

Index(['HH01', 'HH06', 'HH11', 'HH16', 'HH86', 'HH91', 'HH96'], dtype='object')

There are 1'716 Transportation Analysis Zones.

In [38]:
df.index.nunique()

1716

In [39]:
len(df)

1716

In [40]:
len(df) * 7

12012

That means, that after melting we should get a table with 1'716 x 7 = 12'012 rows and 8 columns (`TAZ_O`, `year`, and 6 columns with TTS variables).

### Melt all the variables in the table

In [47]:
i = 0
id_col = 'TAZ_O'

for var_name in var_names:

    # select a subset of columns containing different years of a Census variable
    mask1 = df.columns.str.contains('{0}\d'.format(var_name))
    var_cols = df.columns[mask1]
    s = df[[id_col] + list(var_cols)]

    df_melt = pd.melt(s, id_vars=id_col, value_name=var_name)
    
    df_melt['variable'] = df_melt['variable'].str.slice(-2)
    mask1 =  df_melt['variable'].isin(['01', '06', '11', '16'])
    df_melt.loc[mask1, 'variable'] = '20' + df_melt.loc[mask1, 'variable']
    df_melt.loc[~mask1, 'variable'] = '19' + df_melt.loc[~mask1, 'variable']
    df_melt = df_melt.rename(columns={'variable': 'year'})
    df_melt['year'] = df_melt['year'].astype('int')
    
    if i  == 0:
        df_tidy = df_melt.sort_values(id_col)
    else:
        df_tidy = pd.merge(df_tidy, df_melt, how='left', 
                           left_on=[id_col, 'year'], right_on=[id_col, 'year'])
    i += 1

df_tidy.sort_values([id_col, 'year'])

Unnamed: 0,TAZ_O,year,Pop,FT_wrk,Stu,HH,Jobs,Cars
1,1,1986,4885,2322,2943,1889.0,,2295
5,1,1991,5396,2790,744,2233.0,1211.0,2977
3,1,1996,4967,2257,3027,2177.0,888.0,2750
2,1,2001,3932,1820,1376,1829.0,1509.0,2151
4,1,2006,4607,2086,1316,1931.0,1142.0,2261
...,...,...,...,...,...,...,...,...
12008,2670,1996,42,21,0,21.0,2829.0,21
12010,2670,2001,143,41,328,61.0,2354.0,81
12007,2670,2006,0,0,0,,2797.0,0
12009,2670,2011,0,0,0,,2965.0,0


#### Validate the shape of the melted DataFrame

In [48]:
df.shape

(1716, 42)

In [49]:
df_tidy.shape

(12012, 8)

## Save results to a .csv file

In [50]:
save_path = data_path + 'taz_tts_tidy.csv'
t = time()
df_tidy.sort_values(['TAZ_O', 'year']).to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))

DataFrame saved to file:
 ../../../data/tts/taz_tts_tidy.csv 
took 0.26 seconds
