# GTHA housing market database
# OSEMN methodology Step 1: Obtain
# Obtain Housing Monetary DA-level Census Variables

---

This notebook describes _Step 1: Obtain_ of OSEMN methodology, the process of select DA-level Census variables.

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

## Import dependencies

In [1]:
import pandas as pd
import os
from time import time

In [2]:
data_path = '../../../data/census/taz_level_vars/'
os.listdir(data_path)

['taz_census_housing_monetaryvars.xlsx',
 'taz_census_age_edu_employment.xlsx',
 'da_census_housing_monetary_tidy.csv']

## Load geometry of GTHA Dissemination Areas (DAs)

In [3]:
t = time()
df = pd.read_excel(data_path + 'taz_census_age_edu_employment.xlsx')
elapsed = time() - t
print("\n----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)


----- DataFrame loaded
in 4.16 seconds
with 1,716 rows
and 252 columns
-- Column names:
 Index(['TAZ_O', 'Pop16', 'Pop_15pl16', 'Akid16', 'Ayad16', 'Amid16', 'Asen16',
       'Hi_sch16', 'Posec_dip16', 'Posec_deg16',
       ...
       'Agri71', 'Min71', 'Manu71', 'Cons71', 'Trans71', 'Trad71', 'Fin71',
       'Com71', 'Gov71', 'Oth71'],
      dtype='object', length=252)


## Convert data to tidy format
Same variables are stored in several columns corresponding to different Census years, which violates one of the conditions of Tidy Data. To correct this, a new column `year` will be added to be used as a part of a **_primary key_** along with `dauid` for the `da_census_select` table in the database. DataFrame will be melted for each variable to be stored in a single column additionally referenced by the column `year`.

### Desired shape of the DataFrame
There are 15 unique variables for each Census year.

In [4]:
mask1 = df.columns.str.contains('01')
cols = df.columns[mask1]
len(cols)

31

#### All Census variables in the original table

In [5]:
var_names = df.columns[df.columns.str.contains('\d')]
var_names = var_names.str.slice(stop=-2)
var_names = var_names.unique()
len(var_names)

33

In [6]:
var_names

Index(['Pop', 'Pop_15pl', 'Akid', 'Ayad', 'Amid', 'Asen', 'Hi_sch',
       'Posec_dip', 'Posec_deg', 'Avg_HHsize', 'Lbrfrc', 'Emp', 'Unemp',
       'Employee', 'Self_emp', 'WfH', 'No_fix_wkpl', 'Usl_wkpl', 'WrkCSD_res',
       'WrkCSD_diff', 'Agri', 'Min', 'Manu', 'Cons', 'Tran', 'Trad', 'Fin',
       'Com', 'Busi', 'Gov', 'Oth', 'Trans', 'pop'],
      dtype='object')

There are four different Census years:

In [9]:
mask1 = df.columns.str.contains('Pop_15pl')
cols = df.columns[mask1]
cols.sort_values()

Index(['Pop_15pl01', 'Pop_15pl06', 'Pop_15pl11', 'Pop_15pl16', 'Pop_15pl71',
       'Pop_15pl81', 'Pop_15pl86', 'Pop_15pl91', 'Pop_15pl96'],
      dtype='object')

There are 9'182 Dissemination Areas.

In [10]:
df.index.nunique()

1716

In [11]:
len(df)

1716

In [12]:
len(df) * 9

15444

That means, that after melting we should get a table with 9'182 x 4 = 36'728 rows and 32 columns (`DAUID`, `year`, and 31 columns with Census variables).

### Melt all the variables in the table

In [13]:
i = 0

for var_name in var_names:

    # select a subset of columns containing different years of a Census variable
    mask1 = df.columns.str.contains('{0}\d'.format(var_name))
    var_cols = df.columns[mask1]
    s = df[['TAZ_O'] + list(var_cols)]

    df_melt = pd.melt(s, id_vars='TAZ_O', value_name=var_name)
    df_melt['variable'] = '20' + df_melt['variable'].str.slice(-2)
    df_melt = df_melt.rename(columns={'variable': 'year'})
    df_melt['year'] = df_melt['year'].astype('int')
    if i  == 0:
        df_tidy = df_melt.sort_values('TAZ_O')
    else:
        df_tidy = pd.merge(df_tidy, df_melt, how='left', 
                           left_on=['TAZ_O', 'year'], right_on=['TAZ_O', 'year'])
    i += 1

df_tidy.sort_values(['TAZ_O', 'year'])

Unnamed: 0,TAZ_O,year,Pop,Pop_15pl,Akid,Ayad,Amid,Asen,Hi_sch,Posec_dip,...,Cons,Tran,Trad,Fin,Com,Busi,Gov,Oth,Trans,pop
2,1,2001,4159,3316.0,659.0,685.0,2312.0,520.0,505.0,915.0,...,116.0,,437.0,295.0,727.0,336.0,196.0,110.0,176.0,
7,1,2006,3842,3007.0,480.0,600.0,1945.0,445.0,757.0,905.0,...,117.0,,310.0,236.0,720.0,309.0,137.0,85.0,127.0,
1,1,2011,3984,3157.0,479.0,603.0,1998.0,531.0,711.0,886.0,...,150.0,,351.0,102.0,381.0,222.0,125.0,30.0,80.0,
0,1,2016,4228,3355.0,492.0,662.0,2142.0,620.0,934.0,929.0,...,145.0,175.0,394.0,188.0,868.0,355.0,192.0,132.0,,
8,1,2071,4278,3285.0,,,,,3116.0,48.0,...,102.0,,491.0,123.0,455.0,,97.0,181.0,239.0,
6,1,2076,4001,,,,,,,,...,,,,,,,,,,
5,1,2081,4009,2243.0,,,,,286.0,439.0,...,87.0,,222.0,138.0,356.0,,61.0,36.0,181.0,
4,1,2086,3402,1781.0,318.0,799.0,797.0,252.0,241.0,622.0,...,52.0,,271.0,73.0,264.0,73.0,34.0,80.0,143.0,
3,1,2091,4243,2437.0,656.0,1019.0,2212.0,473.0,370.0,593.0,...,70.0,,295.0,179.0,334.0,109.0,119.0,85.0,208.0,
9,2,2001,5545,4346.0,909.0,911.0,2739.0,979.0,660.0,1329.0,...,226.0,,543.0,286.0,752.0,390.0,266.0,213.0,203.0,


#### Validate the shape of the melted DataFrame

In [14]:
df.shape

(1716, 252)

In [15]:
df_tidy.shape

(15444, 35)

## Save results to a .csv file

In [16]:
save_path = data_path + 'taz_census_age_edu_employment_tidy.csv'
t = time()
df_tidy.sort_values(['TAZ_O', 'year']).to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))

DataFrame saved to file:
 ../../../data/census/taz_level_vars/taz_census_age_edu_employment_tidy.csv 
took 0.45 seconds
