# GTHA housing market database
# OSEMN methodology Step 1: Obtain
# Obtain Housing Monetary DA-level Census Variables

---

This notebook describes _Step 1: Obtain_ of OSEMN methodology, the process of select DA-level Census variables.

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

## Import dependencies

In [1]:
import pandas as pd
import os
from time import time

In [2]:
data_path = '../../../data/census/taz_level_vars/'
os.listdir(data_path)

['taz_census_housing_monetaryvars.xlsx',
 'taz_census_age_edu_employment.xlsx',
 'da_census_housing_monetary_tidy.csv',
 'taz_census_age_edu_employment_tidy.csv']

## Load geometry of GTHA Dissemination Areas (DAs)

In [3]:
t = time()
df = pd.read_excel(data_path + 'taz_census_housing_monetaryvars.xlsx')
elapsed = time() - t
print("\n----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)


----- DataFrame loaded
in 2.13 seconds
with 1,716 rows
and 125 columns
-- Column names:
 Index(['TAZ_O', 'Avg_own_payt16', 'Avg_own_paytinf16', 'Avg_val_dwel16',
       'Avg_val_dwelinf16', 'Avg_rent16', 'Avg_rentinf16', 'Avg_HHinc16',
       'Avg_HHincinf16', 'Med_HHincinf16',
       ...
       'Owned81', 'Rented81', 'Avg_HHinc71', 'Avg_HHincinf71', 'Dwel71',
       'Sgl_det71', 'Apt_5plus71', 'Sgl_att71', 'Owned71', 'Rented71'],
      dtype='object', length=125)


## Convert data to tidy format
Same variables are stored in several columns corresponding to different Census years, which violates one of the conditions of Tidy Data. To correct this, a new column `year` will be added to be used as a part of a **_primary key_** along with `dauid` for the `da_census_select` table in the database. DataFrame will be melted for each variable to be stored in a single column additionally referenced by the column `year`.

### Desired shape of the DataFrame
There are 15 unique variables for each Census year.

In [4]:
mask1 = df.columns.str.contains('01')
cols = df.columns[mask1]
len(cols)

15

#### All Census variables in the original table

In [5]:
var_names = df.columns[df.columns.str.contains('\d')]
var_names = var_names.str.slice(stop=-2)
var_names = var_names.unique()
len(var_names)

15

In [6]:
var_names

Index(['Avg_own_payt', 'Avg_own_paytinf', 'Avg_val_dwel', 'Avg_val_dwelinf',
       'Avg_rent', 'Avg_rentinf', 'Avg_HHinc', 'Avg_HHincinf', 'Med_HHincinf',
       'Dwel', 'Sgl_det', 'Apt_5plus', 'Sgl_att', 'Owned', 'Rented'],
      dtype='object')

There are four different Census years:

In [7]:
mask1 = df.columns.str.contains('Avg_own_paytinf')
cols = df.columns[mask1]
cols.sort_values()

Index(['Avg_own_paytinf01', 'Avg_own_paytinf06', 'Avg_own_paytinf11',
       'Avg_own_paytinf16', 'Avg_own_paytinf86', 'Avg_own_paytinf91',
       'Avg_own_paytinf96'],
      dtype='object')

In [8]:
mask1 = df.columns.str.contains('Avg_own_paytinf')
cols = df.columns[mask1]
cols.sort_values()

Index(['Avg_own_paytinf01', 'Avg_own_paytinf06', 'Avg_own_paytinf11',
       'Avg_own_paytinf16', 'Avg_own_paytinf86', 'Avg_own_paytinf91',
       'Avg_own_paytinf96'],
      dtype='object')

There are 9'182 Dissemination Areas.

In [9]:
df.index.nunique()

1716

In [10]:
len(df)

1716

In [11]:
len(df) * 9

15444

That means, that after melting we should get a table with 9'182 x 4 = 36'728 rows and 32 columns (`DAUID`, `year`, and 31 columns with Census variables).

### Melt all the variables in the table

In [12]:
i = 0

for var_name in var_names:

    # select a subset of columns containing different years of a Census variable
    mask1 = df.columns.str.contains('{0}\d'.format(var_name))
    var_cols = df.columns[mask1]
    s = df[['TAZ_O'] + list(var_cols)]

    df_melt = pd.melt(s, id_vars='TAZ_O', value_name=var_name)
    df_melt['variable'] = '20' + df_melt['variable'].str.slice(-2)
    df_melt = df_melt.rename(columns={'variable': 'year'})
    df_melt['year'] = df_melt['year'].astype('int')
    if i  == 0:
        df_tidy = df_melt.sort_values('TAZ_O')
    else:
        df_tidy = pd.merge(df_tidy, df_melt, how='left', 
                           left_on=['TAZ_O', 'year'], right_on=['TAZ_O', 'year'])
    i += 1

df_tidy.sort_values(['TAZ_O', 'year'])

Unnamed: 0,TAZ_O,year,Avg_own_payt,Avg_own_paytinf,Avg_val_dwel,Avg_val_dwelinf,Avg_rent,Avg_rentinf,Avg_HHinc,Avg_HHincinf,Med_HHincinf,Dwel,Sgl_det,Apt_5plus,Sgl_att,Owned,Rented
2,1,2001,1009.714286,1318.989771,230989.428571,301741.490543,793.785714,1036.922279,60338.500000,78820.182550,66281.102157,1851.0,609.0,298.0,964.0,982.0,877.0
4,1,2006,1210.785714,1428.121750,359926.785714,424533.643750,889.000000,1048.575500,67891.571429,80078.108500,63148.071000,1641.0,477.0,283.0,870.0,886.0,753.0
6,1,2011,1314.875000,1404.286500,443741.187500,473915.588250,931.437500,994.775250,79507.937500,84914.477250,64161.168000,1744.0,487.0,323.0,910.0,1032.0,720.0
0,1,2016,1675.812500,1675.812500,733765.875000,733765.875000,1080.437500,1080.437500,94044.500000,94044.500000,68736.000000,1891.0,496.0,462.0,935.0,961.0,909.0
1,1,2086,535.000000,1044.052500,101758.333333,198581.387500,492.000000,960.138000,33165.888889,64723.232167,67771.692000,914.0,334.0,147.0,436.0,389.0,530.0
5,1,2091,841.333333,1303.982533,229163.333333,355180.250333,742.222222,1150.370222,46928.888889,72735.084889,66278.373700,1233.0,423.0,281.0,527.0,677.0,553.0
3,1,2096,964.111111,1393.719022,211851.444444,306252.448089,724.222222,1046.935644,47880.222222,69215.649244,59774.114400,1257.0,380.0,281.0,595.0,676.0,581.0
8,2,2001,909.300000,1187.818590,235829.450000,308064.010535,665.450000,869.277335,70680.200000,92329.545260,82881.338620,2086.0,1589.0,62.0,421.0,1544.0,525.0
12,2,2006,1153.050000,1360.022475,350297.050000,413175.370475,705.600000,832.255200,78954.200000,93126.478900,80224.872000,2068.0,1338.0,81.0,665.0,1673.0,398.0
10,2,2011,1280.454545,1367.525455,479115.772727,511695.645273,713.954545,762.503455,82225.000000,87816.300000,76228.500000,2152.0,1402.0,84.0,671.0,1758.0,379.0


#### Validate the shape of the melted DataFrame

In [13]:
df.shape

(1716, 125)

In [14]:
df_tidy.shape

(12012, 17)

## Save results to a .csv file

In [15]:
save_path = data_path + 'taz_census_housing_monetary_tidy.csv'
t = time()
df_tidy.sort_values(['TAZ_O', 'year']).to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))

DataFrame saved to file:
 ../../../data/census/taz_level_vars/taz_census_housing_monetary_tidy.csv 
took 0.24 seconds
