# GTHA housing market database
# OSEMN methodology Step 1: Obtain
# Obtain Fuel Prices data

---

This notebook describes _Step 1: Obtain_ of OSEMN methodology, the process of obtaining fuel prices data.

---

For description of OSEMN methodology, see `methodology/0.osemn/osemn.pdf`.

## Import dependencies

In [1]:
import pandas as pd
import os
from time import time

In [2]:
data_path = '../../data/tts/'
os.listdir(data_path)

['TTS_variables.xlsx', 'Num_of_Jobs.xlsx']

## Load geometry of GTHA Dissemination Areas (DAs)

In [3]:
t = time()
df = pd.read_excel(data_path + 'Num_of_Jobs.xlsx')
df = df.rename(columns={'TAZ ID':'taz_id'})
elapsed = time() - t
print("\n----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)


----- DataFrame loaded
in 0.43 seconds
with 1,716 rows
and 7 columns
-- Column names:
 Index(['taz_id', 1991, 1996, 2001, 2006, 2011, 2016], dtype='object')


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716 entries, 0 to 1715
Data columns (total 7 columns):
taz_id    1716 non-null int64
1991      1470 non-null float64
1996      1594 non-null float64
2001      1660 non-null float64
2006      1675 non-null float64
2011      1678 non-null float64
2016      1667 non-null float64
dtypes: float64(6), int64(1)
memory usage: 94.0 KB


## Convert data to tidy format
The `num_jobs` variable is stored in several columns corresponding to different TTS years, which violates one of the conditions of Tidy Data. To correct this, a new column `year` will be added to be used as a part of a **_primary key_** along with `taz_id` for the `tts_num_jobs` table in the database. DataFrame will be melted for each variable to be stored in a single column additionally referenced by the column `year`.

### Desired shape of the DataFrame
There is 1 variable with 6 values for each `taz_id` according to each TTS year.

In [14]:
df.columns

Index(['taz_id', 1991, 1996, 2001, 2006, 2011, 2016], dtype='object')

There are 1'716 unique Transporation Analysis Zones (TAZ):

In [15]:
df['taz_id'].nunique()

1716

In [16]:
len(df)

1716

In [17]:
len(df) * 6

10296

That means, that after melting we should get a table with 1'716 x 6 = 10'296 rows and 3 columns (`DAUID`, `year`, and `num_jobs`).

In [13]:
df_tidy = pd.melt(df, id_vars='taz_id', var_name='year', value_name='num_jobs').sort_values(['taz_id', 'year'])
df_tidy.head()

Unnamed: 0,taz_id,year,num_jobs
0,1,1991,1211.0
1716,1,1996,888.0
3432,1,2001,1509.0
5148,1,2006,1142.0
6864,1,2011,893.0


#### Validate the shape of the melted DataFrame

In [18]:
df.shape

(1716, 7)

In [19]:
df_tidy.shape

(10296, 3)

## Save results to a .csv file

In [21]:
save_path = data_path + 'tts_num_jobs_tidy.csv'
t = time()
df_tidy.to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))

DataFrame saved to file:
 ../../data/tts/tts_num_jobs_tidy.csv 
took 0.07 seconds
