<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Preparation-Pt2-(Features)" data-toc-modified-id="Data-Preparation-Pt2-(Features)-1">Data Preparation Pt2 (Features)</a></span><ul class="toc-item"><li><span><a href="#I.-Feature-Cleaning" data-toc-modified-id="I.-Feature-Cleaning-1.1">I. Feature Cleaning</a></span><ul class="toc-item"><li><span><a href="#I-1.-Bond-Yields" data-toc-modified-id="I-1.-Bond-Yields-1.1.1">I-1. Bond Yields</a></span></li><li><span><a href="#I-2.-Interest-Rate" data-toc-modified-id="I-2.-Interest-Rate-1.1.2">I-2. Interest Rate</a></span></li><li><span><a href="#I-3.-Construction--Produce-Price-Index" data-toc-modified-id="I-3.-Construction--Produce-Price-Index-1.1.3">I-3. Construction  Produce Price Index</a></span></li><li><span><a href="#I-4.-Weekly-Income" data-toc-modified-id="I-4.-Weekly-Income-1.1.4">I-4. Weekly Income</a></span></li><li><span><a href="#I-5.-Household-size" data-toc-modified-id="I-5.-Household-size-1.1.5">I-5. Household size</a></span></li><li><span><a href="#I-6-Population-by-Age" data-toc-modified-id="I-6-Population-by-Age-1.1.6">I-6 Population by Age</a></span></li><li><span><a href="#I-7-Cultural-Diversity-&amp;-Immigration" data-toc-modified-id="I-7-Cultural-Diversity-&amp;-Immigration-1.1.7">I-7 Cultural Diversity &amp; Immigration</a></span><ul class="toc-item"><li><span><a href="#I-7-1-Australian-Citizenship" data-toc-modified-id="I-7-1-Australian-Citizenship-1.1.7.1">I-7-1 Australian Citizenship</a></span></li><li><span><a href="#I-7-2-Indigenous-Status" data-toc-modified-id="I-7-2-Indigenous-Status-1.1.7.2">I-7-2 Indigenous Status</a></span></li><li><span><a href="#I-7-3-Year-of-Arrival-in-Australia" data-toc-modified-id="I-7-3-Year-of-Arrival-in-Australia-1.1.7.3">I-7-3 Year of Arrival in Australia</a></span></li></ul></li></ul></li><li><span><a href="#II.-Group-Features-into-Two-DataFrames" data-toc-modified-id="II.-Group-Features-into-Two-DataFrames-1.2">II. Group Features into Two DataFrames</a></span><ul class="toc-item"><li><span><a href="#II-2-features_postcode" data-toc-modified-id="II-2-features_postcode-1.2.1">II-2 <code>features_postcode</code></a></span></li><li><span><a href="#II-3-features_timePeriod" data-toc-modified-id="II-3-features_timePeriod-1.2.2">II-3 <code>features_timePeriod</code></a></span></li></ul></li><li><span><a href="#III-Merge-Features-Data-with-Housing-Data" data-toc-modified-id="III-Merge-Features-Data-with-Housing-Data-1.3">III Merge Features Data with Housing Data</a></span><ul class="toc-item"><li><span><a href="#III-1-Merge-features-into-the-stacked-housing-data-file" data-toc-modified-id="III-1-Merge-features-into-the-stacked-housing-data-file-1.3.1">III-1 Merge features into the stacked housing data file</a></span><ul class="toc-item"><li><span><a href="#III-1-1-features_postcode$\xrightarrow{merge}$-stacked-master" data-toc-modified-id="III-1-1-features_postcode$\xrightarrow{merge}$-stacked-master-1.3.1.1">III-1-1 <code>features_postcode</code>$\xrightarrow{merge}$ stacked master</a></span></li><li><span><a href="#III-2-1-features_timePeriod$\xrightarrow{merge}$-stacked-master" data-toc-modified-id="III-2-1-features_timePeriod$\xrightarrow{merge}$-stacked-master-1.3.1.2">III-2-1 <code>features_timePeriod</code>$\xrightarrow{merge}$ stacked master</a></span></li></ul></li><li><span><a href="#III-2-Merge-Features-into-the-unstacked-housing-data-file" data-toc-modified-id="III-2-Merge-Features-into-the-unstacked-housing-data-file-1.3.2">III-2 Merge Features into the unstacked housing data file</a></span></li></ul></li></ul></li></ul></div>

# Data Preparation Pt2 (Features)

The notebook is the part 2 of the data cleaning process, following <i>Data Preparation Pt1 (NSW Housing Data)</i>, in which we cleaned and merged all house sales and rent data into the complete housing dataset.The objective of this notebook is to clean and prepare below features data to be used for modelling :
1. Bond yields 
2. Ineterest rate
3. Construction produce price index 
4. Weekly income
5. Household size
6. Population by age
7. Cultural diversity & immigration

The above mentioned features data is collected from following sources:
* Yieldbroker ([source link](https://www.yieldbroker.com/)): 1
* Reserve Bank of Australia ([source link](https://www.rba.gov.au/statistics/historical-data.html)): 2
* 2016 Census from Australian Bureau of Statistics ([source link](https://www.abs.gov.au/census)): 4-7

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

## I. Feature Cleaning

### I-1. Bond Yields ###
`yields_join`

In [2]:
# read data in 
yields_data = "Files/Bond Yields/f02hist.xls"
yields = pd.read_excel(yields_data, sheet_name='Data', usecols='A:B,E', header=None, skiprows=range(0,12))
yields.columns = ['Date', '2yBonds%', '10yBonds%']
yields.head()

Unnamed: 0,Date,2yBonds%,10yBonds%
0,2013-07-31,2.5375,3.75
1,2013-08-31,2.5,3.86
2,2013-09-30,2.6875,3.995
3,2013-10-31,2.7075,3.97
4,2013-11-30,2.77,4.125


In [3]:
# split date column into year and month
dates = pd.to_datetime(yields["Date"])
yields["Year"] = dates.dt.year
yields["Quarter"] = dates.dt.quarter

# set datetime as index
yields.set_index('Date', inplace = True)

In [4]:
# create new column with average rate per quarter
yields_quarter_rates = yields.resample('QS').mean()
yields_quarter_rates.head(2)

Unnamed: 0_level_0,2yBonds%,10yBonds%,Year,Quarter
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-07-01,2.575,3.868333,2013.0,3.0
2013-10-01,2.729167,4.1125,2013.0,4.0


In [5]:
# convert year and quarter to int
yields_quarter_rates["Year"] = yields_quarter_rates["Year"].astype(int)
yields_quarter_rates["Quarter"] = yields_quarter_rates["Quarter"].astype(int)

# create time period variable from 'Year' and 'Quarter'
yields_quarter_rates["time_period"] = yields_quarter_rates["Year"].map(str) + " Q" + yields_quarter_rates["Quarter"].map(str)

# Drop 'Year' and Quarter 
yields_join = yields_quarter_rates[["time_period", "2yBonds%", "10yBonds%"]] 
yields_join.head()

Unnamed: 0_level_0,time_period,2yBonds%,10yBonds%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-07-01,2013 Q3,2.575,3.868333
2013-10-01,2013 Q4,2.729167,4.1125
2014-01-01,2014 Q1,2.7,4.135
2014-04-01,2014 Q2,2.685,3.835833
2014-07-01,2014 Q3,2.583333,3.474167


The resulting cleaned bond yield df is **yields_join**.

----

### I-2. Interest Rate ###
`rates_join`  

In [6]:
# read data in
interest = "Files/Interest Rates/f01d.xls"
interest = pd.read_excel(interest, sheet_name = "Data", usecols = "A:B", header = None, skiprows = range(0,12))
interest.columns = ['Date', 'Rate']

# get date time
dates = pd.to_datetime(interest["Date"])
interest["Year"] = dates.dt.year
interest["Quarter"] = dates.dt.quarter

# set date as index
interest.set_index('Date', inplace = True)


# calculate average
quarter_rates = interest.resample('QS').mean()

# create new column with average rate per quarter
quarter_rates["Year"] = quarter_rates["Year"].astype(int)
quarter_rates["Quarter"] = quarter_rates["Quarter"].astype(int)

# time period
quarter_rates["time_period"] = quarter_rates["Year"].map(str) + " Q" + quarter_rates["Quarter"].map(str)

# remove Year and Quarter
quarter_rates = quarter_rates[['Rate','time_period']]
quarter_rates.head()

Unnamed: 0_level_0,Rate,time_period
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-01,4.75,2011 Q1
2011-04-01,4.75,2011 Q2
2011-07-01,4.75,2011 Q3
2011-10-01,4.52381,2011 Q4
2012-01-01,4.25,2012 Q1


In [7]:
# read data in
interest = "Files/Interest Rates/f01d.xls"

interest = pd.read_excel(interest, sheet_name = "Data", usecols = "A:B", header = None, skiprows = range(0,12))
interest.columns = ['Date', 'Rate']
interest.head()

Unnamed: 0,Date,Rate
0,2011-01-05,4.75
1,2011-01-06,4.75
2,2011-01-07,4.75
3,2011-01-10,4.75
4,2011-01-11,4.75


In [8]:
# check data types
interest.dtypes

Date    datetime64[ns]
Rate           float64
dtype: object

In [9]:
# split date column into year and month
dates = pd.to_datetime(interest["Date"])
interest["Year"] = dates.dt.year
interest["Quarter"] = dates.dt.quarter

# set datetime as index
interest.set_index('Date', inplace = True)

In [10]:
# create new column with average rate per quarter
quarter_rates = interest.resample('QS').mean()
quarter_rates.head(2)

Unnamed: 0_level_0,Rate,Year,Quarter
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01-01,4.75,2011.0,1.0
2011-04-01,4.75,2011.0,2.0


In [11]:
# convert year and quarter to int
quarter_rates["Year"] = quarter_rates["Year"].astype(int)
quarter_rates["Quarter"] = quarter_rates["Quarter"].astype(int)

# create time period variable for join from 'Year' and 'Quarter'
quarter_rates["time_period"] = quarter_rates["Year"].map(str) + " Q" + quarter_rates["Quarter"].map(str)
quarter_rates.head()

# Drop 'Year' and 'Quarter'
rates_join = quarter_rates[["time_period", "Rate"]]
rates_join.head(2)

Unnamed: 0_level_0,time_period,Rate
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-01,2011 Q1,4.75
2011-04-01,2011 Q2,4.75


The resulting cleaned bond yield df is **rates_join**.

----

### I-3. Construction  Produce Price Index
`df_cons_clean`

In [12]:
# --read file, --rename columns
construction_file = "Files/Construction/Quarterly, Building construction prices rose, due to Homebuilder grants and government infrastructure investment.xlsx"
df_cons = pd.read_excel(construction_file,header=1,usecols="A:B", skipfooter=2)
df_cons.columns=['date','constr_index']

# --convert to datetime
df_cons['date'] = pd.to_datetime(df_cons['date'],format='%b-%y')

# --get year and quarter, --concatenate as time_period format, --drop other columns
df_cons['year'] = df_cons.date.dt.year
df_cons['quarter'] = df_cons.date.dt.quarter
df_cons['time_period'] = df_cons.year.map(str) + " Q" + df_cons.quarter.map(str)
df_cons_clean = df_cons.drop(columns=['date','year','quarter'],axis=1)
df_cons_clean.head()

Unnamed: 0,constr_index,time_period
0,100.1,2012 Q2
1,100.3,2012 Q3
2,100.2,2012 Q4
3,101.0,2013 Q1
4,101.6,2013 Q2


### I-4. Weekly Income
`incp_gr`

In [13]:
# Read data in to the raw da
census_INCP = "Files/Census/POA (UR) by INCP Toal Personal Income (Weekly).csv"

incp_raw = pd.read_csv(census_INCP, skiprows=9, nrows=11142,
                       usecols=['POA (UR)', 'INCP Total Personal Income (weekly)', 'Count'])

# Rename column for easier referencing
incp_cols = {'POA (UR)':'postcode', 'INCP Total Personal Income (weekly)':'INCP_WK'}
incp_raw.rename(columns=incp_cols, inplace=True)

# Unstack
incp = incp_raw.groupby(['postcode','INCP_WK'])['Count'].sum().unstack()

# Remove the last row (grand total)
incp = incp[:-1]

incp.head(2)

INCP_WK,"$1,000-$1,249 ($52,000-$64,999)","$1,250-$1,499 ($65,000-$77,999)","$1,500-$1,749 ($78,000-$90,999)","$1,750-$1,999 ($91,000-$103,999)","$1-$149 ($1-$7,799)","$150-$299 ($7,800-$15,599)","$2,000-$2,999 ($104,000-$155,999)","$3,000 or more ($156,000 or more)","$300-$399 ($15,600-$20,799)","$400-$499 ($20,800-$25,999)","$500-$649 ($26,000-$33,799)","$650-$799 ($33,800-$41,599)","$800-$999 ($41,600-$51,999)",Negative income,Nil income,Not applicable,Not stated,Total
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
"2000, NSW",1676,1028,977,677,502,1112,1456,1796,1978,2201,1915,1599,1638,204,3297,1251,4115,27411
"2006, NSW",11,9,9,3,275,291,4,11,98,47,31,5,3,9,267,15,156,1261


In [14]:
# Remove 'NSW' in the index and cast postcode to int64
incp.reset_index(inplace=True)
incp['postcode'] = incp['postcode'].str.split(",").str.get(0)
incp['postcode'] = incp['postcode'].astype('int64')
incp = incp.set_index('postcode')

In [15]:
# Clean column names
income_cols= {'$1,000-$1,249 ($52,000-$64,999)' : '$1000-1249', 
            '$1,250-$1,499 ($65,000-$77,999)' : '$1250-1499',
            '$1,500-$1,749 ($78,000-$90,999)' : '$1500-1749 ', 
            '$1,750-$1,999 ($91,000-$103,999)': '$1750-1999',
            '$1-$149 ($1-$7,799)': '$1-149', 
            '$150-$299 ($7,800-$15,599)' : '$150-299',
            '$2,000-$2,999 ($104,000-$155,999)':'$2000-2999',
            '$3,000 or more ($156,000 or more)':'>=$3000', 
            '$300-$399 ($15,600-$20,799)':'$300-399',
            '$400-$499 ($20,800-$25,999)':'$400-499', 
            '$500-$649 ($26,000-$33,799)':'$500-649',
            '$650-$799 ($33,800-$41,599)':'$650-799', 
            '$800-$999 ($41,600-$51,999)':'$800-999'}
incp.rename(columns=income_cols, inplace=True)

# Combine 'not applicable' and 'not stated' into 'total_na'
incp['total_na'] = incp['Not applicable'] + incp['Not stated']

# Drop the 'Total column'
incp = incp.drop(columns=['Not applicable', 'Not stated', 'Total'], axis=1)

# Reorder columns
cols = incp.columns.tolist()
cols = ['$1-149','$150-299','$300-399','$400-499','$500-649','$650-799',
        '$800-999','$1000-1249','$1250-1499','$1500-1749 ',
        '$1750-1999','$2000-2999','>=$3000',
        'Negative income','Nil income','total_na']
incp=incp[cols]

incp.head(1)

INCP_WK,$1-149,$150-299,$300-399,$400-499,$500-649,$650-799,$800-999,$1000-1249,$1250-1499,$1500-1749,$1750-1999,$2000-2999,>=$3000,Negative income,Nil income,total_na
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2000,502,1112,1978,2201,1915,1599,1638,1676,1028,977,677,1456,1796,204,3297,5366


In [16]:
# Create income buckets and save into incp_gr
incp['INCP_LOW'] = incp.iloc[:, 0:6].sum(axis=1)
incp['INCP_MID'] = incp.iloc[:, 6:10].sum(axis=1)
incp['INCP_HIGH'] = incp.iloc[:, 10:13].sum(axis=1)
incp['INCP_NEG_NIL'] = incp.iloc[:, 13:15].sum(axis=1)
incp_gr = incp[['INCP_LOW', 'INCP_MID', 'INCP_HIGH', 'INCP_NEG_NIL']]

# Reset index
incp_gr.reset_index(inplace=True)

incp_gr.head(1)

INCP_WK,postcode,INCP_LOW,INCP_MID,INCP_HIGH,INCP_NEG_NIL
0,2000,9307,5319,3929,3501


In [17]:
incp.iloc[:, 10:13]

INCP_WK,$1750-1999,$2000-2999,>=$3000
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,677,1456,1796
2006,3,4,11
2007,180,342,196
2008,405,588,359
2009,587,1347,1271
...,...,...,...
2877,60,76,56
2878,8,5,13
2879,7,14,8
2880,497,624,197


*The resulting cleanead df is <b>incp_gr</b>*

----

### I-5. Household size
`cprf`

In [18]:
# Read data
census_cprf = "Files/Census/POA by CPRF Count of Persons in Family by STATE.xlsx"
cprf = pd.read_excel(census_cprf, sheet_name="Data Sheet 0", skiprows=9, nrows=619)

# Remove redundant rows and columns 
cprf = cprf[1:] #remove the first row
cprf = cprf.drop(columns='CPRF Count of Persons in Family') # remove the first column

# Rename columns
cprf_cols= {'Unnamed: 1' : 'postcode', 
            'Two persons in family' : 'CPRF_2',
            'Three persons in family' : 'CPRF_3', 
            'Four persons in family': 'CPRF_4',
            'Five persons in family': 'CPRF_5', 
            'Six or more persons in family' : 'CPRF_6+',
            'Not applicable':'CPRF_na',
            'Total' :'CPRF_HHOLD_NO'}
cprf.rename(columns=cprf_cols, inplace=True)

cprf.head(1)

Unnamed: 0,postcode,CPRF_2,CPRF_3,CPRF_4,CPRF_5,CPRF_6+,CPRF_na,CPRF_HHOLD_NO
1,"2000, NSW",3453.0,857.0,354.0,54.0,21.0,8125.0,12861.0


In [19]:
# Remove 'NSW' in the index and cast postcode to int64
cprf.reset_index(inplace=True)
cprf['postcode'] = cprf['postcode'].str.split(",").str.get(0)
cprf['postcode'] = cprf['postcode'].astype('int64')
cprf = cprf.set_index('postcode')
cprf = cprf.drop(columns='index', axis=1)

In [20]:
# Reset index for merging
cprf.reset_index(inplace=True)
cprf.head(1)

Unnamed: 0,postcode,CPRF_2,CPRF_3,CPRF_4,CPRF_5,CPRF_6+,CPRF_na,CPRF_HHOLD_NO
0,2000,3453.0,857.0,354.0,54.0,21.0,8125.0,12861.0


*The resulting cleanead df is <b>cprf</b>*

----

### I-6 Population by Age
`age_gr`

In [21]:
census_age5p = "Files/Census/POA (UR) by AGE5P - Age in Five Year Groups.xlsx"
age = pd.read_excel(census_age5p, sheet_name="Data Sheet 0", skiprows=8, nrows=619)

# Remove redundant rows and columns 
age = age[1:] #remove the first row
age = age.drop(columns='AGE5P - Age in Five Year Groups') # remove the first column

# Rename columns
age.rename(columns={"Unnamed: 1":"postcode"}, inplace=True)

age.head(1)

Unnamed: 0,postcode,0-4 years,5-9 years,10-14 years,15-19 years,20-24 years,25-29 years,30-34 years,35-39 years,40-44 years,...,60-64 years,65-69 years,70-74 years,75-79 years,80-84 years,85-89 years,90-94 years,95-99 years,100 years and over,Total
1,"2000, NSW",698.0,292.0,260.0,1040.0,4791.0,6015.0,4661.0,2703.0,1450.0,...,780.0,660.0,449.0,268.0,158.0,107.0,40.0,10.0,0.0,27411.0


In [22]:
# Remove 'NSW' in the index and cast postcode to int64
age.reset_index(inplace=True)
age['postcode'] = age['postcode'].str.split(",").str.get(0)
age['postcode'] = age['postcode'].astype('int64')
age = age.set_index('postcode')
age = age.drop(columns='index', axis=1)

In [23]:
# Create age brackets 

age['0-4yo'] = age.iloc[:, 0:1].sum(axis=1)
age['5-14yo'] = age.iloc[:, 1:3].sum(axis=1)
age['15-24yo'] = age.iloc[:, 3:5].sum(axis=1)
age['25-34yo'] = age.iloc[:, 5:7].sum(axis=1)
age['35-54yo'] = age.iloc[:, 7:11].sum(axis=1)
age['55-64yo'] = age.iloc[:, 11:13].sum(axis=1)
age['65+yo'] = age.iloc[:, 13:21].sum(axis=1)
age['population_2016']=age.iloc[:, 21:22]

age_gr = age[['0-4yo', '5-14yo', '15-24yo', '25-34yo', '35-54yo', '55-64yo','65+yo', 'population_2016']]

In [24]:
# Reset index for merging
age_gr.reset_index(inplace=True)
age_gr.head(1)

Unnamed: 0,postcode,0-4yo,5-14yo,15-24yo,25-34yo,35-54yo,55-64yo,65+yo,population_2016
0,2000,698.0,552.0,5831.0,10676.0,6298.0,1670.0,1692.0,27411.0


### I-7 Cultural Diversity & Immigration
`cald`

#### I-7-1 Australian Citizenship
`citp`

In [25]:
census_citp = "Files/Census/POA (UR) by CITP Australian Citizenship by STATE (UR).xlsx"
citp = pd.read_excel(census_citp, sheet_name="Data Sheet 0", skiprows=9, nrows=619)


# Remove redundant rows and columns 
citp = citp[1:] #remove the first row
citp = citp.drop(columns='CITP Australian Citizenship') # remove the first column


# Rename columns
citp.rename(columns={"Unnamed: 1":"postcode",
                    "Australian":"citizen_AU",
                     "Not Australian":"citizen_non_AU"}, inplace=True)

# Remove 'not stated' and total
citp = citp.drop(columns=["Not stated",'Total'])
citp.tail(1)

Unnamed: 0,postcode,citizen_AU,citizen_non_AU
618,"2898, NSW",317.0,26.0


#### I-7-2 Indigenous Status
`ing`

In [26]:
census_ing = "Files/census/POA (UR) by INGP Indigenous Status by STATE (UR).xlsx"
ing = pd.read_excel(census_ing, sheet_name="Data Sheet 0", skiprows=9, nrows=619)

# Remove redundant rows and columns 
ing = ing[1:] #remove the first row
ing = ing.drop(columns=['INGP Indigenous Status',"Non-Indigenous","Not stated", "Total"])

# Rename columns
ing.rename(columns={"Unnamed: 1":"postcode"}, inplace=True)

ing.head(1)

Unnamed: 0,postcode,Aboriginal,Torres Strait Islander,Both Aboriginal and Torres Strait Islander
1,"2000, NSW",45.0,8.0,0.0


In [27]:
# Create a total column for all aboriginal and Torres Strait Islanders
cols = ['Aboriginal', 'Torres Strait Islander','Both Aboriginal and Torres Strait Islander']
ing['ATSI']=ing.loc[:,cols].sum(axis=1)

# Drop cols
ing = ing[['postcode', 'ATSI']]

ing.head()

Unnamed: 0,postcode,ATSI
1,"2000, NSW",53.0
2,"2006, NSW",16.0
3,"2007, NSW",58.0
4,"2008, NSW",77.0
5,"2009, NSW",125.0


#### I-7-3 Year of Arrival in Australia
`yarrp_gr`

In [28]:
census_yarrp = "Files/census/POA (UR) by YARRP Year of Arrival in Australia.xlsx"
yarrp = pd.read_excel(census_yarrp, sheet_name="Data Sheet 0", skiprows=8, nrows=619)

# Remove redundant rows and columns 
yarrp = yarrp[1:] #remove the first row
yarrp = yarrp.drop(columns=['YARRP Year of Arrival in Australia (ranges)',
                        "Not stated", "Not applicable", "Total"])

# Rename columns
yarrp.rename(columns={"Unnamed: 1":"postcode","Arrived 1996-2005":"YARRP 1996-2005"}, inplace=True)

yarrp.head(1)

Unnamed: 0,postcode,Arrived 1900-1945,Arrived 1946-1955,Arrived 1956-1965,Arrived 1966-1975,Arrived 1976-1985,Arrived 1986-1995,YARRP 1996-2005,Arrived 2006-2015,Arrived 1 Jan 2016 - 9 August 2016
1,"2000, NSW",5.0,83.0,133.0,304.0,489.0,1101.0,2175.0,11390.0,1633.0


In [29]:
# Create age brackets 

yarrp['YARRP <1975'] = yarrp.iloc[:, 1:5].sum(axis=1)
yarrp['YARRP 1976-1995'] = yarrp.iloc[:, 5:7].sum(axis=1)
yarrp['YARRP 2006-2016'] = yarrp.iloc[:, 8:10].sum(axis=1)

yarrp_gr = yarrp[['postcode','YARRP <1975', 'YARRP 1976-1995', 'YARRP 1996-2005', 'YARRP 2006-2016']]
yarrp_gr.head(1)

Unnamed: 0,postcode,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016
1,"2000, NSW",525.0,1590.0,2175.0,13023.0


Now join the three DFs together:

In [30]:
cald = citp.merge(yarrp_gr, left_on='postcode',right_on='postcode')
cald = cald.merge(ing, left_on='postcode',right_on='postcode')

In [31]:
# Remove 'NSW' in the index and cast postcode to int64
cald.reset_index(inplace=True)
cald['postcode'] = cald['postcode'].str.split(",").str.get(0)
cald['postcode'] = cald['postcode'].astype('int64')
cald = cald.set_index('postcode')
cald = cald.drop(columns='index', axis=1)

# Reset index for merging
cald.reset_index(inplace=True)
cald.head(1)

Unnamed: 0,postcode,citizen_AU,citizen_non_AU,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016,ATSI
0,2000,8691.0,14662.0,525.0,1590.0,2175.0,13023.0,53.0


## II. Group Features into Two DataFrames


1. Cross-section features will later be merged using <u>postcodes</u> as key. These features will first be combined into **`features_postcode`** DataFrame in this section, and then be merged into the housing data.


2. Time-series features ranges from 2011-2021, will be merged using use <u>time period</u> as key. These features will be first combined into **`features_timePeriod`** DataFrame. 


### II-2 `features_postcode`

Merge:
* `incp_gr` weekly income
* `cprf` household size)
* `age_gr`population by age group
* `cald` AU citizenship, year of arrival in AUS, and Indigenous status

In [32]:
print("number of postcodes in incp:", len(set(incp_gr['postcode'])))
print("number of postcodes in cprf:", len(set(cprf['postcode'])))
print("number of postcodes in age_gr:", len(set(age_gr['postcode'])))
print("number of postcodes in cald:", len(set(cald['postcode'])))

number of postcodes in incp: 618
number of postcodes in cprf: 618
number of postcodes in age_gr: 618
number of postcodes in cald: 618


In [33]:
# merging
features_postcode = pd.merge(incp_gr, cprf, on='postcode')
features_postcode = pd.merge(features_postcode, age_gr, on='postcode')
features_postcode = pd.merge(features_postcode, cald, on='postcode')

In [34]:
print("Number of postcodes in features_postcode:",len(set(features_postcode['postcode'])))
features_postcode.head()

Number of postcodes in features_postcode: 618


Unnamed: 0,postcode,INCP_LOW,INCP_MID,INCP_HIGH,INCP_NEG_NIL,CPRF_2,CPRF_3,CPRF_4,CPRF_5,CPRF_6+,...,55-64yo,65+yo,population_2016,citizen_AU,citizen_non_AU,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016,ATSI
0,2000,9307,5319,3929,3501,3453.0,857.0,354.0,54.0,21.0,...,1670.0,1692.0,27411.0,8691.0,14662.0,525.0,1590.0,2175.0,13023.0,53.0
1,2006,747,32,18,276,0.0,0.0,0.0,0.0,5.0,...,13.0,11.0,1261.0,991.0,104.0,0.0,11.0,92.0,160.0,16.0
2,2007,2803,1458,718,2410,830.0,215.0,98.0,25.0,4.0,...,333.0,380.0,8846.0,2726.0,5033.0,109.0,479.0,593.0,4517.0,58.0
3,2008,3321,2496,1352,2779,1308.0,243.0,105.0,17.0,7.0,...,470.0,370.0,11712.0,4358.0,5962.0,155.0,541.0,607.0,5527.0,77.0
4,2009,2896,3232,3205,1040,2107.0,635.0,298.0,56.0,8.0,...,1227.0,1150.0,12813.0,7407.0,4142.0,430.0,1201.0,1097.0,3730.0,125.0


In [35]:
features_postcode.to_csv("Files/Cleaned/Features/Features_postcode_demo_census2016.csv")

### II-3 `features_timePeriod`
Merge:
* `yields_join` (bond yields)
* `rates_join` (interest rates)
* `df_cons_clean` (construction activities)

In [36]:
print("number of time period in bond yields:", len(set(yields_join['time_period'])))
print("number of time period in interest rates:", len(set(rates_join['time_period'])))
print("number of time period in construction:", len(set(df_cons_clean['time_period'])))

number of time period in bond yields: 33
number of time period in interest rates: 43
number of time period in construction: 37


We'll only keep the shared time periods in the three dataframes.

In [37]:
# Merge
features_timePeriod = pd.merge(yields_join, df_cons_clean, on='time_period')
features_timePeriod = pd.merge(features_timePeriod, rates_join, on='time_period')
features_timePeriod.head()

Unnamed: 0,time_period,2yBonds%,10yBonds%,constr_index,Rate
0,2013 Q3,2.575,3.868333,101.9,2.602273
1,2013 Q4,2.729167,4.1125,102.2,2.5
2,2014 Q1,2.7,4.135,102.5,2.5
3,2014 Q2,2.685,3.835833,103.7,2.5
4,2014 Q3,2.583333,3.474167,104.7,2.5


In [38]:
# Check the number of time period after merge
len(set(features_timePeriod['time_period']))

32

In [39]:
features_timePeriod['time_period'].unique()

array(['2013 Q3', '2013 Q4', '2014 Q1', '2014 Q2', '2014 Q3', '2014 Q4',
       '2015 Q1', '2015 Q2', '2015 Q3', '2015 Q4', '2016 Q1', '2016 Q2',
       '2016 Q3', '2016 Q4', '2017 Q1', '2017 Q2', '2017 Q3', '2017 Q4',
       '2018 Q1', '2018 Q2', '2018 Q3', '2018 Q4', '2019 Q1', '2019 Q2',
       '2019 Q3', '2019 Q4', '2020 Q1', '2020 Q2', '2020 Q3', '2020 Q4',
       '2021 Q1', '2021 Q2'], dtype=object)

Even though we started with more time periods and end up with less, this is sufficient to cover the time periods in the housing data we're interested in.

*Note: It is possible to do the above in a single line of code, though for clarity, I've left it like this. As seen in https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns*

## III Merge Features Data with Housing Data

The final step would be merging the features in the previously prepared housing data files:
* <i>Master_Sales_Rent_2017Q4_2021Q1.csv</i> (Stacked complete housing data Q4'17-Q1'21)
* <i>Pivot_Sales_Rent_5Quarters_Imputed.csv</i> (Unstacked complete housing data Q1'20-Q1'21)

As explained in cleaning notebook Pt1, the stacked file (one quarter of data is stacked on top of another quarter's with multiple entry of the same postcodes in the index) will be used mainly for exploring trends.The unstacked file has the housing statistics from previous quarters and features on columns to be used for modelling.



### III-1 Merge features into the stacked housing data file

Source files are:
* <i>Master_Sales_Rent_2017Q4_2021Q1.csv</i> 
* features_postcode DataFrame
* features_timePeriod DataFrame

The output files are:
* `master_mg1` DataFrame
* <i>Master_Sales_Rent_2017Q4_2021Q1_pcFeatures.csv</i>, saved under 'Files/Cleaned'.

#### III-1-1 `features_postcode`$\xrightarrow{merge}$ stacked master

In [40]:
# Import Master Housing DF

master = pd.read_csv("Files/Cleaned/Housing/Master_Sales_Rent_2017Q4_2021Q1.csv")
print(master.shape)
master.head(1)

(20717, 21)


Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,Qdelta_count,Adelta_count,rkey,median_rent_newb,new_bonds_no,total_bonds_no,Qdelta_median_rent,Qdelta_new_bonds,Adelta_median_rent,Adelta_new_bonds
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,-0.325,-0.3112,r121,640.0,1169.0,7914.0,-0.2,0.5545,,


In [41]:
print("Number of unique postcodes in features_postcode:", 
      features_postcode['postcode'].nunique())
print("Number of unique postcodes in the housing data", 
      master['postcode'].nunique())

Number of unique postcodes in features_postcode: 618
Number of unique postcodes in the housing data 587


In [42]:
master_mg1 = master.merge(features_postcode, left_on='postcode', right_on='postcode')
master_mg1.head(1)

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,55-64yo,65+yo,population_2016,citizen_AU,citizen_non_AU,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016,ATSI
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,1670.0,1692.0,27411.0,8691.0,14662.0,525.0,1590.0,2175.0,13023.0,53.0


In [43]:
print("master_merge1 shape:", master_mg1.shape)
print("Number of unique postcodes in master_merge1:", master_mg1['postcode'].nunique())

master_merge1 shape: (20590, 47)
Number of unique postcodes in master_merge1: 581


#### III-2-1 `features_timePeriod`$\xrightarrow{merge}$ stacked master

In [44]:
features_timePeriod.head()

Unnamed: 0,time_period,2yBonds%,10yBonds%,constr_index,Rate
0,2013 Q3,2.575,3.868333,101.9,2.602273
1,2013 Q4,2.729167,4.1125,102.2,2.5
2,2014 Q1,2.7,4.135,102.5,2.5
3,2014 Q2,2.685,3.835833,103.7,2.5
4,2014 Q3,2.583333,3.474167,104.7,2.5


In [45]:
master_mg1 = master_mg1.merge(features_timePeriod, 
                              left_on='time_period', right_on='time_period')
master_mg1.head(1)

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,citizen_non_AU,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016,ATSI,2yBonds%,10yBonds%,constr_index,Rate
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,14662.0,525.0,1590.0,2175.0,13023.0,53.0,1.814167,2.646667,112.0,1.5


In [46]:
print("master_merge1 shape:", master_mg1.shape)
print("Time periods in master_merge1:\n", master_mg1.time_period.unique())

master_merge1 shape: (20590, 51)
Time periods in master_merge1:
 ['2017 Q3' '2017 Q4' '2018 Q1' '2018 Q2' '2018 Q3' '2018 Q4' '2019 Q1'
 '2019 Q2' '2019 Q3' '2019 Q4' '2020 Q1' '2020 Q2' '2020 Q3' '2020 Q4'
 '2021 Q1']


Saving as CSV

In [47]:
master_mg1.to_csv('Files/Cleaned/Postcode-based/Master_Sales_Rent_2017Q4_2021Q1_pcFeatures.csv', index=False)

In [48]:
master_mg1.head(1)

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,citizen_non_AU,YARRP <1975,YARRP 1976-1995,YARRP 1996-2005,YARRP 2006-2016,ATSI,2yBonds%,10yBonds%,constr_index,Rate
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,14662.0,525.0,1590.0,2175.0,13023.0,53.0,1.814167,2.646667,112.0,1.5


### III-2 Merge Features into the unstacked housing data file

Source files are:
* <i>Pivot_Sales_Rent_5Quarters_Imputed.csv</i> 
* features_postcode DataFrame
* features_timePeriod DataFrame

The output files are:
* `unstacked_mg2` DataFrame
* <i>Master_Sales_Rent_2017Q4_2021Q1_pcFeatures.csv</i>, saved under 'Files/Cleaned'.

In [57]:
# Import unstacked Housing df

unstacked2 = pd.read_csv("Files/Cleaned/Housing/Pivot_Sales_Rent_5Quarters_Imputed.csv")
print("Unstacked shape:", unstacked2.shape)
print("Number of postcodes in unstacked:", unstacked2['postcode'].nunique())
unstacked2.head(1)

Unstacked shape: (577, 31)
Number of postcodes in unstacked: 577


Unnamed: 0,postcode,mean_price 2020 Q1,mean_price 2020 Q2,mean_price 2020 Q3,mean_price 2020 Q4,mean_price 2021 Q1,median_price 2020 Q1,median_price 2020 Q2,median_price 2020 Q3,median_price 2020 Q4,...,sales_no 2020 Q1,sales_no 2020 Q2,sales_no 2020 Q3,sales_no 2020 Q4,sales_no 2021 Q1,total_bonds_no 2020 Q1,total_bonds_no 2020 Q2,total_bonds_no 2020 Q3,total_bonds_no 2020 Q4,total_bonds_no 2021 Q1
0,2000,1541.0,1322.0,1631.0,1379.0,2794.0,1225.0,1000.0,1390.0,1110.0,...,105.0,74.0,100.0,155.0,184.0,8615.0,7595.0,8069.0,9140.0,9327.0


**`features_postcodes` $\xrightarrow{merge}$ unstacked**

In [50]:
# Merge features_postcode in
unstacked_mg2= unstacked2.merge(features_postcode, left_on='postcode', right_on='postcode')
print("unstacked_mg2 shape:", unstacked_mg2.shape)
print("Number of postcodes in the unstacked_mg2:", unstacked_mg2['postcode'].nunique())

unstacked_mg2 shape: (573, 57)
Number of postcodes in the unstacked_mg2: 573


Again we lost 4 postcodes this time after merging.

**`features_timePeriods` $\xrightarrow{merge}$ unstacked**

In [51]:
# Filter out relevant time periods (Q1 2020 - Q1 2021)

TP = ['2020 Q1','2020 Q2','2020 Q3','2020 Q4','2021 Q1']
features_timePeriod = features_timePeriod.loc[features_timePeriod.time_period.isin(TP)]
features_timePeriod

Unnamed: 0,time_period,2yBonds%,10yBonds%,constr_index,Rate
26,2020 Q1,0.616667,1.006667,116.0,0.638889
27,2020 Q2,0.24,0.896667,116.2,0.25
28,2020 Q3,0.243333,0.89,116.2,0.25
29,2020 Q4,0.103333,0.89,116.9,0.15625
30,2021 Q1,0.09,1.353333,117.8,0.1


In [52]:
# Flat the features DF
features_TP_pvt = pd.pivot_table(features_timePeriod, 
                                 index=features_timePeriod.index,
                                 columns='time_period',
                                 values=['2yBonds%', '10yBonds%', 'constr_index', 'Rate'])

features_TP_pvt.columns = [' '.join(col) for col in features_TP_pvt.columns]
features_TP_pvt = features_TP_pvt.append(features_TP_pvt.sum(numeric_only=True), ignore_index=True)
features_TP_pvt = features_TP_pvt.iloc[[-1]]
features_TP_pvt.reset_index(inplace=True)
features_TP_pvt = features_TP_pvt.drop(columns='index', axis=1)
features_TP_pvt.round(2)

features_TP_pvt

Unnamed: 0,10yBonds% 2020 Q1,10yBonds% 2020 Q2,10yBonds% 2020 Q3,10yBonds% 2020 Q4,10yBonds% 2021 Q1,2yBonds% 2020 Q1,2yBonds% 2020 Q2,2yBonds% 2020 Q3,2yBonds% 2020 Q4,2yBonds% 2021 Q1,Rate 2020 Q1,Rate 2020 Q2,Rate 2020 Q3,Rate 2020 Q4,Rate 2021 Q1,constr_index 2020 Q1,constr_index 2020 Q2,constr_index 2020 Q3,constr_index 2020 Q4,constr_index 2021 Q1
0,1.006667,0.896667,0.89,0.89,1.353333,0.616667,0.24,0.243333,0.103333,0.09,0.638889,0.25,0.25,0.15625,0.1,116.0,116.2,116.2,116.9,117.8


In [53]:
# Convert the one-row DF into a dictionary 
tp_dict = features_TP_pvt.to_dict("record")[0]
type(tp_dict)


  tp_dict = features_TP_pvt.to_dict("record")[0]


dict

In [54]:
unstacked_mg2 = unstacked_mg2.assign(**tp_dict)

print("unstacked_mg2 shape:", unstacked_mg2.shape)
unstacked_mg2.head(1)

unstacked_mg2 shape: (573, 77)


Unnamed: 0,postcode,mean_price 2020 Q1,mean_price 2020 Q2,mean_price 2020 Q3,mean_price 2020 Q4,mean_price 2021 Q1,median_price 2020 Q1,median_price 2020 Q2,median_price 2020 Q3,median_price 2020 Q4,...,Rate 2020 Q1,Rate 2020 Q2,Rate 2020 Q3,Rate 2020 Q4,Rate 2021 Q1,constr_index 2020 Q1,constr_index 2020 Q2,constr_index 2020 Q3,constr_index 2020 Q4,constr_index 2021 Q1
0,2000,1541.0,1322.0,1631.0,1379.0,2794.0,1225.0,1000.0,1390.0,1110.0,...,0.638889,0.25,0.25,0.15625,0.1,116.0,116.2,116.2,116.9,117.8


In [55]:
# Save as CSV
unstacked_mg2.to_csv('Files/Cleaned/Postcode-based/Unstacked_Sales_Rent_5Quarters_Imputed_pcFeatures.csv', index=False)