# to do 10/10/21
**Chris**
- put LGA fixing from step 6 into pt2
- merge remaining features:  incp_gr, cprf, yields_join, df_cons_clean, quarter_rates,

**Alexis**
- merge with steve 

**Ken / Felix**
- check for nans (think of any columns or variables that we may not need) 
- sort out geopandas

Felix wnts to know the goal of the different dataframes so he can have a good understanding of the structure of the DFs. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#1.-Bond-Yields" data-toc-modified-id="1.-Bond-Yields-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>1. Bond Yields</a></span><ul class="toc-item"><li><span><a href="#@all-target-df-to-be-defined" data-toc-modified-id="@all-target-df-to-be-defined-0.1.1"><span class="toc-item-num">0.1.1&nbsp;&nbsp;</span>@all target df to be defined</a></span></li></ul></li><li><span><a href="#2.-Interest-Rate" data-toc-modified-id="2.-Interest-Rate-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>2. Interest Rate</a></span><ul class="toc-item"><li><span><a href="#@all-target-df-do-be-defined" data-toc-modified-id="@all-target-df-do-be-defined-0.2.1"><span class="toc-item-num">0.2.1&nbsp;&nbsp;</span>@all target df do be defined</a></span></li></ul></li><li><span><a href="#3.-Population-/-Age-bands" data-toc-modified-id="3.-Population-/-Age-bands-0.3"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>3. Population / Age bands</a></span><ul class="toc-item"><li><span><a href="#Household-count" data-toc-modified-id="Household-count-0.3.1"><span class="toc-item-num">0.3.1&nbsp;&nbsp;</span>Household count</a></span></li><li><span><a href="#Population-movement-in-5-year-period" data-toc-modified-id="Population-movement-in-5-year-period-0.3.2"><span class="toc-item-num">0.3.2&nbsp;&nbsp;</span>Population movement in 5 year period</a></span></li><li><span><a href="#Population-Age" data-toc-modified-id="Population-Age-0.3.3"><span class="toc-item-num">0.3.3&nbsp;&nbsp;</span>Population Age</a></span></li></ul></li><li><span><a href="#4.-Construction" data-toc-modified-id="4.-Construction-0.4"><span class="toc-item-num">0.4&nbsp;&nbsp;</span>4. Construction</a></span></li><li><span><a href="#5.-Weekly-Income" data-toc-modified-id="5.-Weekly-Income-0.5"><span class="toc-item-num">0.5&nbsp;&nbsp;</span>5. Weekly Income</a></span></li><li><span><a href="#6.-Household-size" data-toc-modified-id="6.-Household-size-0.6"><span class="toc-item-num">0.6&nbsp;&nbsp;</span>6. Household size</a></span></li><li><span><a href="#7.-Additional-Feature-1" data-toc-modified-id="7.-Additional-Feature-1-0.7"><span class="toc-item-num">0.7&nbsp;&nbsp;</span>7. Additional Feature 1</a></span></li><li><span><a href="#8.-Additional-Feature-2" data-toc-modified-id="8.-Additional-Feature-2-0.8"><span class="toc-item-num">0.8&nbsp;&nbsp;</span>8. Additional Feature 2</a></span></li><li><span><a href="#9.-Additional-Feature-3" data-toc-modified-id="9.-Additional-Feature-3-0.9"><span class="toc-item-num">0.9&nbsp;&nbsp;</span>9. Additional Feature 3</a></span></li></ul></li><li><span><a href="#USE-BELOW-CELL-TO-MERGE-FEATURES-INTO-THE-MASTER-DF,-IGNORE-FOR-NOW" data-toc-modified-id="USE-BELOW-CELL-TO-MERGE-FEATURES-INTO-THE-MASTER-DF,-IGNORE-FOR-NOW-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>USE BELOW CELL TO MERGE FEATURES INTO THE MASTER DF, IGNORE FOR NOW</a></span></li></ul></div>

**The objective of this notebook is to collate the codes for cleaning below data:**
1. Bond yields (F)
2. Ineterest rate (F)
3. Population (K)
4. Construction (K)
5. Weekly income (A)
6. Household size (A)

(please add features to the list if there's any additional ones)

**and merge all features into a complete feature set at the end.**

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

## Feature Cleaning

### 1. Bond Yields ###
`yields_join`

In [2]:
# read data in 
yields_data = "Files/Bond Yields/f02hist.xls"
yields = pd.read_excel(yields_data, sheet_name='Data', usecols='A:B,E', header=None, skiprows=range(0,12))
yields.columns = ['Date', '2yBonds%', '10yBonds%']
yields.head()

Unnamed: 0,Date,2yBonds%,10yBonds%
0,2013-07-31,2.5375,3.75
1,2013-08-31,2.5,3.86
2,2013-09-30,2.6875,3.995
3,2013-10-31,2.7075,3.97
4,2013-11-30,2.77,4.125


In [3]:
# split date column into year and month
dates = pd.to_datetime(yields["Date"])
yields["Year"] = dates.dt.year
yields["Quarter"] = dates.dt.quarter

# set datetime as index
yields.set_index('Date', inplace = True)

In [4]:
# create new column with average rate per quarter
yields_quarter_rates = yields.resample('QS').mean()
yields_quarter_rates.head(2)

Unnamed: 0_level_0,2yBonds%,10yBonds%,Year,Quarter
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-07-01,2.575,3.868333,2013.0,3.0
2013-10-01,2.729167,4.1125,2013.0,4.0


In [5]:
# convert year and quarter to int
yields_quarter_rates["Year"] = yields_quarter_rates["Year"].astype(int)
yields_quarter_rates["Quarter"] = yields_quarter_rates["Quarter"].astype(int)

# create time period variable from 'Year' and 'Quarter'
yields_quarter_rates["time_period"] = yields_quarter_rates["Year"].map(str) + " Q" + yields_quarter_rates["Quarter"].map(str)

# Drop 'Year' and Quarter 
yields_join = yields_quarter_rates[["time_period", "2yBonds%", "10yBonds%"]] 
yields_join.head()

Unnamed: 0_level_0,time_period,2yBonds%,10yBonds%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-07-01,2013 Q3,2.575,3.868333
2013-10-01,2013 Q4,2.729167,4.1125
2014-01-01,2014 Q1,2.7,4.135
2014-04-01,2014 Q2,2.685,3.835833
2014-07-01,2014 Q3,2.583333,3.474167


The resulting cleaned bond yield df is **yields_join**.

----

### 2. Interest Rate ###
`rates_join`  

In [6]:
# read data in
interest = "Files/Interest Rates/f01d.xls"
interest = pd.read_excel(interest, sheet_name = "Data", usecols = "A:B", header = None, skiprows = range(0,12))
interest.columns = ['Date', 'Rate']

# get date time
dates = pd.to_datetime(interest["Date"])
interest["Year"] = dates.dt.year
interest["Quarter"] = dates.dt.quarter

# set date as index
interest.set_index('Date', inplace = True)


# calculate average
quarter_rates = interest.resample('QS').mean()

# create new column with average rate per quarter
quarter_rates["Year"] = quarter_rates["Year"].astype(int)
quarter_rates["Quarter"] = quarter_rates["Quarter"].astype(int)

# time period
quarter_rates["time_period"] = quarter_rates["Year"].map(str) + " Q" + quarter_rates["Quarter"].map(str)

# remove Year and Quarter
quarter_rates = quarter_rates[['Rate','time_period']]
quarter_rates.head()

Unnamed: 0_level_0,Rate,time_period
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-01,4.75,2011 Q1
2011-04-01,4.75,2011 Q2
2011-07-01,4.75,2011 Q3
2011-10-01,4.52381,2011 Q4
2012-01-01,4.25,2012 Q1


In [7]:
# read data in
interest = "Files/Interest Rates/f01d.xls"

interest = pd.read_excel(interest, sheet_name = "Data", usecols = "A:B", header = None, skiprows = range(0,12))
interest.columns = ['Date', 'Rate']
interest.head()

Unnamed: 0,Date,Rate
0,2011-01-05,4.75
1,2011-01-06,4.75
2,2011-01-07,4.75
3,2011-01-10,4.75
4,2011-01-11,4.75


In [8]:
# check data types
interest.dtypes

Date    datetime64[ns]
Rate           float64
dtype: object

In [9]:
# split date column into year and month
dates = pd.to_datetime(interest["Date"])
interest["Year"] = dates.dt.year
interest["Quarter"] = dates.dt.quarter

# set datetime as index
interest.set_index('Date', inplace = True)

In [10]:
# create new column with average rate per quarter
quarter_rates = interest.resample('QS').mean()
quarter_rates.head(2)

Unnamed: 0_level_0,Rate,Year,Quarter
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-01-01,4.75,2011.0,1.0
2011-04-01,4.75,2011.0,2.0


In [11]:
# convert year and quarter to int
quarter_rates["Year"] = quarter_rates["Year"].astype(int)
quarter_rates["Quarter"] = quarter_rates["Quarter"].astype(int)

# create time period variable for join from 'Year' and 'Quarter'
quarter_rates["time_period"] = quarter_rates["Year"].map(str) + " Q" + quarter_rates["Quarter"].map(str)
quarter_rates.head()

# Drop 'Year' and 'Quarter'
rates_join = quarter_rates[["time_period", "Rate"]]
rates_join.head(2)

Unnamed: 0_level_0,time_period,Rate
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-01,2011 Q1,4.75
2011-04-01,2011 Q2,4.75


The resulting cleaned bond yield df is **rates_join**.

----

### 3. Population / Age bands 

#### 3.1 Household count
`hhold`  

In [13]:
# Read data
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_hhold_wide = pd.read_excel(popfile,sheet_name='LGA Household Totals',header=6,usecols="A:C",skipfooter=3)
df_hhold_wide.columns=['LGA','hhold_count_2016','hhold_count_2021']
df_hhold_wide['LGA'] = df_hhold_wide.LGA.str.split('(').str.get(0)
df_hhold_wide.head(3)

Unnamed: 0,LGA,hhold_count_2016,hhold_count_2021
0,Albury,21940,23227
1,Armidale Regional,11755,13041
2,Ballina,18178,19080


In [16]:
# Calculate difference in HH counts between 2016 and 2021
df_hhold_wide['hhold_count_delta'] = df_hhold_wide.hhold_count_2021 - df_hhold_wide.hhold_count_2016

# Shorten df name
hhold = df_hhold_wide
hhold.head()

Unnamed: 0,LGA,hhold_count_2016,hhold_count_2021,hhold_count_delta
0,Albury,21940,23227,1287
1,Armidale Regional,11755,13041,1286
2,Ballina,18178,19080,902
3,Balranald,963,1015,52
4,Bathurst Regional,16105,17351,1246


The resulting cleaned houshold count df is **hhold**.

Please note that the some of the LGA names in hhold don't match exactly those in our master df. We will resolve this in later section of this notebook. 

----

#### 3.2 Population movement in 5 year period
`move`

In [23]:
# Read dta
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_move = pd.read_excel(popfile,sheet_name='LGA population accounts', header=5, skipfooter=3, usecols="A:C")
df_move.columns=['LGA','pop_move','2016-2021']
df_move['LGA'] = df_move.LGA.str.split('(').str.get(0) #Get rid of the bracket in each LGA name
df_move.head(8)

Unnamed: 0,LGA,pop_move,2016-2021
0,Albury,Population at Start of Period,52171
1,Albury,Births,3390
2,Albury,Deaths,2219
3,Albury,Natural change,1171
4,Albury,Net Migration (all sources),1031
5,Albury,Population at End of Period,54374
6,Armidale Regional,Population at Start of Period,30313
7,Armidale Regional,Births,1768


In [25]:
# Pivot features from rows to columns
df_move_melt = pd.melt(df_move,id_vars=['LGA','pop_move'], value_vars=['2016-2021'], var_name='year')
df_move_pivot = df_move_melt.pivot(index=['LGA','year'], columns='pop_move', values='value').reset_index()

# Rename columns
df_move_pivot = df_move_pivot.rename(columns={'    Births':'Births_16_21', 
                                              '    Deaths':'Deaths_16_21',
                                              'Population at End of Period':'population_2021',
                                              'Population at Start of Period':'population_2016'})

# Drop the 'year' column
df_move_pivot = df_move_pivot.drop(columns=['year'],axis=1)
df_move_pivot.head(1)

pop_move,LGA,Births_16_21,Deaths_16_21,Natural change,Net Migration (all sources),population_2021,population_2016
0,Albury,3390.0,2219.0,1171.0,1031.0,54374.0,52171.0


In [28]:
# Create delta variable for population difference 
df_move_pivot['pop_delta'] = df_move_pivot['population_2021'] - df_move_pivot['population_2016']

# Shorten df name
move = df_move_pivot

move.head(1)

pop_move,LGA,Births_16_21,Deaths_16_21,Natural change,Net Migration (all sources),population_2021,population_2016,pop_delta
0,Albury,3390.0,2219.0,1171.0,1031.0,54374.0,52171.0,2203.0


The resulting cleaned population movement df is **move**.

Please note that the some of the LGA names in move don't match exactly those in our master df. We will resolve this in later section of this notebook. 

----

#### 3.3 Population Age
`age_bracket_delta`

In [32]:
# Read data
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_age = pd.read_excel(popfile,sheet_name='LGA Sex Age projections',header=5,usecols="A:E",skipfooter=3)
df_age.columns=['LGA','sex','age','2016','2021']
df_age['LGA'] = df_age.LGA.str.split('(').str.get(0)
df_age.head(20)

Unnamed: 0,LGA,sex,age,2016,2021
0,Albury,Female,00-04,1693,1661
1,Albury,Female,05-09,1597,1694
2,Albury,Female,10-14,1617,1647
3,Albury,Female,15-19,1705,1724
4,Albury,Female,20-24,1928,1785
5,Albury,Female,25-29,1766,1740
6,Albury,Female,30-34,1693,1781
7,Albury,Female,35-39,1623,1815
8,Albury,Female,40-44,1665,1651
9,Albury,Female,45-49,1708,1723


In [35]:
# Create a variable for change of population by age band
df_age['age_delta'] = df_age['2021'] - df_age['2016']

# Combine the population of both genders
df_age_pivot = pd.pivot_table(df_age,index=['LGA','age'], values=['2016','2021','age_delta'], 
               aggfunc=({'2016':np.sum, '2021':np.sum, 'age_delta':np.sum})).reset_index()
df_age_pivot['LGA'] = df_age_pivot['LGA'].str.strip()

df_age_pivot.head()

Unnamed: 0,LGA,age,2016,2021,age_delta
0,Albury,00-04,3505,3401,-104
1,Albury,05-09,3279,3510,231
2,Albury,10-14,3228,3370,142
3,Albury,15-19,3381,3306,-75
4,Albury,20-24,3744,3448,-296


We'll lump the age brackets into fewer bands for ease of future analysis:
* Child: 0-14
* Youth: 15-24
* Adult: 25-44
* MiddleAge: 45-64
* Senior: 65+

In [38]:
# Group age brackets into 5 bandsd
Child = df_age_pivot.age.unique()[:3]
Youth = df_age_pivot.age.unique()[3:5]
Adult = df_age_pivot.age.unique()[5:9]
MiddleAge = df_age_pivot.age.unique()[9:13]
Senior = df_age_pivot.age.unique()[13:]

# Add ageband label
age_categ = [df_age_pivot['age'].isin(Child),
             df_age_pivot['age'].isin(Youth),
             df_age_pivot['age'].isin(Adult),
             df_age_pivot['age'].isin(MiddleAge),
             df_age_pivot['age'].isin(Senior)]

age_output = ['Child','Youth','Adult','MiddleAge','Senior']
df_age_pivot['age_bracket'] = np.select(age_categ,age_output)
df_age_pivot.head(20)

In [41]:
# Pivot by ageband and remove original groupings
age_bracket_delta = pd.pivot_table(df_age_pivot, index=['LGA'], columns=['age_bracket'], 
                                   values='age_delta', aggfunc=np.sum)

# Shorten df name
ageband = age_bracket_delta
ageband.reset_index(inplace=True)
ageband.head()

age_bracket,LGA,Adult,Child,MiddleAge,Senior,Youth
0,Albury,485,269,302,1519,-371
1,Armidale Regional,1501,195,211,630,-114
2,Ballina,-62,-186,5,1535,-48
3,Balranald,14,36,-29,106,-20
4,Bathurst Regional,716,-73,476,1062,-112


The resulting cleaned population by age bracket df is **ageband**.

Please note that the some of the LGA names in ageband don't match exactly those in our master df. We will resolve this in later section of this notebook. 

----

### 4. Construction  
`df_cons_clean`

In [43]:
# --read file, --rename columns
construction_file = "Files/Construction/Quarterly, Building construction prices rose, due to Homebuilder grants and government infrastructure investment.xlsx"
df_cons = pd.read_excel(construction_file,header=1,usecols="A:B", skipfooter=2)
df_cons.columns=['date','constr_index']

# --convert to datetime
df_cons['date'] = pd.to_datetime(df_cons['date'],format='%b-%y')

# --get year and quarter, --concatenate as time_period format, --drop other columns
df_cons['year'] = df_cons.date.dt.year
df_cons['quarter'] = df_cons.date.dt.quarter
df_cons['time_period'] = df_cons.year.map(str) + " Q" + df_cons.quarter.map(str)
df_cons_clean = df_cons.drop(columns=['date','year','quarter'],axis=1)
df_cons_clean.head()

Unnamed: 0,constr_index,time_period
0,100.1,2012 Q2
1,100.3,2012 Q3
2,100.2,2012 Q4
3,101.0,2013 Q1
4,101.6,2013 Q2


### 5. Weekly Income
`incp_gr`

In [44]:
# Read data in to the raw da
census_INCP = "Files/Census/POA (UR) by INCP Toal Personal Income (Weekly).csv"

incp_raw = pd.read_csv(census_INCP, skiprows=9, nrows=11142,
                       usecols=['POA (UR)', 'INCP Total Personal Income (weekly)', 'Count'])

# Rename column for easier referencing
incp_cols = {'POA (UR)':'postcode', 'INCP Total Personal Income (weekly)':'INCP_WK'}
incp_raw.rename(columns=incp_cols, inplace=True)

# Unstack
incp = incp_raw.groupby(['postcode','INCP_WK'])['Count'].sum().unstack()

# Remove the last row (grand total)
incp = incp[:-1]

incp.head(2)

INCP_WK,"$1,000-$1,249 ($52,000-$64,999)","$1,250-$1,499 ($65,000-$77,999)","$1,500-$1,749 ($78,000-$90,999)","$1,750-$1,999 ($91,000-$103,999)","$1-$149 ($1-$7,799)","$150-$299 ($7,800-$15,599)","$2,000-$2,999 ($104,000-$155,999)","$3,000 or more ($156,000 or more)","$300-$399 ($15,600-$20,799)","$400-$499 ($20,800-$25,999)","$500-$649 ($26,000-$33,799)","$650-$799 ($33,800-$41,599)","$800-$999 ($41,600-$51,999)",Negative income,Nil income,Not applicable,Not stated,Total
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
"2000, NSW",1676,1028,977,677,502,1112,1456,1796,1978,2201,1915,1599,1638,204,3297,1251,4115,27411
"2006, NSW",11,9,9,3,275,291,4,11,98,47,31,5,3,9,267,15,156,1261


In [45]:
# Remove 'NSW' in the index and cast postcode to int64
incp.reset_index(inplace=True)
incp['postcode'] = incp['postcode'].str.split(",").str.get(0)
incp['postcode'] = incp['postcode'].astype('int64')
incp = incp.set_index('postcode')

In [46]:
# Clean column names
income_cols= {'$1,000-$1,249 ($52,000-$64,999)' : '$1000-1249', 
            '$1,250-$1,499 ($65,000-$77,999)' : '$1250-1499',
            '$1,500-$1,749 ($78,000-$90,999)' : '$1500-1749 ', 
            '$1,750-$1,999 ($91,000-$103,999)': '$1750-1999',
            '$1-$149 ($1-$7,799)': '$1-149', 
            '$150-$299 ($7,800-$15,599)' : '$150-299',
            '$2,000-$2,999 ($104,000-$155,999)':'$2000-2999',
            '$3,000 or more ($156,000 or more)':'>=$3000', 
            '$300-$399 ($15,600-$20,799)':'$300-399',
            '$400-$499 ($20,800-$25,999)':'$400-499', 
            '$500-$649 ($26,000-$33,799)':'$500-649',
            '$650-$799 ($33,800-$41,599)':'$650-799', 
            '$800-$999 ($41,600-$51,999)':'$800-999'}
incp.rename(columns=income_cols, inplace=True)

# Combine 'not applicable' and 'not stated' into 'total_na'
incp['total_na'] = incp['Not applicable'] + incp['Not stated']

# Drop the 'Total column'
incp = incp.drop(columns=['Not applicable', 'Not stated', 'Total'], axis=1)

# Reorder columns
cols = incp.columns.tolist()
cols = ['$1-149','$150-299','$300-399','$400-499','$500-649','$650-799',
        '$800-999','$1000-1249','$1250-1499','$1500-1749 ',
        '$1750-1999','$2000-2999','>=$3000',
        'Negative income','Nil income','total_na']
incp=incp[cols]

incp.head(1)

INCP_WK,$1-149,$150-299,$300-399,$400-499,$500-649,$650-799,$800-999,$1000-1249,$1250-1499,$1500-1749,$1750-1999,$2000-2999,>=$3000,Negative income,Nil income,total_na
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2000,502,1112,1978,2201,1915,1599,1638,1676,1028,977,677,1456,1796,204,3297,5366


In [47]:
# Create income buckets and save into incp_gr
incp['INCP_LOW'] = incp.iloc[:, 0:6].sum(axis=1)
incp['INCP_MID'] = incp.iloc[:, 6:10].sum(axis=1)
incp['INCP_HIGH'] = incp.iloc[:, 10:13].sum(axis=1)
incp['INCP_NEG_NIL'] = incp.iloc[:, 13:15].sum(axis=1)
incp_gr = incp[['INCP_LOW', 'INCP_MID', 'INCP_HIGH', 'INCP_NEG_NIL']]

# Reset index
incp_gr.reset_index(inplace=True)

incp_gr.head(1)

INCP_WK,postcode,INCP_LOW,INCP_MID,INCP_HIGH,INCP_NEG_NIL
0,2000,9307,5319,3929,3501


*The resulting cleanead df is <b>incp_gr</b>*

----

### 6. Household size
`cprf`

In [48]:
# Read data
census_cprf = "Files/Census/POA by CPRF Count of Persons in Family by STATE.xlsx"
cprf = pd.read_excel(census_cprf, sheet_name="Data Sheet 0", skiprows=9, nrows=619)

# Remove redundant rows and columns 
cprf = cprf[1:] #remove the first row
cprf = cprf.drop(columns='CPRF Count of Persons in Family') # remove the first column

# Rename columns
cprf_cols= {'Unnamed: 1' : 'postcode', 
            'Two persons in family' : 'CPRF_2',
            'Three persons in family' : 'CPRF_3', 
            'Four persons in family': 'CPRF_4',
            'Five persons in family': 'CPRF_5', 
            'Six or more persons in family' : 'CPRF_6+',
            'Not applicable':'CPRF_na',
            'Total' :'CPRF_TOTAL_FAM_NO'}
cprf.rename(columns=cprf_cols, inplace=True)

cprf.head(1)

Unnamed: 0,postcode,CPRF_2,CPRF_3,CPRF_4,CPRF_5,CPRF_6+,CPRF_na,CPRF_TOTAL_FAM_NO
1,"2000, NSW",3453.0,857.0,354.0,54.0,21.0,8125.0,12861.0


In [49]:
# Remove 'NSW' in the index and cast postcode to int64
cprf.reset_index(inplace=True)
cprf['postcode'] = cprf['postcode'].str.split(",").str.get(0)
cprf['postcode'] = cprf['postcode'].astype('int64')
cprf = cprf.set_index('postcode')
cprf = cprf.drop(columns='index', axis=1)

In [50]:
# Reset index for merging
cprf.reset_index(inplace=True)
cprf.head(1)

Unnamed: 0,postcode,CPRF_2,CPRF_3,CPRF_4,CPRF_5,CPRF_6+,CPRF_na,CPRF_TOTAL_FAM_NO
0,2000,3453.0,857.0,354.0,54.0,21.0,8125.0,12861.0


*The resulting cleanead df is <b>cprf</b>*

----

## LGA Mapping

The unique aread identifier in the master housing df is postcode, however some of our features (household counts, population age and movements) are grouped by LGA. Hence, we have two tasks are hand:
1. Map postcodes in the master df to LGA using the **mapping** data file.
2. Resolve the mismatch of LGA names between the **mapping** df and the **feature dfs** (`hhold`,`move`,`ageband`) before they can be harmonised

### 1. Map postcodes to LGA 

In [56]:
# Read master housing data we cleaned and merged earlier
master = pd.read_csv("Files/Cleaned/Master_Sales_Rent_2017Q4_2021Q1.csv")

# Number of postcodes in the master df
print("number of postcodes in the master housing df:", master['postcode'].nunique())

# Store unique postcodes in the master df to an array
pc_master = master['postcode'].unique()

number of postcodes in the master housing df: 587


In [53]:
# Read the mapping data into df
lga_poa = "Files/Area/Postcode_and_LGA.xlsx"
mapping = pd.read_excel(lga_poa, sheet_name="SuburbLGA", 
                        usecols=['lganame','councilnam','suburbname','postcode'])
# Drop na
mapping = mapping.dropna()

# Rename columns 
rename_cols= {'lganame':'LGA',
              'councilnam':'council', 
              'suburbname':'suburb'}
mapping.rename(columns=rename_cols,inplace=True)

mapping.head()

Unnamed: 0,LGA,council,suburb,postcode
0,ALBURY CITY,ALBURY CITY COUNCIL,ALBURY,2640.0
1,ALBURY CITY,ALBURY CITY COUNCIL,EAST ALBURY,2640.0
2,ALBURY CITY,ALBURY CITY COUNCIL,ETTAMOGAH,2640.0
3,ALBURY CITY,ALBURY CITY COUNCIL,GLENROY,2640.0
4,ALBURY CITY,ALBURY CITY COUNCIL,HAMILTON VALLEY,2641.0


In [57]:
# Number of postcodes in the mapping  df
print("number of postcodes in the mapping df:", mapping['postcode'].nunique())

# Store unique postcodes in the mapping df to an array
pc_map = mapping['postcode'].unique()

number of postcodes in the mapping df: 622


In [58]:
# Check if the mapping df covers all the postcodes in the housing df
pc_shared = list(set(pc_master).intersection(pc_map))
print("number of postcodes in both lists:", len(pc_shared))

number of postcodes in both lists: 587


GREAT NEWS - ALL POSTCODES IN THE MASTER SALES/RENT DF CAN BE FOUND IN THE MAPPING FILE

The next step is to sort out the mapping df and join it with the master df.

In [None]:
# Dealing with the unincorporated

unico = mapping.loc[mapping.LGA=='UNINCORPORATED']
print("number of unicorporated postcodes:", unico['postcode'].nunique())
unico = unico.sort_values(by='postcode', ascending=True)
unico

In [None]:
# import manully looked-up LGAs for unicorporated ones

patch = ["CITY OF SYDNEY",
"CITY OF SYDNEY",
"CITY OF SYDNEY",
"CITY OF SYDNEY",
"CITY OF SYDNEY",
"CITY OF SYDNEY",
"WOOLLAHRA",
"WOOLLAHRA",
"WOOLLAHRA",
"INNER WEST",
"INNER WEST",
"CANADA BAY",
"CANADA BAY",
"NORTH SYDNEY",
"LANE COVE",
"MOSMAN",
"NORTH SYDNEY",
"NORTH SYDNEY",
"NORTHERN BEACHES",
"HUNTERS HILL",
"CITY OF RYDE",
"WENTWORTH SHIRE",
"WENTWORTH SHIRE",
"CENTRAL DARLING SHIRE",
"BOURKE SHIRE",
"CENTRAL DARLING SHIRE",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST",
"UNICORPORATED FAR WEST"]

# Add patch
unico['LGA patch'] = patch

# Drop LGA
unico = unico.drop(columns=['LGA'],axis=1)

# Remane patch as the new LGA
unico = unico.rename(columns={"LGA patch":"LGA"})

# Reorder columns
cols = ['LGA', 'council', 'suburb', 'postcode']
unico = unico[cols]

unico.head()

In [None]:
# Remove unicorporated from the original mapping df
mapping = mapping.loc[mapping['LGA'] != "UNINCORPORATED"]

# Append patched unico to mapping
mapping = mapping.append(unico)

# Save complete mapping df to csv
mapping.to_csv('Files/Area/Cleaned_LGA_Mapping_07102021.csv', index=False)

**UP TO THIS POINT, WE'VE SOLVED THE UNINCORPORATED ISSUES OF THE MAPPING FILE AND CAN START MERGING**

In [None]:
# Remove the suburb and council columns
mapping_reduce = mapping.drop(columns=['council','suburb'], axis=1)

# Drop duplicates
mapping_reduce = mapping_reduce.drop_duplicates()

mapping_reduce

In [None]:
# Join LGA from mapping with master
master_map = master.merge(mapping_reduce, left_on='postcode', right_on='postcode')

cols = ['postcode', 'LGA','skey', 'time_period', 'year', 'quarter', 'dwelling_type',
       'median_price', 'mean_price', 'sales_no', 'Qdelta_median',
       'Adelta_median', 'Qdelta_count', 'Adelta_count', 'rkey',
       'median_rent_newb', 'new_bonds_no', 'total_bonds_no',
       'Qdelta_median_rent', 'Qdelta_new_bonds', 'Adelta_median_rent',
       'Adelta_new_bonds']

master_map = master_map[cols]
master_map.head(1)

In [None]:
master_map.to_csv('Files/Cleaned/Master_Sales_Rent_2017Q4_2021Q1_wLGA.csv', index=False)

In [None]:
master_map.columns

### Solving the mismatch of LGA names between the master file and the feature DFs

#### `df_hhold_wide`

In [None]:
hhold.head()

In [None]:
print("Number of LGAs in the population df:", hhold['LGA'].nunique())
print("Number of LGAs in the mapping df:", mapping['LGA'].nunique())

In [None]:
# Find shared LGA in both 
LGA_map = list(mapping['LGA'].unique())
LGA_hhold = list(hhold['LGA'].unique())
LGA_HHOLD = list(map(str.upper, LGA_hhold)) # Convert LGA in hhold to uppercase

LGA_shared= list(set(LGA_map).intersection(LGA_HHOLD))
print(len(LGA_shared))

In [None]:
# Find LGA in hhold ONLY
set(LGA_HHOLD) - set(LGA_shared)

In [None]:
# Find LGA in mapping ONLY
set(LGA_map) - set(LGA_shared)

In [None]:
# Match name, except for unincorporated NSW

hhold.loc[hhold.LGA=='Albury', 'LGA'] = 'ALBURY CITY'
hhold.loc[hhold.LGA=='Lithgow', 'LGA'] = 'LITHGOW CITY'
hhold.loc[hhold.LGA=='Nambucca', 'LGA'] = 'NAMBUCCA VALLEY'
hhold.loc[hhold.LGA=='Parramatta', 'LGA'] = 'CITY OF PARRAMATTA'
hhold.loc[hhold.LGA=='Upper Hunter Shire', 'LGA'] = 'UPPER HUNTER'
hhold.loc[hhold.LGA=='Warrumbungle Shire', 'LGA'] = 'WARRUMBUNGLE'

LGA_hhold = list(hhold['LGA'].unique())
LGA_HHOLD = list(map(str.upper, LGA_hhold)) # Convert LGA in hhold to uppercase

LGA_shared= list(set(LGA_map).intersection(LGA_HHOLD))
print("number of LGAs matched:", len(LGA_shared))
print("number of LGAs in the hhold df:", hhold['LGA'].nunique())

In [None]:
# Find LGA in hhold ONLY
set(LGA_HHOLD) - set(LGA_shared)

In [None]:
# making the LGAs uppercase
hhold['LGA'] = hhold['LGA'].str.upper()
hhold.head()

#### `df_move_pivot`

In [None]:
LGA_move = list(move['LGA'].unique())
LGA_hhold = list(df_hhold_wide['LGA'].unique())

In [None]:
# Check if movement df and hhold df has the same LGAs

print("Number of LGAs in the move df:", len(LGA_move))
move_hhold = list(set(LGA_hhold).intersection(LGA_move))
print("Common LGAs in raw move and raw hhold df:", len(move_hhold))

So the same renaming procedure can be used to the move df.

In [None]:
# Match name, except for unincorporated NSW
move.loc[move.LGA=='Albury', 'LGA'] = 'ALBURY CITY'
move.loc[move.LGA=='Lithgow', 'LGA'] = 'LITHGOW CITY'
move.loc[move.LGA=='Nambucca', 'LGA'] = 'NAMBUCCA VALLEY'
move.loc[move.LGA=='Parramatta', 'LGA'] = 'CITY OF PARRAMATTA'
move.loc[move.LGA=='Upper Hunter Shire', 'LGA'] = 'UPPER HUNTER'
move.loc[move.LGA=='Warrumbungle Shire', 'LGA'] = 'WARRUMBUNGLE'

move['LGA'] = move['LGA'].str.upper()

move.head()

In [None]:
# Check matching result

LGA_MOVE = list(move['LGA'].unique())
move_map= list(set(LGA_map).intersection(LGA_MOVE))
print("matched LGA in mapping and move:", len(move_map))

#### `age_bracket_delta`


In [None]:
ageband.head()

In [None]:
# Check if ageband df and hhold df has the same LGAs
LGA_age = list(ageband['LGA'].unique())

print("Number of LGAs in the ageband df:", len(LGA_age))
age_hhold = list(set(LGA_hhold).intersection(LGA_age))
print("Common LGAs in raw move and raw hhold df:", len(age_hhold))

In [None]:
# Apply the same renaming procedure to ageband
ageband.loc[ageband.LGA=='Albury', 'LGA'] = 'ALBURY CITY'
ageband.loc[ageband.LGA=='Lithgow', 'LGA'] = 'LITHGOW CITY'
ageband.loc[ageband.LGA=='Nambucca', 'LGA'] = 'NAMBUCCA VALLEY'
ageband.loc[ageband.LGA=='Parramatta', 'LGA'] = 'CITY OF PARRAMATTA'
ageband.loc[ageband.LGA=='Upper Hunter Shire', 'LGA'] = 'UPPER HUNTER'
ageband.loc[ageband.LGA=='Warrumbungle Shire', 'LGA'] = 'WARRUMBUNGLE'

ageband['LGA'] = ageband['LGA'].str.upper()

ageband.head()

In [None]:
LGA_AGE = list(ageband['LGA'].unique())
LGA_AGE

In [None]:
# Check matching result

age_map= list(set(LGA_map).intersection(LGA_AGE))
print("matched LGA in mapping and ageband:", len(age_map))

## Merge the three population related data together
`features_lga`

In [None]:
hhold.head(1)

In [None]:
ageband = ageband.rename(columns={'age':'age_band', 
                        '2016':'age_2016',
                        '2021':'age_2021'})
ageband.head(5)

In [None]:
features_lga = hhold.merge(move, left_on='LGA', right_on='LGA')
features_lga = features_lga.merge(ageband, left_on='LGA', right_on='LGA')
features_lga.head()

In [None]:
print("df shape:", features_lga.shape)
print("number of LGAs in pop_merged:", features_lga['LGA'].nunique())

In [None]:
features_lga.to_csv('Files/Population/Population_cleaned_081021.csv', index=False)

## Merge remaining features
`features_postcode`

`features_timePeriod`

--------

help with `merge()` https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

### `features_postcode`

In [None]:
len(set(incp_gr['postcode']))

In [None]:
len(set(cprf['postcode']))

In [None]:
features_postcode = pd.merge(incp_gr, cprf, on='postcode')
features_postcode.head()

In [None]:
len(set(features_postcode['postcode']))

### `features_timePeriod`

`pd.merge()` will only merge on existing values, i.e. `yields_join` latest quarter is 2021 Q3 and `df_cons_clean` latest quarter is 2021 Q2. The merge DF will have latest quarter as 2021 Q2. 

In [None]:
features_timePeriod = pd.merge(yields_join, df_cons_clean, on='time_period')

Checking the new amount of unique time periods, `merge()` will drop any non shared time_periods

In [None]:
features_timePeriod = pd.merge(features_timePeriod, quarter_rates, on='time_period')
features_timePeriod.head()

In [None]:
len(set(features_timePeriod['time_period']))

In [None]:
len(set(quarter_rates['time_period']))

We now have 32 unique quarters (8 years). We started with 43 quarters. 

It is possible to do the above in a single line of code, though for clarity, I've left it like this. As seen in https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns

### `features_lga` $\xrightarrow{merge}$ master

In [None]:
# recall master sales/rent df w. LGA
master_map.head()

In [None]:
master_LGA = list(master_map['LGA'].unique())
pop_LGA = list(features_lga['LGA'].unique())
shared_LGA = list(set(master_LGA).intersection(pop_LGA))
print("Number of shared LGAs in master df and population df:", len(shared_LGA))

Hence there should be 128 LGAs after merging

In [None]:
master_pop = master_map.merge(features_lga, left_on='LGA', right_on='LGA')
master_pop.head(1)

In [None]:
print("df shape:", master_pop.shape)
print("number of LGAs in master_pop:", master_pop['LGA'].nunique())

### `features_postcode`$\xrightarrow{merge}$ master

In [None]:
features_postcode.head()

In [None]:
master = master_pop

In [None]:
master = master.merge(features_postcode, left_on='postcode', right_on='postcode')

In [None]:
master.head()

In [None]:
print("df shape:", master.shape)
print("number of LGAs in master_pop:", master['postcode'].nunique())

### `features_timePeriod`$\xrightarrow{merge}$ master

In [None]:
features_timePeriod.head()

In [None]:
master = master.merge(features_timePeriod, left_on='time_period', right_on='time_period')

In [None]:
master.head()

In [None]:
print("df shape:", master.shape)
print("number of LGAs in master_pop:", master['time_period'].nunique())

Saving as CSV

In [None]:
master.to_csv('Files/Cleaned/Master_Sales_Rent_2017Q4_2021Q1_wFeatures.csv', index=False)