This notebook is for cleaning __Construction cost index__ and __Population data__  

- Construction cost index is fairly straightforward
- Population data has a lot of categories

##### To do:
- Decide which population data to include
- How to merge 
- Clean LGA Map data inconsistencies:
    - Albury City vs Albury
    - City of parramatta vs parramatta
    - Nambucca Valley vs Nambucca
    - 'unincorporated NSW' whatever that is
    - Lord Howe

In [36]:
import pandas as pd
import numpy as np
import datetime as dt

# Construction cost index

In [37]:
# --read file, --rename columns
construction_file = "Files/Construction/Quarterly, Building construction prices rose, due to Homebuilder grants and government infrastructure investment.xlsx"
df_cons = pd.read_excel(construction_file,header=1,usecols="A:B", skipfooter=2)
df_cons.columns=['date','constr_index']

In [38]:
# --convert to datetime
df_cons['date'] = pd.to_datetime(df_cons['date'],format='%b-%y')

# --get year and quarter, --concatenate as time_period format, --drop other columns
df_cons['year'] = df_cons.date.dt.year
df_cons['quarter'] = df_cons.date.dt.quarter
df_cons['time_period'] = df_cons.year.map(str) + " Q" + df_cons.quarter.map(str)
df_cons_clean = df_cons.drop(columns=['date','year','quarter'],axis=1)
df_cons_clean.head()

Unnamed: 0,constr_index,time_period
0,100.1,2012 Q2
1,100.3,2012 Q3
2,100.2,2012 Q4
3,101.0,2013 Q1
4,101.6,2013 Q2


#### Join statement:
replace the two xxx with master dataframe

xxx = pd.merge(xxx, df_cons_clean, on='time_period',how='left')

# Population

### LGA to Postcode mapping file

Do we need suburb name?

In [39]:
SuburbLGA = "Files/Area/Postcode_and_LGA.xlsx"
postcodeLGA = pd.read_excel(SuburbLGA, usecols = "A, C, D") #suburbname optional

postcodeLGA = postcodeLGA.dropna()
postcodeLGA["postcode"] = postcodeLGA["postcode"].astype(int)
postcodeLGA["lganame"] = postcodeLGA.lganame.str.title()
postcodeLGA.head()

Unnamed: 0,lganame,suburbname,postcode
0,Albury City,ALBURY,2640
1,Albury City,EAST ALBURY,2640
2,Albury City,ETTAMOGAH,2640
3,Albury City,GLENROY,2640
4,Albury City,HAMILTON VALLEY,2641


### Household per LGA, 2016 and 2021

Long format

In [40]:
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_hhold = pd.read_excel(popfile,sheet_name='LGA Household Totals',header=6,usecols="A:C",skipfooter=3)

# --convert wide to long format with melt, --rename cols, --clean LGA name
df_hhold = pd.melt(df_hhold, id_vars='Counting households', value_vars=[2016,2021])
df_hhold.columns=['LGA','year','hhold_count']
df_hhold['LGA'] = df_hhold.LGA.str.split('(').str.get(0)
df_hhold

Unnamed: 0,LGA,year,hhold_count
0,Albury,2016,21940
1,Armidale Regional,2016,11755
2,Ballina,2016,18178
3,Balranald,2016,963
4,Bathurst Regional,2016,16105
...,...,...,...
253,Wingecarribee,2021,20577
254,Wollondilly,2021,18402
255,Wollongong,2021,87168
256,Woollahra,2021,24009


Wide format

In [41]:
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_hhold_wide = pd.read_excel(popfile,sheet_name='LGA Household Totals',header=6,usecols="A:C",skipfooter=3)
df_hhold_wide.columns=['LGA','hhold_count_2016','hhold_count_2021']
df_hhold_wide['LGA'] = df_hhold.LGA.str.split('(').str.get(0)

df_hhold_wide['hhold_count_delta'] = df_hhold_wide.hhold_count_2021 - df_hhold_wide.hhold_count_2016

df_hhold_wide.head()

Unnamed: 0,LGA,hhold_count_2016,hhold_count_2021,hhold_count_delta
0,Albury,21940,23227,1287
1,Armidale Regional,11755,13041,1286
2,Ballina,18178,19080,902
3,Balranald,963,1015,52
4,Bathurst Regional,16105,17351,1246


Check LGA Map name vs household count LGA Name

In [42]:
lgamap = pd.Series(postcodeLGA.lganame.unique())
lgadf = pd.Series(df_hhold.LGA.unique())

lgacomps = pd.concat([lgamap,lgadf],axis=1)
with pd.option_context('display.max_rows', None, 'display.max_columns', None): #force to display all data
    print(lgacomps)

                                          0                               1
0                               Albury City                         Albury 
1                         Armidale Regional              Armidale Regional 
2                                   Ballina                        Ballina 
3                                 Balranald                      Balranald 
4                         Bathurst Regional              Bathurst Regional 
5                                   Bayside                        Bayside 
6                               Bega Valley                    Bega Valley 
7                                 Bellingen                      Bellingen 
8                                  Berrigan                       Berrigan 
9                                 Blacktown                      Blacktown 
10                                    Bland                          Bland 
11                                  Blayney                        Blayney 
12          

There are some inconsistencies with LGA Mapping
- Albury City vs Albury
- City of parramatta vs parramatta
- Nambucca Valley vs Nambucca
- Unincorporated vs 'unincorporated NSW' 
- Lord Howe Island - Unincorporated Area

### Population movement in 5 year period


In [43]:
df_move = pd.read_excel(popfile,sheet_name='LGA population accounts', header=5, skipfooter=3, usecols="A:C")
df_move.columns=['LGA','pop_move','2016-2021']
df_move.head() 

Unnamed: 0,LGA,pop_move,2016-2021
0,Albury (C),Population at Start of Period,52171
1,Albury (C),Births,3390
2,Albury (C),Deaths,2219
3,Albury (C),Natural change,1171
4,Albury (C),Net Migration (all sources),1031


In [44]:
df_move_melt = pd.melt(df_move,id_vars=['LGA','pop_move'], value_vars=['2016-2021'], var_name='year')
df_move_pivot = df_move_melt.pivot(index=['LGA','year'], columns='pop_move', values='value').reset_index()
df_move_pivot['LGA'] = df_move_pivot.LGA.str.split('(').str.get(0)
df_move_pivot['pop_delta'] = df_move_pivot['Population at End of Period'] - df_move_pivot['Population at Start of Period']
df_move_pivot.head()

pop_move,LGA,year,Births,Deaths,Natural change,Net Migration (all sources),Population at End of Period,Population at Start of Period,pop_delta
0,Albury,2016-2021,3390.0,2219.0,1171.0,1031.0,54374.0,52171.0,2203.0
1,Armidale Regional,2016-2021,1768.0,1266.0,502.0,1921.0,32736.0,30313.0,2423.0
2,Ballina,2016-2021,1790.0,2491.0,-701.0,1945.0,44237.0,42993.0,1244.0
3,Balranald,2016-2021,194.0,96.0,98.0,8.0,2437.0,2330.0,107.0
4,Bathurst Regional,2016-2021,2500.0,1710.0,790.0,1277.0,44310.0,42244.0,2066.0


### Population Age

In [45]:
popfile = "Files/Population/2019 NSW Population Projections ASGS 2019 LGA.xlsx"
df_age = pd.read_excel(popfile,sheet_name='LGA Sex Age projections',header=5,usecols="A:E",skipfooter=3)
df_age.columns=['LGA','sex','age','2016','2021']
df_age['age_delta'] = df_age['2021'] - df_age['2016']
df_age['LGA'] = df_age.LGA.str.split('(').str.get(0)
df_age

Unnamed: 0,LGA,sex,age,2016,2021,age_delta
0,Albury,Female,00-04,1693,1661,-32
1,Albury,Female,05-09,1597,1694,97
2,Albury,Female,10-14,1617,1647,30
3,Albury,Female,15-19,1705,1724,19
4,Albury,Female,20-24,1928,1785,-143
...,...,...,...,...,...,...
4639,Yass Valley,Male,65-69,472,484,12
4640,Yass Valley,Male,70-74,335,433,98
4641,Yass Valley,Male,75-79,209,295,86
4642,Yass Valley,Male,80-84,128,160,32


In [46]:
df_age_pivot = pd.pivot_table(df_age,index=['LGA','age'], values=['2016','2021','age_delta'], 
               aggfunc=({'2016':np.sum, '2021':np.sum, 'age_delta':np.sum})).reset_index()
df_age_pivot.head()

Unnamed: 0,LGA,age,2016,2021,age_delta
0,Albury,00-04,3505,3401,-104
1,Albury,05-09,3279,3510,231
2,Albury,10-14,3228,3370,142
3,Albury,15-19,3381,3306,-75
4,Albury,20-24,3744,3448,-296


In [47]:
df_age_pivot.age.unique()

array(['00-04', '05-09', '10-14', '15-19', '20-24', '25-29', '30-34',
       '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69',
       '70-74', '75-79', '80-84', '85+'], dtype=object)

Group  
0-14: Child  
15-24: Youth  
25-49: Adult  
50-64: Middle Age  
65+ : Senior  

In [48]:
#clusters

Child = df_age_pivot.age.unique()[:3]
Youth = df_age_pivot.age.unique()[3:5]
Adult = df_age_pivot.age.unique()[5:9]
MiddleAge = df_age_pivot.age.unique()[9:13]
Senior = df_age_pivot.age.unique()[13:]

print('Child',Child)
print('Youth',Youth)
print('Adult',Adult)
print('MiddleAge',MiddleAge)
print('Senior',Senior)

Child ['00-04' '05-09' '10-14']
Youth ['15-19' '20-24']
Adult ['25-29' '30-34' '35-39' '40-44']
MiddleAge ['45-49' '50-54' '55-59' '60-64']
Senior ['65-69' '70-74' '75-79' '80-84' '85+']


In [49]:
age_categ = [df_age_pivot['age'].isin(Child),
             df_age_pivot['age'].isin(Youth),
             df_age_pivot['age'].isin(Adult),
             df_age_pivot['age'].isin(MiddleAge),
             df_age_pivot['age'].isin(Senior)]
age_output = ['Child','Youth','Adult','MiddleAge','Senior']

df_age_pivot['age_bracket'] = np.select(age_categ,age_output)
df_age_pivot.head(20)

Unnamed: 0,LGA,age,2016,2021,age_delta,age_bracket
0,Albury,00-04,3505,3401,-104,Child
1,Albury,05-09,3279,3510,231,Child
2,Albury,10-14,3228,3370,142,Child
3,Albury,15-19,3381,3306,-75,Youth
4,Albury,20-24,3744,3448,-296,Youth
5,Albury,25-29,3485,3505,20,Adult
6,Albury,30-34,3400,3543,143,Adult
7,Albury,35-39,3143,3526,383,Adult
8,Albury,40-44,3206,3145,-61,Adult
9,Albury,45-49,3330,3284,-46,MiddleAge


# Try merging

In [50]:
master = pd.read_csv("Files/Cleaned/Master_Sales_Rent_2017Q4_2021Q1.csv")
master.head()

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,Qdelta_count,Adelta_count,rkey,median_rent_newb,new_bonds_no,total_bonds_no,Qdelta_median_rent,Qdelta_new_bonds,Adelta_median_rent,Adelta_new_bonds
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,-0.325,-0.3112,r121,640.0,1169.0,7914.0,-0.2,0.5545,,
1,2000,s122,2017 Q3,2017,Q3,Strata,1350.0,1516.328059,135.0,0.1431,...,-0.3182,-0.2703,r121,640.0,1169.0,7914.0,-0.2,0.5545,,
2,2007,s122,2017 Q3,2017,Q3,Total,817.5,804.448333,36.0,-0.109,...,-0.4462,-0.3455,r121,535.0,301.0,2231.0,-0.1371,1.1049,,
3,2007,s122,2017 Q3,2017,Q3,Strata,817.5,804.448333,36.0,-0.0815,...,-0.4194,-0.3208,r121,535.0,301.0,2231.0,-0.1371,1.1049,,
4,2008,s122,2017 Q3,2017,Q3,Total,995.0,1061.807024,41.0,-0.0005,...,-0.1087,-0.1277,r121,479.0,762.0,5020.0,-0.1812,1.3374,,


In [51]:
master.isna().sum()

postcode                 0
skey                     0
time_period              0
year                     0
quarter                  0
dwelling_type            0
median_price             0
mean_price               0
sales_no                 0
Qdelta_median         5857
Adelta_median         5861
Qdelta_count          5857
Adelta_count          5861
rkey                     6
median_rent_newb         6
new_bonds_no             6
total_bonds_no           6
Qdelta_median_rent    3885
Qdelta_new_bonds      3706
Adelta_median_rent    3908
Adelta_new_bonds      3908
dtype: int64

In [52]:
master[master.isna().any(axis=1)]

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,Qdelta_median,...,Qdelta_count,Adelta_count,rkey,median_rent_newb,new_bonds_no,total_bonds_no,Qdelta_median_rent,Qdelta_new_bonds,Adelta_median_rent,Adelta_new_bonds
0,2000,s122,2017 Q3,2017,Q3,Total,1350.0,1516.328059,135.0,0.1345,...,-0.3250,-0.3112,r121,640.0,1169.0,7914.0,-0.2000,0.5545,,
1,2000,s122,2017 Q3,2017,Q3,Strata,1350.0,1516.328059,135.0,0.1431,...,-0.3182,-0.2703,r121,640.0,1169.0,7914.0,-0.2000,0.5545,,
2,2007,s122,2017 Q3,2017,Q3,Total,817.5,804.448333,36.0,-0.1090,...,-0.4462,-0.3455,r121,535.0,301.0,2231.0,-0.1371,1.1049,,
3,2007,s122,2017 Q3,2017,Q3,Strata,817.5,804.448333,36.0,-0.0815,...,-0.4194,-0.3208,r121,535.0,301.0,2231.0,-0.1371,1.1049,,
4,2008,s122,2017 Q3,2017,Q3,Total,995.0,1061.807024,41.0,-0.0005,...,-0.1087,-0.1277,r121,479.0,762.0,5020.0,-0.1812,1.3374,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20708,2875,s136,2021 Q1,2021,Q1,Non Strata,843.0,888.000000,5.0,,...,,,r135,450.0,5.0,20.0,,,,
20709,2876,s136,2021 Q1,2021,Q1,Total,760.5,760.500000,5.0,,...,,,r135,450.0,5.0,5.0,,,,
20710,2876,s136,2021 Q1,2021,Q1,Non Strata,843.0,888.000000,5.0,,...,,,r135,450.0,5.0,5.0,,,,
20713,2879,s136,2021 Q1,2021,Q1,Total,760.5,760.500000,5.0,,...,,,r135,450.0,5.0,5.0,,,,


In [53]:
# Create a subset that only contains 2 Quarters of data (FOR TESTING PURPOSE)
subset = master.loc[master['time_period'].isin(['2020 Q4', '2021 Q1'])]

# Get some of the (potentially) unnecessary variables
subset = subset.drop(columns=['Qdelta_median','Adelta_median','Qdelta_count','Adelta_count',
                              'Qdelta_median_rent', 'Adelta_median_rent','Qdelta_new_bonds','Adelta_new_bonds'],
                     axis=1)

# And only keep 'Total' dwelling type (i.e. get rid of Strata and Non-strata)
subset_total = subset.loc[subset['dwelling_type'] == 'Total']

In [54]:
subset_total.head(1)

Unnamed: 0,postcode,skey,time_period,year,quarter,dwelling_type,median_price,mean_price,sales_no,rkey,median_rent_newb,new_bonds_no,total_bonds_no
17831,2000,s135,2020 Q4,2020,Q4,Total,1110.0,1379.0,155.0,r134,550.0,1705.0,9140.0


#### Merge

In [55]:
try_merge = subset_total.merge(df_cons_clean,on='time_period', how='inner')

In [56]:
#pivoting to get 'Steve DF'
pivot = try_merge.pivot_table(index='postcode', columns='time_period', values=
                                 ['median_price', 'mean_price', 'sales_no', 
                                  'median_rent_newb','new_bonds_no', 'total_bonds_no','constr_index'])

pivot.head()

Unnamed: 0_level_0,constr_index,constr_index,mean_price,mean_price,median_price,median_price,median_rent_newb,median_rent_newb,new_bonds_no,new_bonds_no,sales_no,sales_no,total_bonds_no,total_bonds_no
time_period,2020 Q4,2021 Q1,2020 Q4,2021 Q1,2020 Q4,2021 Q1,2020 Q4,2021 Q1,2020 Q4,2021 Q1,2020 Q4,2021 Q1,2020 Q4,2021 Q1
postcode,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
2000,116.9,117.8,1379.0,2794.0,1110.0,1371.0,550.0,600.0,1705.0,1469.0,155.0,184.0,9140.0,9327.0
2007,116.9,117.8,677.0,754.0,651.0,763.0,480.0,455.0,347.0,388.0,20.0,20.0,1979.0,2024.0
2008,116.9,117.8,1184.0,937.0,991.0,855.0,490.0,500.0,915.0,987.0,45.0,49.0,4514.0,4228.0
2009,116.9,117.8,1661.0,1427.0,1075.0,1188.0,550.0,600.0,543.0,471.0,75.0,54.0,3247.0,3333.0
2010,116.9,117.8,2267.0,1371.0,1240.0,1201.0,518.0,525.0,1158.0,1139.0,211.0,218.0,8521.0,8626.0


In [57]:
pivot.columns = [' '.join(col) for col in pivot.columns] # only run once!
pivot.head()

Unnamed: 0_level_0,constr_index 2020 Q4,constr_index 2021 Q1,mean_price 2020 Q4,mean_price 2021 Q1,median_price 2020 Q4,median_price 2021 Q1,median_rent_newb 2020 Q4,median_rent_newb 2021 Q1,new_bonds_no 2020 Q4,new_bonds_no 2021 Q1,sales_no 2020 Q4,sales_no 2021 Q1,total_bonds_no 2020 Q4,total_bonds_no 2021 Q1
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000,116.9,117.8,1379.0,2794.0,1110.0,1371.0,550.0,600.0,1705.0,1469.0,155.0,184.0,9140.0,9327.0
2007,116.9,117.8,677.0,754.0,651.0,763.0,480.0,455.0,347.0,388.0,20.0,20.0,1979.0,2024.0
2008,116.9,117.8,1184.0,937.0,991.0,855.0,490.0,500.0,915.0,987.0,45.0,49.0,4514.0,4228.0
2009,116.9,117.8,1661.0,1427.0,1075.0,1188.0,550.0,600.0,543.0,471.0,75.0,54.0,3247.0,3333.0
2010,116.9,117.8,2267.0,1371.0,1240.0,1201.0,518.0,525.0,1158.0,1139.0,211.0,218.0,8521.0,8626.0


In [58]:
pivot.loc[pivot.isna().any(axis=1)]

Unnamed: 0_level_0,constr_index 2020 Q4,constr_index 2021 Q1,mean_price 2020 Q4,mean_price 2021 Q1,median_price 2020 Q4,median_price 2021 Q1,median_rent_newb 2020 Q4,median_rent_newb 2021 Q1,new_bonds_no 2020 Q4,new_bonds_no 2021 Q1,sales_no 2020 Q4,sales_no 2021 Q1,total_bonds_no 2020 Q4,total_bonds_no 2021 Q1
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2175,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,84.0,
2311,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,20.0,
2342,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,5.0,
2345,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,20.0,
2355,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,20.0,
2369,,117.8,,760.5,,760.5,,450.0,,5.0,,5.0,,20.0
2403,,117.8,,760.5,,760.5,,450.0,,5.0,,5.0,,20.0
2409,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,20.0,
2415,,117.8,,760.5,,760.5,,450.0,,5.0,,5.0,,20.0
2424,116.9,,733.0,,733.0,,450.0,,5.0,,5.0,,5.0,


### Try loading cleaned data


In [59]:
clean_sr = pd.read_csv('Files/Cleaned/Pivot_Sales_Rent_5Quarters_SharedPOA.csv')
clean_sr.head()

Unnamed: 0,mean_price 2020 Q1,mean_price 2020 Q2,mean_price 2020 Q3,mean_price 2020 Q4,mean_price 2021 Q1,median_price 2020 Q1,median_price 2020 Q2,median_price 2020 Q3,median_price 2020 Q4,median_price 2021 Q1,...,sales_no 2020 Q1,sales_no 2020 Q2,sales_no 2020 Q3,sales_no 2020 Q4,sales_no 2021 Q1,total_bonds_no 2020 Q1,total_bonds_no 2020 Q2,total_bonds_no 2020 Q3,total_bonds_no 2020 Q4,total_bonds_no 2021 Q1
0,1541.0,1322.0,1631.0,1379.0,2794.0,1225.0,1000.0,1390.0,1110.0,1371.0,...,105.0,74.0,100.0,155.0,184.0,8615.0,7595.0,8069.0,9140.0,9327.0
1,834.0,739.0,658.0,677.0,754.0,745.0,775.0,655.0,651.0,763.0,...,20.0,20.0,20.0,20.0,20.0,2116.0,1810.0,1876.0,1979.0,2024.0
2,956.0,1144.0,985.0,1184.0,937.0,750.0,1173.0,890.0,991.0,855.0,...,35.0,20.0,36.0,45.0,49.0,4613.0,4286.0,4116.0,4514.0,4228.0
3,1277.0,1282.0,1373.0,1661.0,1427.0,986.0,1100.0,1085.0,1075.0,1188.0,...,20.0,32.0,56.0,75.0,54.0,2913.0,2793.0,2913.0,3247.0,3333.0
4,1357.0,1395.0,1476.0,2267.0,1371.0,1280.0,1325.0,1270.0,1240.0,1201.0,...,151.0,125.0,157.0,211.0,218.0,8338.0,8079.0,8182.0,8521.0,8626.0
