**NOTE & UPDATE:**

*Alexis (8:00 1-Sep):*
* Tried to find the suburbs that have most total rental bonds (no.1 is 2170 liverpool?)/ new rental bonds (no.1 is 2017 Waterloo/Zetland - where I live! ) in Q1 2021
* Fancy map visualisation?

*Alexis (8:00 31-AUG):*
* Looked at rent data - it has more granularity to it:
    * Each postcode is first grouped by **4 dwelling types**: house, twonhouse, flat/unit and other
    * Each dwelling type is then grouped by **'number of bedroom'**: 1, 2, 3, 4, Bedsitter(?) or na
    These can be used to compare rents of the same dwelling type / bedroom across postcodes, which is a piece of analysis by itself
    
    
* We can also explore correlation between rent and sales price of the same postcode - from a time series point of view

**DATA SOURCE:**

[NSW Housing Rent and Sales](https://www.facs.nsw.gov.au/resources/statistics/rent-and-sales/back-issues)

Sales data - renamed vs. original variable names:
* <b>dwelling_type</b>: Dwelling Type
* <b>25%_price</b>: First Quartile Sales Price (AUD 000s)
* <b>50%_price</b>: Median Sales Price (AUD 000s)
* <b>75%_price</b>: Third Quartile Sales Price (AUD 000s)
* <b>mean_prce</b>: Mean Sales Price (AUD 000s)
* <b>sales_no</b>: Number of Sales
* <b>Qdealta_median</b>: Qtly change in Median
* <b>Adealta_median</b>: Annual change in Median
* <b>Qdelta_count</b>: Qtly change in Count
* <b>Adelta_count</b>: Annual change in Count

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

### Understand data strucutre of the sales and rent data 

**Sales data**

In [2]:
s136 = "Files/Sales/Issue-136-Sales-tables-March-2021-quarter.xlsx"
s135 = "Files/Sales/Issue-135-Sales-tables-December-2020-quarter.xlsx"
s134 = "Files/Sales/Issue-134-Sales-tables-September-2020-quarter.xlsx"
s133 = "Files/Sales/Issue-133-Sales-tables-June-2020-quarter.xlsx"
s132 = "Files/Sales/Issue-132-Sales-tables-March-2020-quarter.xlsx"

# Read the two sheets into two separate dataframes
s136 = pd.read_excel(s136, sheet_name="Postcode", na_values='-', header=6)
s135 = pd.read_excel(s135, sheet_name="Postcode", na_values='-', header=6)
s134 = pd.read_excel(s134, sheet_name="Postcode", na_values='-', header=6)
s133 = pd.read_excel(s133, sheet_name="Postcode", na_values='-', header=6)
s132 = pd.read_excel(s132, sheet_name="Postcode", na_values='-', header=6)

# Sale prices in any geographical area where the number of sales is 10 or less were not shown for confidentiality
# They were represented as '-' in the table

print("Q1 2021(s136):", s136.shape)
print("Q4 2020(s135):", s135.shape)
print("Q3 2020(s134):", s134.shape)
print("Q2 2020(s133):", s133.shape)
print("Q1 2020(s132):", s132.shape)

Q1 2021(s136): (1427, 11)
Q4 2020(s135): (1459, 11)
Q3 2020(s134): (1419, 11)
Q2 2020(s133): (1332, 11)
Q1 2020(s132): (1361, 11)


In [4]:
# Add time period and key columns before merging

s136['key'] = 's136'
s135['key'] = 's135'
s134['key'] = 's134'
s133['key'] = 's133'
s132['key'] = 's132'

s136['time_period'] = 'Q1 2021'
s135['time_period'] = 'Q4 2020'
s134['time_period'] = 'Q3 2020'
s133['time_period'] = 'Q2 2020'
s132['time_period'] = 'Q1 2020'

s136['year'] = '2021'
s135['year'] = '2020'
s134['year'] = '2020'
s133['year'] = '2020'
s132['year'] = '2020'

s136['quarter'] = '1'
s135['quarter'] = '4'
s134['quarter'] = '3'
s133['quarter'] = '2'
s132['quarter'] = '1'

In [11]:
# Merge sales file into one master file
frames = [s132, s133, s134, s135, s136]
sales_master = pd.concat(frames)

# Check master sales data's shape and dtypes
print("sales_master:", sales_master.shape, "\n")
print(sales_master.dtypes)

sales_master: (6998, 15) 

Postcode                                int64
Dwelling Type                          object
First Quartile Sales Price\n$'000s    float64
Median Sales Price\n$'000s            float64
Third Quartile Sales Price\n'000s     float64
Mean Sales Price\n$'000s              float64
Sales\nNo.                             object
Qtly change in Median                 float64
Annual change in Median               float64
Qtly change in Count                  float64
Annual change in Count                float64
time_period                            object
year                                   object
quarter                                object
key                                    object
dtype: object


In [14]:
# Rename column for easier referencing
rename_cols= {'Postcode':'postcode', 
             'Dwelling Type':'dwelling_type', 
             "First Quartile Sales Price\n$'000s" : '25%_price',
             "Median Sales Price\n$'000s" : '50%_price', 
             "Third Quartile Sales Price\n'000s" : '75%_price',
             "Mean Sales Price\n$'000s" : 'mean_price',
             'Sales\nNo.':'sales_no',
             'Qtly change in Median':'Qdelta_median',
             'Annual change in Median':'Adelta_median',
             'Qtly change in Count':'Qdelta_count',
             'Annual change in Count':'Adelta_count'}

sales_master.rename(columns=rename_cols, inplace=True)
sales_master.head(5)

Unnamed: 0,postcode,dwelling_type,25%_price,50%_price,75%_price,mean_price,sales_no,Qdelta_median,Adelta_median,Qdelta_count,Adelta_count,time_period,year,quarter,key
0,2000,Total,900.0,1225.0,1950.0,1541.0,105,-0.02,0.0524,-0.3787,-0.0278,Q1 2020,2020,1,s132
1,2000,Non Strata,,,,,,,,,,Q1 2020,2020,1,s132
2,2000,Strata,920.0,1238.0,1950.0,1584.0,102,-0.01,0.0668,-0.3964,-0.0286,Q1 2020,2020,1,s132
3,2007,Total,611.0,745.0,1039.0,834.0,s,-0.1312,0.1622,-0.5652,0.1765,Q1 2020,2020,1,s132
4,2007,Strata,611.0,745.0,1039.0,834.0,s,-0.1183,0.1622,-0.5455,0.1765,Q1 2020,2020,1,s132


Note that each postcode has a total row, a row for strata properties, and a row for non-strata properties.

Need to use groupby to tease them apart later.

In [16]:
# Check na - resulted from <10 per postcode sample size
sales_master.isnull().sum()

postcode            0
dwelling_type       0
25%_price        1847
50%_price        1847
75%_price        1847
mean_price       1847
sales_no         1847
Qdelta_median    1850
Adelta_median    1855
Qdelta_count     1850
Adelta_count     1855
time_period         0
year                0
quarter             0
key                 0
dtype: int64

In [17]:
# Drop na
sales_master = sales_master[sales_master['sales_no'].notna()]

# Check data type and df shape after dropna
print("data types:", sales_master.dtypes, "\n")
print("shape: ", sales_master.shape)

data types: postcode           int64
dwelling_type     object
25%_price        float64
50%_price        float64
75%_price        float64
mean_price       float64
sales_no          object
Qdelta_median    float64
Adelta_median    float64
Qdelta_count     float64
Adelta_count     float64
time_period       object
year              object
quarter           object
key               object
dtype: object 

shape:  (5151, 15)


Sales number was read into the dataframe as string because accordingly to the Explanatory note "statistics calculated from sample sizes between 10 and 30 are shown by an ‘s’ in the relevant table.  We suggest these data are treated with caution, particularly when assessing quarterly and annual changes."

In [19]:
# Replace 's' with the median of 10 and 30 since there're quite a few
sales_master.loc[sales_master['sales_no'] == 's', 'sales_no'] = 20.0

# Cast type as float
sales_master['sales_no'] = sales_master['sales_no'].astype(float)

sales_master.describe().round(2)

Unnamed: 0,postcode,25%_price,50%_price,75%_price,mean_price,sales_no,Qdelta_median,Adelta_median,Qdelta_count,Adelta_count
count,5151.0,5151.0,5151.0,5151.0,5151.0,5151.0,5148.0,5143.0,5148.0,5143.0
mean,2333.42,745.29,908.89,1150.6,984.22,59.01,0.03,0.09,0.17,0.38
std,243.52,508.82,637.52,917.86,718.33,54.67,0.19,0.19,0.63,0.86
min,2000.0,62.0,80.0,133.0,110.0,20.0,-0.97,-0.97,-0.72,-0.8
25%,2125.5,441.0,525.0,630.0,549.0,20.0,-0.04,0.01,-0.17,-0.03
50%,2280.0,638.0,735.0,867.0,774.0,40.0,0.02,0.08,0.03,0.22
75%,2528.0,868.0,1085.5,1396.0,1184.5,74.0,0.09,0.16,0.34,0.53
max,2880.0,5630.0,6630.0,15551.0,7374.0,418.0,8.46,6.52,10.0,13.0


**Rent data**

In [None]:
rent = "Files/Rent/Issue-135-Rent-tables-March-2021-quarter.xlsx"
rent_pc = pd.read_excel(rent, sheet_name="Postcode", na_values='-', header=7)

In [None]:
# Rename column for easier referencing (rent)

rename_cols= {'Postcode':'postcode', 
             'Dwelling Types':'dwelling_type', 
              'Number of Bedrooms':'bed_number',
             'First Quartile Weekly Rent for New Bonds\n$': '25%_wrent_newb',
             'Median Weekly Rent for New Bonds\n$': '50%_wrent_newb', 
             'Third Quartile Weekly Rent for New Bonds\n$': "75%_wrent_newb",
             'New Bonds Lodged\nNo.' : 'new_bonds_number',
              'Total Bonds Held\nNo.': 'total_bonds_number',
             'Sales\nNo.':'sales_number'}

rent_pc.rename(columns=rename_cols,inplace=True)
rent_pc.head(5)

In [None]:
# Check data types
rent_pc.dtypes

In [None]:
rent_pc.loc[rent_pc['new_bonds_number'] == 's', 'new_bonds_number'] = 20.0
rent_pc.loc[rent_pc['total_bonds_number'] == 's', 'total_bonds_number'] = 20.0

rent_pc['new_bonds_number'] = rent_pc['new_bonds_number'].astype(float)
rent_pc['total_bonds_number'] = rent_pc['total_bonds_number'].astype(float)

# Check data types again
rent_pc.dtypes

In [None]:
# Check unique values of dwelling type
rent_pc.groupby('dwelling_type').size()

In [None]:
# Check unique values of bed_number
rent_pc.groupby('bed_number').size()

In [None]:
print(rent_pc.shape, "\n")
print(rent_pc.isnull().sum())

In [None]:
# Check top 20 postcodes that have higest total bond number in Q1 2021
tbonds_pc = rent_pc.groupby(["postcode", "dwelling_type"])['total_bonds_number'].sum().unstack()

tbonds_pc.sort_values(by="Total", ascending = False).head(20)f

In [None]:
nbonds_pc = rent_pc.groupby(["postcode", "dwelling_type"])['new_bonds_number'].sum().unstack()
nbonds_pc.sort_values(by='Total', ascending = False).head(20)