**NOTE & UPDATE:**

*Alexis (8:00 31-AUG):*
* Looked at rent data - it has more granularity to it:
    * Each postcode is first grouped by **4 dwelling types**: house, twonhouse, flat/unit and other
    * Each dwelling type is then grouped by **'number of bedroom'**: 1, 2, 3, 4, Bedsitter(?) or na
    These can be used to compare rents of the same dwelling type / bedroom across postcodes, which is a piece of analysis by itself
    
    
* We can also explore correlation between rent and sales price of the same postcode - from a time series point of view

**DATA SOURCE:**

[NSW Housing Rent and Sales](https://www.facs.nsw.gov.au/resources/statistics/rent-and-sales/back-issues)

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

### Understand data strucutre of the sales and rent data 
Use Q1 (March) 2021 data as an example

In [2]:
sales = "Files/Sales/Issue-136-Sales-tables-March-2021-quarter.xlsx"
rent = "Files/Rent/Issue-135-Rent-tables-March-2021-quarter.xlsx"

# Read the two sheets into two separate dataframes
sales_pc = pd.read_excel(sales, sheet_name="Postcode", na_values='-', header=6)
rent_pc = pd.read_excel(rent, sheet_name="Postcode", na_values='-', header=7)

# Note:
# Sale prices in any geographical area where the number of sales is 10 or less were not shown for confidentiality
# They were represented as '-' in the table

**Sales data**

In [3]:
# Rename column for easier referencing

rename_cols= {'Postcode':'postcode', 
             'Dwelling Type':'dwelling_type', 
             "First Quartile Sales Price\n$'000s" : '25%_price',
             "Median Sales Price\n$'000s" : '50%_price', 
             "Third Quartile Sales Price\n'000s" : "75%_price",
             "Mean Sales Price\n$'000s" : 'mean_price',
             'Sales\nNo.':'sales_number'}

sales_pc.rename(columns=rename_cols,inplace=True)
sales_pc.head(5)

Unnamed: 0,postcode,dwelling_type,25%_price,50%_price,75%_price,mean_price,sales_number,Qtly change in Median,Annual change in Median,Qtly change in Count,Annual change in Count
0,2000,Total,924.0,1371.0,3500.0,2794.0,184,0.1425,0.0711,0.0888,0.5862
1,2000,Strata,924.0,1371.0,3500.0,2794.0,184,0.1331,0.0632,0.1018,0.6429
2,2007,Total,619.0,763.0,954.0,754.0,s,0.1713,0.0235,0.0,0.3
3,2007,Non Strata,,,,,,,,,
4,2007,Strata,549.0,710.0,882.0,688.0,s,0.0906,-0.047,-0.0769,0.2


Note that each postcode has a total row, a row for strata properties, and a row for non-strata properties.

Need to use groupby to tease them apart later.

In [4]:
# Check na - resulted from <10 per postcode sample size
sales_pc.isnull().sum()

postcode                     0
dwelling_type                0
25%_price                  316
50%_price                  316
75%_price                  316
mean_price                 316
sales_number               316
Qtly change in Median      316
Annual change in Median    321
Qtly change in Count       316
Annual change in Count     321
dtype: int64

In [5]:
# Drop na
sales_pc = sales_pc[sales_pc['sales_number'].notna()]

# Check data type and df shape after dropna
print("data types:", sales_pc.dtypes, "\n")
print("shape: ", sales_pc.shape)

data types: postcode                     int64
dwelling_type               object
25%_price                  float64
50%_price                  float64
75%_price                  float64
mean_price                 float64
sales_number                object
Qtly change in Median      float64
Annual change in Median    float64
Qtly change in Count       float64
Annual change in Count     float64
dtype: object 

shape:  (1111, 11)


Sales number was read into the dataframe as string because accordingly to the Explanatory note "statistics calculated from sample sizes between 10 and 30 are shown by an ‘s’ in the relevant table.  We suggest these data are treated with caution, particularly when assessing quarterly and annual changes."

In [10]:
# Replace 's' with the median of 10 and 30 since there're quite a few
sales_pc.loc[sales_pc['sales_number'] == 's', 'sales_number'] = 20.0

# Cast type as float
sales_pc['sales_number'] = sales_pc['sales_number'].astype(float)

sales_pc.describe().round(2)

Unnamed: 0,postcode,25%_price,50%_price,75%_price,mean_price,sales_number,Qtly change in Median,Annual change in Median,Qtly change in Count,Annual change in Count
count,1111.0,1111.0,1111.0,1111.0,1111.0,1111.0,1111.0,1106.0,1111.0,1106.0
mean,2340.06,786.08,965.06,1221.18,1047.83,64.69,0.06,0.13,-0.03,0.62
std,246.4,566.09,706.56,948.62,791.96,60.61,0.14,0.2,0.41,0.85
min,2000.0,88.0,120.0,147.0,125.0,20.0,-0.9,-0.46,-0.59,-0.49
25%,2130.5,452.5,540.0,650.0,574.5,20.0,-0.01,0.03,-0.22,0.17
50%,2284.0,652.0,770.0,907.0,804.0,43.0,0.05,0.11,-0.1,0.41
75%,2533.5,924.0,1188.0,1499.0,1257.0,82.5,0.12,0.2,0.06,0.74
max,2880.0,5375.0,6400.0,9000.0,7374.0,397.0,0.78,2.04,6.5,7.0


**Rent data**

In [18]:
# Rename column for easier referencing (rent)

rename_cols= {'Postcode':'postcode', 
             'Dwelling Types':'dwelling_type', 
              'Number of Bedrooms':'bed_number',
             'First Quartile Weekly Rent for New Bonds\n$': '25%_wrent_newb',
             'Median Weekly Rent for New Bonds\n$': '50%_wrent_newb', 
             'Third Quartile Weekly Rent for New Bonds\n$': "75%_wrent_newb",
             'New Bonds Lodged\nNo.' : 'new_bonds_number',
              'Total Bonds Held\nNo.': 'total_bonds_number',
             'Sales\nNo.':'sales_number'}

rent_pc.rename(columns=rename_cols,inplace=True)
rent_pc.head(5)

Unnamed: 0,postcode,dwelling_type,bed_number,25%_wrent_newb,50%_wrent_newb,75%_wrent_newb,new_bonds_number,total_bonds_number,Quarterly change in Median Weekly Rent,Annual change in Median Weekly Rent,Quarterly change in New Bonds Lodged,Annual change in New Bonds Lodged
0,2000,Total,Total,500.0,600.0,750.0,1469,9327,0.0909,-0.1429,-0.1384,0.1943
1,2000,Total,Bedsitter,250.0,365.0,400.0,89,382,0.0429,-0.2843,0.0349,0.9778
2,2000,Total,1 Bedroom,450.0,540.0,620.0,741,4063,0.102,-0.1692,-0.1136,-0.0326
3,2000,Total,Not Specified,350.0,445.0,548.0,34,511,-0.1524,-0.3904,-0.32,0.2593
4,2000,Total,2 Bedrooms,640.0,750.0,850.0,517,3741,0.1194,-0.1979,-0.2083,0.6056


In [19]:
# Check unique values of dwelling type
rent_pc.groupby('dwelling_type').size()

dwelling_type
Flat/Unit    3160
House        3653
Other        3257
Total        3903
Townhouse    2424
dtype: int64

In [20]:
# Check unique values of bed_number
rent_pc.groupby('bed_number').size()

bed_number
1 Bedroom             2336
2 Bedrooms            2655
3 Bedrooms            2637
4 or more Bedrooms    2330
Bedsitter             1296
Not Specified         2310
Total                 2833
dtype: int64

In [21]:
print(rent_pc.shape, "\n")
print(rent_pc.isnull().sum())

(16397, 12) 

postcode                                      0
dwelling_type                                 0
bed_number                                    0
25%_wrent_newb                            12274
50%_wrent_newb                            12274
75%_wrent_newb                            12274
new_bonds_number                          12274
total_bonds_number                         6095
Quarterly change in Median Weekly Rent    12276
Annual change in Median Weekly Rent       12277
Quarterly change in New Bonds Lodged      12276
Annual change in New Bonds Lodged         12277
dtype: int64


### WOW SUCH NULLS!!!

Need to think about how to deal with them.