# Introduction

This notebook analysis the data structures in:
* [`raw/2021-04-10.gz`](#Airbnb-listing-data)
* [`raw/2016Census_G01_NSW_LGA.csv`](#Census-G01-data)
* [`raw/2016Census_G02_NSW_LGA.csv`](#Census-G02-data)
* [`raw/shapefile`](#Shapefile)

to design a [star schema](https://en.wikipedia.org/wiki/Star_schema).

The raw data is uploaded to Postgres with 'test_' prepended to the table names.

## Joins

The four data sets are to be joined to each.
1. The listings data is to be joined to the shapefile using point-in-polygon join, which is the most robust method of joining. Even though the listings data has `neighbourhood_cleansed` column, the value don't match perfectly to the list of LGAs. Using the latitude and longitude values is the better method.
2. Using the official LGA names from the shapefile as the key, the G01 and G02 data can be joined.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
from datetime import datetime
import sqlalchemy as sa
import os
import gzip
import shutil
import requests
import pandas as pd
import geopandas as gpd
from pathlib import Path
from psycopg2.extras import execute_values
from dotenv import (
    load_dotenv,
    find_dotenv
)
import psycopg2

from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook

from src.data.database import (
    get_connection_string
)
from src.utils.utils import (
    stringify_columns,
    get_create_query
)

In [3]:
load_dotenv(find_dotenv())

project_dir = Path(find_dotenv()).parent
data_dir = project_dir / 'data'
raw_data_dir = data_dir / 'raw'
interim_data_dir = data_dir / 'interim'
reports_dir = project_dir / 'reports'
references_dir = project_dir / 'references'

In [4]:
pd.set_option('display.max_columns', 100)

# Connect to Postgres

In [5]:
conn_string = get_connection_string()
print(conn_string)
engine = sa.create_engine(conn_string)

postgresql+psycopg2://airflow:airflow@postgres:5432/airflow


In [6]:
schema = 'star'
table_name = f'fact_airbnb'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df = pd.read_sql(con=engine,
                 sql=query)

In [7]:
schema = 'star'
table_name = f'dim_host'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df_host = pd.read_sql(con=engine,
                      sql=query)

In [8]:
schema = 'star'
table_name = f'dim_property'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df_property = pd.read_sql(con=engine,
                          sql=query)

In [9]:
df_merged = (
    df
    .merge(df_host, on='host_id', how='left')
    .merge(df_property, on='id', how='left')
)

In [31]:
groupby = [
    'neighbourhood_cleansed',
    pd.Grouper(freq='M')
]
(
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
    .agg(active_listing_rate = ('has_availability', lambda s: s.mean() * 100))
)

Unnamed: 0_level_0,Unnamed: 1_level_0,active_listing_rate
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1
Ashfield,2020-05-31,100.0
Ashfield,2020-06-30,100.0
Ashfield,2020-07-31,100.0
Ashfield,2020-08-31,100.0
Ashfield,2020-09-30,100.0
...,...,...
Woollahra,2020-12-31,100.0
Woollahra,2021-01-31,100.0
Woollahra,2021-02-28,100.0
Woollahra,2021-03-31,100.0


In [32]:
(
        df_merged
        .set_index('execution_date')
        .query('has_availability')
        .groupby(groupby)
        .agg(min_price = ('price', 'min'),
             max_price = ('price', 'max'),
             median_price = ('price', 'median'),
             average_price = ('price', 'mean'))
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,min_price,max_price,median_price,average_price
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ashfield,2020-05-31,4400.0,9900.0,8600.0,7633.333333
Ashfield,2020-06-30,4400.0,9900.0,8600.0,7633.333333
Ashfield,2020-07-31,4400.0,9900.0,8600.0,7633.333333
Ashfield,2020-08-31,4400.0,12000.0,9250.0,8725.000000
Ashfield,2020-09-30,4400.0,9900.0,8600.0,7633.333333
...,...,...,...,...,...
Woollahra,2020-12-31,86.0,120000.0,16500.0,25184.320000
Woollahra,2021-01-31,86.0,120000.0,16500.0,25306.720000
Woollahra,2021-02-28,86.0,120000.0,17500.0,26000.750000
Woollahra,2021-03-31,86.0,120000.0,17500.0,26000.750000


In [34]:
(
        df_merged
        .set_index('execution_date')
        .groupby(groupby)
        .agg(distinct_host = ('host_id', pd.Series.nunique))
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,distinct_host
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1
Ashfield,2020-05-31,3
Ashfield,2020-06-30,3
Ashfield,2020-07-31,3
Ashfield,2020-08-31,4
Ashfield,2020-09-30,3
...,...,...
Woollahra,2020-12-31,48
Woollahra,2021-01-31,47
Woollahra,2021-02-28,46
Woollahra,2021-03-31,46


In [40]:
(
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
    .apply(lambda x: pd.Series([x.host_id.nunique(), 
                                x.drop_duplicates(subset=['host_id']).host_is_superhost.mean()], 
                               index=['n_distinct_hosts', 
                                      'superhost_rate']))
)

Unnamed: 0_level_0,Unnamed: 1_level_0,n_distinct_hosts,superhost_rate
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1,Unnamed: 3_level_1
Ashfield,2020-05-31,3.0,0.000000
Ashfield,2020-06-30,3.0,0.000000
Ashfield,2020-07-31,3.0,0.000000
Ashfield,2020-08-31,4.0,0.000000
Ashfield,2020-09-30,3.0,0.000000
...,...,...,...
Woollahra,2020-12-31,48.0,0.125000
Woollahra,2021-01-31,47.0,0.127660
Woollahra,2021-02-28,46.0,0.130435
Woollahra,2021-03-31,46.0,0.130435


In [44]:
(
        df_merged
        .set_index('execution_date')
        .groupby(groupby)
        .apply(lambda x: pd.Series(
            [x.has_availability.mean() * 100,
             x.query('has_availability').review_scores_rating.mean()],
            index=['active_listing_rate', 'average_review_scores_rating']))
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,active_listing_rate,average_review_scores_rating
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1,Unnamed: 3_level_1
Ashfield,2020-05-31,100.0,96.500000
Ashfield,2020-06-30,100.0,96.500000
Ashfield,2020-07-31,100.0,96.500000
Ashfield,2020-08-31,100.0,92.333333
Ashfield,2020-09-30,100.0,96.500000
...,...,...,...
Woollahra,2020-12-31,100.0,92.555556
Woollahra,2021-01-31,100.0,92.644444
Woollahra,2021-02-28,100.0,92.714286
Woollahra,2021-03-31,100.0,92.714286


# `pct_change`

In [102]:
df_grouped = (
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
)
pct_change = (
    df_grouped
    .apply(lambda x: pd.Series([x.has_availability.sum(),
                               (~x.has_availability).sum()], 
                               index=['n_active_listings', 'n_inactive_listings']))
    .reset_index()
    .pivot_table(index='execution_date', 
                 values=['n_active_listings', 'n_inactive_listings'], 
                 columns='neighbourhood_cleansed',
                 fill_value=0)
    .pct_change(periods=1)
#     .reset_index()
#     .melt(
#         value_vars=['n_active_listings', 'n_inactive_listings'], 
#         col_level=1,
#         ignore_index=False
#     )
#     .set_index(['neighbourhood_cleansed', 'execution_date'])

)
pct_change

Unnamed: 0_level_0,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_active_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings,n_inactive_listings
neighbourhood_cleansed,Ashfield,Auburn,Blacktown,Botany Bay,Burwood,Camden,Canada Bay,Canterbury,City Of Kogarah,Fairfield,Holroyd,Hornsby,Hunters Hill,Hurstville,Ku-Ring-Gai,Lane Cove,Leichhardt,Manly,Marrickville,Mosman,North Sydney,Parramatta,Penrith,Pittwater,Randwick,Rockdale,Ryde,Sutherland Shire,Sydney,The Hills Shire,Warringah,Waverley,Willoughby,Woollahra,Ashfield,Auburn,Blacktown,Botany Bay,Burwood,Camden,Canada Bay,Canterbury,City Of Kogarah,Fairfield,Holroyd,Hornsby,Hunters Hill,Hurstville,Ku-Ring-Gai,Lane Cove,Leichhardt,Manly,Marrickville,Mosman,North Sydney,Parramatta,Penrith,Pittwater,Randwick,Rockdale,Ryde,Sutherland Shire,Sydney,The Hills Shire,Warringah,Waverley,Willoughby,Woollahra
execution_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2,Unnamed: 61_level_2,Unnamed: 62_level_2,Unnamed: 63_level_2,Unnamed: 64_level_2,Unnamed: 65_level_2,Unnamed: 66_level_2,Unnamed: 67_level_2,Unnamed: 68_level_2
2020-05-31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-06-30,0.0,0.0,0.0,0.5,0.0,,0.0,0.0,inf,0.0,0.0,0.0,0.0,0.0,0.25,0.0,-0.021739,0.014925,-0.03125,0.0,0.030303,0.0,0.0,0.0,0.031915,-0.166667,0.0,0.0,-0.007968,0.0,-0.037037,0.0,0.0,-0.033898,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-07-31,0.0,-0.333333,0.0,0.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.3,0.0,0.066667,0.044118,-0.032258,0.0625,0.058824,0.0,0.0,-0.073171,0.0,-0.2,0.0,0.0,-0.044177,0.5,0.019231,0.004717,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-31,0.333333,0.0,0.142857,0.0,0.0,inf,0.0,0.0,0.0,0.0,-0.25,0.0,0.0,0.0,-0.230769,-0.142857,0.0,0.028169,0.1,-0.176471,-0.055556,0.0,-0.333333,0.078947,-0.041237,-0.25,0.25,0.0,0.004202,0.333333,0.018868,-0.014085,0.2,0.035088,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-09-30,-0.25,0.5,-0.125,0.333333,0.0,-1.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.3,0.0,0.041667,-0.027397,-0.060606,0.0,-0.058824,0.0,0.0,-0.04878,0.010753,0.333333,-0.2,0.111111,0.012552,-0.25,-0.037037,0.014286,0.0,-0.016949,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-10-31,0.0,0.0,0.0,0.0,0.0,inf,0.0,0.0,0.0,0.0,0.0,-0.2,0.0,0.0,0.0,0.0,-0.02,0.056338,0.0,0.0,-0.03125,0.0,0.0,-0.025641,-0.074468,0.0,0.0,0.0,-0.004132,0.0,-0.019231,0.037559,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-11-30,0.333333,0.0,0.142857,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026667,-0.032258,0.0,0.0,0.0,0.0,0.131579,0.011494,0.0,0.25,-0.1,0.0,0.333333,-0.019608,-0.00905,0.0,-0.068966,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-12-31,0.0,-0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.0,0.0,0.0,-0.307692,0.0,-0.142857,0.0,0.133333,0.214286,0.064516,0.0,0.0,0.069767,-0.011364,0.0,0.0,0.0,0.016598,0.0,0.02,0.0,0.333333,-0.074074,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2021-01-31,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,-0.083333,0.0,0.0,0.0,0.0,0.0,0.0,-0.029412,0.0,-0.030303,0.0,0.0,-0.021739,0.022989,0.25,0.0,0.111111,-0.016327,0.0,0.0,0.022831,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2021-02-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.444444,0.166667,0.0,-0.025974,-0.060606,0.0,0.0625,0.0,0.0,-0.044444,0.022472,0.0,0.0,-0.1,-0.008299,-0.25,0.019608,0.0,0.0,-0.04,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [104]:
pct_change.columns.levels

FrozenList([['n_active_listings', 'n_inactive_listings'], ['Ashfield', 'Auburn', 'Blacktown', 'Botany Bay', 'Burwood', 'Camden', 'Canada Bay', 'Canterbury', 'City Of Kogarah', 'Fairfield', 'Holroyd', 'Hornsby', 'Hunters Hill', 'Hurstville', 'Ku-Ring-Gai', 'Lane Cove', 'Leichhardt', 'Manly', 'Marrickville', 'Mosman', 'North Sydney', 'Parramatta', 'Penrith', 'Pittwater', 'Randwick', 'Rockdale', 'Ryde', 'Sutherland Shire', 'Sydney', 'The Hills Shire', 'Warringah', 'Waverley', 'Willoughby', 'Woollahra']])

In [120]:
df_grouped = (
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
)
new_column = 'n_active_listings'
pct_change = (
    df_grouped
    .apply(lambda x: pd.Series([x.has_availability.sum()], 
                               index=[new_column]))
    .reset_index()
    .pivot_table(index='execution_date', 
                 values=new_column, 
                 columns='neighbourhood_cleansed',
                 fill_value=0)
    .pct_change(periods=1)
    .melt(ignore_index=False, value_name=new_column)
    .reset_index()
    .set_index(['neighbourhood_cleansed', 'execution_date'])

)
pct_change

Unnamed: 0_level_0,Unnamed: 1_level_0,n_active_listings
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1
Ashfield,2020-05-31,
Ashfield,2020-06-30,0.000000
Ashfield,2020-07-31,0.000000
Ashfield,2020-08-31,0.333333
Ashfield,2020-09-30,-0.250000
...,...,...
Woollahra,2020-12-31,-0.074074
Woollahra,2021-01-31,0.000000
Woollahra,2021-02-28,-0.040000
Woollahra,2021-03-31,0.000000


In [121]:
df_grouped = (
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
)
new_column = 'n_inactive_listings'
pct_change = (
    df_grouped
    .apply(lambda x: pd.Series([(~x.has_availability).sum()], 
                               index=[new_column]))
    .reset_index()
    .pivot_table(index='execution_date', 
                 values=new_column, 
                 columns='neighbourhood_cleansed',
                 fill_value=0)
    .pct_change(periods=1)
    .melt(ignore_index=False, value_name=new_column)
    .reset_index()
    .set_index(['neighbourhood_cleansed', 'execution_date'])

)
pct_change

Unnamed: 0_level_0,Unnamed: 1_level_0,n_inactive_listings
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1
Ashfield,2020-05-31,
Ashfield,2020-06-30,
Ashfield,2020-07-31,
Ashfield,2020-08-31,
Ashfield,2020-09-30,
...,...,...
Woollahra,2020-12-31,
Woollahra,2021-01-31,
Woollahra,2021-02-28,
Woollahra,2021-03-31,


In [118]:
df_grouped = (
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
)
pct_change = (
    df_grouped
    .apply(lambda x: pd.Series([x.has_availability.sum()], 
                               index=['n_active_listings']))
    .reset_index()
    .pivot_table(index='execution_date', 
                 values='n_active_listings', 
                 columns='neighbourhood_cleansed',
                 fill_value=0)
    .pct_change(periods=1)
    .melt(ignore_index=False)
    .reset_index()
    .set_index(['neighbourhood_cleansed', 'execution_date'])

)
pct_change

Unnamed: 0_level_0,Unnamed: 1_level_0,value
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1
Ashfield,2020-05-31,
Ashfield,2020-06-30,0.000000
Ashfield,2020-07-31,0.000000
Ashfield,2020-08-31,0.333333
Ashfield,2020-09-30,-0.250000
...,...,...
Woollahra,2020-12-31,-0.074074
Woollahra,2021-01-31,0.000000
Woollahra,2021-02-28,-0.040000
Woollahra,2021-03-31,0.000000


In [99]:
df_grouped = (
    df_merged
    .set_index('execution_date')
    .groupby(groupby)
)
pct_change = (
    df_grouped
    .apply(lambda x: pd.Series([x.has_availability.sum(),
                               (~x.has_availability).sum()], 
                               index=['n_active_listings', 'n_inactive_listings']))
    .reset_index()
    .pivot_table(index='execution_date', 
                 values=['n_active_listings', 'n_inactive_listings'], 
                 columns='neighbourhood_cleansed',
                 fill_value=0)
    .pct_change(periods=1)
#     .reset_index()
    .melt(
#         value_vars=['n_active_listings', 'n_inactive_listings'], 
        col_level=1,
        ignore_index=False
    )
#     .set_index(['neighbourhood_cleansed', 'execution_date'])

)
pct_change

Unnamed: 0_level_0,neighbourhood_cleansed,value
execution_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-05-31,Ashfield,
2020-06-30,Ashfield,0.000000
2020-07-31,Ashfield,0.000000
2020-08-31,Ashfield,0.333333
2020-09-30,Ashfield,-0.250000
...,...,...
2020-12-31,Woollahra,
2021-01-31,Woollahra,
2021-02-28,Woollahra,
2021-03-31,Woollahra,


# Number of stays

In [129]:
df_merged.price

0           6400.0
1        1431500.0
2          47000.0
3          10000.0
4          13100.0
           ...    
11969       6500.0
11970      14100.0
11971       7200.0
11972      11100.0
11973      20000.0
Name: price, Length: 11974, dtype: float64

In [131]:
n_stays_revenue = (
    df_grouped
    .apply(lambda x: pd.Series([
        (30 - x.query('has_availability').availability_30).sum(),
        ((30 - x.query('has_availability').availability_30) * x.query('has_availability').price).mean()
    ],
                               index=['n_stays', 'est_revenue_per_active_listing']))
)

n_stays

Unnamed: 0_level_0,Unnamed: 1_level_0,n_stays,est_revenue_per_active_listing
neighbourhood_cleansed,execution_date,Unnamed: 2_level_1,Unnamed: 3_level_1
Ashfield,2020-05-31,63.0,189400.000000
Ashfield,2020-06-30,63.0,189400.000000
Ashfield,2020-07-31,63.0,189400.000000
Ashfield,2020-08-31,93.0,232050.000000
Ashfield,2020-09-30,63.0,189400.000000
...,...,...,...
Woollahra,2020-12-31,1079.0,538807.000000
Woollahra,2021-01-31,1083.0,572997.880000
Woollahra,2021-02-28,872.0,523891.666667
Woollahra,2021-03-31,916.0,542458.333333


In [136]:
def agg_group(df: pd.DataFrame) -> pd.DataFrame:
    """
    To be used in pandas.core.groupby.GroupBy.apply
    :param df: Grouped dataframes
    :return:
    """
    calc_list = []
    calc_names = [
        'active_listing_rate',
        'average_review_scores_rating',
        'min_price',
        'max_price,',
        'median_price',
        'average_price',
        'n_distinct_hosts',
        'superhost_rate',
        # TODO: need to do pct_change outside of the groupby
        # 'active_listings_pct_change',
        # 'inactive_listings_pct_change',
        'n_stays',
        'est_revenue_per_active_listing'
    ]
    calc_list.append(df.has_availability.mean() * 100)
    calc_list.append(df.query('has_availability').review_scores_rating.mean())
    calc_list.append(df.price.min())
    calc_list.append(df.price.max())
    calc_list.append(df.price.median())
    calc_list.append(df.price.mean())
    calc_list.append(df.host_id.nunique())
    calc_list.append(df.drop_duplicates(subset=['host_id']).host_is_superhost.mean())
    calc_list.append((30 - df.query('has_availability').availability_30).sum())
    calc_list.append(((30 - df.query('has_availability').availability_30) * df.query('has_availability').price).mean())

    result = pd.Series(calc_list, index=calc_names)
    return result

In [158]:
df_grouped.apply(agg_group).reset_index()

Unnamed: 0,neighbourhood_cleansed,execution_date,active_listing_rate,average_review_scores_rating,min_price,"max_price,",median_price,average_price,n_distinct_hosts,superhost_rate,n_stays,est_revenue_per_active_listing
0,Ashfield,2020-05-31,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
1,Ashfield,2020-06-30,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
2,Ashfield,2020-07-31,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
3,Ashfield,2020-08-31,100.0,92.333333,4400.0,12000.0,9250.0,8725.000000,4.0,0.000000,93.0,232050.000000
4,Ashfield,2020-09-30,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
394,Woollahra,2020-12-31,100.0,92.555556,86.0,120000.0,16500.0,25184.320000,48.0,0.125000,1079.0,538807.000000
395,Woollahra,2021-01-31,100.0,92.644444,86.0,120000.0,16500.0,25306.720000,47.0,0.127660,1083.0,572997.880000
396,Woollahra,2021-02-28,100.0,92.714286,86.0,120000.0,17500.0,26000.750000,46.0,0.130435,872.0,523891.666667
397,Woollahra,2021-03-31,100.0,92.714286,86.0,120000.0,17500.0,26000.750000,46.0,0.130435,916.0,542458.333333


In [145]:
logging.basicConfig(filename='db.log')
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

In [151]:
year_previous = 2020
month_previous = 4
execution_date = '2020-04-01'
query_prev = f"""
SELECT * 
FROM star.fact_airbnb
WHERE 
    execution_date = '{execution_date}'
"""
print(query_prev)

df_test = pd.read_sql(con=engine,
                      sql=query_prev)


SELECT * 
FROM star.fact_airbnb
WHERE 
    execution_date = '2020-04-01'

[[34m2021-05-21 12:58:30,779[0m] {[34mbase.py:[0m132} INFO[0m - select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s[0m
[[34m2021-05-21 12:58:30,780[0m] {[34mbase.py:[0m132} INFO[0m - [cached since 9833s ago] {'name': "\nSELECT * \nFROM star.fact_airbnb\nWHERE \n    execution_date = '2020-04-01'\n"}[0m
[[34m2021-05-21 12:58:30,782[0m] {[34mbase.py:[0m132} INFO[0m - 
SELECT * 
FROM star.fact_airbnb
WHERE 
    execution_date = '2020-04-01'
[0m
[[34m2021-05-21 12:58:30,783[0m] {[34mbase.py:[0m132} INFO[0m - [raw sql] {}[0m


In [160]:
#
df_merged[['property_type', 'room_type', 'accommodates']]

Unnamed: 0,property_type,room_type,accommodates
0,Apartment,Private room,1
1,Townhouse,Private room,2
2,House,Entire home/apt,6
3,Apartment,Private room,2
4,Loft,Entire home/apt,2
...,...,...,...
11969,Entire apartment,Entire home/apt,2
11970,Private room in house,Private room,2
11971,Private room in house,Private room,2
11972,Entire apartment,Entire home/apt,2


In [168]:
df_grouped.apply(agg_group).reset_index()

Unnamed: 0,neighbourhood_cleansed,execution_date,active_listing_rate,average_review_scores_rating,min_price,"max_price,",median_price,average_price,n_distinct_hosts,superhost_rate,n_stays,est_revenue_per_active_listing
0,Ashfield,2020-05-31,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
1,Ashfield,2020-06-30,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
2,Ashfield,2020-07-31,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
3,Ashfield,2020-08-31,100.0,92.333333,4400.0,12000.0,9250.0,8725.000000,4.0,0.000000,93.0,232050.000000
4,Ashfield,2020-09-30,100.0,96.500000,4400.0,9900.0,8600.0,7633.333333,3.0,0.000000,63.0,189400.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
394,Woollahra,2020-12-31,100.0,92.555556,86.0,120000.0,16500.0,25184.320000,48.0,0.125000,1079.0,538807.000000
395,Woollahra,2021-01-31,100.0,92.644444,86.0,120000.0,16500.0,25306.720000,47.0,0.127660,1083.0,572997.880000
396,Woollahra,2021-02-28,100.0,92.714286,86.0,120000.0,17500.0,26000.750000,46.0,0.130435,872.0,523891.666667
397,Woollahra,2021-03-31,100.0,92.714286,86.0,120000.0,17500.0,26000.750000,46.0,0.130435,916.0,542458.333333


# Test reading from a non_existent table

In [165]:
query = f"""
SELECT *
FROM data_mart.table1
"""

try:
    test_df = pd.read_sql(con=engine,
                          sql=query)
except:
    test_df = None

[[34m2021-05-21 13:51:15,633[0m] {[34mbase.py:[0m132} INFO[0m - select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s[0m
[[34m2021-05-21 13:51:15,634[0m] {[34mbase.py:[0m132} INFO[0m - [cached since 1.3e+04s ago] {'name': '\nSELECT *\nFROM data_mart.table1\n'}[0m
[[34m2021-05-21 13:51:15,636[0m] {[34mbase.py:[0m132} INFO[0m - 
SELECT *
FROM data_mart.table1
[0m
[[34m2021-05-21 13:51:15,638[0m] {[34mbase.py:[0m132} INFO[0m - [raw sql] {}[0m
[[34m2021-05-21 13:51:15,639[0m] {[34mbase.py:[0m132} INFO[0m - ROLLBACK[0m


## Check `neighbourhood_cleansed`

In [170]:
path = raw_data_dir / '2020-06-11_1000.gz'
df = pd.read_csv(path, compression='gzip')

In [171]:
df.neighbourhood_cleansed

0            Sydney
1            Sydney
2             Manly
3            Sydney
4            Mosman
           ...     
995          Sydney
996    North Sydney
997           Manly
998    North Sydney
999          Sydney
Name: neighbourhood_cleansed, Length: 1000, dtype: object

In [175]:
rename_dict = {
            'n_active_listings': 'n_active_listings_prev',
            'n_inactive_listings': 'n_inactive_listings_prev'}

In [180]:
groupby = ['neighbourhood_cleansed']

In [183]:
current_listing_cols = ['n_active_listings', 'n_inactive_listings']
prev_listing_cols = ['n_active_listings_prev', 'n_inactive_listings_prev']

In [184]:
dict(zip(current_listing_cols, prev_listing_cols))

{'n_active_listings': 'n_active_listings_prev',
 'n_inactive_listings': 'n_inactive_listings_prev'}

# Check the `host_neighbourhood` vs LGA names

In [32]:
path = raw_data_dir / 'shapefile/LGA_2016_AUST.shp'
df_shape = gpd.read_file(path)

In [47]:
path = raw_data_dir / 'shapefile_ssc_2011/SSC_2011_AUST.shp'
df_shape_ssc_2011 = gpd.read_file(path)

In [125]:
df_shape_ssc_2011

Unnamed: 0,SSC_CODE,SSC_NAME,CONF_VALUE,SQKM,geometry
0,10001,Abbotsbury,Very good,4.984673,"POLYGON ((150.85118 -33.87069, 150.85104 -33.8..."
1,10002,Abbotsford (NSW),Very good,1.017855,"POLYGON ((151.12593 -33.84578, 151.12678 -33.8..."
2,10003,Abercrombie,Very good,1.041389,"POLYGON ((149.55478 -33.39421, 149.55414 -33.3..."
3,10004,Aberdare,Very good,1.649523,"POLYGON ((151.36829 -32.83665, 151.37073 -32.8..."
4,10005,Aberdeen (NSW),Good,129.908434,"POLYGON ((150.84626 -32.12995, 150.84627 -32.1..."
...,...,...,...,...,...
8524,90003,Home Island,Very good,0.893725,"MULTIPOLYGON (((96.89374 -12.12017, 96.89374 -..."
8525,90004,Jervis Bay (OT),Very good,67.798498,"MULTIPOLYGON (((150.69504 -35.18410, 150.69517..."
8526,90005,West Keeling Island,Very good,5.927649,"MULTIPOLYGON (((96.82264 -12.17193, 96.82263 -..."
8527,99494,No usual address (OT),,,


In [37]:
lga_names = df_shape.LGA_NAME16.str.split('(').str[0]
lga_names

0                               Albury 
1                    Armidale Regional 
2                              Ballina 
3                            Balranald 
4                    Bathurst Regional 
                     ...               
558                   No usual address 
559    Migratory - Offshore - Shipping 
560         Unincorp. Other Territories
561                   No usual address 
562    Migratory - Offshore - Shipping 
Name: LGA_NAME16, Length: 563, dtype: object

In [49]:
df_shape_ssc_2011.columns

Index(['SSC_CODE', 'SSC_NAME', 'CONF_VALUE', 'SQKM', 'geometry'], dtype='object')

In [127]:
ssc_names_2011 = (
    df_shape_ssc_2011
    .SSC_NAME
    .str.split('(').str[0]
    .str.strip()
)
ssc_names_2011

0                            Abbotsbury
1                            Abbotsford
2                           Abercrombie
3                              Aberdare
4                              Aberdeen
                     ...               
8524                        Home Island
8525                         Jervis Bay
8526                West Keeling Island
8527                   No usual address
8528    Migratory - Offshore - Shipping
Name: SSC_NAME, Length: 8529, dtype: object

In [129]:
df_shape_ssc_2011.loc[:, 'in_nsw'] = df_shape_ssc_2011.SSC_NAME.str.contains('NSW')
df_shape_ssc_2011

Unnamed: 0,SSC_CODE,SSC_NAME,CONF_VALUE,SQKM,geometry,in_nsw
0,10001,Abbotsbury,Very good,4.984673,"POLYGON ((150.85118 -33.87069, 150.85104 -33.8...",False
1,10002,Abbotsford (NSW),Very good,1.017855,"POLYGON ((151.12593 -33.84578, 151.12678 -33.8...",True
2,10003,Abercrombie,Very good,1.041389,"POLYGON ((149.55478 -33.39421, 149.55414 -33.3...",False
3,10004,Aberdare,Very good,1.649523,"POLYGON ((151.36829 -32.83665, 151.37073 -32.8...",False
4,10005,Aberdeen (NSW),Good,129.908434,"POLYGON ((150.84626 -32.12995, 150.84627 -32.1...",True
...,...,...,...,...,...,...
8524,90003,Home Island,Very good,0.893725,"MULTIPOLYGON (((96.89374 -12.12017, 96.89374 -...",False
8525,90004,Jervis Bay (OT),Very good,67.798498,"MULTIPOLYGON (((150.69504 -35.18410, 150.69517...",False
8526,90005,West Keeling Island,Very good,5.927649,"MULTIPOLYGON (((96.82264 -12.17193, 96.82263 -...",False
8527,99494,No usual address (OT),,,,False


In [74]:
non_match_index = pd.Series(df_merged.host_neighbourhood.unique()).str.split('/').str[0].str.strip().sort_values().isin(ssc_names_2011).loc[lambda x: ~x].index
non_match_index

Int64Index([250, 182, 271, 238, 176, 249, 230, 269, 221, 113, 140, 256, 235,
            232, 245, 223, 260, 222,  10, 205, 233, 253, 194, 217, 188, 179,
             72, 247, 252, 201, 219, 189, 240, 244, 146, 243, 261, 227, 141,
            266, 229, 187,  98, 234, 258, 225, 218, 237, 263, 231, 202, 265,
            209, 228, 206, 257, 210, 270, 264, 200, 267, 242, 224, 212, 213,
            215, 211, 193, 268, 262, 207, 251,  59, 208,  96, 186, 241, 184,
             11],
           dtype='int64')

In [80]:
ssc_clean = df_shape_ssc_2011.SSC_NAME.str.split('(').str[0].str.strip()

In [76]:
for col in pd.Series(df_merged.host_neighbourhood.unique()).sort_values().loc[non_match_index]:
    print(col)

Amsterdam Centrum
Anaheim
Arlington Ridge
Balham
Ballsbridge
Ban Rim Pha
Baumettes
Bela Vista
Beverly Park
Brighton-Le-Sands
Brixton
Brockley
Bugis/Kampong Glam
Bushwick
Cannes
Canonbury
Capucins - Victoire
Central Area
Central Business District
Chinatown
City Centre
Coral Way
Da'an
Dalston
Dansaert
Deceyville
Denpasar
Dreta de l'Eixample
Earls Court
Fortress Hill
Gangnam
Gramercy Park
Hammersmith
Hampstead
Hilo
Isle of Dogs
Kaunlaran/Valencia
Kreuzberg
Kuta Village
LB of Brent
LB of Camden
LB of Islington
Ludwigsvorstadt - Isarvorstadt
Merkaz HaIr
Mescal Corridor NW
Mid-Levels
Midtown East
Mitte
Monceau
Mong Kok
Murray Hill
Núñez
Oud-West
Palermo
Pasir Ris
Poblados Marítimos
Prenzlauer Berg
Punta Cancun
Ratchathewi/Phaya Thai
Rifredi
Saint Peters
Santa Catarina
Shabazi
Sheung Wan
Soho
Södermalm
Tai Ping Shan
The Liberties
Tsim Sha Tsui
Ubud
Vaugirard
Vijay Vihar Phase II
Waverly
West Village
Whitechapel/Brick Lane
Williamsburg
XI Arrondissement
Žižkov
None


In [131]:
df_mapping = pd.read_csv(references_dir / 'host_neighbourhood_mapping.csv')
df_mapping

Unnamed: 0,host_neighbourhood,suburb
0,Amsterdam Centrum,Other
1,Anaheim,Other
2,Arlington Ridge,Other
3,Balham,Other
4,Ballsbridge,Other
...,...,...
74,Whitechapel/Brick Lane,Other
75,Williamsburg,Other
76,XI Arrondissement,Other
77,�i�kov,Other


In [133]:
df_merged.host_neighbourhood

0          Potts Point
1              Pyrmont
2            Balgowlah
3         Darlinghurst
4         Darlinghurst
              ...     
410968    Marrickville
410969      Palm Beach
410970            None
410971            None
410972            None
Name: host_neighbourhood, Length: 410973, dtype: object

In [135]:
replace_dict = dict(zip(df_mapping.host_neighbourhood, df_mapping.suburb))

In [136]:
df_merged.loc[:, 'host_neighbourhood_cleansed'] = df_merged.host_neighbourhood.replace(replace_dict)

In [145]:
contains = 'Potts'
ssc_clean[ssc_clean.str.contains(contains)]

1903     Potts Hill
1904    Potts Point
1905     Pottsville
Name: SSC_NAME, dtype: object

In [146]:
ssc_clean.loc[1904]

'Potts Point'

In [148]:
df_merged.host_neighbourhood_cleansed.loc[0]

'Potts Point'

In [149]:
df_shape_ssc_2011.query('SSC_NAME == "Potts Point"')

Unnamed: 0,SSC_CODE,SSC_NAME,CONF_VALUE,SQKM,geometry,in_nsw
1904,11905,Potts Point,Very good,0.616643,"POLYGON ((151.22489 -33.87301, 151.22504 -33.8...",False


In [169]:
df_shape_ssc_2011_cleaned = (
    df_shape_ssc_2011
    # Some suburb names are duplicated in other states.
    .assign(in_nsw = lambda x: x.SSC_NAME.str.contains('NSW'), 
            is_duplicated = lambda x: x.SSC_NAME.str.contains('\('), 
            SSC_NAME_cleaned = lambda x: x.SSC_NAME.str.split('(').str[0].str.strip())
    .loc[:, ['SSC_NAME', 'SSC_NAME_cleaned', 'geometry', 'in_nsw', 'is_duplicated']]
    .dropna(subset=['geometry'])
)

df_shape_ssc_2011_cleaned

Unnamed: 0,SSC_NAME,SSC_NAME_cleaned,geometry,in_nsw,is_duplicated
0,Abbotsbury,Abbotsbury,"POLYGON ((150.85118 -33.87069, 150.85104 -33.8...",False,False
1,Abbotsford (NSW),Abbotsford,"POLYGON ((151.12593 -33.84578, 151.12678 -33.8...",True,True
2,Abercrombie,Abercrombie,"POLYGON ((149.55478 -33.39421, 149.55414 -33.3...",False,False
3,Aberdare,Aberdare,"POLYGON ((151.36829 -32.83665, 151.37073 -32.8...",False,False
4,Aberdeen (NSW),Aberdeen,"POLYGON ((150.84626 -32.12995, 150.84627 -32.1...",True,True
...,...,...,...,...,...
8522,Christmas Island,Christmas Island,"POLYGON ((105.63262 -10.52337, 105.63263 -10.5...",False,False
8523,Directions Island,Directions Island,"MULTIPOLYGON (((96.88902 -12.20096, 96.88904 -...",False,False
8524,Home Island,Home Island,"MULTIPOLYGON (((96.89374 -12.12017, 96.89374 -...",False,False
8525,Jervis Bay (OT),Jervis Bay,"MULTIPOLYGON (((150.69504 -35.18410, 150.69517...",False,True


In [179]:
drop_index = df_shape_ssc_2011_cleaned.loc[lambda x: ~x.in_nsw & x.is_duplicated].index
drop_index

Int64Index([2629, 2635, 2636, 2637, 2638, 2642, 2644, 2650, 2651, 2653,
            ...
            8494, 8496, 8497, 8501, 8502, 8506, 8507, 8510, 8517, 8525],
           dtype='int64', length=1044)

In [182]:
ssc_df_clean = df_shape_ssc_2011_cleaned.drop(drop_index)

In [187]:
df_merged.loc[:, 'host_neighbourhood_cleansed'] = (
        df_merged.host_neighbourhood
        .replace(replace_dict)
        # For values with a forward slash, take only the first value.
        .str.split('/')
        .str[0]
        .str.strip()
    )

In [188]:
df_merged.host_neighbourhood_cleansed

0          Potts Point
1              Pyrmont
2            Balgowlah
3         Darlinghurst
4         Darlinghurst
              ...     
410968    Marrickville
410969      Palm Beach
410970            None
410971            None
410972            None
Name: host_neighbourhood_cleansed, Length: 410973, dtype: object

In [189]:
df_merged = df_merged.merge(ssc_df_clean[['SSC_NAME_cleaned', 'geometry']],
                     left_on='host_neighbourhood_cleansed',
                     right_on='SSC_NAME_cleaned',
                     how='left')
df_merged

Unnamed: 0,id,host_id,execution_date,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable_x,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,LGA_CODE_2016_x,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,LGA_CODE_2016_y,name,description,neighborhood_overview,picture_url,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,price,number_of_reviews,instant_bookable_y,host_neighbourhood_cleansed,SSC_NAME_cleaned,geometry
0,11156,40855,2020-05-01,2.0,2.0,180.0,180.0,2.0,180.0,True,28,58,88,363,2020-05-11,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,False,1,0,1,0,1.54,,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",within a day,100%,93%,False,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,17200,An Oasis in the City,Very central to the city which can be reached ...,"It is very close to everything and everywhere,...",https://a0.muscache.com/im/pictures/2797669/17...,Sydney,-33.86917,151.22656,Apartment,Private room,1,6400.0,196,False,Potts Point,Potts Point,"POLYGON ((151.22489 -33.87301, 151.22504 -33.8..."
1,12351,17061,2020-05-01,2.0,2.0,7.0,7.0,2.0,7.0,True,0,0,0,0,2020-05-10,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,False,2,0,2,0,4.41,,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,75%,False,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,17200,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/im/pictures/763ad5c8-c...,Sydney,-33.86515,151.19190,Townhouse,Private room,2,1431500.0,526,False,Pyrmont,Pyrmont,"POLYGON ((151.19113 -33.86555, 151.19112 -33.8..."
2,14250,55948,2020-05-01,5.0,5.0,22.0,22.0,5.0,22.0,True,0,0,0,141,2020-05-11,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,False,2,2,0,0,0.04,,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,100%,52%,True,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,15990,Manly Harbour House,"Beautifully renovated, spacious and quiet, our...",Balgowlah Heights is one of the most prestigio...,https://a0.muscache.com/im/pictures/56935671/f...,Manly,-33.80093,151.26172,House,Entire home/apt,6,47000.0,2,False,Balgowlah,Balgowlah,"POLYGON ((151.25673 -33.80046, 151.25565 -33.8..."
3,15253,59850,2020-05-01,2.0,2.0,7.0,7.0,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,100%,99%,False,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Unique Designer Rooftop Apartment in City Loca...,Penthouse living at it best ... You will be st...,The location is really central and there is nu...,https://a0.muscache.com/im/pictures/46dcb8a1-5...,Sydney,-33.87964,151.21680,Apartment,Private room,2,10000.0,364,True,Darlinghurst,Darlinghurst,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8..."
4,44545,112237,2020-05-01,3.0,3.0,365.0,365.0,3.0,365.0,True,0,0,0,0,2020-05-12,2010-10-20,2020-01-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,False,1,1,0,0,0.65,,https://www.airbnb.com/users/show/112237,Atari,2010-04-22,"Sydney, New South Wales, Australia",Curious about the world and full of wanderlust...,,,85%,True,https://a0.muscache.com/im/pictures/user/34708...,https://a0.muscache.com/im/pictures/user/34708...,Darlinghurst,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Sunny Darlinghurst Warehouse Apartment,Sunny warehouse/loft apartment in the heart of...,Darlinghurst is home to some of Sydney's best ...,https://a0.muscache.com/im/pictures/a88d8e14-4...,Sydney,-33.87888,151.21439,Loft,Entire home/apt,2,13100.0,76,False,Darlinghurst,Darlinghurst,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
412292,49118065,4358703,2021-04-01,3.0,3.0,1125.0,1125.0,3.0,1125.0,True,9,35,65,156,2021-04-12,,,,,,,,,,True,3,3,0,0,,,https://www.airbnb.com/users/show/4358703,Galina,2012-12-08,"Sydney, New South Wales, Australia",My name is Galina. I'm a proud mum of a 5 year...,within an hour,100%,98%,False,https://a0.muscache.com/im/pictures/user/c226a...,https://a0.muscache.com/im/pictures/user/c226a...,Marrickville,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,14170,"Marvellous Marrickville apartment, 15 mins to ...",For those wanting easy access to inner Sydney'...,,https://a0.muscache.com/pictures/94a3bd3f-4ac2...,Marrickville,-33.91559,151.15586,Entire house,Entire home/apt,4,8800.0,0,True,Marrickville,Marrickville,"POLYGON ((151.14322 -33.90899, 151.14336 -33.9..."
412293,49118280,95214788,2021-04-01,1.0,1.0,1125.0,1125.0,1.0,1125.0,True,7,7,7,7,2021-04-12,,,,,,,,,,True,35,35,0,0,,,https://www.airbnb.com/users/show/95214788,Cushie - Concierge Services,2016-09-15,"Avalon Beach, New South Wales, Australia",cushie provides hosting management and concier...,within a few hours,91%,93%,False,https://a0.muscache.com/im/pictures/user/8b55f...,https://a0.muscache.com/im/pictures/user/8b55f...,Palm Beach,36.0,36.0,"['email', 'phone', 'google', 'reviews', 'offli...",t,f,15990,Peaceful Pittwater Views from Eclectic Getaway,Immerse yourself in this north facing peaceful...,,https://a0.muscache.com/pictures/96d93fbd-590d...,Pittwater,-33.63855,151.31895,Entire house,Entire home/apt,4,40000.0,0,True,Palm Beach,Palm Beach,"POLYGON ((151.32729 -33.61595, 151.32728 -33.6..."
412294,49118321,382207272,2021-04-01,2.0,2.0,1125.0,1125.0,2.0,1125.0,True,25,55,85,176,2021-04-12,,,,,,,,,,False,26,26,0,0,,,https://www.airbnb.com/users/show/382207272,Victor,2020-12-29,AU,,within a few hours,100%,100%,False,https://a0.muscache.com/im/pictures/user/3537d...,https://a0.muscache.com/im/pictures/user/3537d...,,7.0,7.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,t,16700,MQ13 Convinent 2 Bedroom Close MQ Shopping Centre,Brand new luxury 2 bedroom apartments in the h...,,https://a0.muscache.com/pictures/ad703b5e-e9d9...,Ryde,-33.77901,151.12065,Entire apartment,Entire home/apt,5,12800.0,0,False,,,
412295,49118480,382207272,2021-04-01,2.0,2.0,1125.0,1125.0,2.0,1125.0,True,14,44,74,75,2021-04-13,,,,,,,,,,False,26,26,0,0,,,https://www.airbnb.com/users/show/382207272,Victor,2020-12-29,AU,,within a few hours,100%,100%,False,https://a0.muscache.com/im/pictures/user/3537d...,https://a0.muscache.com/im/pictures/user/3537d...,,7.0,7.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,t,11520,FD50 Newly Furnished 1 Bedroom in Five Dock,Five dock is a real gem in waterside centrally...,,https://a0.muscache.com/pictures/b6c19d4b-424f...,Canada Bay,-33.86390,151.13073,Entire apartment,Entire home/apt,2,11000.0,0,False,,,


## The `SSC_NAME`s that have '('

In [164]:
df_shape_ssc_2011_cleaned.SSC_NAME.loc[df_shape_ssc_2011_cleaned.SSC_NAME.str.contains('\(')]

1            Abbotsford (NSW)
4              Aberdeen (NSW)
23      Alison (Dungog - NSW)
25            Allandale (NSW)
38            Annandale (NSW)
                ...          
8506             Spence (ACT)
8507           Stirling (ACT)
8510           Theodore (ACT)
8517             Weston (ACT)
8525          Jervis Bay (OT)
Name: SSC_NAME, Length: 1440, dtype: object

In [141]:
df_merged.merge(df_shape_ssc_2011
               # Some suburb names are duplicated in other states.
               .assign(in_nsw = lambda x: x.SSC_NAME.str.contains('NSW'))
               .query('in_nsw')
               .loc[:, ['SSC_NAME', 'geometry']]
               .dropna(subset=['geometry']),
               left_on='host_neighbourhood_cleansed',
               right_on='SSC_NAME',
               how='left')['']

Unnamed: 0,id,host_id,execution_date,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable_x,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,LGA_CODE_2016_x,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,LGA_CODE_2016_y,name,description,neighborhood_overview,picture_url,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,price,number_of_reviews,instant_bookable_y,host_neighbourhood_cleansed,SSC_NAME,geometry
0,11156,40855,2020-05-01,2.0,2.0,180.0,180.0,2.0,180.0,True,28,58,88,363,2020-05-11,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,False,1,0,1,0,1.54,,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",within a day,100%,93%,False,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,17200,An Oasis in the City,Very central to the city which can be reached ...,"It is very close to everything and everywhere,...",https://a0.muscache.com/im/pictures/2797669/17...,Sydney,-33.86917,151.22656,Apartment,Private room,1,6400.0,196,False,Potts Point,,
1,12351,17061,2020-05-01,2.0,2.0,7.0,7.0,2.0,7.0,True,0,0,0,0,2020-05-10,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,False,2,0,2,0,4.41,,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,75%,False,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,17200,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/im/pictures/763ad5c8-c...,Sydney,-33.86515,151.19190,Townhouse,Private room,2,1431500.0,526,False,Pyrmont,,
2,14250,55948,2020-05-01,5.0,5.0,22.0,22.0,5.0,22.0,True,0,0,0,141,2020-05-11,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,False,2,2,0,0,0.04,,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,100%,52%,True,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,15990,Manly Harbour House,"Beautifully renovated, spacious and quiet, our...",Balgowlah Heights is one of the most prestigio...,https://a0.muscache.com/im/pictures/56935671/f...,Manly,-33.80093,151.26172,House,Entire home/apt,6,47000.0,2,False,Balgowlah,,
3,15253,59850,2020-05-01,2.0,2.0,7.0,7.0,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,100%,99%,False,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Unique Designer Rooftop Apartment in City Loca...,Penthouse living at it best ... You will be st...,The location is really central and there is nu...,https://a0.muscache.com/im/pictures/46dcb8a1-5...,Sydney,-33.87964,151.21680,Apartment,Private room,2,10000.0,364,True,Darlinghurst,,
4,44545,112237,2020-05-01,3.0,3.0,365.0,365.0,3.0,365.0,True,0,0,0,0,2020-05-12,2010-10-20,2020-01-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,False,1,1,0,0,0.65,,https://www.airbnb.com/users/show/112237,Atari,2010-04-22,"Sydney, New South Wales, Australia",Curious about the world and full of wanderlust...,,,85%,True,https://a0.muscache.com/im/pictures/user/34708...,https://a0.muscache.com/im/pictures/user/34708...,Darlinghurst,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Sunny Darlinghurst Warehouse Apartment,Sunny warehouse/loft apartment in the heart of...,Darlinghurst is home to some of Sydney's best ...,https://a0.muscache.com/im/pictures/a88d8e14-4...,Sydney,-33.87888,151.21439,Loft,Entire home/apt,2,13100.0,76,False,Darlinghurst,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
410968,49118065,4358703,2021-04-01,3.0,3.0,1125.0,1125.0,3.0,1125.0,True,9,35,65,156,2021-04-12,,,,,,,,,,True,3,3,0,0,,,https://www.airbnb.com/users/show/4358703,Galina,2012-12-08,"Sydney, New South Wales, Australia",My name is Galina. I'm a proud mum of a 5 year...,within an hour,100%,98%,False,https://a0.muscache.com/im/pictures/user/c226a...,https://a0.muscache.com/im/pictures/user/c226a...,Marrickville,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,14170,"Marvellous Marrickville apartment, 15 mins to ...",For those wanting easy access to inner Sydney'...,,https://a0.muscache.com/pictures/94a3bd3f-4ac2...,Marrickville,-33.91559,151.15586,Entire house,Entire home/apt,4,8800.0,0,True,Marrickville,,
410969,49118280,95214788,2021-04-01,1.0,1.0,1125.0,1125.0,1.0,1125.0,True,7,7,7,7,2021-04-12,,,,,,,,,,True,35,35,0,0,,,https://www.airbnb.com/users/show/95214788,Cushie - Concierge Services,2016-09-15,"Avalon Beach, New South Wales, Australia",cushie provides hosting management and concier...,within a few hours,91%,93%,False,https://a0.muscache.com/im/pictures/user/8b55f...,https://a0.muscache.com/im/pictures/user/8b55f...,Palm Beach,36.0,36.0,"['email', 'phone', 'google', 'reviews', 'offli...",t,f,15990,Peaceful Pittwater Views from Eclectic Getaway,Immerse yourself in this north facing peaceful...,,https://a0.muscache.com/pictures/96d93fbd-590d...,Pittwater,-33.63855,151.31895,Entire house,Entire home/apt,4,40000.0,0,True,Palm Beach,,
410970,49118321,382207272,2021-04-01,2.0,2.0,1125.0,1125.0,2.0,1125.0,True,25,55,85,176,2021-04-12,,,,,,,,,,False,26,26,0,0,,,https://www.airbnb.com/users/show/382207272,Victor,2020-12-29,AU,,within a few hours,100%,100%,False,https://a0.muscache.com/im/pictures/user/3537d...,https://a0.muscache.com/im/pictures/user/3537d...,,7.0,7.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,t,16700,MQ13 Convinent 2 Bedroom Close MQ Shopping Centre,Brand new luxury 2 bedroom apartments in the h...,,https://a0.muscache.com/pictures/ad703b5e-e9d9...,Ryde,-33.77901,151.12065,Entire apartment,Entire home/apt,5,12800.0,0,False,,,
410971,49118480,382207272,2021-04-01,2.0,2.0,1125.0,1125.0,2.0,1125.0,True,14,44,74,75,2021-04-13,,,,,,,,,,False,26,26,0,0,,,https://www.airbnb.com/users/show/382207272,Victor,2020-12-29,AU,,within a few hours,100%,100%,False,https://a0.muscache.com/im/pictures/user/3537d...,https://a0.muscache.com/im/pictures/user/3537d...,,7.0,7.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,t,11520,FD50 Newly Furnished 1 Bedroom in Five Dock,Five dock is a real gem in waterside centrally...,,https://a0.muscache.com/pictures/b6c19d4b-424f...,Canada Bay,-33.86390,151.13073,Entire apartment,Entire home/apt,2,11000.0,0,False,,,


# Test groupby

In [29]:
def agg_group(df: pd.DataFrame) -> pd.Series:
    """
    To be used in pandas.core.groupby.GroupBy.apply
    :param df: Grouped dataframes
    :return: A pd.Series of a collection of pd.Series
    """
    calc_list = []
    calc_names = [
        'n_distinct_hosts',
        'est_revenue',
        'est_revenue_per_host'
    ]

    calc_list.append(df.host_id.nunique())
    calc_list.append(((30 - df.availability_30) * df.price).sum())
    calc_list.append(((30 - df.availability_30) * df.price).sum() / df.host_id.nunique())

    result = pd.Series(calc_list, index=calc_names)
    return result

In [30]:
groupby = ['host_neighbourhood']
df_merged.query('execution_date == "2020-05-01"')

(
    df_merged
    .groupby(groupby)
    .apply(agg_group)
)

Unnamed: 0_level_0,n_distinct_hosts,est_revenue,est_revenue_per_host
host_neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abbotsford,1.0,4.848300e+07,4.848300e+07
Abbotsford/Wareemba,15.0,6.341240e+07,4.227493e+06
Albert Park,1.0,4.950000e+06,4.950000e+06
Alexandria,129.0,5.303063e+08,4.110902e+06
Allawah,14.0,3.127530e+07,2.233950e+06
...,...,...,...
Woollahra,167.0,1.582896e+09,9.478421e+06
Woolloomooloo,146.0,8.650559e+08,5.925040e+06
XI Arrondissement,1.0,5.400000e+06,5.400000e+06
Zetland,166.0,8.074835e+08,4.864359e+06


# Polygon in Polygon

In [212]:
def _join_ssc(df: pd.DataFrame,
              engine: sa.engine.base.Engine) -> None:
    """
    Join suburb level data, from `star.SSC_2011_AUST` to `table_name`. The goal
    is to produce `host_neighbourhood_cleansed`.
    :param df: The dataframe to be added the suburb `geometry`.
    :param engine:
    :param kwargs:
    :return:
    """

    ssc_df = gpd.GeoDataFrame.from_postgis(
        con=engine,
        sql='SELECT * FROM star."SSC_2011_AUST"',
        geom_col='geometry'
    )
    
    # The values of `host_neighbourhood` don't all match to the SSC_NAME of
    # the suburb data
    df_mapping = pd.read_csv(references_dir / 'host_neighbourhood_mapping.csv')
    replace_dict = dict(zip(df_mapping.host_neighbourhood, df_mapping.suburb))
    df.loc[:, 'host_neighbourhood_cleansed'] = (
        df.host_neighbourhood
        .replace(replace_dict)
        # For values with a forward slash, take only the first value.
        .str.split('/')
        .str[0]
        .str.strip()
    )

    ssc_df_clean = (
        ssc_df
        # Some suburb names are duplicated in other states.
        .assign(in_nsw = lambda x: x.SSC_NAME.str.contains('NSW'),
                is_duplicated = lambda x: x.SSC_NAME.str.contains('\('),
                SSC_NAME_cleaned = lambda x: x.SSC_NAME.str.split('(').str[0].str.strip())
        .dropna(subset=['geometry'])
    )
    print(f'type(ssc_df_clean): {type(ssc_df_clean)}')

    # Join the geometry of the suburb.
    drop_index = ssc_df_clean.loc[lambda x: ~x.in_nsw & x.is_duplicated].index
    ssc_df_clean.drop(drop_index, inplace=True)

    df_merged = (
        df
        .merge(ssc_df,
               left_on='host_neighbourhood_cleansed',
               right_on='SSC_NAME',
               how='left')
    )
    
    df_merged = (
        ssc_df
        .merge(df,
               left_on='SSC_NAME',
               right_on='host_neighbourhood_cleansed',
               how='right')
    )

    return df_merged

In [193]:
ds = '2020-05-01'
query = f"""
        SELECT * 
        FROM star.fact_airbnb
        WHERE execution_date = '{ds}'
        """

df = pd.read_sql(con=engine, sql=query)
df

Unnamed: 0,id,host_id,execution_date,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,LGA_CODE_2016
0,11156,40855,2020-05-01,2,2,180,180,2.0,180.0,True,28,58,88,363,2020-05-11,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,False,1,0,1,0,1.54,
1,12351,17061,2020-05-01,2,2,7,7,2.0,7.0,True,0,0,0,0,2020-05-10,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,False,2,0,2,0,4.41,
2,14250,55948,2020-05-01,5,5,22,22,5.0,22.0,True,0,0,0,141,2020-05-11,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,False,2,2,0,0,0.04,
3,15253,59850,2020-05-01,2,2,7,7,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,
4,44545,112237,2020-05-01,3,3,365,365,3.0,365.0,True,0,0,0,0,2020-05-12,2010-10-20,2020-01-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,False,1,1,0,0,0.65,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37554,43386958,342929486,2020-05-01,1,1,1125,1125,1.0,1125.0,True,30,60,90,365,2020-05-12,,,,,,,,,,True,1,1,0,0,,
37555,43391404,345727481,2020-05-01,45,45,1125,1125,45.0,1125.0,True,30,60,90,365,2020-05-12,,,,,,,,,,True,1,0,1,0,,
37556,43391666,300655692,2020-05-01,30,30,1125,1125,30.0,1125.0,True,24,54,84,84,2020-05-11,,,,,,,,,,False,53,10,7,36,,
37557,43392171,223730845,2020-05-01,2,2,365,365,2.0,365.0,True,24,51,81,172,2020-05-11,,,,,,,,,,False,1,1,0,0,,


In [198]:
df_host = pd.read_sql(con=engine, sql='SELECT * FROM star.dim_host')
df_merged = df.merge(df_host, on='host_id', how='left')

In [235]:
rename_dict = {'geometry': 'host_neighbourhood_geometry'}
# Not recommended to rename
df_ssc = _join_ssc(df_merged, engine)

type(ssc_df_clean): <class 'geopandas.geodataframe.GeoDataFrame'>


In [205]:
df_lga = gpd.GeoDataFrame.from_postgis(
    sql='SELECT * FROM star."LGA_2016_AUST"',
    con=engine,
    geom_col='geometry'
)

In [220]:
print(f'df_ssc: {type(df_ssc)}')
print(f'df_lga: {type(df_lga)}')

df_ssc: <class 'geopandas.geodataframe.GeoDataFrame'>
df_lga: <class 'geopandas.geodataframe.GeoDataFrame'>


The `df_ssc.host_neighbourhood_geometry` should be contained in `df_lga.geometry`.

In [240]:
gpd.sjoin(left_df=df_ssc,
          right_df=df_lga,
          op='intersects',
          how='left')

Unnamed: 0,SSC_CODE,SSC_NAME,CONF_VALUE,SQKM,geometry,id,host_id,execution_date,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,LGA_CODE_2016_left,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,host_neighbourhood_cleansed,index_right,LGA_CODE_2016_right,LGA_NAME16,STE_CODE16,STE_NAME16,AREASQKM16
0,11905,Potts Point,Very good,0.616643,"POLYGON ((151.22489 -33.87301, 151.22504 -33.8...",11156,40855,2020-05-01,2,2,180,180,2.0,180.0,True,28,58,88,363,2020-05-11,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,False,1,0,1,0,1.54,,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",within a day,100%,93%,False,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,Potts Point,105.0,17200,Sydney (C),1,New South Wales,26.7429
1,11923,Pyrmont,Very good,0.932580,"POLYGON ((151.19113 -33.86555, 151.19112 -33.8...",12351,17061,2020-05-01,2,2,7,7,2.0,7.0,True,0,0,0,0,2020-05-10,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,False,2,0,2,0,4.41,,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,75%,False,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,Pyrmont,105.0,17200,Sydney (C),1,New South Wales,26.7429
2,10084,Balgowlah,Very good,1.955000,"POLYGON ((151.25673 -33.80046, 151.25565 -33.8...",14250,55948,2020-05-01,5,5,22,22,5.0,22.0,True,0,0,0,141,2020-05-11,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,False,2,2,0,0,0.04,,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,100%,52%,True,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,Balgowlah,85.0,15990,Northern Beaches (A),1,New South Wales,254.2074
3,10691,Darlinghurst,Very good,0.857011,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8...",15253,59850,2020-05-01,2,2,7,7,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,100%,99%,False,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,Darlinghurst,105.0,17200,Sydney (C),1,New South Wales,26.7429
3,10691,Darlinghurst,Very good,0.857011,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8...",15253,59850,2020-05-01,2,2,7,7,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,100%,99%,False,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,Darlinghurst,127.0,18500,Woollahra (A),1,New South Wales,12.2775
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37556,12531,Wolli Creek,Very good,0.679303,"POLYGON ((151.14991 -33.93172, 151.14978 -33.9...",43391666,300655692,2020-05-01,30,30,1125,1125,30.0,1125.0,True,24,54,84,84,2020-05-11,,,,,,,,,,False,53,10,7,36,,,https://www.airbnb.com/users/show/300655692,Eraldo,2019-10-07,AU,,within a few hours,90%,64%,False,https://a0.muscache.com/im/pictures/user/14943...,https://a0.muscache.com/im/pictures/user/14943...,Wolli Creek,91.0,91.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,f,Wolli Creek,96.0,16650,Rockdale (C),1,New South Wales,28.2078
37556,12531,Wolli Creek,Very good,0.679303,"POLYGON ((151.14991 -33.93172, 151.14978 -33.9...",43391666,300655692,2020-05-01,30,30,1125,1125,30.0,1125.0,True,24,54,84,84,2020-05-11,,,,,,,,,,False,53,10,7,36,,,https://www.airbnb.com/users/show/300655692,Eraldo,2019-10-07,AU,,within a few hours,90%,64%,False,https://a0.muscache.com/im/pictures/user/14943...,https://a0.muscache.com/im/pictures/user/14943...,Wolli Creek,91.0,91.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,f,Wolli Creek,23.0,11570,Canterbury-Bankstown (A),1,New South Wales,110.2368
37556,12531,Wolli Creek,Very good,0.679303,"POLYGON ((151.14991 -33.93172, 151.14978 -33.9...",43391666,300655692,2020-05-01,30,30,1125,1125,30.0,1125.0,True,24,54,84,84,2020-05-11,,,,,,,,,,False,53,10,7,36,,,https://www.airbnb.com/users/show/300655692,Eraldo,2019-10-07,AU,,within a few hours,90%,64%,False,https://a0.muscache.com/im/pictures/user/14943...,https://a0.muscache.com/im/pictures/user/14943...,Wolli Creek,91.0,91.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,f,Wolli Creek,55.0,14170,Inner West (A),1,New South Wales,35.3713
37557,,,,,,43392171,223730845,2020-05-01,2,2,365,365,2.0,365.0,True,24,51,81,172,2020-05-11,,,,,,,,,,False,1,1,0,0,,,https://www.airbnb.com/users/show/223730845,Elizabeth,2018-11-01,AU,,,,,False,https://a0.muscache.com/im/pictures/user/38886...,https://a0.muscache.com/im/pictures/user/38886...,,2.0,2.0,"['email', 'phone', 'offline_government_id', 's...",t,f,,,,,,,


In [239]:
gpd.sjoin(left_df=df_lga.loc[:, ['geometry', 'LGA_NAME16', 'LGA_CODE_2016']],
          right_df=df_ssc.loc[:, ['geometry', 'host_neighbourhood_cleansed']],
          op='contains',
          how='left')

Unnamed: 0,geometry,LGA_NAME16,LGA_CODE_2016,index_right,host_neighbourhood_cleansed
0,"POLYGON ((146.82130 -36.04997, 146.82138 -36.0...",Albury (C),10050,,
1,"POLYGON ((151.32425 -30.26923, 151.32419 -30.2...",Armidale Regional (A),10130,,
2,"MULTIPOLYGON (((153.57094 -28.87390, 153.57097...",Ballina (A),10250,,
3,"POLYGON ((143.00432 -33.78165, 143.01538 -33.7...",Balranald (A),10300,,
4,"POLYGON ((149.90753 -33.39968, 149.90717 -33.4...",Bathurst Regional (A),10470,,
...,...,...,...,...,...
540,"MULTIPOLYGON (((132.99223 -11.08298, 132.99068...",West Arnhem (R),74660,,
541,"MULTIPOLYGON (((129.69812 -14.80951, 129.69522...",West Daly (R),74680,,
542,"MULTIPOLYGON (((130.02044 -13.17982, 130.01951...",Unincorporated NT,79399,,
543,"POLYGON ((149.06241 -35.15916, 149.07352 -35.1...",Unincorporated ACT,89399,,


In [233]:
print(type(df_ssc))
print(type(df_ssc[['host_neighbourhood_geometry', 'host_neighbourhood_cleansed']]))
print(type(df_ssc.loc[:, ['host_neighbourhood_geometry', 'host_neighbourhood_cleansed']]))

<class 'geopandas.geodataframe.GeoDataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'geopandas.geodataframe.GeoDataFrame'>


In [224]:
?gpd.sjoin

[0;31mSignature:[0m
[0mgpd[0m[0;34m.[0m[0msjoin[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mleft_df[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mright_df[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhow[0m[0;34m=[0m[0;34m'inner'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mop[0m[0;34m=[0m[0;34m'intersects'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlsuffix[0m[0;34m=[0m[0;34m'left'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrsuffix[0m[0;34m=[0m[0;34m'right'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Spatial join of two GeoDataFrames.

See the User Guide page :doc:`../../user_guide/mergingdata` for details.


Parameters
----------
left_df, right_df : GeoDataFrames
how : string, default 'inner'
    The type of join:

    * 'left': use keys from left_df; retain only left_df geometry column
    * 'right': use keys from right_df; retain only right_df geometry column
    * 'inner': use intersection of ke

In [223]:
df_ssc[['host_neighbourhood_geometry', 'host_neighbourhood_cleansed']]

Unnamed: 0,host_neighbourhood_geometry,host_neighbourhood_cleansed
0,"POLYGON ((151.22489 -33.87301, 151.22504 -33.8...",Potts Point
1,"POLYGON ((151.19113 -33.86555, 151.19112 -33.8...",Pyrmont
2,"POLYGON ((151.25673 -33.80046, 151.25565 -33.8...",Balgowlah
3,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8...",Darlinghurst
4,"POLYGON ((151.21771 -33.88409, 151.21764 -33.8...",Darlinghurst
...,...,...
37554,,
37555,,
37556,"POLYGON ((151.14991 -33.93172, 151.14978 -33.9...",Wolli Creek
37557,,


In [216]:
df_property = pd.read_sql(con=engine,
                          sql='SELECT * FROM star.dim_property')
df_merged_property = (
    df_ssc
    .merge(df_property, on='id', how='left')
)

In [241]:
df_mapping = pd.read_csv(references_dir / 'host_neighbourhood_mapping.csv')


# Geopandas point in polygon

In [244]:
schema = 'star'
table_name = f'fact_airbnb'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df = pd.read_sql(con=engine,
                 sql=query)

In [245]:
schema = 'star'
table_name = f'dim_host'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df_host = pd.read_sql(con=engine,
                      sql=query)

In [246]:
schema = 'star'
table_name = f'dim_property'
query = f"""
SELECT *
FROM {schema}.{table_name}
"""

df_property = pd.read_sql(con=engine,
                          sql=query)

In [251]:
schema = 'star'
table_name = f'LGA_2016_AUST'
query = f"""
SELECT *
FROM {schema}."{table_name}"
"""

df_lga = gpd.GeoDataFrame.from_postgis(
    con=engine,
    sql=query,
    geom_col='geometry'
)

In [247]:
df_merged = (
    df
    .merge(df_host, on='host_id', how='left')
    .merge(df_property, on='id', how='left')
)

In [249]:
gdf = gpd.GeoDataFrame(
    df_merged,
    geometry=gpd.points_from_xy(df_merged.longitude, df_merged.latitude)
)

In [264]:
gdf_sjoin = gpd.sjoin(left_df=gdf.loc[:, ['geometry', 'neighbourhood_cleansed']].set_crs('EPSG:4283'),
                      right_df=df_lga.loc[:, ['geometry', 'LGA_CODE16', 'LGA_NAME16']],
                      op='intersects',
                      how='left')

In [265]:
gdf_sjoin

Unnamed: 0,geometry,neighbourhood_cleansed,index_right,LGA_CODE16,LGA_NAME16
0,POINT (151.22656 -33.86917),Sydney,105.0,17200,Sydney (C)
1,POINT (151.19190 -33.86515),Sydney,105.0,17200,Sydney (C)
2,POINT (151.26172 -33.80093),Manly,85.0,15990,Northern Beaches (A)
3,POINT (151.21680 -33.87964),Sydney,105.0,17200,Sydney (C)
4,POINT (151.21439 -33.87888),Sydney,105.0,17200,Sydney (C)
...,...,...,...,...,...
37554,POINT (150.98975 -33.72536),The Hills Shire,109.0,17420,The Hills Shire (A)
37555,POINT (150.78146 -33.73593),Blacktown,8.0,10750,Blacktown (C)
37556,POINT (151.19215 -33.86996),Sydney,105.0,17200,Sydney (C)
37557,POINT (151.29550 -33.63772),Pittwater,,,


In [258]:
gdf_sjoin

Unnamed: 0,geometry,neighbourhood_cleansed,index_right,LGA_CODE16,LGA_NAME16
0,POINT (151.22656 -33.86917),Sydney,105.0,17200,Sydney (C)
1,POINT (151.19190 -33.86515),Sydney,105.0,17200,Sydney (C)
2,POINT (151.26172 -33.80093),Manly,85.0,15990,Northern Beaches (A)
3,POINT (151.21680 -33.87964),Sydney,105.0,17200,Sydney (C)
4,POINT (151.21439 -33.87888),Sydney,105.0,17200,Sydney (C)
...,...,...,...,...,...
37554,POINT (150.98975 -33.72536),The Hills Shire,109.0,17420,The Hills Shire (A)
37555,POINT (150.78146 -33.73593),Blacktown,8.0,10750,Blacktown (C)
37556,POINT (151.19215 -33.86996),Sydney,105.0,17200,Sydney (C)
37557,POINT (151.29550 -33.63772),Pittwater,,,


In [263]:
gdf_sjoin

Unnamed: 0,geometry,neighbourhood_cleansed,index_right,LGA_CODE16,LGA_NAME16
0,POINT (151.22656 -33.86917),Sydney,105.0,17200,Sydney (C)
1,POINT (151.19190 -33.86515),Sydney,105.0,17200,Sydney (C)
2,POINT (151.26172 -33.80093),Manly,85.0,15990,Northern Beaches (A)
3,POINT (151.21680 -33.87964),Sydney,105.0,17200,Sydney (C)
4,POINT (151.21439 -33.87888),Sydney,105.0,17200,Sydney (C)
...,...,...,...,...,...
37554,POINT (150.98975 -33.72536),The Hills Shire,109.0,17420,The Hills Shire (A)
37555,POINT (150.78146 -33.73593),Blacktown,8.0,10750,Blacktown (C)
37556,POINT (151.19215 -33.86996),Sydney,105.0,17200,Sydney (C)
37557,POINT (151.29550 -33.63772),Pittwater,,,


In [261]:
gdf.set_crs('EPSG:4326')

Unnamed: 0,id,host_id,execution_date,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable_x,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,LGA_CODE_2016,name,description,neighborhood_overview,picture_url,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,price,number_of_reviews,instant_bookable_y,geometry
0,11156,40855,2020-05-01,2,2,180,180,2.0,180.0,True,28,58,88,363,2020-05-11,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,False,1,0,1,0,1.54,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",within a day,100%,93%,False,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,17200,An Oasis in the City,Very central to the city which can be reached ...,"It is very close to everything and everywhere,...",https://a0.muscache.com/im/pictures/2797669/17...,Sydney,-33.86917,151.22656,Apartment,Private room,1,6400.0,196,False,POINT (151.22656 -33.86917)
1,12351,17061,2020-05-01,2,2,7,7,2.0,7.0,True,0,0,0,0,2020-05-10,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,False,2,0,2,0,4.41,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,75%,False,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,17200,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/im/pictures/763ad5c8-c...,Sydney,-33.86515,151.19190,Townhouse,Private room,2,1431500.0,526,False,POINT (151.19190 -33.86515)
2,14250,55948,2020-05-01,5,5,22,22,5.0,22.0,True,0,0,0,141,2020-05-11,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,False,2,2,0,0,0.04,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,100%,52%,True,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,15990,Manly Harbour House,"Beautifully renovated, spacious and quiet, our...",Balgowlah Heights is one of the most prestigio...,https://a0.muscache.com/im/pictures/56935671/f...,Manly,-33.80093,151.26172,House,Entire home/apt,6,47000.0,2,False,POINT (151.26172 -33.80093)
3,15253,59850,2020-05-01,2,2,7,7,2.0,7.0,True,30,60,90,344,2020-05-11,2012-02-23,2020-03-17,88.0,10.0,9.0,10.0,10.0,10.0,9.0,True,1,0,1,0,3.64,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,100%,99%,False,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Unique Designer Rooftop Apartment in City Loca...,Penthouse living at it best ... You will be st...,The location is really central and there is nu...,https://a0.muscache.com/im/pictures/46dcb8a1-5...,Sydney,-33.87964,151.21680,Apartment,Private room,2,10000.0,364,True,POINT (151.21680 -33.87964)
4,44545,112237,2020-05-01,3,3,365,365,3.0,365.0,True,0,0,0,0,2020-05-12,2010-10-20,2020-01-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,False,1,1,0,0,0.65,https://www.airbnb.com/users/show/112237,Atari,2010-04-22,"Sydney, New South Wales, Australia",Curious about the world and full of wanderlust...,,,85%,True,https://a0.muscache.com/im/pictures/user/34708...,https://a0.muscache.com/im/pictures/user/34708...,Darlinghurst,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,17200,Sunny Darlinghurst Warehouse Apartment,Sunny warehouse/loft apartment in the heart of...,Darlinghurst is home to some of Sydney's best ...,https://a0.muscache.com/im/pictures/a88d8e14-4...,Sydney,-33.87888,151.21439,Loft,Entire home/apt,2,13100.0,76,False,POINT (151.21439 -33.87888)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37554,43386958,342929486,2020-05-01,1,1,1125,1125,1.0,1125.0,True,30,60,90,365,2020-05-12,,,,,,,,,,True,1,1,0,0,,https://www.airbnb.com/users/show/342929486,Linda,2020-04-01,AU,We are a happy family and expecting your visit...,within a day,100%,50%,False,https://a0.muscache.com/im/pictures/user/0c94f...,https://a0.muscache.com/im/pictures/user/0c94f...,,2.0,2.0,"['email', 'phone', 'offline_government_id', 's...",t,f,17420,Self-contained house for a relaxed stay,Self contained house with your own entrance on...,,https://a0.muscache.com/im/pictures/ee664a76-7...,The Hills Shire,-33.72536,150.98975,House,Entire home/apt,3,7200.0,0,True,POINT (150.98975 -33.72536)
37555,43391404,345727481,2020-05-01,45,45,1125,1125,45.0,1125.0,True,30,60,90,365,2020-05-12,,,,,,,,,,True,1,0,1,0,,https://www.airbnb.com/users/show/345727481,Waleed,2020-05-08,AU,,,,,False,https://a0.muscache.com/im/pictures/user/9aee8...,https://a0.muscache.com/im/pictures/user/9aee8...,,0.0,0.0,"['email', 'phone']",t,f,10750,Private Room to Rent,This is a beautiful 3 bedroom house located in...,"5 minutes walk to coles, pharmacy, dominos and...",https://a0.muscache.com/im/pictures/d29d588d-d...,Blacktown,-33.73593,150.78146,House,Private room,1,17100.0,0,True,POINT (150.78146 -33.73593)
37556,43391666,300655692,2020-05-01,30,30,1125,1125,30.0,1125.0,True,24,54,84,84,2020-05-11,,,,,,,,,,False,53,10,7,36,,https://www.airbnb.com/users/show/300655692,Eraldo,2019-10-07,AU,,within a few hours,90%,64%,False,https://a0.muscache.com/im/pictures/user/14943...,https://a0.muscache.com/im/pictures/user/14943...,Wolli Creek,91.0,91.0,"['email', 'phone', 'jumio', 'offline_governmen...",t,f,17200,FULLY FURNISHED 2 BEDROOM APARTMENT-AVAILABLE NOW,PERFECT FOR COUPLES or 4 FRIENDS and can be us...,"EXCELLENT LOCATION: Murray Street, Pyrmont 9 m...",https://a0.muscache.com/im/pictures/9fd3da8e-e...,Sydney,-33.86996,151.19215,Apartment,Entire home/apt,4,7500.0,0,False,POINT (151.19215 -33.86996)
37557,43392171,223730845,2020-05-01,2,2,365,365,2.0,365.0,True,24,51,81,172,2020-05-11,,,,,,,,,,False,1,1,0,0,,https://www.airbnb.com/users/show/223730845,Elizabeth,2018-11-01,AU,,,,,False,https://a0.muscache.com/im/pictures/user/38886...,https://a0.muscache.com/im/pictures/user/38886...,,2.0,2.0,"['email', 'phone', 'offline_government_id', 's...",t,f,,Florence Boathouse - luxury waterfront retreat,At the water’s edge with a North East aspect e...,,https://a0.muscache.com/im/pictures/e1ec4323-1...,Pittwater,-33.63772,151.29550,House,Entire home/apt,6,49000.0,0,False,POINT (151.29550 -33.63772)


# Part 4

## a

In [267]:
path = raw_data_dir / '2016Census_G01_NSW_LGA.csv'
g01 = pd.read_csv(path)
g01.head()

Unnamed: 0,LGA_CODE_2016,Tot_P_M,Tot_P_F,Tot_P_P,Age_0_4_yr_M,Age_0_4_yr_F,Age_0_4_yr_P,Age_5_14_yr_M,Age_5_14_yr_F,Age_5_14_yr_P,Age_15_19_yr_M,Age_15_19_yr_F,Age_15_19_yr_P,Age_20_24_yr_M,Age_20_24_yr_F,Age_20_24_yr_P,Age_25_34_yr_M,Age_25_34_yr_F,Age_25_34_yr_P,Age_35_44_yr_M,Age_35_44_yr_F,Age_35_44_yr_P,Age_45_54_yr_M,Age_45_54_yr_F,Age_45_54_yr_P,Age_55_64_yr_M,Age_55_64_yr_F,Age_55_64_yr_P,Age_65_74_yr_M,Age_65_74_yr_F,Age_65_74_yr_P,Age_75_84_yr_M,Age_75_84_yr_F,Age_75_84_yr_P,Age_85ov_M,Age_85ov_F,Age_85ov_P,Counted_Census_Night_home_M,Counted_Census_Night_home_F,Counted_Census_Night_home_P,Count_Census_Nt_Ewhere_Aust_M,Count_Census_Nt_Ewhere_Aust_F,Count_Census_Nt_Ewhere_Aust_P,Indigenous_psns_Aboriginal_M,Indigenous_psns_Aboriginal_F,Indigenous_psns_Aboriginal_P,Indig_psns_Torres_Strait_Is_M,Indig_psns_Torres_Strait_Is_F,Indig_psns_Torres_Strait_Is_P,Indig_Bth_Abor_Torres_St_Is_M,...,Birthplace_Elsewhere_F,Birthplace_Elsewhere_P,Lang_spoken_home_Eng_only_M,Lang_spoken_home_Eng_only_F,Lang_spoken_home_Eng_only_P,Lang_spoken_home_Oth_Lang_M,Lang_spoken_home_Oth_Lang_F,Lang_spoken_home_Oth_Lang_P,Australian_citizen_M,Australian_citizen_F,Australian_citizen_P,Age_psns_att_educ_inst_0_4_M,Age_psns_att_educ_inst_0_4_F,Age_psns_att_educ_inst_0_4_P,Age_psns_att_educ_inst_5_14_M,Age_psns_att_educ_inst_5_14_F,Age_psns_att_educ_inst_5_14_P,Age_psns_att_edu_inst_15_19_M,Age_psns_att_edu_inst_15_19_F,Age_psns_att_edu_inst_15_19_P,Age_psns_att_edu_inst_20_24_M,Age_psns_att_edu_inst_20_24_F,Age_psns_att_edu_inst_20_24_P,Age_psns_att_edu_inst_25_ov_M,Age_psns_att_edu_inst_25_ov_F,Age_psns_att_edu_inst_25_ov_P,High_yr_schl_comp_Yr_12_eq_M,High_yr_schl_comp_Yr_12_eq_F,High_yr_schl_comp_Yr_12_eq_P,High_yr_schl_comp_Yr_11_eq_M,High_yr_schl_comp_Yr_11_eq_F,High_yr_schl_comp_Yr_11_eq_P,High_yr_schl_comp_Yr_10_eq_M,High_yr_schl_comp_Yr_10_eq_F,High_yr_schl_comp_Yr_10_eq_P,High_yr_schl_comp_Yr_9_eq_M,High_yr_schl_comp_Yr_9_eq_F,High_yr_schl_comp_Yr_9_eq_P,High_yr_schl_comp_Yr_8_belw_M,High_yr_schl_comp_Yr_8_belw_F,High_yr_schl_comp_Yr_8_belw_P,High_yr_schl_comp_D_n_g_sch_M,High_yr_schl_comp_D_n_g_sch_F,High_yr_schl_comp_D_n_g_sch_P,Count_psns_occ_priv_dwgs_M,Count_psns_occ_priv_dwgs_F,Count_psns_occ_priv_dwgs_P,Count_Persons_other_dwgs_M,Count_Persons_other_dwgs_F,Count_Persons_other_dwgs_P
0,LGA10050,24662,26411,51076,1689,1594,3286,3208,3117,6328,1611,1635,3248,1695,1810,3508,3194,3299,6498,2972,3228,6205,3169,3329,6497,3045,3327,6372,2328,2573,4907,1251,1659,2913,490,841,1329,23024,24811,47832,1639,1597,3244,661,700,1363,19,15,29,12,...,2866,5540,21282,22839,44120,1634,1808,3446,21701,23328,45032,328,335,669,2933,2868,5798,1095,1167,2258,412,640,1046,618,1140,1754,7677,9423,17096,2183,2231,4413,5236,5180,10414,1649,1639,3287,1040,1036,2076,134,154,287,22056,23627,45686,2555,2523,5081
1,LGA10130,14227,15220,29449,844,825,1669,1833,1833,3667,1254,1302,2560,1369,1422,2793,1700,1761,3464,1465,1635,3100,1772,1889,3657,1713,1840,3552,1344,1400,2747,705,893,1595,220,426,649,13166,14144,27311,1062,1079,2137,1031,1076,2113,19,17,37,16,...,1869,3631,12024,12911,24937,1142,1192,2333,12156,13073,25226,230,200,433,1691,1675,3368,922,1015,1936,698,827,1518,651,924,1576,5650,6621,12270,698,731,1431,2579,2585,5166,828,828,1651,614,553,1169,37,40,77,11921,12759,24682,2409,2596,5006
2,LGA10250,20127,21658,41790,1074,997,2072,2565,2294,4856,1245,1133,2384,780,788,1571,1732,1853,3581,2207,2456,4669,2701,3060,5759,3036,3398,6434,2699,2805,5503,1437,1773,3203,655,1104,1755,18988,20545,39536,1141,1113,2251,628,670,1297,31,19,53,11,...,2446,4609,17944,19268,37214,749,832,1578,17996,19331,37330,270,280,548,2338,2113,4444,900,864,1758,178,252,423,419,881,1302,6826,7684,14511,1205,1213,2414,4690,5276,9963,1324,1526,2851,824,853,1672,43,48,92,18343,19718,38063,1813,2006,3820
3,LGA10300,1177,1115,2287,80,71,143,172,162,336,74,56,135,81,56,132,99,136,233,147,130,276,156,132,290,194,186,377,119,86,202,59,76,131,12,22,35,1113,1059,2172,57,57,114,98,96,194,0,0,0,3,...,82,144,984,915,1902,70,80,154,1017,953,1974,13,16,24,153,143,288,49,36,80,3,8,10,10,29,37,215,299,514,95,108,208,251,190,439,111,75,186,103,79,183,7,6,15,1032,990,2015,277,199,480
4,LGA10470,20695,20605,41300,1339,1209,2546,2883,2691,5571,1515,1460,2980,1604,1544,3148,2765,2399,5171,2450,2430,4878,2572,2665,5238,2477,2517,4991,1917,1951,3862,882,1137,2019,293,600,892,19480,19541,39021,1217,1066,2283,1131,1011,2146,23,28,45,27,...,1733,3565,17597,18308,35906,853,839,1695,17695,18362,36060,319,306,626,2662,2483,5147,1072,1079,2151,503,673,1178,533,921,1457,6296,7729,14024,1155,1042,2202,4664,4391,9053,1253,1223,2475,771,689,1466,45,51,98,17733,18358,36098,2905,2219,5124


In [281]:
for col in g01.filter(regex='Age.*P').columns:
    print(f', "{col}" / total_population::float AS {col}_perc ')

, "Age_0_4_yr_P" / total_population::float AS Age_0_4_yr_P_perc 
, "Age_5_14_yr_P" / total_population::float AS Age_5_14_yr_P_perc 
, "Age_15_19_yr_P" / total_population::float AS Age_15_19_yr_P_perc 
, "Age_20_24_yr_P" / total_population::float AS Age_20_24_yr_P_perc 
, "Age_25_34_yr_P" / total_population::float AS Age_25_34_yr_P_perc 
, "Age_35_44_yr_P" / total_population::float AS Age_35_44_yr_P_perc 
, "Age_45_54_yr_P" / total_population::float AS Age_45_54_yr_P_perc 
, "Age_55_64_yr_P" / total_population::float AS Age_55_64_yr_P_perc 
, "Age_65_74_yr_P" / total_population::float AS Age_65_74_yr_P_perc 
, "Age_75_84_yr_P" / total_population::float AS Age_75_84_yr_P_perc 
, "Age_85ov_P" / total_population::float AS Age_85ov_P_perc 
, "Age_psns_att_educ_inst_0_4_P" / total_population::float AS Age_psns_att_educ_inst_0_4_P_perc 
, "Age_psns_att_educ_inst_5_14_P" / total_population::float AS Age_psns_att_educ_inst_5_14_P_perc 
, "Age_psns_att_edu_inst_15_19_P" / total_population::float

In [270]:
?pd.DataFrame.filter

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mfilter[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m:[0m [0;34m'FrameOrSeries'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mitems[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlike[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mregex[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'FrameOrSeries'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its
contents. The filter is applied to the labels of the index.

Parameters
----------
items : list-like
    Keep labels from axis which are in items.
like :