# Introduction

This notebook analysis the data structures in:
* [`raw/2021-04-10.gz`](#Airbnb-listing-data)
* [`raw/2016Census_G01_NSW_LGA.csv`](#Census-G01-data)
* [`raw/2016Census_G02_NSW_LGA.csv`](#Census-G02-data)
* [`raw/shapefile`](#Shapefile)

to design a [star schema](https://en.wikipedia.org/wiki/Star_schema).

The raw data is uploaded to Postgres with 'test_' prepended to the table names.

## Joins

The four data sets are to be joined to each.
1. The listings data is to be joined to the shapefile using point-in-polygon join, which is the most robust method of joining. Even though the listings data has `neighbourhood_cleansed` column, the value don't match perfectly to the list of LGAs. Using the latitude and longitude values is the better method.
2. Using the official LGA names from the shapefile as the key, the G01 and G02 data can be joined.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import sqlalchemy as sa
import os
import gzip
import shutil
import requests
import pandas as pd
import geopandas as gpd
from pathlib import Path
from psycopg2.extras import execute_values
from dotenv import (
    load_dotenv,
    find_dotenv
)
import psycopg2

from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook

from src.data.database import (
    get_connection_string
)
from src.utils.utils import (
    stringify_columns,
    get_create_query
)

In [3]:
load_dotenv(find_dotenv())

project_dir = Path(find_dotenv()).parent
data_dir = project_dir / 'data'
raw_data_dir = data_dir / 'raw'
interim_data_dir = data_dir / 'interim'
reports_dir = project_dir / 'reports'
references_dir = project_dir / 'references'

In [4]:
pd.set_option('display.max_columns', 100)

# Load data

In [5]:
path = raw_data_dir / '2021-04-10.gz'
listing_df = pd.read_csv(path, compression='gzip')

In [6]:
path = raw_data_dir / '2016Census_G01_NSW_LGA.csv'
g01_df = pd.read_csv(path)

In [7]:
path = raw_data_dir / '2016Census_G02_NSW_LGA.csv'
g02_df = pd.read_csv(path)

In [59]:
path = raw_data_dir / 'shapefile/LGA_2016_AUST.shp'
shape_df = gpd.read_file(path)

## Airbnb listing data

In [9]:
listing_df.shape

(32679, 74)

In [10]:
listing_df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,11156,https://www.airbnb.com/rooms/11156,20210410042103,2021-04-12,An Oasis in the City,Very central to the city which can be reached ...,"It is very close to everything and everywhere,...",https://a0.muscache.com/pictures/2797669/17895...,40855,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",,,,f,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,"Potts Point, New South Wales, Australia",Sydney,,-33.86767,151.22497,Private room in apartment,Private room,1,,1 shared bath,1.0,0.0,"[""Dishwasher"", ""Backyard"", ""Kitchen"", ""Shower ...",$65.00,2,180,2,2,180,180,2.0,180.0,,t,29,59,89,364,2021-04-12,196,0,0,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,,f,1,0,1,0,1.42
1,12351,https://www.airbnb.com/rooms/12351,20210410042103,2021-04-15,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/pictures/763ad5c8-c951...,17061,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,,f,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,"Pyrmont, New South Wales, Australia",Sydney,,-33.8649,151.19171,Private room in townhouse,Private room,2,,1 shared bath,1.0,1.0,"[""Microwave"", ""Patio or balcony"", ""Wifi"", ""Dis...","$14,315.00",2,7,2,2,7,7,2.0,7.0,,t,0,0,0,0,2021-04-15,526,0,0,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,2,0,2,0,4.03
2,14250,https://www.airbnb.com/rooms/14250,20210410042103,2021-04-14,Manly Harbour House,"Beautifully renovated, spacious and quiet, our...",Balgowlah Heights is one of the most prestigio...,https://a0.muscache.com/pictures/56935671/fdb8...,55948,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,90%,79%,t,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Balgowlah, New South Wales, Australia",Manly,,-33.80084,151.26378,Entire house,Entire home/apt,6,,3 baths,3.0,3.0,"[""Stove"", ""Dedicated workspace"", ""Iron"", ""Pati...",$470.00,5,22,5,5,22,22,5.0,22.0,,t,0,0,0,122,2021-04-14,2,0,0,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,,f,2,2,0,0,0.03
3,15253,https://www.airbnb.com/rooms/15253,20210410042103,2021-04-12,Unique Designer Rooftop Apartment in City Loca...,Penthouse living at it best ... You will be st...,The location is really central and there is nu...,https://a0.muscache.com/pictures/46dcb8a1-5d5b...,59850,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,90%,95%,f,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Darlinghurst, New South Wales, Australia",Sydney,,-33.87964,151.2168,Private room in apartment,Private room,2,,1 private bath,1.0,1.0,"[""Dishwasher"", ""Kitchen"", ""Shower gel"", ""Cooki...",$80.00,2,90,2,2,90,90,2.0,90.0,,t,21,48,78,336,2021-04-12,367,3,0,2012-02-23,2021-03-07,88.0,10.0,9.0,10.0,10.0,10.0,9.0,,t,1,0,1,0,3.3
4,44545,https://www.airbnb.com/rooms/44545,20210410042103,2021-04-13,Sunny Darlinghurst Warehouse Apartment,Sunny warehouse/loft apartment in the heart of...,Darlinghurst is home to some of Sydney's best ...,https://a0.muscache.com/pictures/a88d8e14-4f63...,112237,https://www.airbnb.com/users/show/112237,Atari,2010-04-22,"Sydney, New South Wales, Australia",Curious about the world and full of wanderlust...,,,,t,https://a0.muscache.com/im/pictures/user/34708...,https://a0.muscache.com/im/pictures/user/34708...,Darlinghurst,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Darlinghurst, New South Wales, Australia",Sydney,,-33.87888,151.21439,Entire loft,Entire home/apt,2,,1 bath,1.0,1.0,"[""Dishwasher"", ""Kitchen"", ""Cooking basics"", ""C...",$130.00,3,365,3,3,365,365,3.0,365.0,,t,0,0,0,0,2021-04-13,76,0,0,2010-10-20,2020-01-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,1,1,0,0,0.6


In [37]:
listing_df.host_name.value_counts(dropna=False)

NaN              977
David            296
Team Gospodin    246
MadeComfy        200
James            194
                ... 
Bruno Tomé         1
Isabella-Rae       1
Honour             1
Braden & Liz       1
Antonietta         1
Name: host_name, Length: 7500, dtype: int64

## Census G01 data

In [11]:
g01_df.shape

(132, 109)

In [12]:
g01_df.head()

Unnamed: 0,LGA_CODE_2016,Tot_P_M,Tot_P_F,Tot_P_P,Age_0_4_yr_M,Age_0_4_yr_F,Age_0_4_yr_P,Age_5_14_yr_M,Age_5_14_yr_F,Age_5_14_yr_P,Age_15_19_yr_M,Age_15_19_yr_F,Age_15_19_yr_P,Age_20_24_yr_M,Age_20_24_yr_F,Age_20_24_yr_P,Age_25_34_yr_M,Age_25_34_yr_F,Age_25_34_yr_P,Age_35_44_yr_M,Age_35_44_yr_F,Age_35_44_yr_P,Age_45_54_yr_M,Age_45_54_yr_F,Age_45_54_yr_P,Age_55_64_yr_M,Age_55_64_yr_F,Age_55_64_yr_P,Age_65_74_yr_M,Age_65_74_yr_F,Age_65_74_yr_P,Age_75_84_yr_M,Age_75_84_yr_F,Age_75_84_yr_P,Age_85ov_M,Age_85ov_F,Age_85ov_P,Counted_Census_Night_home_M,Counted_Census_Night_home_F,Counted_Census_Night_home_P,Count_Census_Nt_Ewhere_Aust_M,Count_Census_Nt_Ewhere_Aust_F,Count_Census_Nt_Ewhere_Aust_P,Indigenous_psns_Aboriginal_M,Indigenous_psns_Aboriginal_F,Indigenous_psns_Aboriginal_P,Indig_psns_Torres_Strait_Is_M,Indig_psns_Torres_Strait_Is_F,Indig_psns_Torres_Strait_Is_P,Indig_Bth_Abor_Torres_St_Is_M,...,Birthplace_Elsewhere_F,Birthplace_Elsewhere_P,Lang_spoken_home_Eng_only_M,Lang_spoken_home_Eng_only_F,Lang_spoken_home_Eng_only_P,Lang_spoken_home_Oth_Lang_M,Lang_spoken_home_Oth_Lang_F,Lang_spoken_home_Oth_Lang_P,Australian_citizen_M,Australian_citizen_F,Australian_citizen_P,Age_psns_att_educ_inst_0_4_M,Age_psns_att_educ_inst_0_4_F,Age_psns_att_educ_inst_0_4_P,Age_psns_att_educ_inst_5_14_M,Age_psns_att_educ_inst_5_14_F,Age_psns_att_educ_inst_5_14_P,Age_psns_att_edu_inst_15_19_M,Age_psns_att_edu_inst_15_19_F,Age_psns_att_edu_inst_15_19_P,Age_psns_att_edu_inst_20_24_M,Age_psns_att_edu_inst_20_24_F,Age_psns_att_edu_inst_20_24_P,Age_psns_att_edu_inst_25_ov_M,Age_psns_att_edu_inst_25_ov_F,Age_psns_att_edu_inst_25_ov_P,High_yr_schl_comp_Yr_12_eq_M,High_yr_schl_comp_Yr_12_eq_F,High_yr_schl_comp_Yr_12_eq_P,High_yr_schl_comp_Yr_11_eq_M,High_yr_schl_comp_Yr_11_eq_F,High_yr_schl_comp_Yr_11_eq_P,High_yr_schl_comp_Yr_10_eq_M,High_yr_schl_comp_Yr_10_eq_F,High_yr_schl_comp_Yr_10_eq_P,High_yr_schl_comp_Yr_9_eq_M,High_yr_schl_comp_Yr_9_eq_F,High_yr_schl_comp_Yr_9_eq_P,High_yr_schl_comp_Yr_8_belw_M,High_yr_schl_comp_Yr_8_belw_F,High_yr_schl_comp_Yr_8_belw_P,High_yr_schl_comp_D_n_g_sch_M,High_yr_schl_comp_D_n_g_sch_F,High_yr_schl_comp_D_n_g_sch_P,Count_psns_occ_priv_dwgs_M,Count_psns_occ_priv_dwgs_F,Count_psns_occ_priv_dwgs_P,Count_Persons_other_dwgs_M,Count_Persons_other_dwgs_F,Count_Persons_other_dwgs_P
0,LGA10050,24662,26411,51076,1689,1594,3286,3208,3117,6328,1611,1635,3248,1695,1810,3508,3194,3299,6498,2972,3228,6205,3169,3329,6497,3045,3327,6372,2328,2573,4907,1251,1659,2913,490,841,1329,23024,24811,47832,1639,1597,3244,661,700,1363,19,15,29,12,...,2866,5540,21282,22839,44120,1634,1808,3446,21701,23328,45032,328,335,669,2933,2868,5798,1095,1167,2258,412,640,1046,618,1140,1754,7677,9423,17096,2183,2231,4413,5236,5180,10414,1649,1639,3287,1040,1036,2076,134,154,287,22056,23627,45686,2555,2523,5081
1,LGA10130,14227,15220,29449,844,825,1669,1833,1833,3667,1254,1302,2560,1369,1422,2793,1700,1761,3464,1465,1635,3100,1772,1889,3657,1713,1840,3552,1344,1400,2747,705,893,1595,220,426,649,13166,14144,27311,1062,1079,2137,1031,1076,2113,19,17,37,16,...,1869,3631,12024,12911,24937,1142,1192,2333,12156,13073,25226,230,200,433,1691,1675,3368,922,1015,1936,698,827,1518,651,924,1576,5650,6621,12270,698,731,1431,2579,2585,5166,828,828,1651,614,553,1169,37,40,77,11921,12759,24682,2409,2596,5006
2,LGA10250,20127,21658,41790,1074,997,2072,2565,2294,4856,1245,1133,2384,780,788,1571,1732,1853,3581,2207,2456,4669,2701,3060,5759,3036,3398,6434,2699,2805,5503,1437,1773,3203,655,1104,1755,18988,20545,39536,1141,1113,2251,628,670,1297,31,19,53,11,...,2446,4609,17944,19268,37214,749,832,1578,17996,19331,37330,270,280,548,2338,2113,4444,900,864,1758,178,252,423,419,881,1302,6826,7684,14511,1205,1213,2414,4690,5276,9963,1324,1526,2851,824,853,1672,43,48,92,18343,19718,38063,1813,2006,3820
3,LGA10300,1177,1115,2287,80,71,143,172,162,336,74,56,135,81,56,132,99,136,233,147,130,276,156,132,290,194,186,377,119,86,202,59,76,131,12,22,35,1113,1059,2172,57,57,114,98,96,194,0,0,0,3,...,82,144,984,915,1902,70,80,154,1017,953,1974,13,16,24,153,143,288,49,36,80,3,8,10,10,29,37,215,299,514,95,108,208,251,190,439,111,75,186,103,79,183,7,6,15,1032,990,2015,277,199,480
4,LGA10470,20695,20605,41300,1339,1209,2546,2883,2691,5571,1515,1460,2980,1604,1544,3148,2765,2399,5171,2450,2430,4878,2572,2665,5238,2477,2517,4991,1917,1951,3862,882,1137,2019,293,600,892,19480,19541,39021,1217,1066,2283,1131,1011,2146,23,28,45,27,...,1733,3565,17597,18308,35906,853,839,1695,17695,18362,36060,319,306,626,2662,2483,5147,1072,1079,2151,503,673,1178,533,921,1457,6296,7729,14024,1155,1042,2202,4664,4391,9053,1253,1223,2475,771,689,1466,45,51,98,17733,18358,36098,2905,2219,5124


## Census G02 data

In [13]:
g02_df.shape

(132, 9)

In [14]:
g02_df.head()

Unnamed: 0,LGA_CODE_2016,Median_age_persons,Median_mortgage_repay_monthly,Median_tot_prsnl_inc_weekly,Median_rent_weekly,Median_tot_fam_inc_weekly,Average_num_psns_per_bedroom,Median_tot_hhd_inc_weekly,Average_household_size
0,LGA10050,39,1421,642,231,1532,0.8,1185,2.3
1,LGA10130,36,1393,561,250,1465,0.8,1173,2.4
2,LGA10250,48,1733,601,340,1426,0.8,1156,2.3
3,LGA10300,41,950,624,150,1438,0.8,1174,2.5
4,LGA10470,37,1670,646,280,1632,0.8,1310,2.5


## Shapefile

In [15]:
shape_df.shape

(562, 6)

In [16]:
shape_df.head()

Unnamed: 0,LGA_CODE20,LGA_NAME20,STE_CODE16,STE_NAME16,AREASQKM20,geometry
0,10050,Albury (C),1,New South Wales,305.9459,"POLYGON ((146.82130 -36.04997, 146.82138 -36.0..."
1,10180,Armidale Regional (A),1,New South Wales,7809.4405,"POLYGON ((151.32425 -30.26923, 151.32419 -30.2..."
2,10250,Ballina (A),1,New South Wales,484.9389,"MULTIPOLYGON (((153.57094 -28.87390, 153.57097..."
3,10300,Balranald (A),1,New South Wales,21690.6753,"POLYGON ((143.00432 -33.78165, 143.01538 -33.7..."
4,10470,Bathurst Regional (A),1,New South Wales,3817.8646,"POLYGON ((149.90753 -33.39968, 149.90717 -33.4..."


In [17]:
shape_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 562 entries, 0 to 561
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   LGA_CODE20  562 non-null    object  
 1   LGA_NAME20  562 non-null    object  
 2   STE_CODE16  562 non-null    object  
 3   STE_NAME16  562 non-null    object  
 4   AREASQKM20  562 non-null    float64 
 5   geometry    544 non-null    geometry
dtypes: float64(1), geometry(1), object(4)
memory usage: 26.5+ KB


In [18]:
shape_df.dropna(inplace=True)

[[34m2021-05-17 10:35:28,288[0m] {[34mutils.py:[0m145} INFO[0m - Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.[0m
[[34m2021-05-17 10:35:28,289[0m] {[34mutils.py:[0m157} INFO[0m - NumExpr defaulting to 8 threads.[0m


In [19]:
#
shape_df.geometry

0      POLYGON ((146.82130 -36.04997, 146.82138 -36.0...
1      POLYGON ((151.32425 -30.26923, 151.32419 -30.2...
2      MULTIPOLYGON (((153.57094 -28.87390, 153.57097...
3      POLYGON ((143.00432 -33.78165, 143.01538 -33.7...
4      POLYGON ((149.90753 -33.39968, 149.90717 -33.4...
                             ...                        
551    MULTIPOLYGON (((132.99223 -11.08298, 132.99068...
552    MULTIPOLYGON (((129.69812 -14.80951, 129.69522...
553    MULTIPOLYGON (((131.03247 -12.05832, 131.03075...
556    POLYGON ((149.06241 -35.15916, 149.07352 -35.1...
559    MULTIPOLYGON (((167.96327 -29.07217, 167.96325...
Name: geometry, Length: 544, dtype: geometry

## Check the LGA code in shapefile and census data

All the LGA codes in the census files are accounted for in the shapefile.

In [61]:
shape_df.LGA_CODE16

0      10050
1      10130
2      10250
3      10300
4      10470
       ...  
558    89499
559    89799
560    99399
561    99499
562    99799
Name: LGA_CODE16, Length: 563, dtype: object

### Compare with G01

In [62]:
len(set(shape_df.LGA_CODE16) - set(g01_df.LGA_CODE_2016.str[3:]))

431

In [63]:
len(set(g01_df.LGA_CODE_2016.str[3:]) - set(shape_df.LGA_CODE16))

0

### Compare with G02

In [64]:
len(set(shape_df.LGA_CODE16) - set(g02_df.LGA_CODE_2016.str[3:]))

431

In [65]:
len(set(g02_df.LGA_CODE_2016.str[3:]) - set(shape_df.LGA_CODE16))

0

# Connect to Postgres

In [20]:
conn_string = get_connection_string()
print(conn_string)
engine = sa.create_engine('postgresql+psycopg2://airflow:airflow@postgres:5432/airflow')

postgresql+psycopg2://airflow:airflow@postgres:5432/airflow


## Create schemas

There are several schemas to be created:
1. raw: first landing for the raw data
1. star: the separated data using the star schema
1. data_mart: joins and statistics from the tables in the star schema

In [22]:
schema_list = ['raw', 'star', 'data_mart']
for schema in schema_list:
    query_create_schema = f'CREATE SCHEMA {schema}'
    print(query_create_schema)
    engine.connect().execute(query_create_schema)

CREATE SCHEMA raw
CREATE SCHEMA star
CREATE SCHEMA data_mart


In [32]:
df_dict = {
    'listing_df': listing_df,
    'g01_df': g01_df,
    'g02_df': g02_df,
    'shape_df': shape_df,
}
schema = 'raw'

for key, df in df_dict.items():
    table_name = f'test_{key}'
    print(f'Uploading {table_name}')
    if isinstance(df, gpd.geodataframe.GeoDataFrame):
        df.to_postgis(con=engine,
                      name='test_shape_df',
                      if_exists='replace',
                      schema=schema,
                      index=False)
    else:
        df.to_sql(con=engine,
                  name=table_name,
                  if_exists='replace',
                  schema=schema,
                  index=False)

Uploading test_listing_df
Uploading test_g01_df
Uploading test_g02_df
Uploading test_shape_df


In [24]:
# Check database
query = """
SELECT *
FROM information_schema.tables
WHERE table_type='BASE TABLE'
AND table_schema='raw';
"""

pd.read_sql(con=engine,
            sql=query)

Unnamed: 0,table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
0,airflow,raw,test_listing_df,BASE TABLE,,,,,,YES,NO,
1,airflow,raw,test_g01_df,BASE TABLE,,,,,,YES,NO,
2,airflow,raw,test_g02_df,BASE TABLE,,,,,,YES,NO,
3,airflow,raw,test_shape_df,BASE TABLE,,,,,,YES,NO,


# Dimensions

* Property
* Host: some hosts have multiple properties

### Column actions

An extension to the official [data dictionary](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896) shows the action to be taken for each column:
* keep
* move to another table
* drop

The file is `references\star_schema_column_actions.xlsx`. 

In [21]:
path = references_dir / 'star_schema_column_actions.xlsx'
data_dict_df = pd.read_excel(path)
data_dict_df

Unnamed: 0,Field,Type,Calculated,Description,action,table,comment
0,id,integer,,Airbnb's unique identifier for the listing,move,property,key
1,listing_url,text,y,,move,property,
2,scrape_id,bigint,y,"Inside Airbnb ""Scrape"" this was part of",keep,,
3,last_scraped,datetime,y,"UTC. The date and time this listing was ""scrap...",keep,,
4,name,text,,Name of the listing,move,property,
...,...,...,...,...,...,...,...
69,calculated_host_listings_count,integer,y,The number of listings the host has in the cur...,keep,,
70,calculated_host_listings_count_entire_homes,integer,y,The number of Entire home/apt listings the hos...,keep,,
71,calculated_host_listings_count_private_rooms,integer,y,The number of Private room listings the host h...,keep,,
72,calculated_host_listings_count_shared_rooms,integer,y,The number of Shared room listings the host ha...,keep,,


In [22]:
listing_df.groupby('host_id').id.count()

host_id
10857        1
17061        2
17331        2
18459        1
19082        2
            ..
395175349    1
395304707    1
395864637    1
396018514    1
396020039    1
Name: id, Length: 24406, dtype: int64

In [23]:
listing_df.query('host_id == 17061')

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
1,12351,https://www.airbnb.com/rooms/12351,20210410042103,2021-04-15,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/pictures/763ad5c8-c951...,17061,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,,f,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,"Pyrmont, New South Wales, Australia",Sydney,,-33.8649,151.19171,Private room in townhouse,Private room,2,,1 shared bath,1.0,1.0,"[""Microwave"", ""Patio or balcony"", ""Wifi"", ""Dis...","$14,315.00",2,7,2,2,7,7,2.0,7.0,,t,0,0,0,0,2021-04-15,526,0,0,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,2,0,2,0,4.03
16,73639,https://www.airbnb.com/rooms/73639,20210410042103,2021-04-15,Sydney City Home with Harbour Views,Come stay with Vinh & Stuart (Awarded one of A...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/pictures/547497/8aa33a...,17061,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,,f,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,"Pyrmont, New South Wales, Australia",Sydney,,-33.86459,151.19177,Private room in townhouse,Private room,2,,1 shared bath,1.0,1.0,"[""Microwave"", ""Patio or balcony"", ""Wifi"", ""Dis...","$14,315.00",1,10,1,1,10,10,1.0,10.0,,t,0,0,0,0,2021-04-15,386,0,0,2011-03-17,2019-09-22,96.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,2,0,2,0,3.14


## Property dimension

In [24]:
data_dict_df.query('table == "property"')

Unnamed: 0,Field,Type,Calculated,Description,action,table,comment
0,id,integer,,Airbnb's unique identifier for the listing,move,property,key
1,listing_url,text,y,,move,property,
4,name,text,,Name of the listing,move,property,
5,description,text,,Detailed description of the listing,move,property,
6,neighborhood_overview,text,,Host's description of the neighbourhood,move,property,
7,picture_url,text,,URL to the Airbnb hosted regular sized image f...,move,property,
27,neighbourhood_cleansed,text,y,The neighbourhood as geocoded using the latitu...,move,property,In case the point-in-polygon method doesn't work.
29,latitude,numeric,,Uses the World Geodetic System (WGS84) project...,move,property,
30,longitude,numeric,,Uses the World Geodetic System (WGS84) project...,move,property,
31,property_type,text,,Self selected property type. Hotels and Bed an...,move,property,


### Find a unique key(s) for Property

`id` is a unique. So this can be used to as the key:
* the **primary** key of the `property` table
* the **foreign** key of the `listings` table

In [25]:
listing_df['id'].is_unique

True

## Host dimension

`host_id` is the unique identifier for a host. In the `listing_df`, `host_id` is not unique because some hosts have multiple properties.

In [26]:
data_dict_df.query('table == "host"')

Unnamed: 0,Field,Type,Calculated,Description,action,table,comment
8,host_id,integer,,Airbnb's unique identifier for the host/user,move,host,key
9,host_url,text,y,The Airbnb page for the host,move,host,
10,host_name,text,,Name of the host. Usually just the first name(s).,move,host,
11,host_since,date,,The date the host/user was created. For hosts ...,move,host,
12,host_location,text,,The host's self reported location,move,host,
13,host_about,text,,Description about the host,move,host,
14,host_response_time,,,,move,host,
15,host_response_rate,,,,move,host,
16,host_acceptance_rate,,,That rate at which a host accepts booking requ...,move,host,
17,host_is_superhost,boolean [t=true; f=false],,,move,host,


In [27]:
listing_df['host_id'].is_unique

False

# Separate property from the listing

In [28]:
schema = 'raw'
table_name = 'test_listing_df'

query = f"""
SELECT *
FROM {schema}.{table_name}
"""

db_df = pd.read_sql(con=engine,
                    sql=query)

db_df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,11156,https://www.airbnb.com/rooms/11156,20210410042103,2021-04-12,An Oasis in the City,Very central to the city which can be reached ...,"It is very close to everything and everywhere,...",https://a0.muscache.com/pictures/2797669/17895...,40855,https://www.airbnb.com/users/show/40855,Colleen,2009-09-23,"Potts Point, New South Wales, Australia","Recently retired, I've lived & worked on 4 con...",,,,f,https://a0.muscache.com/im/users/40855/profile...,https://a0.muscache.com/im/users/40855/profile...,Potts Point,1.0,1.0,"['email', 'phone', 'reviews']",t,f,"Potts Point, New South Wales, Australia",Sydney,,-33.86767,151.22497,Private room in apartment,Private room,1,,1 shared bath,1.0,0.0,"[""Dishwasher"", ""Backyard"", ""Kitchen"", ""Shower ...",$65.00,2,180,2,2,180,180,2.0,180.0,,t,29,59,89,364,2021-04-12,196,0,0,2009-12-05,2020-03-13,92.0,10.0,9.0,10.0,10.0,10.0,10.0,,f,1,0,1,0,1.42
1,12351,https://www.airbnb.com/rooms/12351,20210410042103,2021-04-15,Sydney City & Harbour at the door,Come stay with Vinh & Stuart (Awarded as one o...,"Pyrmont is an inner-city village of Sydney, on...",https://a0.muscache.com/pictures/763ad5c8-c951...,17061,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,,f,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,Pyrmont,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t,"Pyrmont, New South Wales, Australia",Sydney,,-33.8649,151.19171,Private room in townhouse,Private room,2,,1 shared bath,1.0,1.0,"[""Microwave"", ""Patio or balcony"", ""Wifi"", ""Dis...","$14,315.00",2,7,2,2,7,7,2.0,7.0,,t,0,0,0,0,2021-04-15,526,0,0,2010-07-24,2019-09-22,95.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,2,0,2,0,4.03
2,14250,https://www.airbnb.com/rooms/14250,20210410042103,2021-04-14,Manly Harbour House,"Beautifully renovated, spacious and quiet, our...",Balgowlah Heights is one of the most prestigio...,https://a0.muscache.com/pictures/56935671/fdb8...,55948,https://www.airbnb.com/users/show/55948,Heidi,2009-11-20,"Sydney, New South Wales, Australia",I am a Canadian who has made Australia her hom...,within a few hours,90%,79%,t,https://a0.muscache.com/im/users/55948/profile...,https://a0.muscache.com/im/users/55948/profile...,Balgowlah,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Balgowlah, New South Wales, Australia",Manly,,-33.80084,151.26378,Entire house,Entire home/apt,6,,3 baths,3.0,3.0,"[""Stove"", ""Dedicated workspace"", ""Iron"", ""Pati...",$470.00,5,22,5,5,22,22,5.0,22.0,,t,0,0,0,122,2021-04-14,2,0,0,2016-01-02,2019-01-02,90.0,8.0,8.0,9.0,8.0,9.0,8.0,,f,2,2,0,0,0.03
3,15253,https://www.airbnb.com/rooms/15253,20210410042103,2021-04-12,Unique Designer Rooftop Apartment in City Loca...,Penthouse living at it best ... You will be st...,The location is really central and there is nu...,https://a0.muscache.com/pictures/46dcb8a1-5d5b...,59850,https://www.airbnb.com/users/show/59850,Morag,2009-12-03,"Sydney, New South Wales, Australia",I am originally Scottish but I have made Sydne...,within an hour,90%,95%,f,https://a0.muscache.com/im/pictures/user/730ee...,https://a0.muscache.com/im/pictures/user/730ee...,Darlinghurst,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Darlinghurst, New South Wales, Australia",Sydney,,-33.87964,151.2168,Private room in apartment,Private room,2,,1 private bath,1.0,1.0,"[""Dishwasher"", ""Kitchen"", ""Shower gel"", ""Cooki...",$80.00,2,90,2,2,90,90,2.0,90.0,,t,21,48,78,336,2021-04-12,367,3,0,2012-02-23,2021-03-07,88.0,10.0,9.0,10.0,10.0,10.0,9.0,,t,1,0,1,0,3.3
4,9995212,https://www.airbnb.com/rooms/9995212,20210410042103,2021-04-13,Close to everything Sydney!,Beautiful 1 bed apt in Pyrmont.<br /><br />Lar...,,https://a0.muscache.com/pictures/599acdd7-87b5...,10697503,https://www.airbnb.com/users/show/10697503,,,,,,,,,,,,,,,,,,Sydney,,-33.86639,151.19215,Entire apartment,Entire home/apt,2,,1 bath,1.0,1.0,"[""Pool"", ""Kitchen"", ""Iron"", ""Cable TV"", ""TV wi...",$280.00,5,10,5,5,10,10,5.0,10.0,,t,0,0,0,0,2021-04-13,0,0,0,,,,,,,,,,,t,1,1,0,0,


In [29]:
# Get the columns that would be moved the dimension tables
schema = 'star'
table_name = 'property'
column_df = data_dict_df.query('action == "move" and table == @table_name')
column_df

Unnamed: 0,Field,Type,Calculated,Description,action,table,comment
0,id,integer,,Airbnb's unique identifier for the listing,move,property,key
1,listing_url,text,y,,move,property,
4,name,text,,Name of the listing,move,property,
5,description,text,,Detailed description of the listing,move,property,
6,neighborhood_overview,text,,Host's description of the neighbourhood,move,property,
7,picture_url,text,,URL to the Airbnb hosted regular sized image f...,move,property,
27,neighbourhood_cleansed,text,y,The neighbourhood as geocoded using the latitu...,move,property,In case the point-in-polygon method doesn't work.
29,latitude,numeric,,Uses the World Geodetic System (WGS84) project...,move,property,
30,longitude,numeric,,Uses the World Geodetic System (WGS84) project...,move,property,
31,property_type,text,,Self selected property type. Hotels and Bed an...,move,property,


In [30]:
dim_columns = column_df['Field']
key_column = column_df.query('comment == "key"').iloc[0, 0]

In [31]:
# Create a query to subset the listing_df
dim_columns_str = stringify_columns(fields=dim_columns)
print(dim_columns_str)

    id, 
    listing_url, 
    name, 
    description, 
    neighborhood_overview, 
    picture_url, 
    neighbourhood_cleansed, 
    latitude, 
    longitude, 
    property_type, 
    room_type, 
    accommodates, 
    bathrooms_text, 
    bedrooms, 
    beds, 
    amenities, 
    price, 
    number_of_reviews, 
    license, 
    instant_bookable


In [32]:
source_schema = 'raw'
source_table_name = 'test_listing_df'
target_schema = 'star'
target_table_name = 'test_dim_property'

query = f"""
SELECT
{dim_columns_str}
INTO {target_schema}.{target_table_name}
FROM {source_schema}.{source_table_name}
"""

print(query)


SELECT
    id, 
    listing_url, 
    name, 
    description, 
    neighborhood_overview, 
    picture_url, 
    neighbourhood_cleansed, 
    latitude, 
    longitude, 
    property_type, 
    room_type, 
    accommodates, 
    bathrooms_text, 
    bedrooms, 
    beds, 
    amenities, 
    price, 
    number_of_reviews, 
    license, 
    instant_bookable
INTO star.test_dim_property
FROM raw.test_listing_df



In [43]:
# ⚠ Doesn't work
engine.connect().execute(query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fca2803f9d0>

The above code doesn't work, however, it works using the `PostgresOperator`, see `dags/dag_test_apply_constraint_key_success.py`.

# Separate host from the listing

`host_id` is not unique, because a host could have multiple properties.

Check that for each `host_id`, all the column names that have `host_` in it only has one unique value.

In [195]:
(
    listing_df
    .filter(regex='^host_')
    .groupby('host_id')
    .apply(pd.DataFrame.nunique)
    .max()
    .max()
)

1

Select some data from `raw.listing_df` and insert into `star.dim_host`.

In [177]:
# Get the columns that would be moved the dimension tables
schema = 'star'
table_name = 'host'
column_df = data_dict_df.query('action == "move" and table == @table_name')
column_df

Unnamed: 0,Field,Type,Calculated,Description,action,table,comment
8,host_id,integer,,Airbnb's unique identifier for the host/user,move,host,key
9,host_url,text,y,The Airbnb page for the host,move,host,
10,host_name,text,,Name of the host. Usually just the first name(s).,move,host,
11,host_since,date,,The date the host/user was created. For hosts ...,move,host,
12,host_location,text,,The host's self reported location,move,host,
13,host_about,text,,Description about the host,move,host,
14,host_response_time,,,,move,host,
15,host_response_rate,,,,move,host,
16,host_acceptance_rate,,,That rate at which a host accepts booking requ...,move,host,
17,host_is_superhost,boolean [t=true; f=false],,,move,host,


In [198]:
host_cols = stringify_columns(fields=column_df['Field'],
                              dtypes=column_df['Type'],
                              keys=column_df['comment'] == 'key',
                              with_dtype=False)
print(host_cols)

    host_id, 
    host_url, 
    host_name, 
    host_since, 
    host_location, 
    host_about, 
    host_response_time, 
    host_response_rate, 
    host_acceptance_rate, 
    host_is_superhost, 
    host_thumbnail_url, 
    host_picture_url, 
    host_listings_count, 
    host_total_listings_count, 
    host_verifications, 
    host_has_profile_pic, 
    host_identity_verified


In [219]:
source_schema = 'raw'
source_table = 'test_listing_df'
query = f"""
SELECT
  host_id
, MAX(host_url) AS host_url
, MAX(host_name) AS host_name
, MAX(host_since) AS host_since
, MAX(host_location) AS host_location
, MAX(host_about) AS host_about
, MAX(host_response_time) AS host_response_time
, MAX(host_response_rate) AS host_response_rate
, MAX(host_acceptance_rate) AS host_acceptance_rate
, MAX(host_is_superhost) AS host_is_superhost
, MAX(host_thumbnail_url) AS host_thumbnail_url
, MAX(host_picture_url) AS host_picture_url
, MAX(host_listings_count) AS host_listings_count
, MAX(host_total_listings_count) AS host_total_listings_count
, MAX(host_verifications) AS host_verifications
, MAX(host_has_profile_pic) AS host_has_profile_pic
, MAX(host_identity_verified) AS host_identity_verified
FROM {source_schema}.{source_table}
GROUP BY 1
"""
print(query)


SELECT
  host_id
, MAX(host_url) AS host_url
, MAX(host_name) AS host_name
, MAX(host_since) AS host_since
, MAX(host_location) AS host_location
, MAX(host_about) AS host_about
, MAX(host_response_time) AS host_response_time
, MAX(host_response_rate) AS host_response_rate
, MAX(host_acceptance_rate) AS host_acceptance_rate
, MAX(host_is_superhost) AS host_is_superhost
, MAX(host_thumbnail_url) AS host_thumbnail_url
, MAX(host_picture_url) AS host_picture_url
, MAX(host_listings_count) AS host_listings_count
, MAX(host_total_listings_count) AS host_total_listings_count
, MAX(host_verifications) AS host_verifications
, MAX(host_has_profile_pic) AS host_has_profile_pic
, MAX(host_identity_verified) AS host_identity_verified
FROM raw.test_listing_df
GROUP BY 1



In [220]:
dim_host_df = pd.read_sql(con=engine,
                          sql=query)

In [221]:
destination_schema = 'star'
destination_table = 'dim_host'

dim_host_df.to_sql

Unnamed: 0,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified
0,10857,https://www.airbnb.com/users/show/10857,Percy,2009-03-20,"Lima Region, Peru",From Australia,within an hour,100%,100%,f,https://a0.muscache.com/im/users/10857/profile...,https://a0.muscache.com/im/users/10857/profile...,0.0,0.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t
1,17061,https://www.airbnb.com/users/show/17061,Stuart,2009-05-14,"Sydney, New South Wales, Australia","G'Day from Australia!\r\n\r\nHe's Vinh, and I'...",,,,f,https://a0.muscache.com/im/users/17061/profile...,https://a0.muscache.com/im/users/17061/profile...,2.0,2.0,"['email', 'phone', 'manual_online', 'reviews',...",t,t
2,17331,https://www.airbnb.com/users/show/17331,Marisa,2009-05-16,"North Bondi, New South Wales, Australia",,,,0%,f,https://a0.muscache.com/im/users/17331/profile...,https://a0.muscache.com/im/users/17331/profile...,2.0,2.0,"['email', 'phone']",t,f
3,18459,https://www.airbnb.com/users/show/18459,Barry,2009-05-23,"Darlinghurst, New South Wales, Australia",30's law student and fashion designer. \n\nI ...,within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/ea2df...,https://a0.muscache.com/im/pictures/user/ea2df...,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t
4,19082,https://www.airbnb.com/users/show/19082,Alana,2009-05-28,"Potts Point, New South Wales, Australia",Creative female working in the digital space. ...,within a day,100%,0%,f,https://a0.muscache.com/im/pictures/user/e53be...,https://a0.muscache.com/im/pictures/user/e53be...,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24401,395175349,https://www.airbnb.com/users/show/395175349,Aaron,2021-04-01,"Sydney, New South Wales, Australia",,within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/4a722...,https://a0.muscache.com/im/pictures/user/4a722...,0.0,0.0,"['email', 'phone']",t,t
24402,395304707,https://www.airbnb.com/users/show/395304707,Antonio,2021-04-02,AU,,,,,f,https://a0.muscache.com/im/pictures/user/cf06f...,https://a0.muscache.com/im/pictures/user/cf06f...,0.0,0.0,['phone'],t,t
24403,395864637,https://www.airbnb.com/users/show/395864637,Apple,2021-04-07,AU,,,,,f,https://a0.muscache.com/im/pictures/user/3a415...,https://a0.muscache.com/im/pictures/user/3a415...,1.0,1.0,"['email', 'phone']",t,t
24404,396018514,https://www.airbnb.com/users/show/396018514,Richard,2021-04-08,AU,,,,50%,f,https://a0.muscache.com/im/pictures/user/02ef7...,https://a0.muscache.com/im/pictures/user/02ef7...,1.0,1.0,"['email', 'phone', 'work_email']",t,t


In [212]:
engine.connect().execute(query)

ProgrammingError: (psycopg2.errors.DuplicateColumn) column "max" specified more than once

[SQL: 
SELECT
  host_id
, MAX(host_url)
, MAX(host_name)
, MAX(host_since)
, MAX(host_location)
, MAX(host_about)
, MAX(host_response_time)
, MAX(host_response_rate)
, MAX(host_acceptance_rate)
, MAX(host_is_superhost)
, MAX(host_thumbnail_url)
, MAX(host_picture_url)
, MAX(host_listings_count)
, MAX(host_total_listings_count)
, MAX(host_verifications)
, MAX(host_has_profile_pic)
, MAX(host_identity_verified)
INTO star.dim_host
FROM raw.test_listing_df
GROUP BY 1
]
(Background on this error at: http://sqlalche.me/e/14/f405)

In [178]:
dim_columns = column_df['Field']
key_column = column_df.query('comment == "key"').iloc[0, 0]
dim_df = listing_df.loc[:, dim_columns]

In [180]:
dim_table_name = f'dim_{table_name}'
dim_df.to_sql(con=engine,
              name=dim_table_name,
              schema=schema,
              index=False)

In [213]:
# The pandas method doesn’t have a parameter to specify the keys. So specifying keys is done after uploading the table.
query = f"""
ALTER TABLE {schema}.{dim_table_name} ADD PRIMARY KEY ({key_column});
"""
print(query)
engine.connect().execute(query)


ALTER TABLE star.dim_host ADD PRIMARY KEY (host_id);



ProgrammingError: (psycopg2.errors.UndefinedTable) relation "star.dim_host" does not exist

[SQL: 
ALTER TABLE star.dim_host ADD PRIMARY KEY (host_id);
]
(Background on this error at: http://sqlalche.me/e/14/f405)

# Point in polygon join

In [47]:
query = f"""
select 
  tld.neighbourhood_cleansed
, tld.latitude 
, tld.longitude 
, tsd."LGA_NAME20" 
from raw.test_listing_df tld
left join raw.test_shape_df tsd 
on ST_CONTAINS(st_setsrid(tsd.geometry, 4326), ST_SetSRID(st_point(tld.longitude, tld.latitude), 4326))
"""

polygon_join = pd.read_sql(con=engine,
                           sql=query)

In [49]:
polygon_join

Unnamed: 0,neighbourhood_cleansed,latitude,longitude,LGA_NAME20
0,Sydney,-33.86767,151.22497,Sydney (C)
1,Sydney,-33.86490,151.19171,Sydney (C)
2,Manly,-33.80084,151.26378,Northern Beaches (A)
3,Sydney,-33.87964,151.21680,Sydney (C)
4,Sydney,-33.86639,151.19215,Sydney (C)
...,...,...,...,...
32674,Canada Bay,-33.86390,151.13073,Canada Bay (A)
32675,Canada Bay,-33.82780,151.08599,Canada Bay (A)
32676,Auburn,-33.84596,151.07130,Parramatta (C)
32677,Marrickville,-33.91559,151.15586,Inner West (A)


In [48]:
query = f"""
select 
  tld.neighbourhood_cleansed
, tld.latitude 
, tld.longitude 
, tsd."LGA_NAME20" 
into star.test_listing_join
from raw.test_listing_df tld
left join raw.test_shape_df tsd 
on ST_CONTAINS(st_setsrid(tsd.geometry, 4326), ST_SetSRID(st_point(tld.longitude, tld.latitude), 4326))
limit 10
"""

engine.connect().execute(query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f73ea48b250>

In [46]:
polygon_join.LGA_NAME20.notnull().value_counts()

True    32539
Name: LGA_NAME20, dtype: int64

# Appendix: Uploading to Postgres through dictionaries

This lead to an error. The [`pandas.DataFrame.to_sql`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) method was preferred instead.

In [131]:
dim_table_name = 'property'
column_df = data_dict_df.query('action == "move" and table == @dim_table_name')
create_table_query = get_create_query(
    table_name=dim_table_name,
    fields=column_df['Field'], 
    dtypes=column_df['Type'],
    keys=column_df['comment'] == 'key'
)
print(create_table_query)

CREATE TABLE property(
    id bigint PRIMARY KEY, 
    listing_url text, 
    name text, 
    description text, 
    neighborhood_overview text, 
    picture_url text, 
    neighbourhood_cleansed text, 
    latitude float, 
    longitude float, 
    property_type text, 
    room_type text, 
    accommodates bigint, 
    bathrooms_text text, 
    bedrooms bigint, 
    beds bigint, 
    amenities text, 
    price money, 
    number_of_reviews bigint, 
    license text, 
    instant_bookable boolean
)


In [132]:
engine.connect().execute(create_table_query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f6ae97b6e20>

In [108]:
insert_df = listing_df[column_df['Field']]
values = insert_df.to_dict('split')['data']

In [95]:
print(stringify_columns( 
    fields=data_dict_property_df['Field'], 
    dtypes=data_dict_property_df['Type'],
    keys=data_dict_property_df['comment'] == 'key',
    with_dtype=False
))

    id, 
    listing_url, 
    name, 
    description, 
    neighborhood_overview, 
    picture_url, 
    neighbourhood_cleansed, 
    latitude, 
    longitude, 
    property_type, 
    room_type, 
    accommodates, 
    bathrooms_text, 
    bedrooms, 
    beds, 
    amenities, 
    price, 
    number_of_reviews, 
    license, 
    instant_bookable


In [134]:
stringified_columns = stringify_columns(
    fields=column_df['Field'], 
    dtypes=column_df['Type'],
    keys=column_df['comment'] == 'key',
    with_dtype=False
)
insert_query = f"""
INSERT INTO public.{table_name}(\n{stringified_columns}\n)\nVALUES %s
"""

print(insert_query)


INSERT INTO public.property(
    id, 
    listing_url, 
    name, 
    description, 
    neighborhood_overview, 
    picture_url, 
    neighbourhood_cleansed, 
    latitude, 
    longitude, 
    property_type, 
    room_type, 
    accommodates, 
    bathrooms_text, 
    bedrooms, 
    beds, 
    amenities, 
    price, 
    number_of_reviews, 
    license, 
    instant_bookable
)
VALUES %s



In [116]:
cursor = engine.raw_connection().cursor()
cursor

<cursor object at 0x7f6ac16bb4f0; closed: 0>

In [70]:
listing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32679 entries, 0 to 32678
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            32679 non-null  int64  
 1   listing_url                                   32679 non-null  object 
 2   scrape_id                                     32679 non-null  int64  
 3   last_scraped                                  32679 non-null  object 
 4   name                                          32667 non-null  object 
 5   description                                   31442 non-null  object 
 6   neighborhood_overview                         19345 non-null  object 
 7   picture_url                                   32679 non-null  object 
 8   host_id                                       32679 non-null  int64  
 9   host_url                                      32679 non-null 

In [71]:
listing_df.shape

(32679, 74)

In [74]:
Path.cwd()

PosixPath('/home/jovyan/work/notebooks')

In [76]:
Path('.test_folder').exists()

False

In [5]:
d = {'a': 1, 'b': 2}

In [7]:
list(d.keys())

['a', 'b']

[1, 2]