# 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position)
This notebook is intended to explore the data join process and determine the join threshold

In [11]:
import pandas as pd
import geopandas as gpd
import helper.functions as hf
from datetime import datetime
import sqldf

In [2]:
# make logger
start = datetime.now()
logger = hf.make_logger('2-initial_data_transformation_explore')

In [3]:
# read service data
logger.info('Loading files')
load_time = datetime.now()
sr = hf.load_service_data('data/raw/sr.csv.gz')
# read geojson
geo = gpd.read_file('data/raw/city-hex-polygons-8.geojson')
load_time = datetime.now() - load_time
logger.info(f'Files loaded: {load_time}')

2022-08-20 15:43:04,623 2-initial_data_transformation_explore INFO     Loading files
2022-08-20 15:44:10,756 2-initial_data_transformation_explore INFO     Files loaded: 0:01:06.123228


In [4]:
# join using geospatial join
logger.info("Joining files on geometry")
join_time = datetime.now()
sr.crs = geo.crs
sr['h3_level8_index'] = sr.sjoin(geo, how='left')['index'] # only want the index col
join_time = datetime.now() - join_time
logger.info(f"Finished join: {join_time}")
sr = sr.drop(['geometry'], axis=1) # don't need to keep geometry data so dropping

2022-08-20 15:44:10,803 2-initial_data_transformation_explore INFO     Joining files on geometry
2022-08-20 15:45:41,519 2-initial_data_transformation_explore INFO     Finished join: 0:01:30.710751


## What does the join data tell us about nulls
There is about 23% of the records which can't join on geo data.
There is also a number of records missing a reference_number which is odd.

In [5]:
fails = hf.count_na(sr,'h3_level8_index')
df_set = sr.shape[0]
logger.info(f"Records failed join: {fails}")
logger.info(f"Records total: {df_set}")
logger.info(f"Join error: {round(fails/df_set, 2)*100}%")

2022-08-20 15:45:46,439 2-initial_data_transformation_explore INFO     Records failed join: 212367
2022-08-20 15:45:46,440 2-initial_data_transformation_explore INFO     Records total: 941634
2022-08-20 15:45:46,441 2-initial_data_transformation_explore INFO     Join error: 23.0%


In [6]:
# make na hexs 0
# If the data didn't join we have null so easier to populate the nulls and not deal with NA lat lons
sr['h3_level8_index'] = sr['h3_level8_index'].fillna('0')

## Validation
Let's compare our records to teh sr_hex.csv

In [7]:
col_types = {
    'notification_number':str,
    'reference_number':str
    }
date_cols = ['creation_timestamp','completion_timestamp']
val = pd.read_csv('data/raw/sr_hex.csv', parse_dates=date_cols,dtype=col_types)
logger.info(val.dtypes)

2022-08-20 15:46:19,553 2-initial_data_transformation_explore INFO     notification_number                                    object
reference_number                                       object
creation_timestamp      datetime64[ns, pytz.FixedOffset(120)]
completion_timestamp    datetime64[ns, pytz.FixedOffset(120)]
directorate                                            object
department                                             object
branch                                                 object
section                                                object
code_group                                             object
code                                                   object
cause_code_group                                       object
cause_code                                             object
official_suburb                                        object
latitude                                              float64
longitude                                             float64

In [23]:
# Simple checks
if sr.shape[0] == val.shape[0]:
    logger.info(f"Dataset size match and thus no duplicates or dropouts")
else:
    logger.error(f"Dataset size doesn't match my set {sr.shape[0]} and validation set {val.shape[0]}")

if sr[sr['h3_level8_index'] == '0'].shape[0] == val[val['h3_level8_index'] == '0'].shape[0]:
    logger.info(f"Dataset size match for null hexes")
else:
    logger.error(f"Null hexes not matching my set {sr[sr['h3_level8_index'] == '0'].shape[0]} and validation set {val[val['h3_level8_index'] == '0'].shape[0]}")

2022-08-20 16:14:26,398 2-initial_data_transformation_explore INFO     Dataset size match and thus no duplicates or dropouts
2022-08-20 16:14:33,931 2-initial_data_transformation_explore ERROR    Null hexes not matching my set 212367 and validation set 212364


## Simple checks check
Oddly I mapped 3 more records to the geo set.
We'll have to compare line for line to see what the hex codes are.
A joins

In [17]:
# more checks
query = '''
select sr.*, 
       val.h3_level8_index val_hex,
       val.latitude val_lat,
       val.longitude val_lon
from sr
left join val on val.notification_number = sr.notification_number
where sr.h3_level8_index <> val.h3_level8_index
'''
test = sqldf.run(query)

In [19]:
my_hex = test['h3_level8_index'].unique()
val_hex = test['val_hex'].unique()
test[['notification_number', 'h3_level8_index', 'val_hex','latitude','longitude','val_lat','val_lon']].sort_values('notification_number')

Unnamed: 0,notification_number,h3_level8_index,val_hex,latitude,longitude,val_lat,val_lon
0,1015706215,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
1,1015706515,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
2,1015720316,88ad36d5b1fffff,88ad36d5b5fffff,-34.055004,18.817866,-34.055004,18.817866
3,1015732806,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
4,1015760130,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
5,1015802932,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
6,1015818134,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
7,1015819134,88ad36135bfffff,88ad361353fffff,-34.015341,18.610848,-34.015341,18.610848
8,1015833016,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912
9,1015835341,88ad360221fffff,88ad360227fffff,-33.871389,18.512912,-33.871389,18.512912


In [22]:
print(my_hex)
print(val_hex)

['88ad360221fffff' '88ad36d5b1fffff' '88ad36135bfffff' '0'
 '88ad360151fffff']
['88ad360227fffff' '88ad36d5b5fffff' '88ad361353fffff' '88ad36c629fffff'
 '88ad360157fffff' '88ad361b51fffff']


## Why are the joins not 100%
for hex 88ad361b51fffff and 88ad36c629fffff I can't find them in city-hex-polygons-8.geojson or city-hex-polygons-8-10.geojson. So they don't exist for mapping.
I'm not sure why the rest doesn't match. The spatial join is pretty simple and no duplicaiton occured.
The lat longs between the sets appear to match.
My assumption is therefore that the mapping table used in the challenge is different to the data, perhaps on purpose to see if candidates detect the mistake.

Eitherway, re-mapping won't help to sort this issue. And reconstructing a polygon from sr_hex.csv won't work either.


## Final thoughts
The join threshold would depend on proportion of successful joins and # of useable data.
If most of the data can join and we have enough with regards to dataset size, it should be fine.

We'll build a function based on this principal.
We have 941,634 records in total ~ 1 mil
With 212,367 failing to join
23% failure

I would argue that if the data dips below half the size we currently have it would be unusable.
So maybe if we have more than 500k joined data and more than 50% successful joins this would work on other sets?