# Main Data Science Workflow

This notebook demonstrates the complete pipeline from data loading to model evaluation.

**Contents:**
1. Setup and Configuration
2. Data Loading from S3
3. Data Transformation

**Last Updated:** 2025-10-19
**Author:** Wiebke Hutiri

In [1]:
# Notebook Configuration

%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

## Setup and Configuration

In [2]:
from ds_code_challenge.data import download_from_s3, load_data, spatial_join
from ds_code_challenge.config import Config

Config.setup_directories()

[32m2025-10-19 23:46:29.099[0m | [1mINFO    [0m | [36mds_code_challenge.config[0m:[36mConfig[0m:[36m26[0m - [1mPROJ_ROOT path is: /Users/wiebke/PycharmProjects/ds_code_challenge[0m


## Data Loading from s3

In [None]:
s3_dataset_keys = ['sr.csv.gz', 'sr_hex.csv.gz', 'sr_hex_truncated.csv', 'city-hex-polygons-8.geojson', 'city-hex-polygons-8-10.geojson', 'images/swimming-pool/yes', 'images/swimming-pool/no']

for key in s3_dataset_keys:
    download_from_s3(key, 'raw')

In [3]:
sr_df = load_data('sr.csv.gz')
hex8_df = load_data('city-hex-polygons-8.geojson')
sr_hex_trun_df = load_data('sr_hex_truncated.csv')

## Data Transformation

In [4]:
sr_hex_df = spatial_join(sr_df, hex8_df)

INFO:ds_code_challenge.data.transform: Join completed in 0.47 seconds.


In [6]:
sr_hex_df.columns

Index(['notification_number', 'reference_number', 'creation_timestamp',
       'completion_timestamp', 'directorate', 'department', 'branch',
       'section', 'code_group', 'code', 'cause_code_group', 'cause_code',
       'official_suburb', 'latitude', 'longitude', 'h3_level8_index'],
      dtype='object')