# Main Data Science Workflow

This notebook demonstrates the complete pipeline from data loading to model evaluation.

**Contents:**
1. Setup and Configuration
2. Data Loading from S3
3. Data Transformation

**Last Updated:** 2025-10-19
**Author:** Wiebke Hutiri

In [1]:
# Notebook Configuration

%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

## Setup and Configuration

In [2]:
from ds_code_challenge.data import (download_from_s3, load_data, spatial_join, validate_join,
                                    requests_by_department_per_day, filter_date_range)
from ds_code_challenge.config import Config

Config.setup_directories()

[32m2025-10-20 02:00:37.480[0m | [1mINFO    [0m | [36mds_code_challenge.config[0m:[36mConfig[0m:[36m26[0m - [1mPROJ_ROOT path is: /Users/wiebke/PycharmProjects/ds_code_challenge[0m


## Data Loading from s3

In [None]:
s3_dataset_keys = ['sr.csv.gz', 'sr_hex.csv.gz', 'sr_hex_truncated.csv', 'city-hex-polygons-8.geojson', 'city-hex-polygons-8-10.geojson', 'images/swimming-pool/yes', 'images/swimming-pool/no']

for key in s3_dataset_keys:
    download_from_s3(key, 'raw')

In [3]:
sr_df = load_data('sr.csv.gz')
hex8_df = load_data('city-hex-polygons-8.geojson')
sr_hex_trun_df = load_data('sr_hex_truncated.csv')
sr_hex_val_df = load_data('sr_hex.csv.gz')

## Data Transformation

In [116]:
sr_hex_df = spatial_join(sr_df, hex8_df)

INFO:ds_code_challenge.data.transform: Join completed in 0.55 seconds.


In [118]:
diff = validate_join(sr_hex_df, sr_hex_val_df)

Found 29 differences:
                     h3_level8_index                 
                                self            other
notification_number                                  
1015706215           88ad360221fffff  88ad360227fffff
1015706515           88ad360221fffff  88ad360227fffff
1015720316           88ad36d5b1fffff  88ad36d5b5fffff
1015732806           88ad360221fffff  88ad360227fffff
1015760130           88ad360221fffff  88ad360227fffff
1015802932           88ad360221fffff  88ad360227fffff
1015818134           88ad360221fffff  88ad360227fffff
1015819134           88ad36135bfffff  88ad361353fffff
1015833016           88ad360221fffff  88ad360227fffff
1015835341           88ad360221fffff  88ad360227fffff
1015835955           88ad360221fffff  88ad360227fffff
1015836310           88ad360221fffff  88ad360227fffff
1015839765           88ad360221fffff  88ad360227fffff
1015840714           88ad360221fffff  88ad360227fffff
1015868773           88ad360221fffff  88ad360227fffff
101587

## Anomaly Detection Challenge

Reshape the sr_hex.csv data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.

In [4]:
agg_requests = requests_by_department_per_day(sr_hex_val_df)
agg_requests.sort_values(by='request_count', ascending=False)

INFO:ds_code_challenge.data.transform:Aggregating 941634 service requests by department and day
INFO:ds_code_challenge.data.transform:Aggregated to 5304 date-department combinations
INFO:ds_code_challenge.data.transform:Date range: 2020-01-01 00:00:00 to 2020-12-31 00:00:00
INFO:ds_code_challenge.data.transform:Departments: 20


Unnamed: 0,date,department,request_count
2729,2020-07-13,Electricity Generation and Distribution,3156
2744,2020-07-14,Electricity Generation and Distribution,2015
3937,2020-10-01,Electricity Generation and Distribution,1844
2791,2020-07-17,Electricity Generation and Distribution,1792
2759,2020-07-15,Electricity Generation and Distribution,1715
...,...,...,...
4469,2020-11-05,Social Development & Early Childhood Development,1
2406,2020-06-20,Social Development & Early Childhood Development,1
1252,2020-03-26,Social Development & Early Childhood Development,1
1258,2020-03-27,Customer Relations,1


In [5]:
jan_to_june_2020 = filter_date_range(agg_requests, '2020-01-01', '2020-06-30')
jan_to_june_2020.sort_values(by='request_count', ascending=False)

INFO:ds_code_challenge.data.transform:Filtering to date range: 2020-01-01 to 2020-06-30
INFO:ds_code_challenge.data.transform:Filtered to 2551 records


Unnamed: 0,date,department,request_count
271,2020-01-20,Electricity Generation and Distribution,1346
2250,2020-06-10,Electricity Generation and Distribution,1330
2265,2020-06-11,Electricity Generation and Distribution,1321
2523,2020-06-29,Electricity Generation and Distribution,1260
2498,2020-06-27,Electricity Generation and Distribution,1253
...,...,...,...
1234,2020-03-25,Operational Coordination,1
1220,2020-03-24,Operational Coordination,1
1213,2020-03-23,Valuations,1
1197,2020-03-22,Technical Services,1


In [None]:
|