# Main Data Science Workflow

This notebook demonstrates the complete pipeline from data loading to model evaluation.

**Contents:**
1. Setup and Configuration
2. Data Loading from S3
3. Data Transformation

**Last Updated:** 2025-10-19
**Author:** Wiebke Hutiri

In [1]:
# Notebook Configuration

%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

## Setup and Configuration

In [14]:
from ds_code_challenge.data import (download_from_s3, load_data, spatial_join, validate_join,
                                    requests_by_department_per_day, filter_date_range)
from ds_code_challenge.config import Config
from ds_code_challenge.modeling import detect_anomalies_zscore

Config.setup_directories()

## Data Loading from s3

In [None]:
s3_dataset_keys = ['sr.csv.gz', 'sr_hex.csv.gz', 'sr_hex_truncated.csv', 'city-hex-polygons-8.geojson', 'city-hex-polygons-8-10.geojson', 'images/swimming-pool/yes', 'images/swimming-pool/no']

for key in s3_dataset_keys:
    download_from_s3(key, 'raw')

In [3]:
sr_df = load_data('sr.csv.gz')
hex8_df = load_data('city-hex-polygons-8.geojson')
sr_hex_trun_df = load_data('sr_hex_truncated.csv')
sr_hex_val_df = load_data('sr_hex.csv.gz')

## Data Transformation

In [116]:
sr_hex_df = spatial_join(sr_df, hex8_df)

INFO:ds_code_challenge.data.transform: Join completed in 0.55 seconds.


In [118]:
diff = validate_join(sr_hex_df, sr_hex_val_df)

Found 29 differences:
                     h3_level8_index                 
                                self            other
notification_number                                  
1015706215           88ad360221fffff  88ad360227fffff
1015706515           88ad360221fffff  88ad360227fffff
1015720316           88ad36d5b1fffff  88ad36d5b5fffff
1015732806           88ad360221fffff  88ad360227fffff
1015760130           88ad360221fffff  88ad360227fffff
1015802932           88ad360221fffff  88ad360227fffff
1015818134           88ad360221fffff  88ad360227fffff
1015819134           88ad36135bfffff  88ad361353fffff
1015833016           88ad360221fffff  88ad360227fffff
1015835341           88ad360221fffff  88ad360227fffff
1015835955           88ad360221fffff  88ad360227fffff
1015836310           88ad360221fffff  88ad360227fffff
1015839765           88ad360221fffff  88ad360227fffff
1015840714           88ad360221fffff  88ad360227fffff
1015868773           88ad360221fffff  88ad360227fffff
101587

## Anomaly Detection Challenge

Reshape the sr_hex.csv data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.

In [15]:
agg_requests = requests_by_department_per_day(sr_hex_val_df)
anomalies = detect_anomalies_zscore(agg_requests)
filter_date_range(anomalies, '2020-01-01', '2020-06-30')

INFO:ds_code_challenge.data.transform:Aggregating 941634 service requests by department and day
INFO:ds_code_challenge.data.transform:Aggregated to 5304 date-department combinations
INFO:ds_code_challenge.data.transform:Date range: 2020-01-01 00:00:00 to 2020-12-31 00:00:00
INFO:ds_code_challenge.data.transform:Departments: 20
INFO:ds_code_challenge.modeling.anomaly_prediction:Detecting anomalies with Z-score threshold: 3
INFO:ds_code_challenge.modeling.anomaly_prediction:Found 44 anomalous days
INFO:ds_code_challenge.modeling.anomaly_prediction:Departments affected: 10
INFO:ds_code_challenge.data.transform:Filtering to date range: 2020-01-01 to 2020-06-30
INFO:ds_code_challenge.data.transform:Filtered to 19 records


Unnamed: 0,date,department,request_count,mean,std,median,min,max,z_score,is_anomaly
743,2020-02-21,City Health,72,18.792614,14.825083,17.0,1,72,3.589011,True
273,2020-01-20,Operational Coordination,12,3.622581,2.704929,3.0,1,15,3.097094,True
629,2020-02-13,Operational Coordination,14,3.622581,2.704929,3.0,1,15,3.836485,True
820,2020-02-26,Operational Coordination,12,3.622581,2.704929,3.0,1,15,3.097094,True
441,2020-01-31,Property Management,59,8.815668,14.630104,4.0,1,120,3.43021,True
2139,2020-06-02,Social Development & Early Childhood Development,17,2.604811,1.96833,2.0,1,19,7.313404,True
2155,2020-06-03,Social Development & Early Childhood Development,19,2.604811,1.96833,2.0,1,19,8.329493,True
295,2020-01-21,Technical Services,232,21.74184,38.06333,8.0,1,272,5.523903,True
343,2020-01-24,Technical Services,165,21.74184,38.06333,8.0,1,272,3.763679,True
399,2020-01-28,Technical Services,272,21.74184,38.06333,8.0,1,272,6.574784,True
