## PREPROCESSING

1. Parse dates.
2. Interpolate missing weather conditions data.
3. Find anomalies in couriers online, save them to the distinct file and then change values to interpolated (to save the date order).
4. Extract Day names out of date column.
5. Create Day type ordinal variable out of day names (0, 1, 2 for three groups ordered by mean).
6. Normalization of feature numeric variables

In [1]:
# Common imports
import pandas as pd
import numpy as np
import sys
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Custom imports

# Add the chosen directory to the Python path
chosen_directory = '../src/'
sys.path.append(chosen_directory)

from data_preprocessing import (handle_missing_values, show_outliers, handle_outliers, 
                                extract_day_category, extract_days_from_beginning)

## Load dataset

In [3]:
# Define the file path
raw_data_path = '../data/raw/daily_cp_activity_dataset.csv'

# Load the dataset
df = pd.read_csv(raw_data_path)

## Data overview

In [4]:
# Data Overview
df.head()

Unnamed: 0,date,courier_partners_online,temperature,relative_humidity,precipitation
0,2021-05-01,49,18.27,0.57,0.0
1,2021-05-02,927,19.88,0.55,0.0
2,2021-05-03,40,16.88,0.6,0.0
3,2021-05-04,51,21.88,0.53,0.0
4,2021-05-05,50,21.11,0.54,0.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761 entries, 0 to 760
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date                     761 non-null    object 
 1   courier_partners_online  761 non-null    int64  
 2   temperature              731 non-null    float64
 3   relative_humidity        761 non-null    float64
 4   precipitation            756 non-null    float64
dtypes: float64(3), int64(1), object(1)
memory usage: 29.9+ KB


In [6]:
df.describe()

Unnamed: 0,courier_partners_online,temperature,relative_humidity,precipitation
count,761.0,731.0,761.0,756.0
mean,72.417871,17.532585,0.653193,0.914735
std,96.039679,10.007564,0.171553,1.749988
min,34.0,-9.98,0.43,0.0
25%,58.0,10.93,0.52,0.0
50%,66.0,18.63,0.59,0.0
75%,72.0,24.41,0.79,0.91
max,1506.0,37.95,1.0,12.9


___
## PREPROCESSING PIPELINE
---

### Parse dates

In [7]:
# Parse dates
df['date'] = pd.to_datetime(df['date'])

### Handling missing values

In [8]:
# Compare length of each column with df length and fill in interpolations
for column in df.columns:
    if df[column].isna().any():
        df[column] = handle_missing_values(df[column])

Missing values handled using interpolation method: linear
Missing values handled using interpolation method: linear


### Handling outliers

In [9]:
# Find outliers and keep dataframes with them in the dictionary
outliers = {}

for column in df.columns:
    outliers[column] = show_outliers(df, column)

There are outliers in column: courier_partners_online over threshold 93.0
There are outliers in column: precipitation over threshold 2.275


Ignore precipitation because of specific distribution, take a look on courier_partners_online

In [12]:
# Show outliers for couriers online
outliers['courier_partners_online']

Unnamed: 0,date,courier_partners_online,temperature,relative_humidity,precipitation
1,2021-05-02,927,19.88,0.55,0.0
138,2021-09-16,1367,30.68,0.46,0.0
269,2022-01-25,1176,-2.61,0.92,0.84
326,2022-03-23,1506,22.81,0.55,0.0
635,2023-01-26,1175,-1.88,0.92,0.85


In [13]:
# Optionally save outliers to file
# Define file path
outliers_file_path = '../data/processed/anomalies.csv'

# Save to the file
outliers['courier_partners_online'].to_csv(outliers_file_path)

In [11]:
# Handle outliers on courier_partners_online
df['courier_partners_online'] = handle_outliers(
    data=df,
    column='courier_partners_online',
    replacement_strategy='interpolate'
)

Outliers handled using threshold: 93.0


### Extracting extra features

In [14]:
# Add day category feature
df['day_category'] = extract_day_category(df['date'])

# Add days from beginning feature
df['day_from_beginning'] = extract_days_from_beginning(df['date'])

Day categories extracted successfully from the Series


### Normalization

In [15]:
# Specify numerical columns for scaling
numerical_columns = ['temperature', 'relative_humidity', 'precipitation']

# Create a MinMaxScaler instance
scaler = MinMaxScaler()

# Fit and transform the selected columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

### Save to file

In [16]:
# Define path for processed file
file_path = "../data/processed/daily_cp_activity_processed.csv"

# Save the dataframe to csv file
df.to_csv(file_path)