# Procurement & Supplier Management — Notebook

This notebook follows the `STEP nA / nB / nC` pattern.

- **df_raw** = raw procurement PO data
- **df_treat** = cleaned dataset after TREAT steps

Steps:
- STEP 1: SEE/TREAT/VERIFY structure
- STEP 2: SEE/TREAT/VERIFY missing dates
- STEP 3: SEE/TREAT/VERIFY outliers & suppliers
- STEP 4: VERIFY KPIs
- STEP 5: Descriptive analytics
- STEP 6: Diagnostic analytics
- STEP 7: Predictive logistic regression
- STEP 8: Prescriptive supplier allocation


In [3]:
# STEP 0 — imports
# TODO: uncomment and run
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
%matplotlib inline


In [4]:
# STEP 0B — load dataset
# TODO: load the CSV file (it should be in the same folder as this notebook)
df_raw = pd.read_csv('https://raw.githubusercontent.com/saikisri97/17_Hof_Lecture_Code_Pingo/refs/heads/main/Supply_Chain_Analytics/data/procurement_po_data.csv')
df_raw.head()


Unnamed: 0,PO_ID,Supplier,Lead_Time_Days,Unit_Price,Defect_Flag,Quantity,PO_Date,Delivery_Date
0,PO_001,A,10.8,10.5,1,357,2024-02-15,2024-02-25
1,PO_002,B,5.5,11.0,0,524,2024-02-19,2024-02-24
2,PO_003,A,14.5,10.2,0,339,2024-01-15,2024-01-29
3,PO_004,C,8.1,0.5,0,480,2024-02-18,2024-02-26
4,PO_005,Sup B,-12.2,10.2,0,578,2024-02-24,2024-02-24


## Helper Functions

We will use the following helper functions in multiple steps:

- `missing_report(df)`
- `step1_check_structure(df)`
- `step2_check_missing_and_dates(df)`
- `step3_check_outliers_and_suppliers(df)`
- `step4_check_verify_kpis(df, promised_lead_time=10)`


In [5]:
def missing_report(df):
    """Print count of missing values per column."""
    print(df.isna().sum())
    print()

def step1_check_structure(df):
    print("dtypes:\n", df.dtypes)
    print("\nDuplicate PO_ID count:", df.duplicated('PO_ID').sum())
    if 'Lead_Time_Days' in df.columns:
        print("Negative/zero lead times:", (df['Lead_Time_Days']<=0).sum())
    print("\nUnique suppliers:", df['Supplier'].unique())
    print()

def step2_check_missing_and_dates(df):
    print("Missing Delivery_Date:", df['Delivery_Date'].isna().sum())
    if 'Lead_Time_Days' in df.columns:
        print("Missing Lead_Time_Days:", df['Lead_Time_Days'].isna().sum())
    print()

def step3_check_outliers_and_suppliers(df):
    if 'Lead_Time_Days' in df.columns:
        print(df['Lead_Time_Days'].describe())
        print("Lead times > 45 days:", (df['Lead_Time_Days']>45).sum())
    print("\nSuppliers frequency:")
    print(df['Supplier'].value_counts())
    print()

def step4_check_verify_kpis(df, promised_lead_time=10):
    df = df.copy()
    df['Late_Flag'] = (df['Lead_Time_Days'] > promised_lead_time).astype(int)
    grp = df.groupby('Supplier')
    kpi = pd.DataFrame(index=grp.size().index)
    kpi['OTD%'] = (1 - grp['Late_Flag'].mean()) * 100
    kpi['Mean_LT'] = grp['Lead_Time_Days'].mean()
    kpi['Std_LT'] = grp['Lead_Time_Days'].std()
    kpi['CV%'] = kpi['Std_LT'] / kpi['Mean_LT'] * 100
    kpi['DefectRate%'] = grp['Defect_Flag'].mean() * 100
    print(kpi)
    return kpi


## STEP 1 — SEE / TREAT / VERIFY: Structure
Check structure of df_raw: dtypes, duplicates, impossible lead times, supplier names.


In [None]:
# STEP 1A — SEE structure
# TODO:
# - Call step1_check_structure(df_raw)
# - Inspect dtypes, duplicate PO_IDs, negative/zero Lead_Time_Days, supplier values
# step1_check_structure(df_raw)
step1_check_structure(df_raw)

dtypes:
 PO_ID              object
Supplier           object
Lead_Time_Days    float64
Unit_Price        float64
Defect_Flag         int64
Quantity            int64
PO_Date            object
Delivery_Date      object
dtype: object

Duplicate PO_ID count: 0
Negative/zero lead times: 2

Unique suppliers: ['A' 'B' 'C' 'Sup B' 'Supp A' 'supA']



In [7]:
# STEP 1B — TREAT structure
# TODO:
# - Create df_treat = df_raw.copy()
# - Standardize Supplier names (e.g., map 'Supp A', 'supA' -> 'A')
# - Convert PO_Date and Delivery_Date to datetime
# - Recompute Lead_Time_Days from dates
# - Set Lead_Time_Days <= 0 to NaN
# - Drop duplicate PO_ID (keep first)
# df_treat = df_raw.copy()
# your cleaning code here
df_treat = df_raw.copy()

In [8]:
mapping = {
    'A': 'A', 'Supp A': 'A', 'supA': 'A',
    'B': 'B', 'Sup B': 'B',
    'C': 'C'
}
df_treat['Supplier'] = df_treat['Supplier'].map(mapping)

In [10]:
step1_check_structure(df_treat)

dtypes:
 PO_ID              object
Supplier           object
Lead_Time_Days    float64
Unit_Price        float64
Defect_Flag         int64
Quantity            int64
PO_Date            object
Delivery_Date      object
dtype: object

Duplicate PO_ID count: 0
Negative/zero lead times: 2

Unique suppliers: ['A' 'B' 'C']



In [11]:
df_treat['PO_Date'] = pd.to_datetime(df_treat['PO_Date'])
df_treat['Delivery_Date'] = pd.to_datetime(df_treat['Delivery_Date'])

In [15]:

df_treat['Lead_Time_Days'] = (df_treat['Delivery_Date'] - df_treat['PO_Date']).dt.days

In [17]:
df_raw

Unnamed: 0,PO_ID,Supplier,Lead_Time_Days,Unit_Price,Defect_Flag,Quantity,PO_Date,Delivery_Date
0,PO_001,A,10.8,10.5,1,357,2024-02-15,2024-02-25
1,PO_002,B,5.5,11.0,0,524,2024-02-19,2024-02-24
2,PO_003,A,14.5,10.2,0,339,2024-01-15,2024-01-29
3,PO_004,C,8.1,0.5,0,480,2024-02-18,2024-02-26
4,PO_005,Sup B,-12.2,10.2,0,578,2024-02-24,2024-02-24
...,...,...,...,...,...,...,...,...
115,PO_116,Supp A,6.9,11.0,1,487,2024-01-20,2024-01-26
116,PO_117,Supp A,10.1,10.0,0,473,2024-02-12,2024-02-22
117,PO_118,B,4.9,50.0,0,506,2024-02-08,2024-02-12
118,PO_119,Supp A,13.4,50.0,0,486,2024-01-13,2024-01-26


In [16]:
df_treat

Unnamed: 0,PO_ID,Supplier,Lead_Time_Days,Unit_Price,Defect_Flag,Quantity,PO_Date,Delivery_Date
0,PO_001,A,10.0,10.5,1,357,2024-02-15,2024-02-25
1,PO_002,B,5.0,11.0,0,524,2024-02-19,2024-02-24
2,PO_003,A,14.0,10.2,0,339,2024-01-15,2024-01-29
3,PO_004,C,8.0,0.5,0,480,2024-02-18,2024-02-26
4,PO_005,B,0.0,10.2,0,578,2024-02-24,2024-02-24
...,...,...,...,...,...,...,...,...
115,PO_116,A,6.0,11.0,1,487,2024-01-20,2024-01-26
116,PO_117,A,10.0,10.0,0,473,2024-02-12,2024-02-22
117,PO_118,B,4.0,50.0,0,506,2024-02-08,2024-02-12
118,PO_119,A,13.0,50.0,0,486,2024-01-13,2024-01-26


In [None]:
# STEP 1C — VERIFY structure
# TODO: re-run step1_check_structure(df_treat) and confirm major structural issues are reduced
# step1_check_structure(df_treat)


## STEP 2 — SEE / TREAT / VERIFY: Missing Dates & Lead Times
Handle missing Delivery_Date and Lead_Time_Days using median lead time per supplier.


In [None]:
# STEP 2A — SEE missingness
# TODO:
# - Call missing_report(df_treat)
# - Call step2_check_missing_and_dates(df_treat)
# missing_report(df_treat)
# step2_check_missing_and_dates(df_treat)


In [None]:
# STEP 2B — TREAT missing Delivery_Date
# TODO:
# - Compute median Lead_Time_Days per Supplier (ignore NaNs)
# - For rows with missing Delivery_Date, set Delivery_Date = PO_Date + median_lead_time_for_supplier
# - Recompute Lead_Time_Days from dates
# median_lt = df_treat.groupby('Supplier')['Lead_Time_Days'].median()
# your imputation code here


In [None]:
# STEP 2C — VERIFY missingness
# TODO:
# - Run missing_report(df_treat) again
# - Confirm Delivery_Date and Lead_Time_Days have no missing values except possibly where Supplier is missing
# missing_report(df_treat)
# step2_check_missing_and_dates(df_treat)


## STEP 3 — SEE / TREAT / VERIFY: Outliers & Supplier Labels
Check lead-time outliers (> 45 days) and ensure Supplier ∈ {A,B,C}.


In [None]:
# STEP 3A — SEE outliers & suppliers
# TODO:
# - Use df_treat['Lead_Time_Days'].describe()
# - Count how many Lead_Time_Days > 45
# - Inspect supplier value counts
# step3_check_outliers_and_suppliers(df_treat)


In [None]:
# STEP 3B — TREAT outliers & suppliers
# TODO:
# - Remove or cap Lead_Time_Days > 45 (e.g., drop those rows)
# df_treat = df_treat[df_treat['Lead_Time_Days'] <= 45]


In [None]:
# STEP 3C — VERIFY outliers & suppliers
# TODO: re-run step3_check_outliers_and_suppliers(df_treat) and confirm outliers are gone
# step3_check_outliers_and_suppliers(df_treat)


## STEP 4 — VERIFY: KPI Stability
Compute OTD%, CV%, DefectRate% per supplier to verify dataset stability.


In [None]:
# STEP 4A — Compute KPI table
# TODO:
# - Define promised_lead_time (e.g., 10 days)
# - Create Late_Flag = (Lead_Time_Days > promised_lead_time).astype(int)
# - Group by Supplier and compute OTD%, Mean_LT, Std_LT, CV%, DefectRate%
# promised_lead_time = 10
# df_treat['Late_Flag'] = ...
# kpi_table = ...


In [None]:
# STEP 4B — Inspect KPI table
# TODO: visually inspect kpi_table for unrealistic values
# kpi_table


### STEP 4C — Interpretation (student)
- Which supplier looks most stable?
- Who has the highest OTD% and lowest CV%?


## STEP 5 — Descriptive Analytics
Build a supplier scorecard: lead time, defects, price behaviour.


In [None]:
# STEP 5A — Descriptive KPIs
# TODO:
# - Group by Supplier and compute mean, std, CV% of Lead_Time_Days
# - Compute DefectRate% and mean/ std of Unit_Price
# desc = ...


In [None]:
# STEP 5B — Visualisations
# TODO:
# - Plot bar chart of Mean_LT by Supplier
# - Plot bar chart of DefectRate% by Supplier
# - Optional: boxplot of Unit_Price by Supplier
# (use plt / sns)


### STEP 5C — Interpretation (student)
- Summarise which supplier is fastest, most reliable, and cheapest.


## STEP 6 — Diagnostic Analytics
Explain **why** suppliers behave differently (variance drivers, relationships).


In [None]:
# STEP 6A — Lead time variance by supplier
# TODO: create boxplot of Lead_Time_Days by Supplier
# sns.boxplot(...)


In [None]:
# STEP 6B — Relationship between defects and lead time
# TODO:
# - Compute mean Lead_Time_Days for Defect_Flag = 0 vs 1
# df_treat.groupby('Defect_Flag')['Lead_Time_Days'].mean()


### STEP 6C — Interpretation (student)
- Does higher lead time correlate with more defects?
- Are some suppliers 'too cheap' and unstable?


## STEP 7 — Predictive Analytics (Logistic Regression)
Model Late_Flag using Lead_Time_Days, Defect_Flag, Supplier_Index, Unit_Price.


In [None]:
# STEP 7A — Feature preparation
# TODO:
# - Ensure Late_Flag exists
# - Encode Supplier as Supplier_Index (0,1,2,...)
# - Build X with columns: ['Lead_Time_Days','Defect_Flag','Supplier_Index','Unit_Price']
# - y = Late_Flag


In [None]:
# STEP 7B — Train logistic regression and predict probabilities
# TODO:
# - Fit LogisticRegression on X, y
# - Add Late_Prob = predicted probability of Late_Flag=1


### STEP 7C — Interpretation (student)
- Which features increase late probability?
- Which supplier has lowest average Late_Prob?


## STEP 8 — Prescriptive Decision (Supplier Allocation)
Combine KPIs and Late_Prob to recommend sourcing allocation.


In [None]:
# STEP 8A — Aggregate supplier performance
# TODO:
# - Group by Supplier and compute: OTD, Mean_LT, CV_LT, DefectRate, Price, Late_Prob
# summary = ...


In [None]:
# STEP 8B — Choose allocation (student decision)
# TODO:
# - Based on summary table, propose % allocation to each supplier (e.g., 70% B, 30% A, 0% C)
# - Represent it as a small dict: {'A':..., 'B':..., 'C':...}


### STEP 8C — Managerial summary (student)
- Write 3–5 bullet points as advice to a Procurement Manager.
