**Modern BI-Ready Analytics Pipeline for Customer Behavior Segmentation Using Python, dbt, and Power BI**



**Goal**

To design and implement a scalable machine learning pipeline that predicts delivery delays and segments customer behavior using historical logistics, rating, and transactional data. The project combines feature engineering, model training, dashboard reporting, and CI/CD automation to support operational decision-making and data reliability.

 **Intended Audience**

- Data Engineers – for implementing ETL, CI/CD, and model lifecycle

- BI Analysts – for visualization and customer segmentation

- Operations Managers – for warehouse and logistics optimization

- ML Practitioners – for experimentation, evaluation, and insights

- Product and Strategy Teams – to support delay reduction and customer experience improvement

**Strategy & Pipeline Steps**

1. Data Ingestion

- Load delivery data from Google Drive or local storage (CSV)

- Handle encoding, data types, and NA values

2. Data Quality & Transformation (dbt-style logic)

- Clean fields (e.g., Customer_rating, Cost_of_the_Product)

- Engineer delivery_risk_score using custom weight logic

- Encode delay labels and normalize features

3. Exploratory Data Analysis

- Visualize relationships using Power BI:
Warehouse_block, Customer_care_calls, Discount_offered, and Gender

4. Machine Learning Modeling

- Train classification models (Logistic Regression, Random Forest, or XGBoost)

- Predict Reached.on.Time_Y.N as Delivery_Status

5. Model Evaluation

- Use accuracy, precision, recall, and F1-score

- Apply cross-validation and feature importance analysis

6. CI/CD Pipeline

- Use GitHub Actions to run unit tests and linting on each push

- Enforce test coverage and code quality via pytest, flake8, and coverage

7. Dashboard Reporting

- Generate business KPIs by warehouse, gender, and risk group in Power BI

- Use filterable visuals and exportable CSVs

**Challenges**

- Imbalanced delivery label classes (delayed vs on-time)

- No time-based features (e.g., delivery date, order timestamp)

- Limited contextual signals (e.g., traffic, region, weather)

- Missing customer feedback sentiment for deeper segmentation

- Operational anomalies (e.g., bulk orders, priority shipments)



**Problem Statement**

How can we proactively identify delivery risks and segment customer interaction profiles to reduce delays and improve satisfaction, using a robust, testable, and automated AI/ML pipeline?

**Strategy & Pipeline Steps**

**1. Ingest Raw CSV from Local Google Drive**

In [None]:
import pandas as pd

file_path = '/content/drive/My Drive/Staff Engineer, Analytics & Insights/Train.csv'
df = pd.read_csv(file_path)
df.head()


Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


**2. Data Quality & Cleaning**

In [None]:
import pandas as pd

# First, update your file_path accordingly:
file_path = '/content/drive/My Drive/Staff Engineer, Analytics & Insights/Train.csv'

# Load the dataframe before using it.
df = pd.read_csv(file_path)

df.dropna(inplace=True)
df['Customer_rating'] = df['Customer_rating'].astype(int)
df['Cost_of_the_Product'] = df['Cost_of_the_Product'].astype(float)
df['Reached.on.Time_Y.N'] = df['Reached.on.Time_Y.N'].replace({0: 'On Time', 1: 'Delayed'})

df.dropna(inplace=True)
df['Customer_rating'] = df['Customer_rating'].astype(int)
df['Cost_of_the_Product'] = df['Cost_of_the_Product'].astype(float)
df['Reached.on.Time_Y.N'] = df['Reached.on.Time_Y.N'].replace({0: 'On Time', 1: 'Delayed'})


**3. ETL Transformation (dbt-style logic)**

**Create a feature for operational KPI tracking:**

In [None]:
df['delivery_risk_score'] = (
    df['Customer_care_calls'] * 0.2 +
    df['Customer_rating'] * -0.3 +
    df['Discount_offered'] * 0.5
)


**Group by warehouse:**

In [None]:
warehouse_summary = df.groupby('Warehouse_block').agg({
    'Reached.on.Time_Y.N': lambda x: (x == 'Delayed').mean(),
    'delivery_risk_score': 'mean',
    'Cost_of_the_Product': 'mean'
}).reset_index()


**Conceptual Enhancement – AGI (Artificial General Intelligence):**

- Multi-Agent Optimization: Simulate warehouse agents that autonomously rebalance inventory and allocate carriers

- Reinforcement Learning: Adaptively modify discount or shipping modes to minimize delay risk

- NLP Integration: Use LLMs to summarize customer complaints into quantifiable inputs

- Real-Time Streaming: Integrate traffic/weather APIs and IoT data streams to adjust risk prediction dynamically

**Reference**

- Kaggle Dataset: Customer Analytics

- scikit-learn: Classification Models

- GitHub Actions Docs: CI/CD Automation

- Power BI: Dashboarding Guide

- Pandas, Streamlit, and Pytest Docs