***How to Do Experiment Tracking:***

1. Go to the VS Code Terminal.
2. Ensure the Conda environment is active.
3. Navigate to the root directory of the project (where the mlruns folder is located).
4. Type the command `mlflow ui` and hit Enter.
5. See a message saying "Serving on http://127.0.0.1:5000".
6. Ctrl + Click that link (or open a web browser and type http://localhost:5000).

# 3.1 Business Assumptions 

The datasets used in this project were obtained from a Sri Lankan Telco company. Hence, these general estimates were derived from **Dialog Axiataâ€™s FY2024 Financial Statements** and **TRCSL (Telecommunications Regulatory Commission of Sri Lanka)** official reports.

## Data Sources & References:

  * **Revenue & Subscriber Base:**

      * **Source:** *Dialog Axiata PLC Annual Report 2024*.
      * **Data:** Revenue of **Rs. 171.17 Billion** and a total subscriber base of **19.1 Million**.
      * **Link:** [Dialog Axiata PLC Annual Report 2024 (Colombo Stock Exchange)](https://cdn.cse.lk/cmt/upload_report_file/389_1747616410421.pdf)

  * **Prepaid vs. Postpaid Split (92% Prepaid):**

      * **Source:** *Dialog Axiata Fact Sheet (2024)*.
      * **Data:** 17.5M Prepaid users vs. 1.5M Postpaid users.
      * **Link:** [Dialog Axiata Fact Sheet](https://www.dialog.lk/fact-sheet)

  * **SMS Pricing (Cost of Contact):**

      * **Source:** *Dialog Enterprise / Third-Party Bulk SMS Rates*.
      * **Data:** Standard commercial bulk SMS rates in Sri Lanka range from **LKR 0.50 to LKR 1.00** per SMS.
      * **Link:** [Sri Lanka SMS Pricing Benchmarks](https://www.sent.dm/resources/sri-lanka-sms-pricing)

  * **Industry Market Context:**

      * **Source:** *Telecommunications Regulatory Commission of Sri Lanka (TRCSL)*.
      * **Data:** Confirmation of mobile penetration rates and competitive operator landscape (Dialog, SLT-Mobitel, Hutch).
      * **Link:** [TRCSL Telecom Statistics 2024/25](https://www.trc.gov.lk/pages_e.php?id=12)

In [None]:
# --- SRI LANKA BUSINESS ASSUMPTIONS (2024/25 DATA) ---
# Sources: Dialog Axiata PLC Annual Report 2024 & TRCSL Statistics

# 1. ARPU (Average Revenue Per User)
# Source: Dialog Axiata FY2024 Revenue (Rs. 171.2Bn) / Subscribers (19.1Mn)
# Calculation: 171,171,000,000 / 19,097,715 / 12 months = Rs. 746.90
ARPU_LKR = 750.00 

# 2. Retention Period
# Logic: Prepaid churn in developing markets ranges from 3-6% monthly.
# "Saved" customers are higher risk, so we discount the standard lifetime 
# to a conservative 12-month period.
RETENTION_PERIOD_MONTHS = 12

# 3. Cost of Contact (SMS)
# Source: Local bulk SMS rates (Dialog/Mobitel Enterprise Rates)
# Rate: Approx Rs. 0.50 - 1.00 per SMS. 
# Campaign: 3 SMS sequence + Overhead = Rs. 3.00
COST_CONTACT_LKR = 3.00

# 4. Cost of Offer (The "Save" Incentive)
# Logic: We apply a "Gold Tier" retention offer (30% discount), aligning with
# Dialog's standard loyalty benefits for mid-value customers.
# Calculation: Rs. 750 (ARPU) * 30% = Rs. 225.
COST_OFFER_LKR = 225.00

# 5. Acceptance Rate
# Based on price elasticity in the Sri Lankan prepaid market, uptake drops slightly.
# We estimate a conservative 25% acceptance rate for this tier.
ACCEPTANCE_RATE = 0.25

# --- DERIVED METRICS ---
LTV_LKR = ARPU_LKR * RETENTION_PERIOD_MONTHS # ~Rs. 9,000
BREAKEVEN_PROB = (COST_OFFER_LKR + COST_CONTACT_LKR) / (LTV_LKR * ACCEPTANCE_RATE)

print(f"--- ðŸ‡±ðŸ‡° SRI LANKA MARKET CONTEXT ---")
print(f"Verified ARPU: Rs. {ARPU_LKR:,.2f} (Based on Dialog FY24 Reports)")
print(f"Estimated LTV: Rs. {LTV_LKR:,.2f}")
print(f"Campaign Cost: Rs. {COST_CONTACT_LKR:.2f} (3x SMS)")
print(f"Offer Cost:    Rs. {COST_OFFER_LKR:.2f} (30% Discount)")
print(f"Breakeven Probability: {BREAKEVEN_PROB:.1%} (We profit if churn probability > {BREAKEVEN_PROB:.1%})")

--- ðŸ‡±ðŸ‡° SRI LANKA MARKET CONTEXT (FY2024) ---
Verified ARPU: Rs. 750.00 (Based on Dialog FY24 Reports)
Estimated LTV: Rs. 9,000.00
Campaign Cost: Rs. 3.00 (3x SMS)
Offer Cost:    Rs. 225.00 (30% Discount)
Breakeven Probability: 10.1% (We profit if churn probability > 10.1%)


# 3.2 Library & Experiment Tracking Setup

In [6]:
# Essentiall Librairies
import pandas as pd
import numpy as np

In [None]:
# Hyperparameter Optimization
import optuna

In [None]:
# Initialize experiment tracking
import mlflow
import sys
import os

# Add project root to path (standard setup)
sys.path.append(os.path.abspath('..'))

# Initialize MLflow
# Notice we import from your specific package name now
from telco_customer_churn_prediction import configure_mlflow

configure_mlflow()

In [None]:
# Preprocessing and Modeling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import make_scorer, recall_score, precision_score, f1_score, accuracy_score

from functools import partial

In [None]:
# Module for Evaluation 
%load_ext autoreload
%autoreload 2

from telco_customer_churn_prediction.modeling import (
    profit_calculator,
    advanced_churn_evaluation, 
    run_sensitivity_analysis,
    )

[32m2025-12-03 18:29:37.482[0m | [1mINFO    [0m | [36mtelco_customer_churn_prediction.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: E:\Data_Science\Repositories\telco-customer-churn-prediction[0m


In [10]:
# Garbage Collection
import gc

# 3.3 Data Loading & Preprocessing

In [16]:
# Load the dataset for training
train_df = pd.read_parquet("../data/processed/train_df.parquet", engine="fastparquet")

# Define Selected Features
selected_features=['trend_data_w4_vs_w1', 'data_gini_coefficient', 'trend_spend_w4_vs_w1',
                   'trend_data_w3_vs_w1', 'data_volatility_shift', 'peak_spend_week',
                   'spend_volatility_shift', 'lowest_data_week', 'trend_spend_w2_vs_w1',
                   'peak_data_week', 'trend_spend_w3_vs_w1', 'trend_data_w2_vs_w1',
                   'ratio_min_daily_data_to_avg', 'pct_video_w4', 'spend_consistency_score',
                   'pct_messaging_w4', 'pct_messaging_w3', 'pct_video_w3', 'pct_messaging_w2',
                   'ratio_min_daily_spend_to_avg', 'pct_messaging_w1', 'pct_video_w2', 'pct_video_w1']

X_train = train_df[selected_features]
y_train = train_df['churn']
del train_df
gc.collect()
print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")

Training data shape: (52004, 23), Training labels shape: (52004,)


# 3.4 Logistic Regression

## 3.4.1 Hyperparameter-Tuning

In [None]:
# Create a version of the profit_calculator function with specific numbers
profit_scorer = partial(profit_calculator, ltv=LTV_LKR,
                        cost=COST_OFFER_LKR+COST_CONTACT_LKR,
                        acceptance=ACCEPTANCE_RATE)

# Convert it to a Scikit-Learn Scorer
profit = make_scorer(profit_scorer)

In [None]:
# Define the objective function for Optuna
def objective(trial):
    with mlflow.start_run(nested=True):
        # 1. Define Strategy for Handling Class Imbalance
        # "Do we fix imbalance using Weights (Model) or SMOTE (Data)?"
        imbalance_strategy = trial.suggest_categorical('imbalance_strategy', ['class_weight', 'smote'])
        
        # Initialize variables
        smote_step = None
        lr_class_weight = None
        
        if imbalance_strategy == 'class_weight':
            # Option A: Use Class Weights
            lr_class_weight = 'balanced'
            # We must still put a placeholder step for SMOTE in the pipeline or handle it conditionally
            # To keep pipeline structure consistent, we can set SMOTE to 'passthrough' (do nothing)
            smote_step = 'passthrough' 
            
        else:
            # Option B: Use SMOTE
            lr_class_weight = None # Let the model see raw counts
            
            # Tune the ratio: 
            # 0.3 (23/77) to 0.47 (32/68)
            # This number represents: N_minority / N_majority
            target_ratio = trial.suggest_float('smote_ratio', 0.30, 0.47)
            smote_step = SMOTE(sampling_strategy=target_ratio, random_state=42)
            
        # 2. Define the Other Hyperparameters Search Space
        solver = trial.suggest_categorical('solver', ['lbfgs', 'saga'])
        
        # Logic for penalty compatibility
        if solver == 'lbfgs':
            penalty = trial.suggest_categorical('penalty_lbfgs', ['l2', None])
        else: # saga
            penalty = trial.suggest_categorical('penalty_saga', ['elasticnet', 'l1', 'l2', None])
        
        C = trial.suggest_float('C', 0.001, 100, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0.0, 1.0) if penalty == 'elasticnet' else None
        max_iter = trial.suggest_int('max_iter', 1000, 2000)
        
        # 2. Log Parameters to MLflow
        params = trial.params
        
        if l1_ratio is not None:
            params['l1_ratio'] = l1_ratio
            
        mlflow.log_params(params)

        # 3. Build Pipeline
        # We use the pipeline to ensure transformations happen INSIDE the CV fold
        pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()), 
            ('yeo_johnson', PowerTransformer(method='yeo-johnson')),
            ('smote', smote_step),
            ('classifier', LogisticRegression(
                random_state=42,
                class_weight=lr_class_weight,
                solver=solver,
                penalty=penalty,
                C=C,
                l1_ratio=l1_ratio,
                max_iter=max_iter,
                n_jobs=-1 # Parallel processing for speed
            ))
        ])

        # 4. Cross-Validation with multiple metrics
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        
        # Define the metrics we want to track
        scoring = {
            'recall': 'recall',
            'precision': 'precision',
            'f1': 'f1',
            'accuracy': 'accuracy',
            'profit': profit
        }
        
        # specific_metrics is a dictionary containing arrays of scores
        scores = cross_validate(pipeline, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)

        # 5. Log Metric
        mlflow.log_metric("mean_recall", scores['test_recall'].mean())
        mlflow.log_metric("mean_precision", scores['test_precision'].mean())
        mlflow.log_metric("mean_f1", scores['test_f1'].mean())
        mlflow.log_metric("mean_accuracy", scores['test_accuracy'].mean())
        mlflow.log_metric("mean_profit", scores['test_profit'].mean())
        
        return scores['test_recall'].mean()

In [None]:
# Run Optuna Optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

print(f"Best Recall: {study.best_value:.4f}")
print(f"Best Params: {study.best_params}")

*Note: In Telco churn, Recall is often prioritized over Precision. It is usually less damaging to offer a discount to a loyal customer (False Positive) than to lose a customer because we failed to identify them (False Negative).*

## 3.4.2 Final-Model Building

In [None]:
# Extract best params
best_params = study.best_params.copy()

# Clean up the param dictionary for the model (handling the conditional penalty names)
final_params = {
    'solver': best_params['solver'],
    'C': best_params['C'],
    'max_iter': best_params['max_iter'],
    'class_weight': best_params['class_weight'],
    'random_state': 42,
    'n_jobs': -1
}

# Assign the correct penalty param name back to 'penalty'
if best_params['solver'] == 'lbfgs':
    final_params['penalty'] = best_params['penalty_lbfgs']
else:
    final_params['penalty'] = best_params['penalty_saga']
    if final_params['penalty'] == 'elasticnet':
        final_params['l1_ratio'] = best_params['l1_ratio']

# Create final pipeline
final_model = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('yeo_johnson', PowerTransformer(method='yeo-johnson')),
    ('classifier', LogisticRegression(**final_params))
])

In [None]:
# Fit on full training data
final_model.fit(X_train, y_train)

## 3.4.3 Evaluation on the Training Set

3.4.3.1 Advanced Business Evaluation

In [None]:
# Advanced Business Evaluation
# Note: Inputs are defined in LKR
advanced_churn_evaluation(
    model=final_model, 
    X=X_train, 
    y=y_train,
    model_name="Logistic Regression",
    ltv=LTV_LKR,
    cost_offer=COST_OFFER_LKR,
    cost_contact=COST_CONTACT_LKR,
    acceptance_rate=ACCEPTANCE_RATE
)

3.4.3.2 Sensitivity Analysis

In [None]:
# Sensitivity Analysis
print("Running Sensitivity Analysis on Campaign Acceptance Rates...")
run_sensitivity_analysis(final_model, X_train, y_train)

3.4.3.3 Error Analysis

In [None]:
# Error Analysis
# Get predictions
y_train_pred = final_model.predict(X_train)
y_train_prob = final_model.predict_proba(X_train)[:, 1]

# Create a dataframe of errors
errors_train = X_train.copy()
errors_train['True_Label'] = y_train
errors_train['Pred_Prob']  = y_train_prob
errors_train['Prediction'] = y_train_pred

# Filter: High Probability of Churn (> 80%) but DID NOT Churn (False Positive)
# These are the "Happy Customers" we annoyed with an offer.
false_positives_train = errors_train[(errors_train['True_Label'] == 0) & (errors_train['Pred_Prob'] > 0.8)]

# Filter: Low Probability of Churn (< 20%) but DID Churn (False Negative)
# These are the "Silent Leavers" we missed. Costly!
false_negatives_train = errors_train[(errors_train['True_Label'] == 1) & (errors_train['Pred_Prob'] < 0.2)]

print(f"High Confidence False Positives: {len(false_positives_train)}")
print(f"High Confidence False Negatives: {len(false_negatives_train)}")

# Inspect the averages of False Negatives to see what we missed
print("Profile of Missed Churners (False Negatives):")
print(false_negatives_train[['trend_data_w4_vs_w1', 'spend_volatility_shift']].mean())

3.4.3.4 Logging the Evaluation Metrics

In [None]:
# Create a special run just for the Final Model Evaluation on the training set

with mlflow.start_run(run_name="Final_Model_Evaluation_for_Training"):
    # Log the best params again
    mlflow.log_params(best_params)
    
    # Log the "Deep" metrics calculated on the training set
    # (Let's assume you calculated these using your custom function logic)
    # Example: You manually calculated that max_profit is 45,000,000 IDR
    mlflow.log_metric("train_roi_percentage", 150.5) 
    mlflow.log_metric("train_max_profit_idr", 45000000)
    mlflow.log_metric("train_top_decile_lift", 2.8)
    
    # Save the Sensitivity Plot as an image
    # plt.savefig("sensitivity_chart.png")
    # mlflow.log_artifact("sensitivity_chart.png")

***Insights:***

* The model struggles to identify churners who have stable spending volatility (spend_volatility_shift near 0) but sudden drops in Week 4 data. We may need a specific feature for 'Sudden Week 4 Drop' in version 2.0.

## 3.4.4 Final Evaluation

*This is only performed once after finding the best final logistic regression model that will not be changed again!*

# 3.5 XGBoost