# üöÄ Sales Churn Prediction - GPU Training Pipeline (Google Colab)

## Complete ML Pipeline with GPU Acceleration

This notebook trains an XGBoost churn prediction model using:
- ‚úÖ **GPU Acceleration** (CUDA)
- ‚úÖ **Pre-computed Optimal Hyperparameters** (no Optuna needed)
- ‚úÖ **MLflow Experiment Tracking**
- ‚úÖ **Production-Ready Model**

### ‚öôÔ∏è Setup Instructions:
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí **GPU**
2. **Run all cells** in order
3. **Download trained model** at the end

Expected Runtime: **~3-5 minutes** with GPU

## 1Ô∏è‚É£ Check GPU Availability

Verify that GPU is enabled and available for training.

In [None]:
# Check GPU availability
!nvidia-smi

# Verify CUDA for XGBoost
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if 'Tesla' in result.stdout or 'GPU' in result.stdout:
    print("\n‚úì GPU is available and ready for training!")
else:
    print("\n‚ö† GPU not detected. Please enable GPU in Runtime settings.")

## 2Ô∏è‚É£ Clone Repository & Setup Project

Clone your project from GitHub or upload files manually.

In [None]:
# Option A: Clone from GitHub (recommended)
!git clone https://github.com/your-username/sales-data-churn.git
%cd sales-data-churn

# Option B: Upload manually (uncomment if needed)
# from google.colab import files
# uploaded = files.upload()  # Upload your project zip
# !unzip sales-data-churn.zip
# %cd sales-data-churn

## 3Ô∏è‚É£ Install Required Packages

Install XGBoost (with GPU support), MLflow, and other dependencies.

In [None]:
# Install required packages
!pip install -q xgboost scikit-learn pandas numpy mlflow optuna

# Verify XGBoost GPU support
import xgboost as xgb
print(f"XGBoost version: {xgb.__version__}")
print("GPU support:", xgb.build_info())

## 4Ô∏è‚É£ Verify Data Files

Check that your training and test data are available.

In [None]:
# Check data files
import os
from pathlib import Path

data_files = {
    "train": "data/raw/train.csv",
    "test": "data/raw/test.csv",
}

print("Checking data files:")
for name, path in data_files.items():
    if os.path.exists(path):
        print(f"  ‚úì {name}: {path}")
    else:
        print(f"  ‚úó {name}: {path} NOT FOUND")
        
# If files are missing, upload them
# from google.colab import files
# uploaded = files.upload()  # Upload train.csv and test.csv
# !mkdir -p data/raw
# !mv train.csv data/raw/
# !mv test.csv data/raw/

## 5Ô∏è‚É£ View Pipeline Configuration

Review the pre-configured hyperparameters and training settings.

In [None]:
import json

# Pre-computed optimal hyperparameters (from 250 Optuna trials)
best_params = {
    "booster": "gbtree",
    "lambda": 0.00032762263951052436,
    "alpha": 0.00017370640229832804,
    "max_depth": 7,
    "eta": 0.2960673713462837,
    "gamma": 0.00017131007397068948,
    "min_child_weight": 6,
    "subsample": 0.7605678991335877,
    "colsample_bytree": 0.9988324896159033,
    "colsample_bylevel": 0.7777131466076425,
    "n_estimators": 900,
}

print("üìä Training Configuration:")
print("-" * 50)
print("Model: XGBoost Classifier")
print("Device: CUDA (GPU)")
print("Optimize Metric: Recall")
print("Threshold: 0.35 (for predictions)")
print("\nüéØ Hyperparameters:")
print(json.dumps(best_params, indent=2))

## 6Ô∏è‚É£ Run Complete Training Pipeline

Execute the full ML pipeline with GPU acceleration.

This will:
1. Load and validate data
2. Preprocess features
3. Train XGBoost model on GPU
4. Track experiments with MLflow
5. Save the best model

In [None]:
# Run the GPU-optimized pipeline
!python scripts/colab_pipeline.py

## 7Ô∏è‚É£ View Training Results

Load and display the training metrics.

In [None]:
import pickle
import os

# Find the trained model
model_dir = "models"
model_files = [f for f in os.listdir(model_dir) if f.endswith('.pkl')]

if model_files:
    model_path = os.path.join(model_dir, model_files[0])
    print(f"‚úì Model found: {model_path}")
    
    # Load model (optional - just to verify)
    with open(model_path, 'rb') as f:
        model = pickle.load(f)
    print(f"  Model type: {type(model).__name__}")
    print(f"  Number of features: {model.n_features_in_}")
    print(f"  Number of trees: {model.n_estimators}")
else:
    print("‚ö† No model file found. Check pipeline execution above.")

## 8Ô∏è‚É£ Download Trained Model

Download your trained model to local machine.

In [None]:
# Download the trained model
from google.colab import files

model_dir = "models"
model_files = [f for f in os.listdir(model_dir) if f.endswith('.pkl')]

if model_files:
    for model_file in model_files:
        model_path = os.path.join(model_dir, model_file)
        print(f"Downloading: {model_path}")
        files.download(model_path)
    print("‚úì Download complete!")
else:
    print("‚ö† No model files to download")

## 9Ô∏è‚É£ Optional: Save to Google Drive

Mount Google Drive and save all outputs for persistence across sessions.

In [None]:
# Mount Google Drive (optional)
from google.colab import drive
drive.mount('/content/drive')

# Copy models and outputs to Drive
import shutil
drive_path = "/content/drive/MyDrive/churn_models"

try:
    os.makedirs(drive_path, exist_ok=True)
    
    # Copy models
    if os.path.exists("models"):
        shutil.copytree("models", f"{drive_path}/models", dirs_exist_ok=True)
        print(f"‚úì Models saved to: {drive_path}/models")
    
    # Copy MLflow runs
    if os.path.exists("mlruns"):
        shutil.copytree("mlruns", f"{drive_path}/mlruns", dirs_exist_ok=True)
        print(f"‚úì MLflow experiments saved to: {drive_path}/mlruns")
        
    print("\n‚úì All outputs saved to Google Drive!")
except Exception as e:
    print(f"‚ö† Error saving to Drive: {e}")

## ‚úÖ Pipeline Complete!

### üéâ What You've Accomplished:
- ‚úì Trained an XGBoost model with GPU acceleration
- ‚úì Used pre-optimized hyperparameters (900 trees)
- ‚úì Achieved optimal recall for churn prediction
- ‚úì Tracked experiments with MLflow
- ‚úì Downloaded production-ready model

### üìä Next Steps:
1. **Test the model** on holdout data
2. **Deploy to production** using the downloaded .pkl file
3. **Monitor performance** and retrain as needed

### üîß Customization:
To modify training parameters, edit `scripts/colab_pipeline.py`:
- Change `THRESHOLD_VALUE` for different precision/recall tradeoff
- Adjust `PREPROCESSING_STRATEGY` for data handling
- Modify `BEST_PARAMS` to try different hyperparameters

---

**Training Time:** ~3-5 minutes with GPU  
**Model Performance:** Optimized for maximum recall (catching churners)  
**GPU Utilization:** Full CUDA acceleration for XGBoost