✅ **Module 3.5: CI/CD Automation with MLflow** 


### 🧠 **Learning Objectives**

### 1️⃣ **Automate Retraining by Monitoring Parameter Files (e.g., params.yaml)**

* **What it means:**
  Automatically starting a new training run whenever your model parameters or hyperparameters are updated. Parameters are usually saved in a simple configuration file like `params.yaml`.

* **Detailed Steps:**

  * Keep parameters (like `alpha`, `learning_rate`, or number of iterations) in a separate file (`params.yaml`).
  * Set up an automation system to detect changes to this file.
  * Trigger automatic model retraining when parameter updates occur, ensuring the latest configurations are always used.

* **Why it matters:**
  This automation ensures consistency and removes manual intervention, reducing mistakes and making experimentation and deployment faster and smoother.

---

### 2️⃣ **Simulate CI/CD Triggers (e.g., via GitHub Actions)**

* **What it means:**
  Automatically triggering actions (like model retraining or testing) whenever certain conditions are met—like pushing code changes to GitHub or updating parameter files.

* **Detailed Steps:**

  * Set up GitHub Actions to monitor changes in your repository.
  * Define specific triggers (e.g., file changes, commits, or merges):

    ```yaml
    on:
      push:
        paths:
          - 'config/params.yaml'
    ```
  * Automate tasks like installing dependencies, retraining models, logging results, and more upon trigger.

* **Why it matters:**
  It enables automation and rapid iteration, significantly improving development workflow efficiency, reproducibility, and collaboration.

---

### 3️⃣ **Structure Training Scripts to be Triggered Programmatically**

* **What it means:**
  Organizing your training scripts clearly, so they can be automatically executed by automation systems without manual intervention.

* **Detailed Steps:**

  * Write a `train.py` script that reads parameters from a file:

    ```python
    import yaml
    import mlflow

    with open("config/params.yaml") as f:
        params = yaml.safe_load(f)

    # use params in training
    ```
  * Ensure the script is self-contained—loading data, training the model, logging metrics and models to MLflow automatically when executed.

* **Why it matters:**
  Clear, structured scripts simplify integration into automated workflows, improving maintainability and reproducibility.

---

### 4️⃣ **Log Updated Models to MLflow from Automation Pipelines**

* **What it means:**
  Automatically recording every training run and its results (parameters, metrics, artifacts) in MLflow whenever automation triggers model retraining.

* **Detailed Steps:**

  * In your automated `train.py` script, add MLflow logging:

    ```python
    import mlflow
    mlflow.set_experiment("cicd-automation")

    with mlflow.start_run():
        mlflow.log_param("alpha", params["alpha"])
        mlflow.log_metric("accuracy", accuracy)
        mlflow.sklearn.log_model(model, "model")
    ```
  * Every triggered run automatically logs its details to MLflow.

* **Why it matters:**
  Provides a complete history of your models, allowing easy comparison and tracking. Enhances reproducibility and auditability of your model development process.


In [1]:
# 📓 Module 3.5: CI/CD Automation with MLflow
# Goal: Simulate a CI/CD workflow that automatically retrains and logs models when parameters change

# ✅ Step 1: Install dependencies
!pip install -q mlflow scikit-learn pyyaml

# ✅ Step 2: Create params.yaml (simulating user-updated config file)
import yaml
import os

params = {"alpha": 0.3, "max_iter": 200}
os.makedirs("config", exist_ok=True)

with open("config/params.yaml", "w") as f:
    yaml.dump(params, f)

# ✅ Step 3: Create a training script that reads parameters and logs a model
training_code = '''
import yaml
import mlflow
import mlflow.sklearn
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

with open("config/params.yaml") as f:
    config = yaml.safe_load(f)

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Ridge(alpha=config["alpha"], max_iter=config["max_iter"])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

mlflow.set_experiment("cicd-simulation")
with mlflow.start_run():
    mlflow.log_param("alpha", config["alpha"])
    mlflow.log_param("max_iter", config["max_iter"])
    mlflow.log_metric("mse", mse)
    mlflow.sklearn.log_model(model, "ridge_model")
    print("✅ Model logged.")
'''

with open("train.py", "w") as f:
    f.write(training_code)

# ✅ Step 4: Simulate CI/CD trigger with CLI (run in GitHub Actions or locally)
print("""
📦 Simulate the CI/CD trigger by running:

python train.py

🛠️ In GitHub Actions, you’d automate this with a .github/workflows/train.yml file:

name: Train Model on Param Change
on:
  push:
    paths:
      - 'config/params.yaml'

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'
      - name: Install deps
        run: pip install -r requirements.txt
      - name: Train model
        run: python train.py
""")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.7/24.7 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.0/677.0 kB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

---

## 📝 Assessment: CI/CD Automation with MLflow

### 📘 Multiple Choice (Correct answers in **bold**)

**1. What is the purpose of `params.yaml` in a CI/CD context?**    
A. Tracks model artifacts    
**B. Stores training parameters that can trigger automation** ✅    
C. Launches model registry    
D. Configures Docker containers    

---

**2. Which GitHub Actions trigger is used to detect file changes?**    
A. `on.workflow_dispatch`    
**B. `on.push.paths`** ✅    
C. `on.commit.message`    
D. `on.model.save`    

---

**3. What is the role of `mlflow.log_param()` inside a CI/CD pipeline?**    
A. Runs batch predictions    
**B. Records parameters used in that training run** ✅    
C. Sends alerts to Slack    
D. Checks Python dependencies    

---

**4. Why is it useful to log models during CI/CD?**    
A. Speeds up Git commits    
B. Replaces notebooks    
**C. Creates traceable and versioned production models** ✅    
D. Avoids using `git pull`    

---

### ✏️ Short Answer    

**5. What’s the benefit of automating model retraining through CI/CD?**
*Ensures consistency, reproducibility, and rapid deployment when data or configuration changes, reducing manual errors.*    

---

**6. How does MLflow improve visibility in a CI/CD pipeline?**    
*It provides a centralized record of each run’s parameters, metrics, and artifacts, making it easy to audit and compare models over time.*    

---

### 🧪 Mini Project

**7. Task:**

* Create a GitHub repo with a `params.yaml` file and `train.py`
* Create a GitHub Actions YAML workflow that runs when `params.yaml` changes
* Log the model and metrics with MLflow
* Push a change to the file and confirm the action triggers and logs the run
