## Monitoring & Model Management

**Monitor**

Data inputs (are they drifting from training distribution?)

Predictions (are they degrading in accuracy or stability?)

System health (latency, error rates, resource usage).

**Manage**

Versioning models (promote/demote models in production).

Rolling back to previous versions if performance drops.

Automating retraining when drift is detected.

| Type of Monitoring      | Examples                                       | Tools                                                       |
| ----------------------- | ---------------------------------------------- | ----------------------------------------------------------- |
| **Data Drift**          | Feature distributions changing over time       | Evidently AI, WhyLabs, Fiddler, AWS SageMaker Model Monitor |
| **Concept Drift**       | Relationship between features & target changes | River (online learning), custom drift detectors             |
| **Prediction Quality**  | Accuracy, Precision, Recall, F1, AUC over time | MLflow, Prometheus, Grafana                                 |
| **Operational Metrics** | Latency, Throughput, Error Rates               | Prometheus, Grafana, ELK stack                              |
| **Business KPIs**       | Conversion rate, revenue impact                | BI tools (Tableau, PowerBI)                                 |


Model Management Essentials
Model Registry (e.g., MLflow Model Registry, SageMaker Model Registry, Neptune.ai)

Stores models with metadata (version, metrics, source code hash).

Supports staging → production promotion workflows.

Lifecycle Stages

Development → Staging → Production → Archived.

Rollback Mechanism

If a new version underperforms, quickly switch back to an older version.

###  MLflow + Evidently AI + Prometheus

#### Step 1: Track Models in MLflow Registry

In [None]:
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("iris-prod")

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

with mlflow.start_run() as run:
    model = LogisticRegression()
    model.fit(X_train, y_train)

    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metric("train_accuracy", model.score(X_train, y_train))
    mlflow.log_metric("test_accuracy", model.score(X_test, y_test))

    mlflow.register_model(f"runs:/{run.info.run_id}/model", "iris-model")


#### Step 2: Monitor Drift with Evidently

In [None]:
from evidently.report import Report
from evidently.metrics import DataDriftPreset
import pandas as pd

# Load old and new data
train_df = pd.DataFrame(X_train)
current_df = pd.DataFrame(X_test)

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=current_df)
report.save_html("drift_report.html")


#### Step 3: Expose Metrics with FastAPI + Prometheus


In [None]:
from fastapi import FastAPI
from prometheus_client import Counter, generate_latest

app = FastAPI()
PREDICTION_COUNT = Counter("prediction_requests_total", "Number of prediction requests")

@app.post("/predict")
def predict(features: list):
    PREDICTION_COUNT.inc()
    # Your prediction logic...
    return {"prediction": 1}

@app.get("/metrics")
def metrics():
    return generate_latest()


#### Step 4: Manage in Kubernetes

Deploy Prometheus and Grafana using Helm charts.

Deploy MLflow tracking server + model registry.

Deploy your inference API with sidecar logging predictions to storage.

Set up Jenkins/GitHub Actions to:

Detect drift → trigger retraining job.

Register new model in MLflow.

Promote new model if metrics improve.


MLflow → Model tracking & registry

Evidently → Data drift detection

Prometheus + Grafana → API health monitoring

Helm → Deploy to Kubernetes

Jenkins → Automate retraining & rollout

#### Project Structure



In [None]:
mlops-iris/
│── train_model.py             # Train and register model in MLflow
│── inference_api.py           # FastAPI inference service
│── drift_monitor.py           # Drift detection with Evidently
│── requirements.txt
│── Dockerfile
│── k8s/
│    ├── deployment.yaml
│    ├── service.yaml
│    ├── prometheus-values.yaml
│    ├── grafana-values.yaml
│── helm-chart/                # Helm deployment templates
│── Jenkinsfile


#### Model Training & MLflow Registry

In [None]:
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd

# MLflow config
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("iris-pipeline")

# Load and split data
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train & log
with mlflow.start_run() as run:
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_params({"max_iter": 200})
    mlflow.log_metrics({
        "train_acc": model.score(X_train, y_train),
        "test_acc": model.score(X_test, y_test)
    })
    
    # Register model
    mlflow.register_model(
        f"runs:/{run.info.run_id}/model",
        "iris-classifier"
    )

# Save test set for drift detection
X_test.to_csv("reference_data.csv", index=False)


#### FastAPI Inference API with Prometheus Metrics

inference_api.py

In [None]:
from fastapi import FastAPI
from prometheus_client import Counter, Histogram, generate_latest
import mlflow.pyfunc
import pandas as pd
import numpy as np
import time

# Load latest production model
model = mlflow.pyfunc.load_model("models:/iris-classifier/Production")

app = FastAPI()

REQUEST_COUNT = Counter("prediction_requests_total", "Total prediction requests")
REQUEST_LATENCY = Histogram("prediction_latency_seconds", "Prediction latency in seconds")

@app.post("/predict")
def predict(data: list):
    REQUEST_COUNT.inc()
    start = time.time()
    
    df = pd.DataFrame(np.array(data), columns=["sepal_length", "sepal_width", "petal_length", "petal_width"])
    prediction = model.predict(df)
    
    REQUEST_LATENCY.observe(time.time() - start)
    return {"prediction": prediction.tolist()}

@app.get("/metrics")
def metrics():
    return generate_latest()


#### Drift Detection with Evidently

drift_monitor.py

In [None]:
from evidently.report import Report
from evidently.metrics import DataDriftPreset
import pandas as pd
import requests

# Load reference data (from training phase)
reference_data = pd.read_csv("reference_data.csv")

# Fetch recent live data (for example, from API logs or DB)
# Here we simulate
current_data = reference_data.sample(frac=1).reset_index(drop=True)  # replace with real data

# Run drift detection
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)

# Save HTML report
report.save_html("drift_report.html")

# Trigger retraining if drift detected
if report.as_dict()["metrics"][0]["result"]["data"]["metrics"]["dataset_drift"]:
    requests.post("http://jenkins:8080/job/retrain/build")



#### Dockerfile

In [None]:
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "inference_api:app", "--host", "0.0.0.0", "--port", "8000"]


#### Helm Chart Structure

In [None]:
helm-chart/
│── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│── values.yaml
│── Chart.yaml


#### Example Deployment Template (templates/deployment.yaml):

In [None]:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-iris
spec:
  replicas: 2
  selector:
    matchLabels:
      app: iris
  template:
    metadata:
      labels:
        app: iris
    spec:
      containers:
      - name: iris
        image: your-dockerhub/iris-api:latest
        ports:
        - containerPort: 8000


#### Jenkinsfile for CI/CD

In [None]:
pipeline {
    agent any
    stages {
        stage('Train Model') {
            steps {
                sh 'python train_model.py'
            }
        }
        stage('Build & Push Docker Image') {
            steps {
                sh 'docker build -t your-dockerhub/iris-api:latest .'
                sh 'docker push your-dockerhub/iris-api:latest'
            }
        }
        stage('Deploy to K8s via Helm') {
            steps {
                sh 'helm upgrade --install iris helm-chart/ --namespace mlops'
            }
        }
    }
}


#### Monitoring Stack
Deploy Prometheus & Grafana via Helm:

In [None]:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prom prometheus-community/prometheus
helm install graf grafana/grafana
