### <div align="center">ML Ops & Cloud Tools</div>

#### 12.1: ML Ops
- ML Ops is a set of practices to streamline and automate end-to-end machine learning lifecycle.
- ML Lifecycle: Develop -> Deploy -> Monitor
- ML Ops is similar to Dev Ops in traditional software development.
- Model performance can be affected due to:
  - Data drift
  - Concept drift
- Setting up ML Ops in a project offers 3 main benefits:
  - Improved efficiency and productivity
  - Enhanced quality and reliability
  - Better collaboration and governance

#### 12.3: ML Flow: Purpose and Overview
- MLflow is an open-source platform that simplifies managing the machine learning lifecycle. It supports MLOps by providing tools for Experiment tracking, Reproducible runs, Model packaging, Model registry.
- Benefits of MLflow:
  - Experiment Tracking
  - Reproducibility & Deployment
  - Model Management

In [2]:
# To install ml flow locally.
# pip install mlflow

# Run ml flow ui locally
# mlflow ui

#### 12.4: ML Flow Experiment Tracking
- MLflow provides an interface and tooling to help ML practitioners with experiment tracking.
Experiment tracking is a way by which data scientists and AI engineers can compare different experiments.

In [33]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

In [34]:
# Step 1: Create an imbalanced binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=8, 
                           weights=[0.9, 0.1], flip_y=0, random_state=42)

np.unique(y, return_counts=True)

(array([0, 1]), array([900, 100], dtype=int64))

In [35]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [36]:
# Experiment 1: Train Logistic Regression Classifier
log_reg = LogisticRegression(C=1, solver='liblinear')
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
print(classification_report(y_test, y_pred_log_reg))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       270
           1       0.60      0.50      0.55        30

    accuracy                           0.92       300
   macro avg       0.77      0.73      0.75       300
weighted avg       0.91      0.92      0.91       300



In [37]:
# Experiment 2: Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=30, max_depth=3)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       270
           1       0.95      0.70      0.81        30

    accuracy                           0.97       300
   macro avg       0.96      0.85      0.89       300
weighted avg       0.97      0.97      0.96       300



In [40]:
# Experiment 3: Train XGBoost
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       270
           1       0.96      0.80      0.87        30

    accuracy                           0.98       300
   macro avg       0.97      0.90      0.93       300
weighted avg       0.98      0.98      0.98       300



In [42]:
# Experiment 4: Handle class imbalance using SMOTETomek and then Train XGBoost
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_train_res, y_train_res = smt.fit_resample(X_train, y_train)
np.unique(y_train_res, return_counts=True)

(array([0, 1]), array([619, 619], dtype=int64))

In [43]:
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train_res, y_train_res)
y_pred_xgb = xgb_clf.predict(X_test)
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       270
           1       0.81      0.83      0.82        30

    accuracy                           0.96       300
   macro avg       0.89      0.91      0.90       300
weighted avg       0.96      0.96      0.96       300



##### Track Experiments Using MLFlow

In [44]:
models = [
    (
        "Logistic Regression", 
        {"C": 1, "solver": 'liblinear'},
        LogisticRegression(), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "Random Forest", 
        {"n_estimators": 30, "max_depth": 3},
        RandomForestClassifier(), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "XGBClassifier",
        {"use_label_encoder": False, "eval_metric": 'logloss'},
        XGBClassifier(), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "XGBClassifier With SMOTE",
        {"use_label_encoder": False, "eval_metric": 'logloss'},
        XGBClassifier(), 
        (X_train_res, y_train_res),
        (X_test, y_test)
    )
]

In [47]:
reports = []

for model_name, params, model, train_set, test_set in models:
    X_train = train_set[0]
    y_train = train_set[1]
    X_test = test_set[0]
    y_test = test_set[1]
    
    model.set_params(**params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    reports.append(report)

In [48]:
import mlflow
import mlflow.sklearn
import mlflow.xgboost

In [52]:
# Initialize MLflow
mlflow.set_experiment("Anomaly Detection Swetank")
mlflow.set_tracking_uri("http://localhost:5000")

for i, element in enumerate(models):
    model_name = element[0]
    params = element[1]
    model = element[2]
    report = reports[i]
    
    with mlflow.start_run(run_name=model_name):        
        mlflow.log_params(params)
        mlflow.log_metrics({
            'accuracy': report['accuracy'],
            'recall_class_1': report['1']['recall'],
            'recall_class_0': report['0']['recall'],
            'f1_score_macro': report['macro avg']['f1-score']
        })  
        
        if "XGB" in model_name:
            mlflow.xgboost.log_model(model, "model")
        else:
            mlflow.sklearn.log_model(model, "model")  

2025/09/15 10:33:46 INFO mlflow.tracking.fluent: Experiment with name 'Anomaly Detection Swetank' does not exist. Creating a new experiment.


🏃 View run Logistic Regression at: http://localhost:5000/#/experiments/620974005201839646/runs/18ab1e95bc594775a2c3d1ea41779e2b
🧪 View experiment at: http://localhost:5000/#/experiments/620974005201839646




🏃 View run Random Forest at: http://localhost:5000/#/experiments/620974005201839646/runs/28f8ff51f32749e7bb725d3c7363ccb9
🧪 View experiment at: http://localhost:5000/#/experiments/620974005201839646




🏃 View run XGBClassifier at: http://localhost:5000/#/experiments/620974005201839646/runs/bf95a09ef46a485aad3fabc2589ffd28
🧪 View experiment at: http://localhost:5000/#/experiments/620974005201839646




🏃 View run XGBClassifier With SMOTE at: http://localhost:5000/#/experiments/620974005201839646/runs/3c1fbb5cad934e19997680eef8b2a1f6
🧪 View experiment at: http://localhost:5000/#/experiments/620974005201839646


##### Register the Model

In [None]:
model_name = 'XGB-Smote'
run_id=input('Please type RunID')
model_uri = f'runs:/{run_id}/model_name'

with mlflow.start_run(run_id=run_id):
    mlflow.register_model(model_uri=model_uri, name=model_name)

##### Load the Model

In [None]:
model_version = 1
model_uri = f"models:/{model_name}/{model_version}"

loaded_model = mlflow.xgboost.load_model(model_uri)
y_pred = loaded_model.predict(X_test)
y_pred[:4]

##### Transition the Model to Production

In [None]:
current_model_uri = f"models:/{model_name}@challenger"
production_model_name = "anomaly-detection-prod"

client = mlflow.MlflowClient()
client.copy_model_version(src_model_uri=current_model_uri, dst_name=production_model_name)

In [None]:
model_version = 1
prod_model_uri = f"models:/{production_model_name}@champion"

loaded_model = mlflow.xgboost.load_model(prod_model_uri)
y_pred = loaded_model.predict(X_test)
y_pred[:4]

Please refer to following to learn more about model registry

https://mlflow.org/docs/latest/model-registry.html#model-registry-workflows to learn

#### 12.6: ML Flow Centralized Server Using Dagshub
- Dagsub is similar to GitHub. It allows code version control and collaboration.
- Dagshub has additional features such as:
  - Data Version Control
  - Experiment Tracking
- Data scientists can use Dagshub to publish experiment metrics on a centralized server.
- For more details about Dagshub implimentation please refer `ml_flow_dagshub` notebook.

#### 12.8: What is API ?
- API stands for Application Programming Interface.
- FastAPI, Flask, Node JS, etc. are web frameworks that allow you to build APIs.
- FastAPI is a modern and better framework that many ML practitioners use to build servers for their models.

#### 12.9: FastAPI Basics
- pip install fastapi[standard]
- FastAPI is a web framework that lets you build servers that can serve inference requests.
- FastAPI is a modern framework that offers several benefits over other options such as Flask.
- The benefits are:
  - Speed
  - In-built data validation
  - In-built documentation
  - Faster code development

#### 12.10: Build Fast API Server For Credit Risk Project
- Data scientists and AI engineers use Postman tool to test backend inference server.
- Pydantic is a Python module used for data validation.
- It can be used along with FastAPI to validate input and output to the server.
- A smart AI engineer / Data Scientist uses AI tools such as ChatGPT, claude.ai, meta.ai to write code faster and boost productivity.

#### 12.15: AWS Sagemaker: Sagemaker Studio
- AWS Sagemaker is one of the popular cloud platforms that allows you to streamline your end-to-end ML project development workflow.
- You can enroll to AWS free tier and you will get certain free benefits.
- SageMaker Studio provides an integrated development environment for end-to-end machine learning workflows.
- It offers seamless collaboration and version control for data scientists and developers.
- The visual interface simplifies Data preparation, Model training, Tuning, Deployment.
- It integrates with other AWS services, enabling a comprehensive and scalable machine learning ecosystem.

#### 12.16: AWS Sagemaker: 4 Ways to Train Model
- SageMaker offers various ways to run model training.
- 4 prominent ways are:
  - Training inside notebook
  - Built-in algorithms
  - Script mode
  - Custom containers
    - Custom containers require you to build your own Docker container and deploy it along with all its dependencies.
Using built-in algorithms or script mode will save all this effort for you.

#### 12.17: AWS Sagemaker - Built In Algorithms
- Built-in algorithms allow you to use inbuilt containers inside the SageMaker environment to run the model training.
This speeds up model development.
- Many times, people need high compute with GPUs just for training, while using lower-end computers for data processing and writing code in the notebook. Built-in algorithms allow you to run just the training on high compute while keeping the rest on a different cloud computer.
- SageMaker’s built-in algorithms are optimized for performance and scalability, handling large datasets efficiently.
- For BuiltIn model training example plese refer `builtin_algo_classification` notebook.

#### 12.18: AWS Sagemaker - Script Mode
- Script mode in SageMaker allows you to run model training using your own Python script. This method is more customizable compared to a Built-In Algorithm.
- Just like Built-In Algorithms, it uses pre-built containers (or Docker images) to run the training. This saves time required for model training in the cloud.

#### 12.20: Data Drift Detection Using PSI & CSI
- PSI and CSI are the ways to measure data drift.
- PSI is mainly used to measure overall drift in a population (mainly for target variable distribution changes).
- PSI > 0.2 is considered a high level of data drift that warrants investigation.
- Note: 0.2 is not a fixed threshold. Threshold can vary based on the situation.
- CSI is a way to measure drift in a feature (or independent variable).
- The formula is exactly the same.
- For categorical variables, use categories to measure Expected % and Actual % for CSI.
- For continuous variables, bin them to measure Expected % and Actual % for CSI.