# **PHASE 3: ADVANCED ANALYTICS (Days 9-11)**

## **DAY 12 (20/01/26) - MLflow Basics**

### **Section 1 - Learn**:

### **_1. MLflow components (tracking, registry, models)_**

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. It is built around several core components that work together to move a model from a local script to a production-ready service.

##### 1. MLflow Tracking

This is the central logging component. It allows you to record and query experiments, which are essentially executions of your data science code (called **Runs**).

* **What it tracks:** Parameters (hyperparameters), Metrics (accuracy, loss), Code version (Git hash), and Artifacts (plots, data files, or the model itself).
* **Key Concept:** **Experiments** group multiple runs together, making it easy to compare different iterations and visualize performance trends through the MLflow UI.

##### 2. MLflow Models

This component provides a **standardized format** for packaging machine learning models so they can be used in various downstream tools—such as real-time serving via a REST API or batch inference on Apache Spark.

* **Flavors:** Every MLflow Model is stored as a directory containing an `MLmodel` file. This file defines "flavors" (e.g., `python_function`, `sklearn`, `pytorch`) that tell different deployment tools how to understand the model.
* **Signatures:** It allows you to define a "Model Signature," which acts as a data contract specifying exactly what input types the model expects and what it will return.

##### 3. MLflow Model Registry

The Registry is a centralized model store, set of APIs, and UI to collaboratively manage the full lifecycle of an MLflow Model. It provides:

* **Versioning:** Automatically tracks and manages different versions of a model (e.g., Version 1, Version 2).
* **Stage Transitions:** Allows you to assign "Stages" to a model version, such as **Staging**, **Production**, or **Archived**.
* **Lineage:** Links every registered model version back to the specific MLflow Run and experiment that created it, ensuring full reproducibility.


##### Comparison Summary

| Component | Primary Purpose | Key Output |
| --- | --- | --- |
| **Tracking** | Logging and comparing experiment results. | Metrics, Parameters, and Artifacts. |
| **Models** | Packaging models for deployment across platforms. | Standardized `MLmodel` directory. |
| **Registry** | Governing model versions and lifecycle stages. | Versioned, production-ready models. |

---

### **_2. Experiment tracking_**

In Databricks, **Experiment Tracking** is powered by a managed version of **MLflow**. It serves as a digital lab notebook that automatically records every detail of your machine learning training sessions—parameters, metrics, code versions, and resulting models—so you can compare them and reproduce results later.

##### **1. Key Concepts: Experiments vs. Runs**

* **Run:** A single execution of your model training code. During a run, you log "ingredients" (parameters) and "results" (metrics).
* **Experiment:** A logical container that groups related runs. For example, you might have one experiment for "Sales Forecasting" and run it 50 times with different settings to find the best model.
* **Artifacts:** Heavy files produced during a run, such as the trained model itself, feature importance plots, or CSVs of predictions.

##### **2. Manual Logging vs. Autologging**

There are two ways to track your work in Databricks:

* **Autologging (Recommended):** Databricks can automatically detect when you use popular libraries (like Scikit-learn, TensorFlow, or PyTorch) and log your parameters and metrics without you writing extra code.
    * **How to enable:** Use `mlflow.autolog()` at the start of your notebook.

* **Manual Logging:** Use the MLflow API for full control over what gets saved.

```python
import mlflow

# Start a tracked session
with mlflow.start_run(run_name="RandomForest_v1"):
    # Log 'Ingredients'
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    
    # ... Train your model here ...
    
    # Log 'Results'
    mlflow.log_metric("rmse", 0.85)
    mlflow.log_metric("r2_score", 0.91)
    
    # Save the 'Model'
    mlflow.sklearn.log_model(my_model, "model_artifact")

```

##### **3. The Experiment Sidebar**

You don't need to leave your notebook to see your progress. The **Experiments Sidebar** (located on the right-hand side of the Databricks notebook UI) provides a live feed of all runs.

* **Comparison View:** You can select multiple runs and click **"Compare"** to see a side-by-side table or a parallel coordinates plot that shows which hyperparameters led to the best accuracy.
* **Reproduce with One Click:** Every run is linked to the exact version of the notebook that created it. If you find a great model from six months ago, you can jump back to that specific code snapshot instantly.

##### **4. Best Practices for Tracking**

* **Log Training Data Versions:** Use `mlflow.data` to log a reference to the specific Delta table version used for training. This ensures your model is 100% reproducible.
* **Use Tags:** Add tags like `team: marketing` or `status: candidate` to your runs so you can filter through thousands of experiments later in the **MLflow UI**.
* **Centrally Manage with Unity Catalog:** In modern Databricks (2025/2026), your experiments and models should be registered in **Unity Catalog** for centralized governance and access control.

##### **Summary of Benefits**

| Benefit | How it helps you |
| --- | --- |
| **Reproducibility** | Exactly recreate any past result by seeing the exact code, data, and environment used. |
| **Model Selection** | Use visual charts to find the "sweet spot" for hyperparameters (e.g., learning rate). |
| **Auditability** | Maintain a complete history of every model ever trained for compliance and debugging. |
| **Collaboration** | Share an experiment link with a teammate so they can see your results without you sending screenshots. |


---

### **_3. Model logging_**

In Databricks, **Model Logging** is the process of saving a trained machine learning model along with its metadata (dependencies, schema, and environment) into the MLflow tracking system.

Unlike simply saving a file (like a `.pkl` or `.h5`), logging a model creates a standardized **MLflow Model**—a directory that contains everything needed for someone else (or a production server) to run the model without manually installing libraries.

##### **1. The Logging Workflow**

There are two primary ways to log your models:

* **Autologging (Fastest):** Most popular libraries (Scikit-learn, XGBoost, PyTorch, etc.) are supported. By calling `mlflow.autolog()`, Databricks automatically captures parameters, metrics, and the model itself when you call `.fit()`.
* **Manual Logging (Best for Production):** Provides granular control. You explicitly define what to save using `mlflow.<flavor>.log_model()`.

```python
import mlflow
from sklearn.ensemble import RandomForestClassifier

# Standard manual logging pattern
with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    
    # Log the model specifically for scikit-learn
    mlflow.sklearn.log_model(
        sk_model=model, 
        artifact_path="model_folder",
        input_example=X_train.iloc[[0]],  # Highly recommended for Unity Catalog
        registered_model_name="sales_predictor" # Optional: auto-registers the model
    )

```

##### **2. Logged Model vs. Artifact**

It is important to distinguish between `log_artifact()` and `log_model()`:

* **`log_artifact()`**: Saves a specific file (e.g., a `.png` plot of feature importance or a `.csv` of predictions). It is just a "dumb" file storage.
* **`log_model()`**: Saves a "smart" package. It includes an `MLmodel` configuration file, a `conda.yaml` for environment reproduction, and the model binary. This allows for **Model Serving** (turning the model into a REST API) with one click.

##### **3. Why Signatures Matter**

In modern Databricks (2025/2026), **Model Signatures** are critical. A signature defines the expected input and output data types (e.g., "Input: 3 Floats, Output: 1 Boolean").

* **Unity Catalog Requirement:** You cannot register a model in Unity Catalog without a signature.
* **The "Easy" Way:** Pass an `input_example` when logging (as shown in the code above). MLflow will automatically infer the signature from that example.

##### **4. Best Practices**

* **Log Dependencies:** Always check the `requirements.txt` generated by MLflow to ensure custom libraries (like a specific version of `pandas`) are included.
* **Use Unity Catalog:** Instead of the legacy Workspace Model Registry, log and register your models directly into **Unity Catalog** for better security and cross-workspace access.
* **Add Descriptions:** Use the `description` parameter in `log_model()` to explain what the version does (e.g., "Updated with Q4 seasonal data").

##### **Summary Comparison**

| Feature | `log_artifact()` | `log_model()` |
| --- | --- | --- |
| **Object Type** | Generic files (images, logs, csvs) | Standardized MLflow Model |
| **Reproducibility** | Manual effort required | Automated (packages dependencies) |
| **Serving** | Not supported | Integrated with Databricks Serving |
| **Unity Catalog** | Not registerable | Required for Model Registry |


---

### **_4. MLflow UI_**

The **MLflow Tracking UI** is the visual command center for your machine learning lifecycle. In Databricks, it provides a centralized interface to visualize, search, and compare the "DNA" of your models—every parameter, metric, and code version you've ever logged.

##### **1. Key Components of the UI**

The UI is organized into a hierarchy that mirrors how you work:

* **Experiments Page:** A high-level list of all your projects. Each experiment acts as a folder containing related training attempts.
* **Runs Table:** A spreadsheet-like view of every individual training session. You can **filter**, **sort**, and **search** for runs by specific metrics (e.g., `accuracy > 0.9`).
* **Run Details Page:** Clicking on a run name opens a deep dive into that specific session. It includes:
* **Parameters:** What "ingredients" were used (e.g., `learning_rate: 0.01`).
* **Metrics:** How the "recipe" turned out (e.g., `F1-Score: 0.88`).
* **Artifacts:** The resulting files, including the **Model binary**, **Conda environment**, and **Feature Importance plots**.
* **Traceability:** A direct link to the exact **Notebook version** and **Git commit** that produced the results.


##### **2. Advanced Visual Comparison**

One of the UI's most powerful features is the **Chart View**. Instead of looking at raw numbers, you can select multiple runs and generate:

* **Parallel Coordinates Plots:** Visualize how different combinations of hyperparameters (like `depth` and `estimators`) impact your final performance.
* **Scatter Plots:** Compare two metrics (e.g., `Latency` vs. `Accuracy`) to find the most efficient model.
* **Bar Charts:** Side-by-side comparisons of key KPIs for candidate models.

##### **3. MLflow UI vs. Notebook Sidebar**

Databricks offers two ways to view your MLflow data:

| Feature | Notebook Experiment Sidebar | Full MLflow UI |
| --- | --- | --- |
| **Location** | Right-hand side of your notebook. | Left-hand sidebar under **Mosaic AI > Experiments**. |
| **Best For...** | Quick checks while coding. | Deep analysis and multi-run comparison. |
| **Functionality** | Shows recent runs for *that* notebook only. | Shows *all* runs across all notebooks/jobs. |
| **Visuals** | Basic list of parameters/metrics. | Advanced charts, search, and artifact browsing. |


##### **4. Accessing the UI in Databricks**

1. **Direct Sidebar:** In the left-hand navigation menu, click **Experiments** (the laboratory beaker icon).
2. **From a Notebook:** Click the **Experiment** icon in the top right of your notebook to open the sidebar, then click the "External Link" icon to jump to the full UI.
3. **From Unity Catalog:** If you've registered a model, the **Catalog Explorer** provides links back to the original MLflow run and experiment that created it.

---

### **Practice**

In [0]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [0]:
# Prepare data
df = spark.table("workspace.gold.products").toPandas()
X = df[["views", "cart_adds"]]
y = df["purchases"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

[0;31m---------------------------------------------------------------------------[0m
[0;31mThe Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.[0m
[0;31m[0m
[0;31m[0m
[0;31m[0m
[0;31mThe last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.[0m
[0;31m---------------------------------------------------------------------------[0m
[0;31mLast messages on stderr:[0m
[0;31mWed Jan 21 10:42:26 2026 Connection to spark from PID  2562[0m
[0;31mWed Jan 21 10:42:27 2026 Initialized gateway on port 36113[0m
[0;31mWed Jan 21 10:42:27 2026 Connected to spark.[0m
[0;31m---------------------------------------------------------------------------[0m
[0;31mLast messages on stdout:[0m
[0;31mNOTE: When using the `ipython kernel` entry point, Ctrl-C will not work.[0m
[0;31m[0m
[0;31mTo exit, you will have to explicitly quit this process, by either 

In [0]:
# MLflow experiment
with mlflow.start_run(run_name="linear_regression_v1"):
    # Log parameters
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("test_size", 0.2)

    # Train
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Evaluate
    score = model.score(X_test, y_test)
    mlflow.log_metric("r2_score", score)

    # Log model
    mlflow.sklearn.log_model(model, "model")

print(f"R² Score: {score:.4f}")

----

### **_Resources_**

- [Mlflow](https://docs.databricks.com/mlflow/)
- [MLflow with databricks](https://youtu.be/ds__AEIqUfE?si=b3EB01LmOB5suO2v)

---