**Module 1.3: Autologging with Scikit-learn**---

## 🎯 **Learning Objectives Expanded**

### 1️⃣ **Enable and Use MLflow’s `autolog()` Function**

* **What it means:**
  MLflow's `autolog()` feature automatically logs key details from your model training process, such as parameters, metrics, and models, without needing explicit manual logging commands.

* **Detailed Steps:**

  * Import MLflow:

    ```python
    import mlflow.sklearn
    ```
  * Enable autologging:

    ```python
    mlflow.sklearn.autolog()
    ```
  * After enabling, any scikit-learn model trained in your session automatically logs metrics, parameters, and model artifacts.

* **Why it matters:**
  Autologging significantly reduces boilerplate code and helps you easily track experiments and model development.

---

### 2️⃣ **Train and Evaluate a Real-world Model (RandomForest on Housing Data)**

* **What it means:**
  Practically training and evaluating a popular machine learning model (RandomForest) on a real dataset (e.g., California housing dataset).

* **Detailed Steps:**

  * Load the dataset:

    ```python
    from sklearn.datasets import fetch_california_housing
    data = fetch_california_housing()
    X, y = data.data, data.target
    ```
  * Split dataset:

    ```python
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    ```
  * Train a RandomForest model:

    ```python
    from sklearn.ensemble import RandomForestRegressor
    model = RandomForestRegressor(n_estimators=100)
    model.fit(X_train, y_train)
    ```
  * Evaluate performance:

    ```python
    from sklearn.metrics import mean_squared_error
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    ```

* **Why it matters:**
  Working with real-world data and popular models provides practical experience and demonstrates MLflow’s power in realistic scenarios.

---

### 3️⃣ **Automatically Log All Relevant Parameters, Metrics, Model Artifacts**

* **What it means:**
  Using MLflow's autologging feature, parameters (like `n_estimators`), metrics (e.g., MSE), and trained model files are automatically captured and stored without explicit logging commands.

* **Detailed Explanation:**
  Once `mlflow.sklearn.autolog()` is enabled:

  * **Parameters** such as hyperparameters used in training (`max_depth`, `n_estimators`) are logged.
  * **Metrics** like Mean Squared Error or R² score are automatically recorded after evaluating on test data.
  * **Artifacts** including the trained model binary (e.g., joblib or pickle file) are stored in MLflow’s artifact storage.

* **Why it matters:**
  Automatic logging ensures you consistently track key details for reproducibility, easier comparison of experiments, and streamlined model governance.

---

### 4️⃣ **View and Compare Results without Manual Logging Code**

* **What it means:**
  Accessing and analyzing automatically logged runs to compare their results, parameters, and metrics using MLflow's UI or via code, without explicitly writing logging statements.

* **Detailed Steps:**

  * Using MLflow’s UI:

    ```bash
    mlflow ui
    ```

    * Open the URL (`http://localhost:5000`) in your browser.
    * Select experiments, view metrics charts, and compare multiple runs visually.
  * Programmatically comparing runs:

    ```python
    import mlflow
    runs_df = mlflow.search_runs(experiment_names=["experiment_name"])
    runs_df.sort_values(by="metrics.mse").head()
    ```

    This DataFrame shows metrics and parameters logged automatically by autologging.

* **Why it matters:**
  Easy comparison without manual intervention streamlines your workflow, saving significant time and effort.



In [2]:
# 📓 Module 1.3: Autologging with Scikit-learn
# Goal: Learn how to use MLflow autologging to automatically capture parameters, metrics, model, and artifacts during training

# ✅ Step 1: Install necessary packages
!pip install -q mlflow scikit-learn

# ✅ Step 2: Import required libraries
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# ✅ Step 3: Load dataset and split into training and test sets
# We'll use the California housing dataset as a real-world regression task
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# ✅ Step 4: Set the experiment to group related runs
mlflow.set_experiment("autologging-random-forest")

# ✅ Step 5: Enable autologging
# This enables automatic tracking of parameters, metrics, model, and even feature importance (if supported)
mlflow.sklearn.autolog()

# ✅ Step 6: Train and evaluate the model within an MLflow run context
with mlflow.start_run():
    # Train a Random Forest Regressor
    model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate performance on the test set
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    print(f"Test RMSE: {rmse:.4f}")

    # Note: We do not need to call mlflow.log_param, log_metric, or log_model
    # These are captured automatically by autologging

    # Print Run ID for reference
    print("Run ID:", mlflow.active_run().info.run_id)

# ✅ Step 7: Compare and review runs
# You can view details in the MLflow UI or search programmatically
runs_df = mlflow.search_runs(experiment_names=["autologging-random-forest"])
runs_df[["run_id", "metrics.training_score", "metrics.test_rmse"]] if "metrics.test_rmse" in runs_df else runs_df.head()

2025/07/31 15:05:21 INFO mlflow.tracking.fluent: Experiment with name 'autologging-random-forest' does not exist. Creating a new experiment.


Test RMSE: 0.6803
Run ID: 9d74c4bab7d84bf595282a078d3faf78


Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.training_score,metrics.training_root_mean_squared_error,metrics.training_r2_score,metrics.training_mean_absolute_error,...,params.max_features,params.monotonic_cst,params.bootstrap,params.oob_score,tags.mlflow.source.name,tags.mlflow.source.type,tags.estimator_class,tags.mlflow.user,tags.mlflow.runName,tags.estimator_name
0,9d74c4bab7d84bf595282a078d3faf78,563453553747295585,FINISHED,file:///content/mlruns/563453553747295585/9d74...,2025-07-31 15:05:21.981000+00:00,2025-07-31 15:05:34.522000+00:00,0.681887,0.652109,0.681887,0.476699,...,1.0,,True,False,/usr/local/lib/python3.11/dist-packages/colab_...,LOCAL,sklearn.ensemble._forest.RandomForestRegressor,root,sedate-calf-684,RandomForestRegressor


## 📝 Assessment: Autologging with Scikit-learn   

### 📘 Multiple Choice (Choose the best answer)

**1. What does `mlflow.sklearn.autolog()` do?**   
A. Logs only model parameters
B. Automatically logs parameters, metrics, and the model artifact ✅   
C. Launches the MLflow UI      
D. Logs metrics only during model evaluation      

---

**2. Do you still need to call `mlflow.log_param()` or `mlflow.log_model()` when using `autolog()`?**   
A. Yes, always   
B. Only for custom metrics   
**C. No, MLflow handles it automatically** ✅   
D. Only when using PyTorch   

---

**3. What must still be used even when autologging is enabled?**   
A. `mlflow.save_model()`   
**B. `mlflow.start_run()`** ✅   
C. `mlflow.ui()`   
D. `mlflow.create_project()`   

---

**4. What types of information are typically captured by MLflow autologging for sklearn models?**   
A. Model architecture only   
**B. Parameters, metrics, model artifacts, and sometimes feature importance** ✅   
C. Just training accuracy   
D. Only hyperparameters and plots   

---

### ✏️ Short Answer   

**5. What are the benefits of using `mlflow.sklearn.autolog()` over manual logging?**   
*Mention reduced boilerplate code, automatic capture of metrics/parameters, and consistency.*   

---

**6. In which scenarios might you still prefer manual logging even with autolog enabled?**   
*Hint: Custom metrics, additional logic, or external artifacts.*   

---

### 🧪 Mini Project   

**7. Task:**   
Modify the notebook to try different values for `max_depth` (e.g., 3, 5, 7) for the `RandomForestRegressor`.   

* Run three experiments with different values   
* Use `autolog()` (do not manually log anything)   
* Use `mlflow.search_runs()` to compare `training_score` and `test_rmse`   
* Determine which value performed best on the test set   
