# **PHASE 4: AI & ML (Days 12-14)**

## **DAY 13 (21/01/26) - Model Comparison & Feature Engineering**

### **Section 1 - Learn**:

### **_1. Training multiple models_**

Training multiple models in Databricks can range from testing a few different algorithms to training thousands of models for different "groups" (like a separate sales model for every store). Depending on your scale, you should choose one of the following three patterns.

##### **1. Hyperparameter Tuning (Parallel Search)**

If you are training one type of model but want to find the **best settings** (e.g., testing 100 different `max_depth` values), use a tuning library that integrates with Spark.

* **Optuna / Ray Tune (Modern Standard):** These libraries allow you to distribute trials across the workers of your cluster. They automatically log every trial to **MLflow** so you can see which "Run" won.
* **SparkTrials (Hyperopt):** A legacy but still common way to use Spark to run many small model trainings in parallel.

> **Key Setting:** Set the `parallelism` parameter to match the number of CPUs in your cluster to ensure you aren't training them one by one.

##### **2. Grouped Model Training (The "Many Models" Pattern)**

If you need to train a separate model for every **Region**, **Store**, or **Product ID**, use the **Pandas UDF (User Defined Function)** pattern. This is "embarrassingly parallel" and highly efficient.

* **How it works:** Spark splits your giant table into "groups." Each group is sent to a worker node as a **Pandas DataFrame**.
* **The Logic:** Your worker node trains the model (using `scikit-learn`, `prophet`, etc.) and logs the results.
* **Scale:** This method can be used to train **thousands of models** in minutes.

```python
# Example: Training a separate model for each 'store_id'
def train_store_model(pdf: pd.DataFrame) -> pd.DataFrame:
    store_id = pdf['store_id'].iloc[0]
    with mlflow.start_run(run_name=f"Store_{store_id}", nested=True):
        # ... Train your model here ...
        mlflow.sklearn.log_model(model, "model")
    return pd.DataFrame({'store_id': [store_id], 'status': ['Success']})

# Apply in parallel across the cluster
results = (df.groupBy("store_id")
             .applyInPandas(train_store_model, schema="store_id long, status string"))

```

##### **3. Automated Model Selection (AutoML)**

If you don't want to write any training code, use **Databricks AutoML**.

* **Process:** You point it at a table and tell it what you want to predict.
* **Results:** It will automatically try dozens of different algorithms (XGBoost, LightGBM, Random Forest) with different parameters.
* **Transparency:** It generates a **source-code notebook** for every model it tests, so you can see exactly how the "best" model was built and customize it.

##### **Choosing Your Strategy**

| Scenario | Recommended Approach | Library/Tool |
| --- | --- | --- |
| **Optimize one model** | Hyperparameter Tuning | Optuna or Ray Tune |
| **Train 100+ distinct models** | Grouped Map UDF | `applyInPandas()` |
| **Don't know which model is best** | AutoML | Databricks AutoML UI |
| **Deep Learning (Multi-node)** | Distributed Training | `TorchDistributor` / Horovod |


---

### **_2. Hyperparameter tuning_**

Hyperparameter tuning is the process of finding the optimal settings (hyperparameters) for your machine learning model to maximize performance. Unlike model weights, which are learned during training, hyperparameters must be set beforehand.

In Databricks, hyperparameter tuning is primarily executed using **Optuna** or **Ray Tune** for modern workloads, while **Hyperopt** remains common for legacy Spark-native tuning.

##### **1. Core Tuning Strategies**

| Strategy | Logic | Efficiency |
| --- | --- | --- |
| **Grid Search** | Tests every possible combination in a fixed grid. | **Low** (Brute force) |
| **Random Search** | Picks random combinations from a range. | **Medium** (Better for large spaces) |
| **Bayesian Optimization** | Uses previous results to predict which settings will work best next. | **High** (Smart & fast) |
| **Hyperband** | Starts many versions with small data and "kills" losers early. | **Very High** (Budget-friendly) |

##### **2. Optuna + MLflow (The Modern Standard)**

As of 2026, **Optuna** is the recommended tool for tuning in Databricks. It integrates deeply with **MLflow** using the `MlflowStorage` class, allowing you to use your Databricks tracking server as the database for your search.

###### **Key Feature: Parallel Tuning with `MlflowSparkStudy`**

You can distribute your tuning trials across your entire Spark cluster to finish in minutes instead of hours.

```python
import optuna
from mlflow.pyspark.optuna.study import MlflowSparkStudy

def objective(trial):
    # Define the search space
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    depth = trial.suggest_int("max_depth", 2, 32)
    
    # ... Train and evaluate your model here ...
    return accuracy

# Run 50 trials in parallel across 4 Spark worker slots
study = MlflowSparkStudy(study_name="my-tuning-exp")
study.optimize(objective, n_trials=50, n_jobs=4)

```

##### **3. Scaling with SparkTrials (Hyperopt)**

For teams using the **Hyperopt** library, Databricks provides the `SparkTrials` class. This class tells Hyperopt to distribute each "trial" (individual model training) as a separate Spark job.

* **Best for:** Tuning single-node libraries like `scikit-learn` or `XGBoost` across a large cluster.
* **Automatic Tracking:** If you have MLflow autologging enabled, every trial is automatically recorded as a "child run" in the MLflow UI.

##### **4. Best Practices for Tuning**

* **Logarithmic Scales:** For parameters like `learning_rate` or `regularization`, always use a **logarithmic scale** (e.g., `1e-5` to `1e-1`) rather than a linear one to explore different orders of magnitude.
* **Pruning:** Use Optuna's **MedianPruner**. It automatically stops "bad" trials halfway through their training if they look significantly worse than previous successful trials, saving you compute cost.
* **Use GPU Clusters sparingly:** If your individual trial doesn't *require* a GPU (e.g., small Random Forest), use a CPU-based cluster for tuning to save money. Save GPUs for the final training of the "best" model.
* **Parallelism vs. Adaptiveness:**
* **High Parallelism:** Faster results, but the "Bayesian" brain has less time to learn from previous results.
* **Low Parallelism:** Slower, but usually finds a slightly more optimal model because each step is more informed.


---

### **_3. Feature importance_**

Feature importance tells you which variables have the most "influence" on your model's predictions. In Databricks, there are three main ways to calculate and visualize this, ranging from fast built-in metrics to mathematically rigorous "game theory" approaches.

##### **1. Built-in (Model-Specific) Importance**

Most tree-based models (like **Random Forest**, **XGBoost**, or **GBT**) automatically calculate feature importance during training based on **Gini Impurity** or **Gain**.

* **Logic:** It sums up how much each feature improved the model's accuracy (or reduced "entropy") across all splits in all trees.
* **Pros:** Extremely fast; zero extra computation required.
* **Cons:** Can be biased toward high-cardinality features (like IDs or timestamps) even if they aren't truly predictive.

```python
# For Spark ML models
importance_scores = model.featureImportances
print(importance_scores)

# For XGBoost
import matplotlib.pyplot as plt
from xgboost import plot_importance
plot_importance(xgb_model)
plt.show()

```

##### **2. Permutation Feature Importance (Model-Agnostic)**

This is a more robust technique that works for **any** model (not just trees).

* **Logic:** You take a single feature and "shuffle" its values (permute them) while leaving the others untouched. If the model's performance crashes, that feature was important. If performance stays the same, the model wasn't really using that feature.
* **Pros:** Reliable; avoids the "high-cardinality bias" of built-in methods.
* **Cons:** Computationally expensive because it requires re-evaluating the model many times.

##### **3. SHAP Values (The "Gold Standard")**

**SHAP (SHapley Additive exPlanations)** is the modern standard for model interpretability in Databricks. It uses game theory to assign a "fair" contribution value to each feature for every single prediction.

* **Global Importance:** Shows which features matter most across the entire dataset.
* **Local Importance:** Explains **why** a specific customer was predicted to churn (e.g., "The high monthly bill increased the risk, while the long tenure decreased it").

```python
import shap
# Use TreeExplainer for fast results on XGBoost/RandomForest
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize the top features
shap.summary_plot(shap_values, X_test)

```

##### **Summary Table**

| Method | Speed | Rigor | Use Case |
| --- | --- | --- | --- |
| **Built-in** | Instant | Low | Quick sanity check during development. |
| **Permutation** | Slow | Medium | Verifying feature selection for non-tree models. |
| **SHAP** | Moderate | **High** | Production explainability and business stakeholder reporting. |

##### **Pro-Tip: Scaling SHAP with Spark**

For massive datasets, calculating SHAP values can be slow. You can use a **Pandas UDF** in Databricks to distribute the SHAP calculations across all nodes in your cluster, making it possible to explain millions of rows in minutes.


---

### **_4. Spark ML Pipelines_**

A **Spark ML Pipeline** is a high-level API that allows you to combine multiple data transformers and a machine learning estimator into a single, cohesive workflow.

Think of it like an **assembly line**: raw data enters at one end, passes through various cleaning and feature engineering stations, and emerges as a trained model at the other.

##### **1. Core Components**

A pipeline consists of a sequence of "stages," which are typically one of two types:

* **Transformers:** Algorithms that convert one DataFrame into another (e.g., `StringIndexer`, `VectorAssembler`, or a trained Model). They implement the `.transform()` method.
* **Estimators:** Algorithms that "learn" from data to produce a Transformer (e.g., `LogisticRegression` or `RandomForest`). They implement the `.fit()` method.

##### **2. Why Use Pipelines?**

* **Consistency:** Pipelines ensure that the exact same preprocessing steps applied to your **Training** data are applied to your **Test** or **Production** data, preventing "training-serving skew."
* **Simplified Code:** Instead of managing 10 different intermediate DataFrames, you manage one `Pipeline` object.
* **Tuning:** You can pass the entire Pipeline into a cross-validator to find the best hyperparameters for the preprocessing steps and the model simultaneously.

##### **3. Typical Pipeline Structure**

A standard pipeline often follows this sequence:

1. **StringIndexer:** Converts categorical strings to numbers.
2. **OneHotEncoder:** Converts those numbers into binary vectors.
3. **VectorAssembler:** Combines all feature columns into a single "features" vector.
4. **Estimator:** The actual ML algorithm (e.g., `DecisionTreeClassifier`).

```python
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Define the stages
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
assembler = VectorAssembler(inputCols=["categoryIndex", "age", "income"], outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Construct the pipeline
pipeline = Pipeline(stages=[indexer, assembler, rf])

# Train the entire assembly line with one command
model = pipeline.fit(train_df)

# Make predictions (the model 'remembers' the indexing and assembly steps)
predictions = model.transform(test_df)

```

##### **4. Persistence (Saving/Loading)**

One of the biggest advantages is that you can save the entire pipeline to disk. This includes the string mappings and the vector logic.

* **Save:** `model.save("s3://my-bucket/models/rf_pipeline")`
* **Load:** `loaded_model = PipelineModel.load("s3://my-bucket/models/rf_pipeline")`

##### **5. Best Practices**

* **Pipeline vs. PipelineModel:** Use `Pipeline` for the untrained workflow and `PipelineModel` for the workflow that has already been "fitted" to data.
* **Keep it Modular:** If you have very heavy preprocessing (like cleaning 100TB of raw logs), do that in a separate **Silver-to-Gold** ETL job rather than inside the ML pipeline.
* **Inference Tables:** When deploying a Pipeline in Databricks Model Serving, use **Inference Tables** to monitor how the individual stages of your pipeline are performing in real-time.

---

### **Practice**

In [0]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [0]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [0]:
# Prepare data
df = spark.table("workspace.gold.products").limit(10000).toPandas()
X = df[["views", "cart_adds"]]
y = df["purchases"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

[0;31m---------------------------------------------------------------------------[0m
[0;31mThe Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.[0m
[0;31m[0m
[0;31m[0m
[0;31m[0m
[0;31mThe last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.[0m
[0;31m---------------------------------------------------------------------------[0m
[0;31mLast messages on stderr:[0m
[0;31mWed Jan 21 12:55:39 2026 Connection to spark from PID  2587[0m
[0;31mWed Jan 21 12:55:39 2026 Initialized gateway on port 43459[0m
[0;31mWed Jan 21 12:55:39 2026 Connected to spark.[0m
[0;31m---------------------------------------------------------------------------[0m
[0;31mLast messages on stdout:[0m
[0;31mNOTE: When using the `ipython kernel` entry point, Ctrl-C will not work.[0m
[0;31m[0m
[0;31mTo exit, you will have to explicitly quit this process, by either 

In [0]:
models = {
    "linear": LinearRegression(),
    "decision_tree": DecisionTreeRegressor(max_depth=5),
    "random_forest": RandomForestRegressor(n_estimators=100)
}

for name, model in models.items():
    with mlflow.start_run(run_name=f"{name}_model"):
        mlflow.log_param("model_type", name)

        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        mlflow.log_metric("r2_score", score)
        mlflow.sklearn.log_model(model, "model")

        print(f"{name}: RÂ² = {score:.4f}")

In [0]:
# Spark ML Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression as SparkLR

assembler = VectorAssembler(inputCols=["views","cart_adds"], outputCol="features")
lr = SparkLR(featuresCol="features", labelCol="purchases")
pipeline = Pipeline(stages=[assembler, lr])

spark_df = spark.table("gold.products")
train, test = spark_df.randomSplit([0.8, 0.2])
model = pipeline.fit(train)

### **_Resources_**

- [Spark ML](https://spark.apache.org/docs/latest/ml-classification-regression.html)

---