# Day 12: Machine Learning with MLflow

**Goal:** Train a model to predict user purchases and track the experiment using MLflow.
**The Data:** We are using the `gold_features` table created in Day 11, which contains:
* `price_log` (Normalized Price)
* `is_weekend` (Cyclical Feature)
* `hour` (Temporal Feature)
* `is_purchased` (Target: 0 or 1)

#### Setup & Data Loading
**Note:** Scikit-Learn runs on a single machine (not a cluster), so we convert our Spark DataFrame to Pandas.

In [0]:
import mlflow
import mlflow.sklearn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix


In [0]:
# 1. Load Data from Unity Catalog
# We limit to 100k rows to keep things fast for this demo since we are converting to Pandas
df_spark = spark.sql("SELECT * FROM ecommerce.silver.gold_features LIMIT 100000")

# 2. Convert to Pandas (Required for Scikit-Learn)
df = df_spark.toPandas()

print(f"Data Loaded into Pandas. Shape: {df.shape}")
display(df.head())

Data Loaded into Pandas. Shape: (100000, 18)


event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,ingestion_ts,source_file,event_date,price_tier,hour,day_of_week,is_weekend,price_log,is_purchased
2019-10-04T08:53:08.000Z,view,3601278,2053013563810775923,appliances.kitchen.washer,bosch,385.85,540858814,a7d6a230-fc37-4e6c-9305-a00857744568,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium,8,6,0,5.958037020995463,0
2019-10-04T08:54:11.000Z,view,1004777,2053013555631882655,electronics.smartphone,xiaomi,135.01,514503087,7c89bb78-8922-4860-bc69-e9943771c9d1,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium,8,6,0,4.912728412444662,0
2019-10-04T08:54:17.000Z,view,12600008,2053013554751078769,appliances.kitchen.grill,tefal,159.52,515957698,0a892879-4090-4bc6-9ec8-8c1d5cd2556f,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium,8,6,0,5.078418545398716,0
2019-10-04T08:54:34.000Z,view,13900345,2053013557343158789,construction.components.faucet,calorie,19.75,513647458,cb6ebf1a-4eb2-4286-b49f-3f0125b266b8,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,budget,8,6,0,3.0325462466767075,0
2019-10-04T08:55:15.000Z,view,4100346,2053013561218695907,,sony,390.95,552270784,551e751c-e40f-4d1c-aec1-9824cecab7a6,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium,8,6,0,5.971134280634731,0


#### Split Data (Train vs. Test)

In [0]:
# Define Features (X) and Target (y)
# We want to predict 'is_purchased' using the other columns
X = df[["price_log", "hour", "is_weekend"]]
y = df["is_purchased"]

# Split into Training (80%) and Testing (20%) sets
# random_state=42 ensures we get the same split every time (reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Rows: {len(X_train)}")
print(f"Testing Rows:  {len(X_test)}")

Training Rows: 80000
Testing Rows:  20000


#### 1. The Baseline Model
I am using **Logistic Regression**. As our target is binary (0 or 1), so Logistic Regression is an appropriate algorithm as it predicts probability.

I will verify if using **only `price_log`** is enough to predict a purchase.

In [0]:
# Define the Experiment Name (this organizes your runs in the UI)
mlflow.set_experiment("/Shared/Day_12_Purchase_Prediction")

run_name = "Run_1_Price_Only"

with mlflow.start_run(run_name=run_name):
    # Define Features for this specific run
    # We only use Price to see if it's a good predictor on its own
    features = ["price_log"]
    X_train_sub = X_train[features]
    X_test_sub = X_test[features]
    
    # Log Parameters (The "Ingredients")
    mlflow.log_param("features", features)
    mlflow.log_param("model_type", "LogisticRegression")
    
    # Train the Model
    model = LogisticRegression()
    model.fit(X_train_sub, y_train)
    
    # Evaluate
    predictions = model.predict(X_test_sub)
    accuracy = accuracy_score(y_test, predictions)
    
    # Log Metrics (The "Results")
    mlflow.log_metric("accuracy", accuracy)
    print(f"Run 1 Accuracy: {accuracy:.4f}")
    
    # Log the Model itself (The "Artifact")
    mlflow.sklearn.log_model(model, "model")

Run 1 Accuracy: 0.9821




#### 2: The Complex Model
Now, let's see if adding **Time Context** (`hour`, `is_weekend`) improves the model.
My Day 11 analysis showed that weekends have higher conversion rates, so I expect this model to perform better.

In [0]:
run_name = "Run_2_All_Features"

with mlflow.start_run(run_name=run_name):
    # 1. Use ALL features this time
    features = ["price_log", "hour", "is_weekend"]
    
    # 2. Log Parameters
    mlflow.log_param("features", features)
    mlflow.log_param("model_type", "LogisticRegression")
    
    # 3. Train
    model = LogisticRegression()
    model.fit(X_train[features], y_train)
    
    # 4. Evaluate
    predictions = model.predict(X_test[features])
    accuracy = accuracy_score(y_test, predictions)
    
    # 5. Log Metrics
    mlflow.log_metric("accuracy", accuracy)
    print(f"Run 2 Accuracy: {accuracy:.4f}")
    
    # 6. Log Model
    mlflow.sklearn.log_model(model, "model")

Run 2 Accuracy: 0.9821




#### 3: Changing the Algorithm
Since Logistic Regression hit a ceiling (likely predicting "No" for everyone), I tried a **Decision Tree**.
* **Hypothesis:** A tree might capture non-linear rules (e.g., "If Weekend AND Price < $50, then Buy").
* **MLflow Feature:** I also logged a **Confusion Matrix** as an image artifact. This allows me to visually inspect *how* the model is making errors directly inside the MLflow UI, rather than just relying on a single accuracy number.

In [0]:
run_name = "Run_3_Decision_Tree"

with mlflow.start_run(run_name=run_name):
    # Define Features & Model Type
    features = ["price_log", "hour", "is_weekend"]
    model_type = "DecisionTreeClassifier"
    
    # Log Parameters (Key step for comparison later)
    mlflow.log_param("features", features)
    mlflow.log_param("model_type", model_type)
    mlflow.log_param("criterion", "gini") # Specific param for Trees
    
    # Train
    model = DecisionTreeClassifier(random_state=42, max_depth=5)
    model.fit(X_train[features], y_train)
    
    # Evaluate
    predictions = model.predict(X_test[features])
    accuracy = accuracy_score(y_test, predictions)
    
    # Log Metrics
    mlflow.log_metric("accuracy", accuracy)
    print(f"Run 3 Accuracy: {accuracy:.4f}")
    
    # Create Confusion Matrix
    cm = confusion_matrix(y_test, predictions)
    
    # Plot it
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix - {run_name}")
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    
    # Save plot to a file
    plot_filename = "confusion_matrix.png"
    plt.savefig(plot_filename)
    
    # Log that file to MLflow
    mlflow.log_artifact(plot_filename)
    # -------------------------------------------------------
    
    # Log Model
    mlflow.sklearn.log_model(model, "model")
    
    # Close plot to prevent displaying double
    plt.close()

Run 3 Accuracy: 0.9821




### The "Accuracy Paradox" Diagnosis
My experiments revealed a critical flaw in using **Accuracy** as a metric for e-commerce data.

**The Evidence (Run 3 - Decision Tree):**
* **Accuracy:** `98.21%` (Sounds perfect)
* **Confusion Matrix:**
    * **True Negatives:** 19,642 (Correctly ignored non-buyers)
    * **True Positives:** **0** (Failed to identify a single buyer)
    * **False Negatives:** 357 (Missed every actual sale)

**Conclusion:**
The model is "lazy." It learned that since 98% of people don't buy, the safest bet is to predict **0 (No Purchase)** for everyone.
* **MLflow Value:** Without logging the **Confusion Matrix artifact**, I would have just seen "98% Accuracy" and deployed a useless model.
* **Next Steps:** In the future, I must use **Class Weights** or **Oversampling (SMOTE)** to force the model to care about the rare "Purchase" events.