### 📘 Day 12 | MLflow Basics  
### Databricks 14-Day AI Challenge


#### 🎯 Objective
Understand how to track machine learning experiments using MLflow by logging parameters, metrics, models, and comparing multiple runs.


### Feature Engineering & Data Preparation
Preparing features and target variable for regression modeling.


In [0]:
from pyspark.sql import functions as F

df = spark.table("default.ecommerce_transactions")

df_ml = (
    df.select(
        "Purchase_Amount",
        "Age",
        F.dayofweek("Transaction_Date").alias("day_of_week"),
        F.month("Transaction_Date").alias("month")
    )
    .na.drop()
    .filter(F.col("Purchase_Amount").isNotNull())
)

display(df_ml.limit(5))
print("Rows:", df_ml.count())

Purchase_Amount,Age,day_of_week,month
780.69,63,6,4
738.56,59,1,7
178.34,26,1,9
401.09,43,4,6
594.83,48,3,10


Rows: 50000


### Train-Test Split
Splitting data into training and testing sets for model evaluation.


In [0]:
from sklearn.model_selection import train_test_split

# Convert to pandas
pdf = df_ml.toPandas()

# Features & target
X = pdf[["Age", "day_of_week", "month"]]
y = pdf["Purchase_Amount"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape :", X_test.shape)

Train shape: (40000, 3)
Test shape : (10000, 3)


### MLflow Experiment Setup
Creating and setting an MLflow experiment to track all model runs.


In [0]:
import mlflow

# Setting experiment
mlflow.set_experiment("/Users/" + spark.sql("select current_user()").first()[0] + "/DAY_12_MLflow_Regression")

print("Experiment set")


Experiment set


### Model Training Run 1 (Baseline Regression)
Training a simple Linear Regression model and logging parameters, metrics, and model artifacts using MLflow.


In [0]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LinearRegression

# cast to float to avoid MLflow integer-missing warning
X_train = X_train.astype("float64")
X_test  = X_test.astype("float64")

with mlflow.start_run(run_name="linear_regression_v1"):
    mlflow.log_param("model", "LinearRegression")
    mlflow.log_param("features", "Age,day_of_week,month")

    lr = LinearRegression().fit(X_train, y_train)
    r2 = lr.score(X_test, y_test)
    mlflow.log_metric("r2_score", r2)

    mlflow.sklearn.log_model(lr, "model", input_example=X_train.head(5))

print(f"Run completed | R2 Score: {r2:.4f}")


Run completed | R2 Score: -0.0006


### Feature Enhancement
Adding an additional feature to test its impact on model performance.


In [0]:
from pyspark.sql import functions as F
from sklearn.model_selection import train_test_split

df2 = (
    spark.table("default.ecommerce_transactions")
    .select(
        "Purchase_Amount",
        "Age",
        F.dayofweek("Transaction_Date").alias("day_of_week"),
        F.month("Transaction_Date").alias("month"),
        F.when(F.dayofweek("Transaction_Date").isin([1,7]), 1).otherwise(0).alias("is_weekend")
    )
    .na.drop()
)

pdf2 = df2.toPandas()

X2 = pdf2[["Age", "day_of_week", "month", "is_weekend"]]
y2 = pdf2["Purchase_Amount"]

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

print("Train-test split:", X2_train.shape, X2_test.shape)

Train-test split: (40000, 4) (10000, 4)


### Model Training Run 2 (Enhanced Features)
Training a second regression model with an updated feature set and logging it as a new MLflow run.


In [0]:
mlflow.start_run(run_name="linear_regression_v2_weekend"):
    lr2 = LinearRegression().fit(X2_train, y2_train)
    r2_v2 = lr2.score(X2_test, y2_test)
    mlflow.log_metric("r2_score", r2_v2)

    mlflow.sklearn.log_model(lr2, "model", input_example=X2_train.head(5))

print(f"Run 2 completed | R2 Score: {r2_v2:.4f}")


Run 2 completed | R2 Score: -0.0009
