## AML Assignment 2
### Shruti Sharma
### MDS202435

Model Version Control and Experiment Tracking

In [1]:
# importing necessary libraries
import pandas as pd
import mlflow
import mlflow.sklearn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import average_precision_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# loading the data
train = pd.read_csv("train.csv")
val = pd.read_csv("validation.csv")

X_train = train["Message"]
y_train = train["Label"]

X_val = val["Message"]
y_val = val["Label"]

In [3]:
# vectorizing the text data using TF-IDF
vectorizer = TfidfVectorizer()

# The TF-IDF vectorizer is fitted on the training data and then used to transform both the training and validation datasets.
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

In [4]:
# Setting up MLflow experiment
mlflow.set_experiment("SMS_Spam_Classification")

2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.schemas
2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.tables
2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.types
2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.constraints
2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.defaults
2026/02/15 17:44:28 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.comments
2026/02/15 17:44:28 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/02/15 17:44:28 INFO mlflow.store.db.utils: Updating database tables
2026/02/15 17:44:28 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/02/15 17:44:28 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2026/02/15 17:44:28 INFO alembic.runtime.migration: Running upgrade  -> 451aebb31d03, add metric step
2026/02/15 17:4

<Experiment: artifact_location='file:c:/Users/SHRUTI/ShrutiWork/Assignment_2/mlruns/1', creation_time=1771157669001, experiment_id='1', last_update_time=1771157669001, lifecycle_stage='active', name='SMS_Spam_Classification', tags={}>

In [5]:
# Function to train model, log metrics, and register model in MLflow
def train_and_log_model(model, model_name):

    with mlflow.start_run(run_name=model_name):

        # Train model
        model.fit(X_train_vec, y_train)

        # Predict probabilities
        y_pred_prob = model.predict_proba(X_val_vec)[:, 1]

        # Calculate AUCPR
        aucpr = average_precision_score(y_val, y_pred_prob)

        # Log metric
        mlflow.log_metric("AUCPR", aucpr)

        # Log model
        mlflow.sklearn.log_model(model, model_name)

        # Register model
        mlflow.register_model(
            f"runs:/{mlflow.active_run().info.run_id}/{model_name}",
            model_name
        )

        print(f"{model_name} AUCPR:", aucpr)

        return aucpr

In [6]:
# Training and logging Logistic Regression model
lr = LogisticRegression(max_iter=1000)

aucpr_lr = train_and_log_model(lr, "LogisticRegression")

  flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
Successfully registered model 'LogisticRegression'.


LogisticRegression AUCPR: 0.9754596010357713


Created version '1' of model 'LogisticRegression'.


In [7]:
# Training and logging Random Forest model
rf = RandomForestClassifier(n_estimators=100)

aucpr_rf = train_and_log_model(rf, "RandomForest")

  flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
Successfully registered model 'RandomForest'.


RandomForest AUCPR: 0.9892091696819135


Created version '1' of model 'RandomForest'.


In [8]:
# Training and logging Decision Tree model
dt = DecisionTreeClassifier()

aucpr_dt = train_and_log_model(dt, "DecisionTree")

  flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
Successfully registered model 'DecisionTree'.


DecisionTree AUCPR: 0.7679889005502019


Created version '1' of model 'DecisionTree'.


In [9]:
# Model Comparison
print("\nModel Comparison")

print("Logistic Regression AUCPR:", aucpr_lr)
print("Random Forest AUCPR:", aucpr_rf)
print("Decision Tree AUCPR:", aucpr_dt)


Model Comparison
Logistic Regression AUCPR: 0.9754596010357713
Random Forest AUCPR: 0.9892091696819135
Decision Tree AUCPR: 0.7679889005502019



Experiment tracking and model version control were successfully implemented using MLflow. Three benchmark models — Logistic Regression, Random Forest, and Decision Tree — were trained, evaluated, and tracked. The AUCPR metric was used to compare model performance, as it is well-suited for evaluating classification models, especially in scenarios with class imbalance.

The AUCPR scores for the models are as follows:

- Logistic Regression AUCPR: 0.9755  
- Random Forest AUCPR: 0.9892 
- Decision Tree AUCPR: 0.7679 

Among the three models, the Random Forest classifier achieved the highest AUCPR score, indicating the best performance in distinguishing between the classes. Logistic Regression also performed well, showing strong baseline performance. However, the Decision Tree model had significantly lower AUCPR, suggesting lower generalization performance compared to the other models.

All models were successfully tracked and registered using MLflow, ensuring reproducibility, experiment management, and proper model version control. Based on the evaluation results, the Random Forest model is selected as the best-performing model and is recommended for further use or deployment.