
# MLOps - Experiment Tracking
We are just building a quick prototype model or, say, PoC for our objective. We do not have following for this stage

- Experiment Tracking Using MLFlow
- Model Management - :/
- No Model Registry or Deployment

We will learn about:
- How to track experiments using MLFlow
   - How to start a mlflow tracking server providing metadata store and artifacts store
   - How to log parameters, metrics, artifacts and models for a single run and multiple runs for hyperparameter tuning
   - How to use autologging
      -
### Autologging
Takes time to run, since it logs all the parameters, metrics, artifacts and models automatically. You can choose the items to log using the parameters of the autologging function. Or you can use the `mlflow.start_run()` context manager to log only the items you want to log.

### Model Logging
 - Using mlflow.log_artifact() to log the model:
    - Save the model in binary format using pickle and then log it to MLflow.
    - But there is no other informatn about it, i.e. no model signature, or environment information that is required to run the model.
- Using mlflow.sklearn.log_model() or mlflow.xgboost.log_model or using other available framework format to log the model:
    - This will log the model with all the information required to run the model, i.e. model signature, environment information, etc.
    - It also logs the model in a format that can be used for deployment.


### Experiments:
- From the No-MLops Version,
    - With outliers for trip-distance and trip-duration throws the Linear regression Model, it predicts same time for almost all trips i.e. 16 minutes. That gives us an RMSE of around ~10 Mins.
    - Removing outliers gives us RMSE around 6.5 minutes for both train and test.
    - When we add pick-up and drop-location in the model, it reduced the train and test RMSE to 5.5 minutes.
    - When we add the combination of pick-up and drop location, RMSE improves to 4.9-5 minutes
- Hyperparameter tuning of XGBoost model

### EDA
- Few trips has date out of the range for the file
- Passenger count is null for many rows not sure what does that mean
- No missings for the pickup location, drop off location
- Seems trips from VendorID=7 have duration=0
- "Trip distance" also have outliers and after removing those, the correlation between duration and distance becomes significantly high.

#### The notebook will run on defaul ipykernel, to connect it to the conda enviornment:
- First install the ipykernel in the enviornment, activate the env and run following in commandline "conda install -c anaconda ipykernel"
- Attach this kernel to jupyter notebooks "python -m ipykernel install --user --name=env_name

In [1]:

#!pip install mlflow

In [2]:
#!pip install protobuf==3.20.01

In [3]:
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import os
import numpy as np

In [4]:
os.getcwd()

'/Users/vss/Personal/Git/mlops-zoomcamp/02-experiment-tracking'

### MLflow setting the metadatastore using mlflow.set_tracking_uri
The following line does not start a tracking server. It just tells the MLflow client library to connect to a database URI for metadata tracking.
> `mlflow.set_tracking_uri("sqlite:///mlflow.db")`
If you want to see the UI or serve it to remote machines, you'd run:
> `mlflow server --backend-store-uri sqlite:///mlflow.db`

### But better create a mlflow server

> `mlflow server \
--backend-store-uri sqlite:///backend.db \
--default-artifact-root ./artifacts \
--host 0.0.0.0
--port 5000`


In [5]:
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("NYC_Taxi_Duration_Prediction_Exp")

<Experiment: artifact_location='/Users/vss/Personal/Git/mlops-zoomcamp/02-experiment-tracking/artifacts/1', creation_time=1752145432906, experiment_id='1', last_update_time=1752145432906, lifecycle_stage='active', name='NYC_Taxi_Duration_Prediction_Exp', tags={}>

## Data Files

In [6]:

data_dir = "/Users/vss/Personal/Git/NYC_Taxi_Trip_Duration_Prediction/data/raw_data/"
file_2025_03 = "{}yellow_tripdata_2025-03.parquet".format(data_dir)
file_2025_04 = "{}yellow_tripdata_2025-04.parquet".format(data_dir)
file_2025_05 = "{}yellow_tripdata_2025-05.parquet".format(data_dir)


### Utilities

In [7]:

def load_parquet_data(file: str) -> pd.DataFrame:
    '''
    Loads parquet data
    '''
    print("Loading data from: ", file)
    data = pd.read_parquet(file)
    print("\n======= SIZE =========")
    print(data.shape)
    print("\n======= DTYPES =========")
    print(data.dtypes)
    print("\n======= SUMMARY =========")
    print(data.describe())
    
    return data


def calculate_target_variable(data: pd.DataFrame) -> pd.DataFrame:
    '''
    calcualte the trip duration from pickup time and drop time
    '''
    data['trip_duration'] = data['tpep_dropoff_datetime'] - data['tpep_pickup_datetime']
    data['trip_duration'] = data['trip_duration'].apply(lambda diff: diff.total_seconds()/60)
    print(data['trip_duration'].describe())
    print(data['trip_duration'].head())
    return data


def filter_outliers(data: pd.DataFrame) -> pd.DataFrame:
    flag1 = (data['trip_duration'] >= 1) & (data['trip_duration'] <= 60)
    flag2 = (data['trip_distance'] >= 0.2) & (data['trip_distance'] <= 100)
    print(flag1.sum())
    print((flag1&flag2).sum())
    data = data[flag1&flag2].reset_index(drop=True)
    return data

from enum import Enum
class Transform_Fun_Options(str, Enum):
    LOG = 'log'
    SQRT = 'sqrt'
    
    
def transform_numeric_features(data: pd.DataFrame, features: list, transformation_fun: Transform_Fun_Options):
    if transformation_fun == Transform_Fun_Options.LOG:
        data[features] = np.log(data[features]+1)
    return data

### Load and Process Data


In [8]:
train_data = load_parquet_data(file_2025_03)
test_data = load_parquet_data(file_2025_04)


print("\n======= SNAPSHOT =========")
train_data.head(10)

# Calcualte Target i.e. Trip duration
train_data = calculate_target_variable(train_data)
test_data = calculate_target_variable(test_data)

# Create PU_DO feature
train_data['PU_DO'] = train_data['PULocationID'].astype('str') + '_' + train_data['DOLocationID'].astype('str')
test_data['PU_DO'] = test_data['PULocationID'].astype('str') + '_' + test_data['DOLocationID'].astype('str')

# Filter outliers
train_data_filtered = filter_outliers(train_data)
test_data_filtered = filter_outliers(test_data)
print(train_data_filtered.shape)
print(test_data_filtered.shape)
print(train_data_filtered['trip_duration'].describe())


# Target and Features
num_features = ['trip_distance']
cat_features = ['PU_DO']#['PULocationID', 'DOLocationID']
features = num_features + cat_features
target = 'trip_duration'

# Transform features
#features_to_transform = num_features + [target]
#train_data_filtered = transform_numeric_features(train_data_filtered, features_to_transform, Transform_Fun_Options.LOG)
#test_data = transform_numeric_features(test_data, features_to_transform, Transform_Fun_Options.LOG)
#test_data_filtered = transform_numeric_features(test_data_filtered, features_to_transform, Transform_Fun_Options.LOG)

# Change all categorical variables to strings
train_data_filtered[cat_features] = train_data_filtered[cat_features].astype('str')
test_data_filtered[cat_features] = test_data_filtered[cat_features].astype('str')
test_data[cat_features] = test_data[cat_features].astype('str')

# Change
train_data_filtered_dict = train_data_filtered[features].to_dict(orient='records')
test_data_filtered_dict = test_data_filtered[features].to_dict(orient='records')
test_data_dict = test_data[features].to_dict(orient='records')

print(train_data_filtered_dict[0])


Loading data from:  /Users/vss/Personal/Git/NYC_Taxi_Trip_Duration_Prediction/data/raw_data/yellow_tripdata_2025-03.parquet

(4145257, 20)

VendorID                          int32
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int32
DOLocationID                      int32
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
cbd_congestion_fee              float64
dtype: object

           VendorID        tpep_pickup_dateti

## Preparing the X, y for training and testing

In [9]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()

X = dv.fit_transform(train_data_filtered_dict)
y = train_data_filtered[target].values
print(X.shape)

X_test_filtered = dv.transform(test_data_filtered_dict)
y_test_filtered = test_data_filtered[target].values

X_test = dv.transform(test_data_dict)
y_test = test_data[target].values

(3922659, 35309)


## Train the model

In [10]:
from sklearn.linear_model import LinearRegression

# Train the model out of mlflow context and then log the model to MLflow
# Because if anything breaks we do not need to run the whole training again

# Train the Model
model = LinearRegression()
model.fit(X, y)
#print(model.coef_, model.intercept_)
print("Model is Trained")

# Make Predictions
import numpy as np
y_predictions = model.predict(X)
y_test_predictions = model.predict(X_test)
y_test_filtered_predictions = model.predict(X_test_filtered)

# Score Predictions
from sklearn.metrics import root_mean_squared_error
train_error = root_mean_squared_error(y, y_predictions)
test_error = root_mean_squared_error(y_test, y_test_predictions)
test_filtered_error = root_mean_squared_error(y_test_filtered, y_test_filtered_predictions)
print("Train Error:", train_error)
print("Test Error:", test_error)
print("Test Error:", test_filtered_error)

# First save the the model in binary format and then log it to MLflow
os.makedirs('models', exist_ok=True)
with open('models/lin_reg.bin', 'wb') as f_out:
    pickle.dump((dv, model), f_out)



Model is Trained
Train Error: 4.874724590750923
Test Error: 1071.8615540959213
Test Error: 5.3162491333348765


In [15]:
with mlflow.start_run():
    # Set Tags
    mlflow.set_tag("mlflow.developer", "Vivek")
    #mlflow.set_tag("mlflow.runName", "NYC_Taxi_Duration_Prediction")
    # Set Train and Test input paths
    mlflow.log_param("train-data-path", "{}".format(file_2025_03))
    mlflow.log_param("test-data-path", "{}".format(file_2025_04))
    # Model Type
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("features", features)
    mlflow.log_param("target", target)
    mlflow.log_param("num_features", num_features)
    mlflow.log_param("cat_features", cat_features)
    # Log the  metrics
    mlflow.log_metric("train_rmse_no_outlier", train_error)
    mlflow.log_metric("test_rmse_no_outlier", test_filtered_error)
    mlflow.log_metric("test_rmse_all_trips", test_error)
    # Log the model
    mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models")



🏃 View run NYC_Taxi_Duration_Prediction at: http://127.0.0.1:5000/#/experiments/1/runs/e724851a220d4c6e9ce94a1278238357
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1


## XGBoost Model Experiment, Hyperparameter Tuning and Logging


In [11]:
#!pip install xgboost
#!pip install hyperopt

In [10]:
import xgboost as xgb
from hyperopt import Trials, STATUS_OK, tpe, fmin, hp
from hyperopt.pyll import scope

In [11]:
train = xgb.DMatrix(X, label=y)
valid = xgb.DMatrix(X_test_filtered, label=y_test_filtered)

In [26]:
def objective(params):
    '''
    Objective function for hyperparameter tuning
    '''
    print(params)
    with mlflow.start_run():
        mlflow.set_tag("mlflow.developer", "Vivek")
        mlflow.set_tag("model_type", "xgboost")
        mlflow.log_params(params)

        # Train the model
        model = xgb.train(
            params=params,
            dtrain=train,
            num_boost_round=100,
            evals=[(valid, 'eval')],
            early_stopping_rounds=50
        )

        # Make Predictions
        y_train_pred = model.predict(train)
        y_valid_pred = model.predict(valid)

        # Score Predictions
        train_rmse = root_mean_squared_error(y, y_train_pred)
        test_rmse = root_mean_squared_error(y_test_filtered, y_valid_pred)

        # Log the metric
        mlflow.log_metric("train_rmse_no_outlier", train_rmse)
        mlflow.log_metric("test_rmse_no_outlier", test_rmse)

        mlflow.log_param("train-data-path", "{}".format(file_2025_03))
        mlflow.log_param("test-data-path", "{}".format(file_2025_04))

        # Model Type
        mlflow.log_param("features", features)
        mlflow.log_param("target", target)
        mlflow.log_param("num_features", num_features)
        mlflow.log_param("cat_features", cat_features)
        # Log the model
        #mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models")

        return {'loss': test_rmse, 'status': STATUS_OK}

In [27]:
search_space = {
    'max_depth': scope.int(hp.quniform('max_depth', 5, 20, 1)),
    'learning_rate': hp.loguniform('learning_rate', -2, 0),
    'reg_alpha': hp.loguniform('reg_alpha', -2, -1),
    'reg_lambda': hp.loguniform('reg_lambda', -2, -1),
    'min_child_weight': hp.loguniform('min_child_weight', 0, 1),
    'objective': 'reg:squarederror',
    'seed': 42
}

best_result = fmin(
    fn=objective, # Objective function to minimize
    space=search_space,
    algo=tpe.suggest, # Algorithm for space searching (smarter then random seach as in grid search)
    max_evals=10,     # Maximum number of combinatios of hyperparameters to try
    trials=Trials()   # Stores information about the trials, i.e. explore results like trails.trials[0]['result']['loss']
)

{'learning_rate': 0.9079992153489611, 'max_depth': 8, 'min_child_weight': 2.682505426238886, 'objective': 'reg:squarederror', 'reg_alpha': 0.32827263176937604, 'reg_lambda': 0.23526559591035467, 'seed': 42}
[0]	eval-rmse:5.72817                                 
[1]	eval-rmse:5.62564                                 
[2]	eval-rmse:5.60755                                 
[3]	eval-rmse:5.59942                                 
[4]	eval-rmse:5.59212                                 
[5]	eval-rmse:5.58530                                 
[6]	eval-rmse:5.57948                                 
[7]	eval-rmse:5.57322                                 
[8]	eval-rmse:5.56766                                 
[9]	eval-rmse:5.56213                                 
[10]	eval-rmse:5.55683                                
[11]	eval-rmse:5.55194                                
[12]	eval-rmse:5.54661                                
[13]	eval-rmse:5.54154                                
[14]	eval-rmse:5.53600 

## Now retraing the model with best hyperparameters
## And Logging the model to MLflow using autologging

In [13]:
best_params = {
    'learning_rate': 0.6283682876870004,
    'max_depth': 18,
    'min_child_weight': 1.3846214120664857,
    'objective': 'reg:squarederror',
    'reg_alpha': 0.1958413715153502,
    'reg_lambda': 0.34286651753790953,
    'seed': 42
}
print(best_params)

# Auto loggin taking too much time, so disabling it for now
#mlflow.xgboost.autolog(
    #log_input_examples=False,
    #log_model_signatures=True,
    #log_models=True,
    #disable=False
#)

model = xgb.train(
        params=best_params,
        dtrain=train,
        num_boost_round=100,
        evals=[(valid, 'eval')],
        early_stopping_rounds=50
)

# Make Predictions
y_train_pred = model.predict(train)
y_valid_pred = model.predict(valid)


{'learning_rate': 0.6283682876870004, 'max_depth': 18, 'min_child_weight': 1.3846214120664857, 'objective': 'reg:squarederror', 'reg_alpha': 0.1958413715153502, 'reg_lambda': 0.34286651753790953, 'seed': 42}
[0]	eval-rmse:6.50674
[1]	eval-rmse:5.69843
[2]	eval-rmse:5.53730
[3]	eval-rmse:5.49217
[4]	eval-rmse:5.47251
[5]	eval-rmse:5.45988
[6]	eval-rmse:5.45281
[7]	eval-rmse:5.44735
[8]	eval-rmse:5.44238
[9]	eval-rmse:5.43779
[10]	eval-rmse:5.43228
[11]	eval-rmse:5.42807
[12]	eval-rmse:5.42312
[13]	eval-rmse:5.41864
[14]	eval-rmse:5.41425
[15]	eval-rmse:5.40987
[16]	eval-rmse:5.40531
[17]	eval-rmse:5.40128
[18]	eval-rmse:5.39754
[19]	eval-rmse:5.39350
[20]	eval-rmse:5.38943
[21]	eval-rmse:5.38577
[22]	eval-rmse:5.38231
[23]	eval-rmse:5.37889
[24]	eval-rmse:5.37550
[25]	eval-rmse:5.37202
[26]	eval-rmse:5.36873
[27]	eval-rmse:5.36551
[28]	eval-rmse:5.36252
[29]	eval-rmse:5.35962
[30]	eval-rmse:5.35407
[31]	eval-rmse:5.35118
[32]	eval-rmse:5.34857
[33]	eval-rmse:5.34618
[34]	eval-rmse:5.343

NameError: name 'root_mean_squared_error' is not defined

In [14]:
from sklearn.metrics import root_mean_squared_error
# Score Predictions
train_rmse = root_mean_squared_error(y, y_train_pred)
test_rmse = root_mean_squared_error(y_test_filtered, y_valid_pred)


In [18]:
with mlflow.start_run():
    # Set Tags
    mlflow.set_tag("mlflow.developer", "Vivek")
    #mlflow.set_tag("mlflow.runName", "NYC_Taxi_Duration_Prediction")
    # Set Train and Test input paths
    mlflow.log_param("train-data-path", "{}".format(file_2025_03))
    mlflow.log_param("test-data-path", "{}".format(file_2025_04))

    #
    mlflow.log_params(best_params)
    # Model Type
    mlflow.log_param("model_type", "xgboost")
    mlflow.log_param("features", features)
    mlflow.log_param("target", target)
    mlflow.log_param("num_features", num_features)
    mlflow.log_param("cat_features", cat_features)
    # Log the  metrics
    mlflow.log_metric("train_rmse_no_outlier", train_rmse)
    mlflow.log_metric("test_rmse_no_outlier", test_rmse)

    # Log the model using mlflow.xgboost.log_model
    mlflow.xgboost.log_model(model, name="xgboost_model")

    # Also, save the preprocessor as binary file and then log the preprocessor
    with open('models/preprocessor.bin', 'wb') as f_out:
        pickle.dump(dv, f_out)
    mlflow.log_artifact("models/preprocessor.bin", artifact_path="preprocessor")

  xgb_model.save_model(model_data_path)


🏃 View run worried-foal-107 at: http://127.0.0.1:5000/#/experiments/1/runs/e32e0d56a5fa4240bc9cbc5522a33ee9
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1


In [20]:
from mlflow.models.signature import infer_signature
signature = infer_signature(train, model.predict(train))
signature




inputs: 
  [Any (required)]
outputs: 
  [Tensor('float32', (-1,))]
params: 
  None