<h1>Part 4 - Experiment Tracking</h1>

# Experiment Tracking and Model Management with MLFlow

There are many ways to use the MLFlow Tracking API. For simple local uses, the best is to leave the data management to MLFlow and let it store runs, metrics, models and artifacts locally. For more advanced usage, all of this information can be stored in databases. You can find the detailed on MLFlow's documentation [here](https://mlflow.org/docs/latest/tracking.html#scenario-1-mlflow-on-localhost).

## Scenario 1: A single data scientist participating in an ML competition

MLflow setup:
* Tracking server: no
* Backend store: local filesystem
* Artifacts store: local filesystem

The experiments can be explored locally by launching the MLflow UI.

Let's print the tracking server URI, where the experiments and runs are going to be logged. We observe it refers to a local path.

In [2]:

mlflow.set_tracking_uri("http://127.0.0.1:5000/")

In [3]:
import mlflow

print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

tracking URI: 'http://127.0.0.1:5000/'


After this initialization, we can connect create a client to connect to the API and see what experiments are present.

By refering to mlflow's [documentation](https://mlflow.org/docs/latest/python_api/mlflow.client.html), create a client and display a list of the available experiments using the search_experiments function. This function could prove useful later to programatically explore experiments (rather than in the UI)

In [4]:
import mlflow
import mlflow.sklearn

# Example: logging an experiment
with mlflow.start_run():
    mlflow.log_param("param1", 5)
    mlflow.log_metric("metric1", 0.86)


In [5]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

experiments = client.search_experiments()
experiments

[<Experiment: artifact_location='mlflow-artifacts:/936627379886763873', creation_time=1736121789416, experiment_id='936627379886763873', last_update_time=1736121789416, lifecycle_stage='active', name='MLflow_track_diamonds', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1736121668381, experiment_id='0', last_update_time=1736121668381, lifecycle_stage='active', name='Default', tags={}>]

We see that there is a default experiment for which the runs are stored locally in the mlruns folder.

### Creating an experiment and logging a new run

An experiment is a logical entity regrouping the logs of multiple attempts at solving a same problem, called runs. \
We will now work with the classic sklearn dataset iris. Our goal here is to manage to classify the different iris species. To track our models performance, we will log every attempt as a "run" and create a new experiment "iris-experiment-1" to regroup them.

Lookup the mlflow.run and mlflow.start_run functions [here](https://mlflow.org/docs/latest/python_api/mlflow.html?highlight=start_run#mlflow.start_run) to find out how to manage runs.
Explore [this part](https://mlflow.org/docs/latest/python_api/mlflow.html) to learn more about the log_params, log_metrics and log_artifact functions. Find out how to log sklearn models [here](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html])

Complete the following in order to log the parameters, interesting metrics and the model.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

mlflow.set_experiment("iris-experiment-1")

with mlflow.start_run() as run:
    run_id = run.info.run_id

    X, y = load_iris(return_X_y=True)

    params = {"C": 0.1, "random_state": 42}
    mlflow.log_params(params)

    lr = LogisticRegression(**params).fit(X, y)
    y_pred = lr.predict(X)
    mlflow.log_metric("accuracy", accuracy_score(y, y_pred))

    mlflow.sklearn.log_model(lr, artifact_path="models")
    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'")

2025/01/05 15:41:56 INFO mlflow.tracking.fluent: Experiment with name 'iris-experiment-1' does not exist. Creating a new experiment.


default artifacts URI: 'mlflow-artifacts:/823888780486837844/eda23daad43e42e0a76ef762237536a4/artifacts'


In [14]:
experiments = client.search_experiments()
experiments

[<Experiment: artifact_location='mlflow-artifacts:/823888780486837844', creation_time=1736088116909, experiment_id='823888780486837844', last_update_time=1736088116909, lifecycle_stage='active', name='iris-experiment-1', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1736088085839, experiment_id='0', last_update_time=1736088085839, lifecycle_stage='active', name='Default', tags={}>]

Try running the training script with various parameters to have runs to compare.
You can now explore your run(s) using the ui: \
(Paste "mlflow ui --host 0.0.0.0 --port 5002" in your terminal, or run the cell below)

**N.B.** Make sure you are in the lecture folder and not the repo root!

In [15]:
#!mlflow ui --host 0.0.0.0 --port 5002

^C


You will have to kill the cell to continue experimenting

### Interacting with the model registry

If you are satisfied with the last run's model, you can transform the logged model into a registered model. It will be logged in the Model Registry, which makes it easier to use in production and manage versions.

In [37]:
# We already have our run id from above. Another way to get it is to use the client:
# run_id = client.list_run_infos(experiment_id='1')[0].run_id

result = mlflow.register_model(f"runs:/{run_id}/models", "iris_lr_model")

Successfully registered model 'iris_lr_model'.
2025/01/06 03:00:47 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: iris_lr_model, version 1
Created version '1' of model 'iris_lr_model'.


## Use Case

The project is *New York City Taxi trip duration prediction*. \
The goal is to use the available data in order to train a simple machine learning model
to predict the trip duration based on **some input that can be available in production environment**.

An ultimate goal for this use case can be to predict in real time trips durations (google-maps/waze itinerary like)
but for simplicity, in this module, we assume that we need batch prediction. The data for which we need predictions
will be stored in a file for ingestion in the trained model.

The machine learning phase is mainly constituted by the following steps : 
- data processing
- model training
- model evaluation
- prediction

The data to use for this module can be downloaded from the [TLC Trip Record Data page](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
To complete this module, you will need 03 samples of data :
- `sample 1 example` : yellow trip 2021-01 data (to train model)
- `sample 2 example` : yellow trip 2021-02 data (to evaluate model)
- `sample 3 example` : yellow trip 2021-03 data (for prediction)

In [36]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

from typing import List
from scipy.sparse import csr_matrix

# 0 - Download Data

In [8]:
import gdown
import os

data_folder = "data"
train_path = f"{data_folder}/yellow_tripdata_2021-01.parquet"
test_path = f"{data_folder}/yellow_tripdata_2021-02.parquet"
predict_path = f"{data_folder}/yellow_tripdata_2021-03.parquet"

# Check whether the specified path exists or not
isExist = os.path.exists(data_folder)
if not isExist:
    # Create a new directory because it does not exist
    os.makedirs(data_folder)
    print(f"New directory {data_folder} created!")

gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    train_path,
    quiet=False,
)
gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet",
    test_path,
    quiet=False,
)
gdown.download(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet",
    predict_path,
    quiet=False,
)

Downloading...
From: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
To: c:\Users\c\10.9\esilv-mlops-crashcourse-24\lessons\01-model-and-experiment-management\data\yellow_tripdata_2021-01.parquet
100%|██████████| 21.7M/21.7M [00:00<00:00, 32.3MB/s]
Downloading...
From: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet
To: c:\Users\c\10.9\esilv-mlops-crashcourse-24\lessons\01-model-and-experiment-management\data\yellow_tripdata_2021-02.parquet
100%|██████████| 21.8M/21.8M [00:00<00:00, 31.8MB/s]
Downloading...
From: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet
To: c:\Users\c\10.9\esilv-mlops-crashcourse-24\lessons\01-model-and-experiment-management\data\yellow_tripdata_2021-03.parquet
100%|██████████| 30.0M/30.0M [00:00<00:00, 36.8MB/s]


'data/yellow_tripdata_2021-03.parquet'

# 1 - Load data

In [35]:
def load_data(path: str):
    return pd.read_parquet(path)


train_df = load_data(train_path)
train_df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,


# 2 - Prepare the data

Let's prepare the data to make it Machine Learning ready. \
For this, we need to clean it, compute the target (what we want to predict), and compute some features to help the model understand the data better.

## 2-1 Compute the target

We want to predict a taxi trip duration in minutes. Let's compute it as a difference between the drop-off time and the pick-up time for each trip.

In [17]:
def compute_target(
    df: pd.DataFrame,
    pickup_column: str = "tpep_pickup_datetime",
    dropoff_column: str = "tpep_dropoff_datetime",
) -> pd.DataFrame:
    df["duration"] = df[dropoff_column] - df[pickup_column]
    df["duration"] = df["duration"].dt.total_seconds() / 60
    return df


train_df = compute_target(train_df)

In [18]:
train_df["duration"].describe()

count    1.369769e+06
mean     1.391168e+01
std      1.312006e+02
min     -1.350846e+05
25%      5.566667e+00
50%      9.066667e+00
75%      1.461667e+01
max      2.881770e+04
Name: duration, dtype: float64

Let's remove outliers and reduce the scope to trips between 1 minute and 1 hour

In [34]:
MIN_DURATION = 1
MAX_DURATION = 60


def filter_outliers(
    df: pd.DataFrame, min_duration: int = 1, max_duration: int = 60
) -> pd.DataFrame:
    return df[df["duration"].between(min_duration, max_duration)]


train_df = filter_outliers(train_df)

## 2-2 Prepare features

### 2-2-1 Categorical features

Most machine learning models don't work with categorical features. Because of this, they must be transformed so that the ML model can consume them.

In [33]:
CATEGORICAL_COLS = ["PUlocationID", "DOlocationID"]


def encode_categorical_cols(
    df: pd.DataFrame, categorical_cols: List[str] = None
) -> pd.DataFrame:
    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    df[categorical_cols] = df[categorical_cols].fillna(-1).astype("int")
    df[categorical_cols] = df[categorical_cols].astype("str")
    return df


train_df = encode_categorical_cols(train_df)
train_df = train_df[:50]

In [32]:
def extract_x_y(
    df: pd.DataFrame,
    categorical_cols: List[str] = None,
    dv: DictVectorizer = None,
    with_target: bool = True,
) -> dict:

    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    dicts = df[categorical_cols].to_dict(orient="records")

    y = None
    if with_target:
        if dv is None:
            dv = DictVectorizer()
            dv.fit(dicts)
        y = df["duration"].values

    x = dv.transform(dicts)
    return x, y, dv


X_train, y_train, dv = extract_x_y(train_df)

# 3 - Train model

We train a basic linear regression model to have a baseline performance

In [22]:
def train_model(x_train: csr_matrix, y_train: np.ndarray):
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    return lr


model = train_model(X_train, y_train)

In [23]:
import xgboost as xgb
def train_model_xgboost(x_train: csr_matrix, y_train: np.ndarray):
    dtrain = xgb.DMatrix(x_train, label=y_train)
    params = {"objective": "reg:squarederror", "eval_metric": "rmse"}
    model = xgb.train(params, dtrain)
    return model

model_xgboost = train_model_xgboost(X_train, y_train)

# 4 - Evaluate model

We evaluate the model on train and test data

## 4-1 On train data

In [24]:
def predict_duration(input_data: csr_matrix, model: LinearRegression):
    return model.predict(input_data)


def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray):
    return mean_squared_error(y_true, y_pred, squared=False)


prediction = predict_duration(X_train, model)
train_me = evaluate_model(y_train, prediction)
train_me

0.26000000000099904

In [25]:
import pickle

def load_pickle(path: str):
    with open(path, "rb") as f:
        loaded_obj = pickle.load(f)
    return loaded_obj


def predict_updated(input_path: str, model: LinearRegression):
    input_data = load_pickle(input_path)
    return model.predict(input_data)

In [26]:
#model_xgboost = train_model_xgboost(X_train, y_train)
dmatrix_data = xgb.DMatrix(X_train)
y_pred_xgboost = model_xgboost.predict(dmatrix_data)

train_me_xgboost = evaluate_model(y_train, y_pred_xgboost)
train_me_xgboost

3.477482658335206

## 4-2 On test data

In [27]:
test_df = load_data(test_path)

In [28]:
test_df = compute_target(test_df)
test_df = encode_categorical_cols(test_df)
X_test, y_test, _ = extract_x_y(test_df, dv=dv)

In [29]:
y_pred_test = predict_duration(X_test, model)
test_me = evaluate_model(y_test, y_pred_test)
test_me

58.71506045956022

In [40]:
#y_pred_test_xgboost = predict_duration(X_test, model_xgboost)
dmatrix_data = xgb.DMatrix(X_test)
y_pred_test = model_xgboost.predict(dmatrix_data)
test_me_xgboost = evaluate_model(y_test, y_pred_test)
test_me_xgboost

58.51669053507327

## 4 - Log Model Parameters to MlFlow

Now that all our development function are built and tested, let's create a training pipeline and log the training parameters, logs and model to MlFlow.

Create a training flow, log all the important parameters, metrics and model. Try to find what could be important and needs to be logged.

In [41]:
# Set the experiment name
mlflow_experiment_path = f"/mlflow/linear_reg_test"
mlflow.set_experiment(mlflow_experiment_path)

# Start a run
with mlflow.start_run() as run:
    run_id = run.info.run_id

    # Set tags for the run
    mlflow.set_tag("Level", "Development")
    mlflow.set_tag("Team", "Data Science")

    # Load data
    train_df = load_data(train_path)
    test_df = load_data(test_path)
    mlflow.log_param(
        "train_date", train_path.split("/")[-1].split(".")[0].split("_")[-1]
    )
    mlflow.log_param("test_date", test_path.split("/")[-1].split(".")[0].split("_")[-1])
    mlflow.log_param("train_set_size", train_df.shape[0])
    mlflow.log_param("test_set_size", test_df.shape[0])

    # Compute target
    train_df_computed = compute_target(train_df)

    # Filter outliers
    mlflow.log_param("filtered_outliers", True)
    train_df_computed = filter_outliers(train_df_computed)

    # Encode categorical columns
    train_df = encode_categorical_cols(train_df)

    # Extract X and y
    X_train, y_train, _ = extract_x_y(train_df)

    # Train model
    model = train_model(X_train, y_train)

    # Evaluate model
    prediction = predict_duration(X_train, model)
    train_me = evaluate_model(y_train, prediction)
    mlflow.log_metric("train_me", train_me)

    # Evaluate model on test set
    test_df = compute_target(test_df)
    test_df = encode_categorical_cols(test_df)
    # Train data
    X_train, y_train, dv = extract_x_y(train_df)

# Test data using the same dv
   

    X_test, y_test, dv = extract_x_y(test_df, dv=dv)
    #dmatrix_data = xgb.DMatrix(X_test)
    #y_pred_test = model_xgboost.predict(dmatrix_data)
    y_pred_test = predict_duration(X_test,model)
    test_me = evaluate_model(y_test, y_pred_test)
    mlflow.log_metric("test_me", test_me)

    # Log your model
    mlflow.sklearn.log_model(model, "models")

    # Register your model as the production model
    mlflow.register_model(f"runs:/{run_id}/models", "linear_reg_test")

Successfully registered model 'linear_reg_test'.
2025/01/06 03:02:12 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: linear_reg_test, version 1
Created version '1' of model 'linear_reg_test'.


In [37]:
# Set the experiment name
mlflow_experiment_path = f"/mlflow/xgboost_test"
mlflow.set_experiment(mlflow_experiment_path)

# Start a run
with mlflow.start_run() as run:
    run_id = run.info.run_id

    # Set tags for the run
    mlflow.set_tag("Level", "Development")
    mlflow.set_tag("Team", "Data Science")

    # Load data
    train_df = load_data(train_path)
    test_df = load_data(test_path)
    mlflow.log_param(
        "train_date", train_path.split("/")[-1].split(".")[0].split("_")[-1]
    )
    mlflow.log_param("test_date", test_path.split("/")[-1].split(".")[0].split("_")[-1])
    mlflow.log_param("train_set_size", train_df.shape[0])
    mlflow.log_param("test_set_size", test_df.shape[0])

    # Compute target
    train_df_computed = compute_target(train_df)

    # Filter outliers
    mlflow.log_param("filtered_outliers", True)
    train_df_computed = filter_outliers(train_df_computed)

    # Encode categorical columns
    train_df = encode_categorical_cols(train_df)

    # Extract X and y
    X_train, y_train, _ = extract_x_y(train_df)

    # Train model
    model_xgboost = train_model(X_train, y_train)

    # Evaluate model
    prediction = predict_duration(X_train, model_xgboost)
    train_me = evaluate_model(y_train, prediction)
    mlflow.log_metric("train_me", train_me)

    # Evaluate model on test set
    test_df = compute_target(test_df)
    test_df = encode_categorical_cols(test_df)
    X_train, y_train, dv = extract_x_y(train_df)
    X_test, y_test, dv = extract_x_y(test_df, dv=dv)
    dmatrix_data = xgb.DMatrix(X_test)
    y_pred_test = model_xgboost.predict(X_test)  # X_test should be a numpy array or sparse matrix

    #y_pred_test = model_xgboost.predict(dmatrix_data)
    #y_pred_test = model_xgboost.predict(dmatrix_data)
    #y_pred_test = predict_duration(X_test, model)
    test_me = evaluate_model(y_test, y_pred_test)
    mlflow.log_metric("test_me", test_me)

    # Log your model
    mlflow.sklearn.log_model(model, "models")

    # Register your model as the production model
    mlflow.register_model(f"runs:/{run_id}/models", "xgboost_test")

2025/01/05 15:44:12 INFO mlflow.tracking.fluent: Experiment with name '/mlflow/xgboost_test' does not exist. Creating a new experiment.
Successfully registered model 'xgboost_test'.
2025/01/05 15:44:39 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: xgboost_test, version 1
Created version '1' of model 'xgboost_test'.


If the model is satisfactory, we stage it as production using the appropriate version. This will help us retreiving it for predictions.

In [43]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

# List registered models
for model in client.search_registered_models():
    print(model)

# List all versions for the specific model
model_name = "linear_reg_test"
model_versions = client.get_registered_model(name=model_name)
for version in model_versions.latest_versions:
    print(f"Version: {version.version}, Stage: {version.current_stage}")


<RegisteredModel: aliases={}, creation_timestamp=1736132869984, description='', last_updated_timestamp=1736133029433, latest_versions=[<ModelVersion: aliases=[], creation_timestamp=1736133029433, current_stage='None', description='', last_updated_timestamp=1736133029433, name='RandomForestRegressor_test', run_id='ec311e061e204c28a60c9aca47cfa407', run_link='', source='mlflow-artifacts:/958600936449629552/ec311e061e204c28a60c9aca47cfa407/artifacts/models', status='READY', status_message='', tags={}, user_id='', version='2'>], name='RandomForestRegressor_test', tags={}>
<RegisteredModel: aliases={}, creation_timestamp=1736128847621, description='', last_updated_timestamp=1736128847649, latest_versions=[<ModelVersion: aliases=[], creation_timestamp=1736128847649, current_stage='None', description='', last_updated_timestamp=1736128847649, name='iris_lr_model', run_id='d77f733e1e5140328c7962740bcf7778', run_link='', source='mlflow-artifacts:/718290349955982801/d77f733e1e5140328c7962740bcf77

In [44]:
client = MlflowClient()
production_version = 1

client.transition_model_version_stage(
    name="linear_reg_test", version=production_version, stage="Production"
)

<ModelVersion: aliases=[], creation_timestamp=1736128931993, current_stage='Production', description='', last_updated_timestamp=1736133189437, name='linear_reg_test', run_id='38cb6da02b334c2b9ea53a36383f49d1', run_link='', source='mlflow-artifacts:/718290349955982801/38cb6da02b334c2b9ea53a36383f49d1/artifacts/models', status='READY', status_message='', tags={}, user_id='', version='1'>

In [40]:
client = MlflowClient()
production_version = 1

client.transition_model_version_stage(
    name="xgboost_test", version=production_version, stage="Production"
)

<ModelVersion: aliases=[], creation_timestamp=1736088279758, current_stage='Production', description='', last_updated_timestamp=1736088279963, name='xgboost_test', run_id='16d7df5f0b1144c08e784201a0ea5996', run_link='', source='mlflow-artifacts:/775980936993477273/16d7df5f0b1144c08e784201a0ea5996/artifacts/models', status='READY', status_message='', tags={}, user_id='', version='1'>

## 5 - Predict

We can now use our model to predict on fresh unseen data and forecast what is going to be the duration of a taxi trip depending on trip characteristics.

In [45]:
# Load prediction data
predict_df = load_data(predict_path)

# Apply feature engineering
predict_df = encode_categorical_cols(predict_df)
X_pred, _, dv2 = extract_x_y(predict_df, dv=dv, with_target=False)

# Load production model
model_uri = f"models:/{mlflow_experiment_path}/production"
model = mlflow.sklearn.load_model(model_uri)

# Make predictions
y_pred = predict_duration(X_pred, model)
y_pred

  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 99.61it/s] 


array([12.7069711 , 13.81109596, 13.81109596, ..., 14.1538733 ,
       20.51853691, 24.14175089])

In [46]:
import os
import pickle

def save_pickle(path: str, obj: Any):
    # Ensure the directory exists
    dir_name = os.path.dirname(path)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    
    # Save the pickle file
    with open(path, "wb") as f:
        pickle.dump(obj, f)

# Call the function to save the pickle
save_pickle("../../02-model-deployment/solution/web_service/local_models/dv__v0.0.1.pkl", dv2)


NameError: name 'Any' is not defined

In [43]:
loaded_dv2 = load_pickle("../../02-model-deployment/solution/web_service/local_models/dv__v0.0.1.pkl")
print(loaded_dv2)

DictVectorizer()


In [51]:
import os
print(f"Current working directory: {os.getcwd()}")

Current working directory: c:\Users\c\10.9\esilv-mlops-crashcourse-24\lessons\02-model-deployment


In [50]:
cd 02-model-deployment

c:\Users\c\10.9\esilv-mlops-crashcourse-24\lessons\02-model-deployment


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [53]:
import pickle

from typing import Any

def save_pickle(path: str, obj: Any):
    """Saves the given object to a pickle file."""
    with open(path, "wb") as f:
        pickle.dump(obj, f)

def load_pickle(path: str):
    """Loads a pickle object from the specified file."""
    with open(path, "rb") as f:
        return pickle.load(f)
save_pickle('web_service/local_models/dv__v0.0.1.pkl', dv)  # Save the DictVectorizer
save_pickle('web_service/local_models/model__v0.0.1.pkl', model)

In [2]:
import os
import pickle
from typing import Any



def save_pickle(path: str, obj: Any):
    # Ensure the directory exists
    dir_name = os.path.dirname(path)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)  # Creates all intermediate directories if they don't exist
    
    # Save the pickle file
    with open(path, "wb") as f:
        pickle.dump(obj, f)

# Example usage: Save the pickle file to the correct path
save_pickle("esilv-mlops-crashcourse-24/lessons/02-model-deployment/web_service/local_models/dv__v0.0.1.pkl", dv2)


NameError: name 'dv2' is not defined

In [1]:
# Saving the preprocessor
from lib.utils import save_pickle

# Assuming dv2 is your DictVectorizer
save_pickle("local_models/dv__v0.0.1.pkl", dv2)

# Saving the model
save_pickle("local_models/model__v0.0.1.pkl", model)


ModuleNotFoundError: No module named 'lib.utils'

In [50]:

############################################
from typing import Any
import pickle

def load_pickle(path: str):
    with open(path, "rb") as f:
        loaded_obj = pickle.load(f)
    return loaded_obj



def save_pickle(path: str, obj: Any):
    with open(path, "wb") as f:
        pickle.dump(obj, f)

save_pickle("esilv-mlops-crashcourse-24/lessons/02-model-deployment/web_service/local_models/dv__v0.0.1.pkl", dv2)

############################################

FileNotFoundError: [Errno 2] No such file or directory: 'esilv-mlops-crashcourse-24/lessons/02-model-deployment/web_service/local_models/dv__v0.0.1.pkl'

In [45]:
from typing import Any

import pickle


def load_pickle(path: str):
    with open(path, "rb") as f:
        loaded_obj = pickle.load(f)
    return loaded_obj


def save_pickle(path: str, obj: Any):
    with open(path, "wb") as f:
        pickle.dump(obj, f)