## Scenario 4: I work in local but I need my artifacts in a remote S3 bucket.


MLflow setup:
- tracking server: yes, local server (to see things in real time)
- backend store: sqlite database (the metadata is stored here, metrics, tags, and so on...)
- artifacts store: s3 bucket (yezer-artifacts-remote-01)

To run this example you need to launch the mlflow server locally, store artifacts in s3 bucket by running the following command in your terminal:

`mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root=s3://yezer-artifacts-remote-01`

When you run that MLflow command with an S3 artifact root, AWS authentication happens through one of these methods (in order of precedence):
- AWS credentials file (~/.aws/credentials)
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- IAM roles (if running on EC2)
- AWS CLI configuration (aws configure)

I have AWS credentials configured in ~/.aws/credentials with your access key ID and secret access key. This is how MLflow can access your S3 bucket yezer-artifacts-remote-01.

The experiments can be explored locally by accessing the local tracking server.

this generates a backend.db for the server in the current directory, and the artifacts are stored in the s3 bucket yezer-artifacts-remote-01.

In [31]:
import mlflow
from mlflow.tracking import MlflowClient
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
import pickle as pk
from pydantic import BaseModel
from typing import Any, Dict


### set MLFlow

In [4]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

tracking URI: 'http://127.0.0.1:5000'


In [5]:
mlflow.search_experiments()

[<Experiment: artifact_location='s3://yezer-artifacts-remote-01/0', creation_time=1750750626480, experiment_id='0', last_update_time=1750750626480, lifecycle_stage='active', name='Default', tags={}>]

In [7]:
mlflow.set_experiment("NYC-taxi-duration-prediction")
mlflow.search_experiments()

2025/06/24 08:54:07 INFO mlflow.tracking.fluent: Experiment with name 'NYC-taxi-duration-prediction' does not exist. Creating a new experiment.


[<Experiment: artifact_location='s3://yezer-artifacts-remote-01/1', creation_time=1750751647033, experiment_id='1', last_update_time=1750751647033, lifecycle_stage='active', name='NYC-taxi-duration-prediction', tags={}>,
 <Experiment: artifact_location='s3://yezer-artifacts-remote-01/0', creation_time=1750750626480, experiment_id='0', last_update_time=1750750626480, lifecycle_stage='active', name='Default', tags={}>]

### train & register some models

In [9]:
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)

        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [16]:
df_train = read_dataframe('/home/yezer/projects/mlops-zoomcamp/01-intro/data/green_tripdata_2021-01.parquet')
df_val = read_dataframe('/home/yezer/projects/mlops-zoomcamp/01-intro/data/green_tripdata_2021-02.parquet')

df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [21]:
with mlflow.start_run():

    lr = LinearRegression(fit_intercept=True)
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_val)

    rmse = root_mean_squared_error(y_val, y_pred)

    with open('./models/lin_reg.bin', 'wb') as f_out:
        pk.dump((lr, dv), f_out)
    
    mlflow.log_artifact(local_path="./models/lin_reg.bin", artifact_path="models")
    mlflow.log_metric("rmse", rmse)

    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'")

default artifacts URI: 's3://yezer-artifacts-remote-01/1/4aa11a9110c34841851c4f32a5e506cf/artifacts'
🏃 View run peaceful-sow-991 at: http://127.0.0.1:5000/#/experiments/1/runs/4aa11a9110c34841851c4f32a5e506cf
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1


### Get a model from run id

In [37]:
class Ride(BaseModel):  # type: ignore
    PULocationID: str
    DOLocationID: str
    trip_distance: float

def prepare_features(ride: Ride) -> Dict[str, Any]:
    print(ride)
    features: Dict[str, Any] = {}
    features['PU_DO'] = f'{ride["PULocationID"]}_{ride["DOLocationID"]}'
    features['trip_distance'] = ride["trip_distance"]
    return features


def predict(features: Dict[str, Any], model: Any, dv: Any) -> float:
    X = dv.transform(features)
    preds = model.predict(X)
    return float(preds[0])

In [26]:
client = MlflowClient("http://127.0.0.1:5000")

In [None]:
RUN_ID = "4aa11a9110c34841851c4f32a5e506cf"
ARTIFACT_FILE_PATH = "models/lin_reg.bin"

# Construct the URI to the specific artifact file
artifact_uri = f"runs:/{RUN_ID}/{ARTIFACT_FILE_PATH}"
print(f"Attempting to download artifact from URI: {artifact_uri}")

# Download the artifact.
downloaded_artifact_path = mlflow.artifacts.download_artifacts(artifact_uri=artifact_uri)
print(f"Artifact downloaded to: {downloaded_artifact_path}")

# Load the model and DictVectorizer from the downloaded pickle file.
with open(downloaded_artifact_path, 'rb') as f_in:
    # Unpack the tuple
    loaded_model_from_pickle, loaded_dv_from_pickle = pk.load(f_in)

ride = {
    "PULocationID": "10",
    "DOLocationID": "50",
    "trip_distance": 10
}

features = prepare_features(ride)
print(features)

preds = predict(features, loaded_model_from_pickle, loaded_dv_from_pickle)

print(preds)


Attempting to download artifact from URI: runs:/4aa11a9110c34841851c4f32a5e506cf/models/lin_reg.bin
Artifact downloaded to: /tmp/tmpjb6x_tdz/lin_reg.bin
{'PULocationID': '10', 'DOLocationID': '50', 'trip_distance': 10}
{'PU_DO': '10_50', 'trip_distance': 10}
25.819684421746068
