# HW3 Orchestration

In [14]:
import sys

# Temporarily remove NumPy from the loaded modules
sys.modules.pop("numpy", None)

# Now import the required Scikit-Learn modules
from sklearn.metrics import root_mean_squared_error
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression


In [18]:
import pandas as pd
from prefect import task, flow
import pyarrow
import os
import gc
import mlflow
from tqdm import tqdm

## Question 1. Select the Tool

You can use the same tool you used when completing the module, or choose a different one for your homework.

What's the name of the orchestrator you chose?



### Ans: Prefect

## Question 2. Version
What's the version of the orchestrator?



### Ans: 3.4.6
prefect --version 3.4.6


## Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load?

* 3,003,766
* 3,203,766
* 3,403,766
* 3,603,766

(Include a print statement in your code)

In [7]:
df_raw = pd.read_parquet('./Data/yellow_tripdata_2023-03.parquet')
len(df_raw)

3403766

## Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously.

This is what we used (adjusted for yellow dataset):


```def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

```

Let's apply to the data we loaded in question 3.

What's the size of the result?

* 2,903,766
* 3,103,766
* 3,316,216
* 3,503,766

In [8]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [9]:
df = read_dataframe('./Data/yellow_tripdata_2023-03.parquet')
len(df)

3316216

## Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

Fit a dict vectorizer.
Train a linear regression with default parameters.
Use pick up and drop off locations separately, don't create a combination feature.
Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model?

Hint: print the `intercept_` field in the code block

* 21.77
* 24.77
* 27.77
* 31.77

In [11]:
mlflow.set_tracking_uri('sqlite:///mlflow.db')
mlflow.set_experiment('HW3')

2025/06/14 04:19:28 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/06/14 04:19:28 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

<Experiment: artifact_location='/Users/ting/MLOps/MLOps-zoomcamp/03-orchestration/HW3/mlruns/1', creation_time=1749845969256, experiment_id='1', last_update_time=1749845969256, lifecycle_stage='active', name='HW3', tags={}>

In [15]:
import mlflow.sklearn

mlflow.sklearn.autolog()

In [32]:
def model_training(df:pd.DataFrame, dv:DictVectorizer = None):
    
    dic= df[['PULocationID', 'DOLocationID']].to_dict(orient='records')
    if dv is None:
        dv = DictVectorizer()
        x = dv.fit_transform(dic)
        
    else:
        x = dv.transform(dic)       
        
    y = df['duration'].values
    
    with mlflow.start_run():
        
        training_model = LinearRegression()
        
        with tqdm(desc="Training Model"):
            training_model.fit(x, y)

        with tqdm(desc="Generating Predictions"):
            y_pred = training_model.predict(x)
        
        rmse = root_mean_squared_error(y, y_pred)
        
        mlflow.log_metric('RMSE', rmse)
        
        mlflow.sklearn.log_model(training_model, artifact_path="models")
        
        print(f'RMSE: {rmse}')
    
    return training_model, dv

In [33]:
model, dv = model_training(df)

Training Model: 0it [02:10, ?it/s]
Generating Predictions: 0it [00:43, ?it/s]


RMSE: 8.15868147199633


In [34]:
intercept = model.intercept_
print(intercept)

24.776368754137366


## Ans: 24.777

## Question 6. Register the model

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (model_size_bytes field):

* 14,534
* 9,534
* 4,534
* 1,534

In [36]:
client = mlflow.MlflowClient()

In [43]:
re_model = client.get_registered_model("HW3")

In [44]:
re_model.latest_versions[0]

<ModelVersion: aliases=[], creation_timestamp=1749849787300, current_stage='None', deployment_job_state=None, description='', last_updated_timestamp=1749849787300, metrics=None, model_id=None, name='HW3', params=None, run_id='', run_link='', source='/Users/ting/MLOps/MLOps-zoomcamp/03-orchestration/HW3/mlruns/1/models/m-9994f104b83648e18a48e617bf998e54/artifacts', status='READY', status_message=None, tags={}, user_id=None, version=2>