## Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use Mage for it.

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the Yellow taxi data for 2023.

In [2]:
import pandas as pd
import numpy as np

## Question 1. Select the Tool


You can use the same tool you used when completing the module, or choose a different one for your homework.

What's the name of the orchestrator you chose?

### Answer:
- Airflow

## Question 2. Version

What's the version of the orchestrator?



### Answer:
- 3.0.1

## Question 3. Creating a pipeline
Let's read the March 2023 Yellow taxi trips data.

How many records did we load?


- 3,003,766
- 3,203,766
- 3,403,766
- 3,603,766

In [10]:
filename = '/Users/pitsuevt/work_main/learning/datatalks/mlops_zoomcamp_2025/03-orchestration/data/yellow_tripdata_2023-03.parquet'

In [11]:
df = pd.read_parquet(filename)

In [4]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,8.6,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,52.7,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,18.4,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,15.6,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,7.2,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0


In [6]:
df.shape

(3403766, 19)

### Answer:
- 3,403,766

## Question 4. Data preparation
Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously.

This is what we used (adjusted for yellow dataset):

In [7]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

Let's apply to the data we loaded in question 3.

What's the size of the result?

In [12]:
df_upgrad = read_dataframe(filename)

In [13]:
df_upgrad.shape

(3316216, 20)

### Answer:
- 3,316,216

## Question 5. Train a model
We will now train a linear regression model using the same code as in homework 1.

Fit a dict vectorizer.
Train a linear regression with default parameters.
Use pick up and drop off locations separately, don't create a combination feature.
Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model?

Hint: print the intercept_ field in the code block

- 21.77
- 24.77
- 27.77
- 31.77

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer

In [20]:
df_upgrad.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,duration
0,2,2023-03-01 00:06:43,2023-03-01 00:16:43,1.0,0.0,1.0,N,238,42,2,8.6,1.0,0.5,0.0,0.0,1.0,11.1,0.0,0.0,10.0
1,2,2023-03-01 00:08:25,2023-03-01 00:39:30,2.0,12.4,1.0,N,138,231,1,52.7,6.0,0.5,12.54,0.0,1.0,76.49,2.5,1.25,31.083333
2,1,2023-03-01 00:15:04,2023-03-01 00:29:26,0.0,3.3,1.0,N,140,186,1,18.4,3.5,0.5,4.65,0.0,1.0,28.05,2.5,0.0,14.366667
3,1,2023-03-01 00:49:37,2023-03-01 01:01:05,1.0,2.9,1.0,N,140,43,1,15.6,3.5,0.5,4.1,0.0,1.0,24.7,2.5,0.0,11.466667
4,2,2023-03-01 00:08:04,2023-03-01 00:11:06,1.0,1.23,1.0,N,79,137,1,7.2,1.0,0.5,2.44,0.0,1.0,14.64,2.5,0.0,3.033333


In [21]:
def feature_vectorizing(df_train):    
    categorical = ['PULocationID', 'DOLocationID']
    numerical = ['trip_distance']
    

    # set feature matrix
    dv = DictVectorizer()
    
    train_dicts = df_train[categorical + numerical].to_dict(orient='records')
    X_train = dv.fit_transform(train_dicts)
    
    # val_dicts = df_val[categorical + numerical].to_dict(orient='records')
    # X_val = dv.transform(val_dicts)


    #set a target
    target = 'duration'
    y_train = df_train[target].values
    # y_val = df_val[target].values

    return X_train, y_train

In [23]:
X_train, y_train = feature_vectorizing(df_upgrad)

In [25]:
lr = LinearRegression()

In [26]:
lr.fit(X_train, y_train)

In [27]:
lr.intercept_

np.float64(23.848056533743687)

### Answer:
- 24.77

## Question 6. Register the model

The model is trained, so let's save it with MLFlow.

- Let's create a dockerfile for mlflow, e.g. mlflow.dockerfile:
- And add it to the docker-compose.yaml:


Note that app-network is the same network as for mage and postgre containers. If you use a different compose file, adjust it.

If you used the suggested docker-compose snippet, mlflow should be accessible at http://mlflow:5000.

Find the logged model, and find MLModel file. What's the size of the model? (model_size_bytes field):

- 14,534
- 9,534
- 4,534
- 1,534
