## Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use Mage for it.

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the **Yellow** taxi data for March, 2023. 

## Question 1. Select the Tool

You can use the same tool you used when completing the module,
or choose a different one for your homework.

What's the name of the orchestrator you chose? 

**Answer**: Prefect


## Question 2. Version

What's the version of the orchestrator? 

In [1]:
import prefect
import pandas as pd 
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
import mlflow

In [2]:
print(f'Version: {prefect.__version__}') 

Version: 3.4.5


## Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load? 

- 3,003,766
- 3,203,766
- 3,403,766
- 3,603,766

(Include a print statement in your code)

In [3]:
file = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet'
df = pd.read_parquet(file)

In [4]:
print(f'Answer: {df.shape[0]}')

Answer: 3403766


## Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously. 

This is what we used (adjusted for yellow dataset):

```python
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
```

Let's apply to the data we loaded in question 3. 

What's the size of the result? 

- 2,903,766
- 3,103,766
- 3,316,216 
- 3,503,766

In [5]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [6]:
df = read_dataframe(file)
print(f'Answer: {df.shape[0]}')

Answer: 3316216


## Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

* Fit a dict vectorizer.
* Train a linear regression with default parameters.
* Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model? 

Hint: print the `intercept_` field in the code block

- 21.77
- 24.77
- 27.77
- 31.77

In [7]:
def ohe(df):
    # Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
    # label encode them)
    id_list = ['DOLocationID', 'PULocationID']
    df = df[id_list].astype(str)
    dict_df = df.to_dict(orient='records')
    
    return dict_df

In [8]:
def train_model(X, y):
    model = LinearRegression()
    model.fit(X, y)
    return model

In [9]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("linreg")
mlflow.sklearn.autolog()

2025/06/09 10:06:03 INFO mlflow.tracking.fluent: Experiment with name 'linreg' does not exist. Creating a new experiment.


In [10]:
with mlflow.start_run():
    # Fit a dictionary vectorizer
    # Get a feature matrix from it
    dv = DictVectorizer()
    dict_df = ohe(df)
    X_train = dv.fit_transform(dict_df)
    y_train = df['duration'].values
    model = train_model(X_train, y_train)

🏃 View run capricious-crow-89 at: http://127.0.0.1:5000/#/experiments/1/runs/b80cd7bbe1fe4c1d82d2dd69d67593c8
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1


In [11]:
print(f'Intercept: {model.intercept_:.2f}')

Intercept: 24.78


## Question 6. Register the model 

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (`model_size_bytes` field):

* 14,534
* 9,534
* 4,534
* 1,534


Steps:

* Include codes
    ```python
    import mlflow
    mlflow.set_tracking_uri("http://127.0.0.1:5000")
    mlflow.set_experiment("linreg")
    mlflow.sklearn.autolog()
    .
    .
    with mlflow.start_run():
        # model training
    ```
* start mlflow server
    ```bash
    mlflow server --backend-store-uri sqlite:///unit3_homework.db --default-artifact-root ./artifacts
    ```
* open mlflow ui
* run notebook, make sure everything completed without errors.
* check mlflow ui > `experiment` is `linreg` > click on the run under `linreg` > click on `artifacts` tab > `model > MLmodel` > refer `model_size_bytes` > `4501`

## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw3
* If your answer doesn't match options exactly, select the closest one.