## Week 3 Homework

The goal of this homework is to familiarize users with workflow orchestration. We start from the solution of homework 1. The notebook can be found below:

https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/01-intro/homework.ipynb

This has already been converted to a script called homework.py in the 03-orchestration folder of this repo.

You will use the FHV dataset like in homework 1.

### Q1. Converting the script to a Prefect flow

If you need Windows support, check `windows.md` for installation instructions.

The current script `homework.py` is a fully functional script as long as you already have `fhv_trip_data_2021-01.parquet` and `fhv_trip_data_2021-02.parquet` inside a `data` folder. You should be able to already run it using:
```
python homework.py
```

We want to bring this to workflow orchestration to add observability around it. The `main` function will be converted to a `flow` and the other functions will be `tasks`. After adding all of the decorators, there is actually one task that you will need to call `.result()` for inside the `flow` to get it to work. Which task is this?

    read_data
    prepare_features
    train_model
    run_model

Important: change all `print` statements to use the Prefect logger. Using the `print` statement will not appear in the Prefect UI. You have to call `get_run_logger` at the start of the task to use it.

### Answer: train_model

In [1]:
import pandas as pd
import os
from pathlib import Path

In [2]:
TRAIN_PATH = os.path.join(Path(os.getcwd()).parent.parent,'data')
print(TRAIN_PATH)

D:\github_repos\mlops-zoomcamp\data


In [3]:
PATH_CUR = os.getcwd()
p = Path(PATH_CUR)
p.parents[1]

WindowsPath('D:/github_repos/mlops-zoomcamp')

In [6]:
!python homework_with_prefect.py

14:50:04.159 | INFO    | prefect.engine - Created flow run 'vigorous-lemur' for flow 'log-example-flow'
14:50:04.159 | INFO    | Flow run 'vigorous-lemur' - Using task runner 'ConcurrentTaskRunner'
14:50:04.309 | INFO    | Flow run 'vigorous-lemur' - Created task run 'read-parquet-task-d3ed3847-0' for task 'read-parquet-task'
14:50:04.372 | INFO    | Task run 'read-parquet-task-d3ed3847-0' - INFO reading parquet files.
14:50:04.435 | INFO    | Flow run 'vigorous-lemur' - Created task run 'task-prepare-features-e5923a4a-0' for task 'task-prepare-features'
14:50:04.920 | INFO    | Flow run 'vigorous-lemur' - Created task run 'read-parquet-task-d3ed3847-1' for task 'read-parquet-task'
14:50:05.028 | INFO    | Flow run 'vigorous-lemur' - Created task run 'task-prepare-features-e5923a4a-1' for task 'task-prepare-features'
14:50:07.857 | INFO    | Task run 'read-parquet-task-d3ed3847-1' - INFO reading parquet files.
14:50:07.954 | INFO    | Flow run 'vigorous-lemur' - Created task run 'train

#### Q2. Parameterizing the flow

Right now there are two parameters for `main()` called `train_path` and `val_path`. We want to change the flow function to accept `date` instead. `date` should then be passed to a task that gives both the `train_path` and `val_path` to use.

It should look like this:

```
@flow
def main(date=None):
    train_path, val_path = get_paths(date).result()
```

Because we have two files:

    fhv_tripdata_2021-01.parquet
    fhv_tripdata_2021-02.parquet

Change the `main()` flow call to the following:

```
main(date="2021-03-15")
```

and it should use those files. This is a simplification for testing our homework.

Download the relevant files needed to run the `main` flow if `date` is 2021-08-15.

For example:
```
main(date="2021-08-15")
```
By setting up the logger from the previous step, we should see some logs about our training job. What is the validation MSE when running the flow with this date?

The validation MSE is:

    11.637
    11.837
    12.037
    12.237
    
### Answer: 11.637


#### Q3. Saving the model and artifacts

At the moment, we are not saving the model and vectorizer for future use. You don't need a new task for this, you can just add it inside the `flow`. The requirements for filenames to save it as were mentioned in the Motivation section. They are pasted again here:

- Save the model as "model-{date}.pkl" where date is in `YYYY-MM-DD`. Note that `date` here is the value of the `flow` parameter. In practice, this setup makes it very easy to get the latest model to run predictions because you just need to get the most recent one.
- In this example we use a DictVectorizer. That is needed to run future data through our model. Save that as "dv-{date}.pkl". Similar to above, if the date is 2021-03-15, the files output should be `model-2021-03-15.bin` and `dv-2021-03-15.b`.

By using this file name, during inference, we can just pull the latest model from our model directory and apply it. Assuming we already had a list of filenames:

['model-2021-03-15.bin', 'model-2021-04-15.bin', 'model-2021-05-15.bin']

We could do something like sorted(model_list, reverse=False)[0] to get the filename of the latest file. This is the simplest way to consistently use the latest trained model for inference. Tools like MLFlow give us more control logic to use flows.

What is the file size of the DictVectorizer that we trained when the date is 2021-08-15?

    13,000 bytes
    23,000 bytes
    33,000 bytes
    43,000 bytes
    
### Answer: 13,000 bytes


### Q4. Creating a deployment with a CronSchedule

We previously showed the `IntervalSchedule` in the video tutorials. In some cases, the interval is too rigid. For example, what if we wanted to run this flow on the 15th of every month? An interval of 30 days would not be in sync. In cases like these, the CronSchedule is more appropriate. The documentation for that is here

Cron is an important part of workflow orchestration. It is used to schedule tasks, and was a predecessor for more mature orchestration frameworks. A lot of teams still use Cron in production. Even if you don't use Cron, the Cron expression is very common as a way to write a schedule, and the basics are worth learning for orchestration, even outside Prefect.

For this exercise, use a CronSchedule when creating a Prefect deployment.

What is the Cron expression to run a flow at 9 AM every 15th of the month?

   - `* * 15 9 0`
   - `9 15 * * *`
   - `0 9 15 * *`
   - `0 15 9 1 *`

Hint: there are many Cron to English tools. Try looking for one to help you.

### Answer: `0 9 15 * *`

Create a deployment with `prefect deployment create` after you write your `DeploymentSpec`

### Q5. Viewing the Deployment

View the deployment in the UI. When first loading, we may not see that many flows because the default filter is 1 day back and 1 day forward. Remove the filter for 1 day forward to see the scheduled runs.

How many flow runs are scheduled by Prefect in advance? You should not be counting manually. There is a number of upcoming runs on the top right of the dashboard.

   - 0
   - 3
   - 10
   - 25


### Answer: 3

### Q6. Creating a work-queue

In order to run this flow, you will need an agent and a work queue. Because we scheduled our flow on every month, it won't really get picked up by an agent. For this exercise, create a work-queue from the UI and view it using the CLI.

For all CLI commands with Prefect, you can use `--help` to get more information.

For example,

    `prefect --help`
    `prefect work-queue --help`

What is the command to view the available work-queues?

   - `prefect work-queue inspect`
   - `prefect work-queue ls`
   - `prefect work-queue preview`
   - `prefect work-queue list`
   
### Answer: `prefect work-queue ls`
