# 3.6 Homework

The goal of this homework is to familiarize users with workflow orchestration. We start from the solution of homework 1. The notebook can be found below: <br>

https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/01-intro/homework.ipynb <br>

This has already been converted to a script called homework.py in the 03-orchestration folder of this repo. <br>

You will use the FHV dataset like in homework 1. <br>

## Q1. Converting the script to a Prefect flow
 After adding all of the decorators, there is actually one task that you will need to call .result() for inside the flow to get it to work. Which task is this? <br>

* read_data
* prepare_features
* train_model
* run_model

Important: change all print statements to use the Prefect logger. Using the print statement will not appear in the Prefect UI. You have to call get_run_logger at the start of the task to use it.

### ANS:
```python
# train the model
lr, dv = train_model(df_train_processed, categorical, date).result()
```

## Q2. Parameterizing the flow

The validation MSE is:

* 11.637
* 11.837
* 12.037
* 12.237

In [3]:
!python homework.py

03:13:11.402 | INFO    | prefect.engine - Created flow run 'macho-gorilla' for flow 'main'
03:13:11.402 | INFO    | Flow run 'macho-gorilla' - Using task runner 'SequentialTaskRunner'
03:13:11.489 | INFO    | Flow run 'macho-gorilla' - Created task run 'read_data-4c7f9de4-0' for task 'read_data'
03:13:17.996 | INFO    | Task run 'read_data-4c7f9de4-0' - Finished in state Completed()
03:13:18.033 | INFO    | Flow run 'macho-gorilla' - Created task run 'prepare_features-4ee39d9f-0' for task 'prepare_features'
03:13:18.284 | INFO    | Task run 'prepare_features-4ee39d9f-0' - The mean duration of training is 18.230538791569668
03:13:27.304 | INFO    | Task run 'prepare_features-4ee39d9f-0' - Finished in state Completed()
03:13:27.342 | INFO    | Flow run 'macho-gorilla' - Created task run 'read_data-4c7f9de4-1' for task 'read_data'
03:13:33.455 | INFO    | Task run 'read_data-4c7f9de4-1' - Finished in state Completed()
03:13:33.491 | INFO    | Flow run 'macho-gorilla' - Created task run 'p

### ANS
11.637

## Q3. Saving the model and artifacts
At the moment, we are not saving the model and vectorizer for future use. You don't need a new task for this, you can just add it inside the flow. The requirements for filenames to save it as were mentioned in the Motivation section. They are pasted again here:

Save the model as "model-{date}.bin" where date is in YYYY-MM-DD. Note that date here is the value of the flow parameter. In practice, this setup makes it very easy to get the latest model to run predictions because you just need to get the most recent one.

In this example we use a DictVectorizer. That is needed to run future data through our model. Save that as "dv-{date}.b". Similar to above, if the date is 2021-03-15, the files output should be model-2021-03-15.bin and dv-2021-03-15.b.

What is the file size of the DictVectorizer that we trained when the date is 2021-08-15?

13,000 bytes

23,000 bytes

33,000 bytes

43,000 bytes


In [4]:
!ls -alt ./models/

total 880
-rw-rw-r-- 1 ubuntu ubuntu  13191 Jun 11 03:13 dv-2021-08-15.pkl
-rw-rw-r-- 1 ubuntu ubuntu   4581 Jun 11 03:13 model-2021-08-15.pkl
drwxrwxr-x 7 ubuntu ubuntu   4096 Jun 11 02:59 ..
drwxrwxr-x 2 ubuntu ubuntu   4096 Jun 10 14:40 .
-rw-rw-r-- 1 ubuntu ubuntu  13218 Jun 10 14:37 dv-2021-04-01.pkl
-rw-rw-r-- 1 ubuntu ubuntu   4589 Jun 10 14:37 model-2021-04-01.pkl
-rw-rw-r-- 1 ubuntu ubuntu  13218 Jun 10 13:24 preprocessor.b
-rw-rw-r-- 1 ubuntu ubuntu 411381 Jun  6 12:47 lin_reg.bin
-rw-rw-r-- 1 ubuntu ubuntu 411381 Jun  6 12:47 lin_reg_expr1.bin


### ANS
13000 (~13191 bytes)

## Q4. Creating a deployment with a CronSchedule
What is the Cron expression to run a flow at 9 AM every 15th of the month?
<li>
<b>* * 15 9 0 </b><br>
<b>9 15 * * * </b><br>
<b>0 9 15 * * </b><br>
<b>0 15 9 1 * </b><br>
 </li>

### ANS:
0 9 15 * *
```python
DeploymentSpec(
    flow=main,
    name="model_training_2021-08-15",
    schedule=CronSchedule(
        cron="0 9 15 * *",
        timezone="America/New_York"),
    flow_runner=SubprocessFlowRunner(),
    tags=["ml"]
)
```

### Q5. Viewing the Deployment
View the deployment in the UI. When first loading, we may not see that many flows because the default filter is 1 day back and 1 day forward. Remove the filter for 1 day forward to see the scheduled runs.

How many flow runs are scheduled by Prefect in advance? You should not be counting manually. There is a number of upcoming runs on the top right of the dashboard.

0

3

10

25

In [5]:
!prefect deployment create homework.py

Loading deployment specifications from python script at [32m'homework.py'[0m...
03:27:21.564 | INFO    | prefect.engine - Created flow run 'crouching-grebe' for flow 'main'
03:27:21.564 | INFO    | Flow run 'crouching-grebe' - Using task runner 'SequentialTaskRunner'
03:27:21.683 | INFO    | Flow run 'crouching-grebe' - Created task run 'read_data-c914d840-0' for task 'read_data'
03:27:28.203 | INFO    | Task run 'read_data-c914d840-0' - Finished in state Completed()
03:27:28.237 | INFO    | Flow run 'crouching-grebe' - Created task run 'prepare_features-21588f7a-0' for task 'prepare_features'
03:27:28.494 | INFO    | Task run 'prepare_features-21588f7a-0' - The mean duration of training is 18.230538791569668
03:27:37.451 | INFO    | Task run 'prepare_features-21588f7a-0' - Finished in state Completed()
03:27:37.486 | INFO    | Flow run 'crouching-grebe' - Created task run 'read_data-c914d840-1' for task 'read_data'
03:27:43.474 | INFO    | Task run 'read_data-c914d840-1' - Finished 

### ANS 
4 upcoming Runs

### Q6. Creating a work-queue

In order to run this flow, you will need an agent and a work queue. Because we scheduled our flow on every month, it won't really get picked up by an agent. For this exercise, create a work-queue from the UI and view it using the CLI.

What is the command to view the available work-queues?

* prefect work-queue inspect
* prefect work-queue ls
* prefect work-queue preview
* prefect work-queue list


In [8]:
!prefect work-queue ls

[3m                             Work Queues                             [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m                                  ID[0m[1m [0m┃[1m [0m[1mName  [0m[1m [0m┃[1m [0m[1mConcurrency Limit[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m57b6c7d6-1a4c-4717-8e3e-3074d921018c[0m[36m [0m│[32m [0m[32mglobal[0m[32m [0m│[34m [0m[34mNone[0m[34m             [0m[34m [0m│
└──────────────────────────────────────┴────────┴───────────────────┘
[31m                     (**) denotes a paused queue                     [0m
