## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.

In [1]:
import requests
import datetime
import os

import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from evidently.metrics import (
    ColumnDriftMetric, 
    DatasetDriftMetric, 
    DatasetMissingValuesMetric,
    ColumnQuantileMetric
)

from joblib import load, dump
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
files = [
    ('green_tripdata_2023-03.parquet', './data'), 
    ('green_tripdata_2023-02.parquet', './data'),
    ('green_tripdata_2023-01.parquet', './data')]

for file, path in files:
    if file in os.listdir('data/'):
        continue
    url=f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp=requests.get(url, stream=True)
    save_path=f"{path}/{file}"
    with open(save_path, "wb") as handle:
        for data in tqdm(resp.iter_content(),
                        desc=f"{file}",
                        postfix=f"save to {save_path}",
                        total=int(resp.headers["Content-Length"])):
            handle.write(data)

## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2023 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 85371
* 78537
* 62495
* 54396

In [10]:
march_23 = pd.read_parquet('data/green_tripdata_2023-03.parquet')
march_23.shape #??

(72044, 20)

In [8]:
base.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2023-03-01 00:25:10,2023-03-01 00:35:47,N,1.0,82,196,1.0,2.36,13.5,1.0,0.5,0.0,0.0,,1.0,16.0,2.0,1.0,0.0
1,2,2023-03-01 00:14:29,2023-03-01 00:25:04,N,1.0,7,7,1.0,0.78,-6.5,-1.0,-0.5,0.0,0.0,,-1.0,-9.0,3.0,1.0,0.0
2,2,2023-03-01 00:14:29,2023-03-01 00:25:04,N,1.0,7,7,1.0,0.78,6.5,1.0,0.5,0.0,0.0,,1.0,9.0,3.0,1.0,0.0
3,2,2023-02-28 22:59:46,2023-02-28 23:08:38,N,1.0,166,74,1.0,1.66,11.4,1.0,0.5,2.78,0.0,,1.0,16.68,1.0,1.0,0.0
4,2,2023-03-01 00:54:03,2023-03-01 01:03:14,N,1.0,236,229,1.0,3.14,15.6,1.0,0.5,4.17,0.0,,1.0,25.02,1.0,1.0,2.75


## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

In [11]:
reference = pd.read_parquet('data/reference.parquet')

report = Report(
    metrics=[
        DataDriftPreset(columns=[
            'trip_distance', 
            'fare_amount',
            'tip_amount'
        ]),
        ColumnQuantileMetric(
            column_name='fare_amount',
            quantile=0.5
        )
    ]
)

In [14]:
reference.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,...,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,duration_min,prediction
30514,2,2022-01-18 07:10:01,2022-01-18 07:28:00,N,1.0,95,258,1.0,3.14,14.0,...,2.0,0.0,,0.3,16.8,1.0,1.0,0.0,17.983333,13.61648
30515,2,2022-01-18 07:55:50,2022-01-18 08:09:05,N,1.0,95,135,1.0,2.72,11.5,...,0.0,0.0,,0.3,12.3,2.0,1.0,0.0,13.25,11.420535
30516,2,2022-01-18 07:31:13,2022-01-18 07:41:07,N,1.0,74,75,1.0,1.66,8.5,...,2.79,0.0,,0.3,12.09,1.0,1.0,0.0,9.9,10.117697
30517,2,2022-01-18 07:44:52,2022-01-18 07:52:31,N,1.0,75,74,1.0,1.07,7.0,...,0.0,0.0,,0.3,7.8,2.0,1.0,0.0,7.65,8.637719
30518,2,2022-01-18 08:00:14,2022-01-18 08:07:20,N,1.0,74,41,1.0,0.94,6.0,...,2.04,0.0,,0.3,8.84,1.0,1.0,0.0,7.1,8.364784


In [15]:
report.run(
    reference_data=reference[['trip_distance', 'fare_amount', 'tip_amount']], 
    current_data=march_23[['trip_distance', 'fare_amount', 'tip_amount']]
)
report

## Q3. Prefect flow 

Let’s update prefect tasks by giving them nice meaningful names, specifying a number of delays and retries.

Hint: use `evidently_metrics_calculation.py` script as a starting point to implement your solution. Check the  prefect docs to check task parameters.

What is the correct way of doing that?

* `@task(retries_num=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries_num=2, retry_delay_seconds=5, name="calculate metrics")`
* `@task(retries=2, retry_seconds=5, task_name="calculate metrics")`
* `@task(retries=2, retry_delay_seconds=5, name="calculate metrics")`

`@task(retries=2, retry_delay_seconds=5, name="calculate metrics")`

## Q4. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2023). 

What is the maximum value of metric `quantile = 0.5` on th `"fare_amount"` column during March 2023 (calculated daily)?

* 10
* 12.5
* 14
* 14.8

`14`

## Q5. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard lets save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)
* `project_folder/dashboards`  (05-monitoring/dashboards)
* `project_folder/data`  (05-monitoring/data)

## Submit the results

* Submit your results here: TBA
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one

## Deadline

The deadline for submitting is 3 July (Monday), 23:00 CEST (Berlin time). 

After that, the form will be closed.