## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.



## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2023 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

In [15]:
import requests
import datetime
import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric, DatasetMissingValuesMetric, ColumnQuantileMetric, ColumnCorrelationsMetric

from joblib import load, dump
from tqdm import tqdm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

In [2]:
files = [('green_tripdata_2023-03.parquet', './data')]

print("Download files:")
for file, path in files:
    url=f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp=requests.get(url, stream=True)
    save_path=f"{path}/{file}"
    with open(save_path, "wb") as handle:
        for data in tqdm(resp.iter_content(),
                        desc=f"{file}",
                        postfix=f"save to {save_path}",
                        total=int(resp.headers["Content-Length"])):
            handle.write(data)

Download files:


green_tripdata_2023-03.parquet: 100%|██████████| 1730999/1730999 [00:05<00:00, 337099.59it/s, save to ./data/green_tripdata_2023-03.parquet]


In [6]:
import pandas as pd
mar23_data = pd.read_parquet('data/green_tripdata_2023-03.parquet')
len(df)

72044

## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

DatasetDriftMetric

## Q3. Prefect flow 

Let’s update prefect tasks by giving them nice meaningful names, specifying a number of delays and retries.

Hint: use `evidently_metrics_calculation.py` script as a starting point to implement your solution. Check the  prefect docs to check task parameters.

What is the correct way of doing that?

`@task(retries=2, retry_delay_seconds=5, name="calculate metrics")`

## Q4. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2023). 

What is the maximum value of metric `quantile = 0.5` on th `"fare_amount"` column during March 2023 (calculated daily)?

In [7]:
mar23_data["duration_min"] = mar23_data.lpep_dropoff_datetime - mar23_data.lpep_pickup_datetime
mar23_data.duration_min = mar23_data.duration_min.apply(lambda td : float(td.total_seconds())/60)

mar23_data = mar23_data[(mar23_data.duration_min >= 0) & (mar23_data.duration_min <= 60)]
mar23_data = mar23_data[(mar23_data.passenger_count > 0) & (mar23_data.passenger_count <= 8)]


In [8]:
# data labeling like above
target = "duration_min"
num_features = ["passenger_count", "trip_distance", "fare_amount", "total_amount"]
cat_features = ["PULocationID", "DOLocationID"]

In [10]:
reference_data = pd.read_parquet('data/reference.parquet')
with open('models/lin_reg.bin', 'rb') as f_in:
	model = load(f_in)

In [11]:
current_preds = model.predict(mar23_data[num_features + cat_features])
mar23_data['prediction'] = current_preds
print(mean_absolute_error(mar23_data.duration_min, mar23_data.prediction))

3.9341453215257784


In [16]:
column_mapping = ColumnMapping(
    # not analyse any target
    target=None,
    prediction='prediction',
    numerical_features=num_features,
    categorical_features=cat_features
)

report = Report(metrics=[
    # choose prediction col to analyse
    ColumnDriftMetric(column_name='prediction'),
    DatasetDriftMetric(),
    DatasetMissingValuesMetric(),
    ColumnQuantileMetric(column_name="fare_amount", quantile=0.5),
    ColumnCorrelationsMetric(column_name="prediction")
]
)

In [17]:
report.run(reference_data=reference_data, current_data=mar23_data, column_mapping=column_mapping)
result = report.as_dict()




In [18]:
result

{'metrics': [{'metric': 'ColumnDriftMetric',
   'result': {'column_name': 'prediction',
    'column_type': 'num',
    'stattest_name': 'Wasserstein distance (normed)',
    'stattest_threshold': 0.1,
    'drift_score': 0.40395123422956475,
    'drift_detected': True,
    'current': {'small_distribution': {'x': [-66.42239440921844,
       -30.212701095002565,
       5.996992219213311,
       42.20668553342918,
       78.41637884764506,
       114.62607216186095,
       150.8357654760768,
       187.04545879029268,
       223.25515210450857,
       259.46484541872445,
       295.67453873294033],
      'y': [7.502897273376752e-06,
       0.0001533925664779247,
       0.02718508095957547,
       0.0002517638862844199,
       1.3338484041558668e-05,
       1.2504828788961262e-06,
       1.2504828788961251e-06,
       2.084138131493542e-06,
       4.1682762629870836e-07,
       8.336552525974167e-07]}},
    'reference': {'small_distribution': {'x': [-36.73636669418323,
       -15.174383681787

In [19]:
result['metrics'][3]['result']


{'column_name': 'fare_amount',
 'column_type': 'num',
 'quantile': 0.5,
 'current': {'value': 12.8},
 'reference': {'value': 10.0}}

In [20]:
result['metrics'][4]['result']


{'column_name': 'prediction',
 'current': {'pearson': {'column_name': 'prediction',
   'kind': 'pearson',
   'values': {'x': ['passenger_count',
     'trip_distance',
     'fare_amount',
     'total_amount'],
    'y': [0.00912494186184666,
     0.790407847382481,
     0.9933601474505193,
     0.9839636274576733]}},
  'spearman': {'column_name': 'prediction',
   'kind': 'spearman',
   'values': {'x': ['passenger_count',
     'trip_distance',
     'fare_amount',
     'total_amount'],
    'y': [0.017125577009466913,
     0.8154441317470709,
     0.9861252489140606,
     0.9794597652825381]}},
  'kendall': {'column_name': 'prediction',
   'kind': 'kendall',
   'values': {'x': ['passenger_count',
     'trip_distance',
     'fare_amount',
     'total_amount'],
    'y': [0.013710226441397612,
     0.7045651338715974,
     0.9140423458026549,
     0.8788591403348839]}}},
 'reference': {'pearson': {'column_name': 'prediction',
   'kind': 'pearson',
   'values': {'x': ['passenger_count',
     't

In [21]:
result['metrics'][4]['result']['current']


{'pearson': {'column_name': 'prediction',
  'kind': 'pearson',
  'values': {'x': ['passenger_count',
    'trip_distance',
    'fare_amount',
    'total_amount'],
   'y': [0.00912494186184666,
    0.790407847382481,
    0.9933601474505193,
    0.9839636274576733]}},
 'spearman': {'column_name': 'prediction',
  'kind': 'spearman',
  'values': {'x': ['passenger_count',
    'trip_distance',
    'fare_amount',
    'total_amount'],
   'y': [0.017125577009466913,
    0.8154441317470709,
    0.9861252489140606,
    0.9794597652825381]}},
 'kendall': {'column_name': 'prediction',
  'kind': 'kendall',
  'values': {'x': ['passenger_count',
    'trip_distance',
    'fare_amount',
    'total_amount'],
   'y': [0.013710226441397612,
    0.7045651338715974,
    0.9140423458026549,
    0.8788591403348839]}}}

In [22]:
result['metrics'][4]['result']['current']['pearson']['values']['y'][3]


0.9839636274576733