## Homework

The goal of this homework is to familiarize users with monitoring for ML batch services, using PostgreSQL database to store metrics and Grafana to visualize them.


In [35]:
import requests
import datetime
import pandas as pd
import os

from evidently import Report, Dataset, DataDefinition
from evidently.presets import DataDriftPreset

from joblib import load, dump
from tqdm import tqdm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

## Q1. Prepare the dataset

Start with `baseline_model_nyc_taxi_data.ipynb`. Download the March 2024 Green Taxi data. We will use this data to simulate a production usage of a taxi trip duration prediction service.

What is the shape of the downloaded data? How many rows are there?

* 72044
* 78537 
* **57457**
* 54396


In [36]:
os.makedirs('data', exist_ok=True)

files = [('green_tripdata_2024-03.parquet', 'data')]

for file, path in files:
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"
    resp = requests.get(url, stream=True)
    if resp.status_code != 200:
        raise RuntimeError(f"Error descargando {url}: status {resp.status_code}")
    save_path = os.path.join(path, file)
    with open(save_path, "wb") as handle:
        total = int(resp.headers.get("Content-Length", 0))
        for chunk in tqdm(resp.iter_content(chunk_size=8192),
                          desc=file,
                          total=total // 8192):
            handle.write(chunk)

march_data = pd.read_parquet('data/green_tripdata_2024-03.parquet')
print("Shape:", march_data.shape)


green_tripdata_2024-03.parquet: 168it [00:00, 1324.94it/s]                         

Shape: (57457, 20)





In [37]:
march_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57457 entries, 0 to 57456
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               57457 non-null  int32         
 1   lpep_pickup_datetime   57457 non-null  datetime64[us]
 2   lpep_dropoff_datetime  57457 non-null  datetime64[us]
 3   store_and_fwd_flag     55360 non-null  object        
 4   RatecodeID             55360 non-null  float64       
 5   PULocationID           57457 non-null  int32         
 6   DOLocationID           57457 non-null  int32         
 7   passenger_count        55360 non-null  float64       
 8   trip_distance          57457 non-null  float64       
 9   fare_amount            57457 non-null  float64       
 10  extra                  57457 non-null  float64       
 11  mta_tax                57457 non-null  float64       
 12  tip_amount             57457 non-null  float64       
 13  t

## Q2. Metric

Let's expand the number of data quality metrics we’d like to monitor! Please add one metric of your choice and a quantile value for the `"fare_amount"` column (`quantile=0.5`).

Hint: explore evidently metric `ColumnQuantileMetric` (from `evidently.metrics import ColumnQuantileMetric`) 

What metric did you choose?

In [38]:
import evidently
print(evidently.__version__)

import pkgutil, evidently.metrics
print([m.name for m in pkgutil.iter_modules(evidently.metrics.__path__)])

0.7.8
['_legacy', 'classification', 'column_statistics', 'dataset_statistics', 'group_by', 'recsys', 'regression', 'row_test_summary']


In [39]:
from evidently.metrics.column_statistics import QuantileValue

metric = QuantileValue(column="fare_amount", quantile=0.5)

## Q3. Monitoring

Let’s start monitoring. Run expanded monitoring for a new batch of data (March 2024). 

What is the maximum value of metric `quantile = 0.5` on the `"fare_amount"` column during March 2024 (calculated daily)?

* 10
* 12.5
* **14.2**
* 14.8

In [42]:
# Ensure the 'lpep_pickup_datetime' column is in datetime format
march_data['lpep_pickup_datetime'] = pd.to_datetime(march_data['lpep_pickup_datetime'])

# Filter data for March 2024
df_march = march_data[(march_data['lpep_pickup_datetime'].dt.month == 3) & (march_data['lpep_pickup_datetime'].dt.year == 2024)]

# Extract the date without the time
df_march['pickup_date'] = df_march['lpep_pickup_datetime'].dt.date

# Calculate the daily median of 'fare_amount'
daily_median = df_march.groupby('pickup_date')['fare_amount'].median()

# Get the maximum of the daily medians
max_daily_median = daily_median.max()

print(f"The maximum daily median of 'fare_amount' in March 2024 is: {max_daily_median}")

The maximum daily median of 'fare_amount' in March 2024 is: 14.2


## Q4. Dashboard


Finally, let’s add panels with new added metrics to the dashboard. After we customize the  dashboard let's save a dashboard config, so that we can access it later. Hint: click on “Save dashboard” to access JSON configuration of the dashboard. This configuration should be saved locally.

Where to place a dashboard config file?

* `project_folder` (05-monitoring)
* `project_folder/config`  (05-monitoring/config)
* **`project_folder/dashboards`  (05-monitoring/dashboards)**
* `project_folder/data`  (05-monitoring/data)


Assume the following directory structure (pgsql): 

05-monitoring/
├── dashboards/
│   └── fare_dashboard.json
└── provisioning/
    └── dashboards/
        └── dashboard.yaml
fare_dashboard.json: Contains the JSON definition of your dashboard.

dashboard.yaml: Configuration file that tells Grafana where to find the dashboard JSON files.

**Step 1:** Create the Dashboard in Grafana (grafana.com)
- Log in to your Grafana instance.
- Create a new dashboard with the desired panels.
- Click on the gear icon (⚙️) at the top of the dashboard and select "JSON Model".
- Copy the JSON content displayed.
- Paste this content into fare_dashboard.json located in the dashboards/ folder.

**Step 2:** Configure File Provisioning in Grafana
- Locate your Grafana configuration file, typically named grafana.ini.
- Enable the required feature toggles:
[feature_toggles]
provisioning = true
kubernetesDashboards = true ; use k8s from browser
- Configure the permitted provisioning paths:
[paths]
permitted_provisioning_paths = /etc/grafana/provisioning/dashboards
Restart Grafana to apply the changes.

**Step 3:** Create the Dashboard Configuration File
In the provisioning/dashboards/ directory, create a file named dashboard.yaml with the following content (yaml):
apiVersion: 1

providers:
  - name: 'Fare Dashboard Provider'
    folder: 'Fare Dashboards'
    type: 'file'
    options:
      path: '/etc/grafana/provisioning/dashboards/dashboards/'
name: A unique name for the provider.

folder: The folder in Grafana where the dashboards will appear.

type: Set to 'file' to load dashboards from files.

options.path: The path to the directory containing the dashboard JSON files.
medium.com

**Step 4:** Verify the Dashboard in Grafana
- Restart Grafana if you haven't already. 
- Navigate to Dashboards > Manage in the Grafana sidebar.
- You should see a folder named "Fare Dashboards" containing your provisioned dashboard.

