# Retrospective analysis of data leakage in a price prediction pipeline

This example evolves around an [ML pipeline for predicting the price of taxi rides](https://github.com/schelterlabs/arguseyes-example/blob/main/pipelines/mlflow-regression-nyctaxifare.py), based on a sample from the [New York City Taxi Fare Prediction](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data) dataset. The pipeline computes additional features, splits the data into train and testset based on date information, and learns a **regression model to predict the fare of a ride**, based on attributes such as the **pickup time**, **dropoff time**, **trip_distance** and the **zip codes** of the pickup and dropoff locations.

When screening this pipeline on Github with this [configuration](https://github.com/schelterlabs/arguseyes-example/blob/main/mlflow-regression-nyctaxifare-dataleakage.yaml), ArgusEyes detects a **data leakage problem** in the pipeline. The screenshot shows the result of the [screening during the build triggered by a Github action](https://github.com/schelterlabs/arguseyes-example/actions/runs/3523396218/jobs/5907507086): There are **177 input tuples which leaked from the train set to the test set**.

In the following, we show how to **leverage ArgusEyes to retrospectively analyze the pipeline run** (based on metadata and captured data artifacts), and **figure out the root cause of the data leakage issue**.

![data-leakage-screening-via-a-github-action](github-action-dataleakage-screening.png)

### Load the metadata and artifacts from the original run of the pipeline

ArgusEyes needs the run id from the mlflow run where ArgusEyes stored the metadata and artifacts. (Note we use a local run here for demo purposes).

In [14]:
from arguseyes.retrospective import PipelineRun, DataLeakageRetrospective

In [15]:
run_id = 'bc07e7b4c8c54ee694078030860649b2'

run = PipelineRun(run_id=run_id)

### Interactively explore the dataflow plan and data of the pipeline run

We can view a dataflow plan of the pipeline, which highlights the input datasets, as well as the features and labels for the train and test data computed by the pipeline. We can interactively explore the pipeline data. Clicking on the pink data vertices provides us with details about the corresponding data.

In [18]:
run.explore_data()

# Pipeline Data Explorer

HBox(children=(CytoscapeWidget(cytoscape_layout={'name': 'dagre'}, cytoscape_style=[{'selector': 'node', 'css'…

## Retrospective analysis of the data leakage issue

ArgusEyes allows us to instantiate a special `DataLeakageRetrospective`, which helps us analyze data leakage problems from a pipeline run

In [19]:
retrospective = DataLeakageRetrospective(run)

### Materialize leaked tuples

We can compute the tuples that were leaked between the train and test set

In [20]:
leaked_data = retrospective.compute_leaked_tuples()
leaked_data

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,pickup_zip,dropoff_zip
37,2016-02-15 01:45:11,2016-02-15 01:48:40,0.60,4.5,10153,10065
57,2016-02-15 08:51:02,2016-02-15 09:06:27,5.50,17.0,11371,11379
73,2016-02-15 23:03:19,2016-02-15 23:38:15,11.40,35.0,11371,10011
77,2016-02-15 16:41:54,2016-02-15 17:38:20,18.43,52.0,11422,10011
144,2016-02-15 12:44:51,2016-02-15 13:07:39,3.48,16.5,10011,10022
...,...,...,...,...,...,...
9016,2016-02-15 21:02:47,2016-02-15 21:05:27,1.06,5.0,10035,10029
9045,2016-02-15 18:21:15,2016-02-15 18:32:51,1.51,9.0,10119,10103
9084,2016-02-15 06:46:32,2016-02-15 06:48:29,0.47,3.5,10044,10021
9113,2016-02-15 11:21:50,2016-02-15 11:27:38,0.84,5.5,10119,10199


In [20]:
run = PipelineRun(run_id='bc07e7b4c8c54ee694078030860649b2')
retrospective = DataLeakageRetrospective(run)

leaked_data = retrospective.compute_leaked_tuples()
leaked_data

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,pickup_zip,dropoff_zip
37,2016-02-15 01:45:11,2016-02-15 01:48:40,0.60,4.5,10153,10065
57,2016-02-15 08:51:02,2016-02-15 09:06:27,5.50,17.0,11371,11379
73,2016-02-15 23:03:19,2016-02-15 23:38:15,11.40,35.0,11371,10011
77,2016-02-15 16:41:54,2016-02-15 17:38:20,18.43,52.0,11422,10011
144,2016-02-15 12:44:51,2016-02-15 13:07:39,3.48,16.5,10011,10022
...,...,...,...,...,...,...
9016,2016-02-15 21:02:47,2016-02-15 21:05:27,1.06,5.0,10035,10029
9045,2016-02-15 18:21:15,2016-02-15 18:32:51,1.51,9.0,10119,10103
9084,2016-02-15 06:46:32,2016-02-15 06:48:29,0.47,3.5,10044,10021
9113,2016-02-15 11:21:50,2016-02-15 11:27:38,0.84,5.5,10119,10199


### Deep dive into leaked tuples

In the following, we can explore the leaked tuples in detail in order to find patterns, which help us determine the root cause of the leakage

In [21]:
leaked_data.trip_distance.describe()

count    177.000000
mean       3.016497
std        3.913846
min        0.300000
25%        1.000000
50%        1.600000
75%        2.870000
max       19.010000
Name: trip_distance, dtype: float64

In [22]:
leaked_data.tpep_pickup_datetime.dt.date.value_counts()

2016-02-15    175
2016-02-14      2
Name: tpep_pickup_datetime, dtype: int64

### Identifying the root cause of the leakage

All the leaked tuples share the same day in their dropoff time! This is a strong hint that the data was not split correctly for train/test. Fixing this will remove the data leakage issue in the pipeline.

In [23]:
leaked_data.tpep_dropoff_datetime.dt.date.value_counts()

2016-02-15    177
Name: tpep_dropoff_datetime, dtype: int64