-
Notifications
You must be signed in to change notification settings - Fork 49
Labels
data:sequentialRelated to timeseries datasetsRelated to timeseries datasetsfeature requestRequest for a new featureRequest for a new feature
Milestone
Description
Problem Description
In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.
Expected behavior
Add a metric called InterRowMSAS that performs the MSAS algorithm for inter-row differences in a sequence.
Data compatibility: 1 ID column (representing the sequence key), and 1 continuous column (datetime or numerical)
Parameters:
- (required)
real_data
: A tuple of 2 pandas.Series objects. The first represents the sequence key of the real data and the second represents a continuous column of data. - (required)
synthetic_data
: A tuple of 2 pandas.Series objects. The first represents the sequence key of the synthetic data and the second represents a continuous column of data. n_rows_diff
: An integer representing the number of rows to consider when taking the difference- (default) 1: Take the difference of a row and the one right before it
- Int > 0: Take the difference between a row
n
andn + n_rows_diff
apply_log
: Whether to apply a natural log before taking the difference- (default)
False
: Do not apply a log. This results in the absolute difference, useful when you expect the data to grow or shrink linearly True
: Apply a lot before taking the difference. This is recommended when you expect the data to grow or shrink exponentially
- (default)
Output: A score in range [0, 1] -- 0 being the worst and 1 being the best
from sdmetrics.column_pairs import InterRowMSAS
score = InterRowMSAS.compute(
real_data=(real_table['patient_id'], real_table['heart_rate']),
synthetic_data = (synthetic_table['patient_id'], synthetic_table['heart_rate']),
n_rows_diff=100,
apply_log=False
)
How does it work? The sequence key determines which continuous values belong to which sequence. This metric computes a statistic for all sequences in the real and synthetic data, and then compares those distributions.
- Calculate the difference between row
r
and rowr+x
for each row in the real data. Then take the average over each sequence to form a distribution D_r - Do the same for the synthetic data to form a new distribution D_s
- Now apply the KSComplement metric to compare the similarities of the distributions (D_r, D_s). Return this score.
Metadata
Metadata
Assignees
Labels
data:sequentialRelated to timeseries datasetsRelated to timeseries datasetsfeature requestRequest for a new featureRequest for a new feature