Skip to content

Add metric for inter-row MSAS #640

@npatki

Description

@npatki

Problem Description

In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.

Expected behavior

Add a metric called InterRowMSAS that performs the MSAS algorithm for inter-row differences in a sequence.

Data compatibility: 1 ID column (representing the sequence key), and 1 continuous column (datetime or numerical)

Parameters:

  • (required) real_data: A tuple of 2 pandas.Series objects. The first represents the sequence key of the real data and the second represents a continuous column of data.
  • (required) synthetic_data: A tuple of 2 pandas.Series objects. The first represents the sequence key of the synthetic data and the second represents a continuous column of data.
  • n_rows_diff: An integer representing the number of rows to consider when taking the difference
    • (default) 1: Take the difference of a row and the one right before it
    • Int > 0: Take the difference between a row n and n + n_rows_diff
  • apply_log: Whether to apply a natural log before taking the difference
    • (default) False: Do not apply a log. This results in the absolute difference, useful when you expect the data to grow or shrink linearly
    • True: Apply a lot before taking the difference. This is recommended when you expect the data to grow or shrink exponentially

Output: A score in range [0, 1] -- 0 being the worst and 1 being the best

from sdmetrics.column_pairs import InterRowMSAS

score = InterRowMSAS.compute(
  real_data=(real_table['patient_id'], real_table['heart_rate']),
  synthetic_data = (synthetic_table['patient_id'], synthetic_table['heart_rate']),
  n_rows_diff=100,
  apply_log=False
)

How does it work? The sequence key determines which continuous values belong to which sequence. This metric computes a statistic for all sequences in the real and synthetic data, and then compares those distributions.

  1. Calculate the difference between row r and row r+x for each row in the real data. Then take the average over each sequence to form a distribution D_r
  2. Do the same for the synthetic data to form a new distribution D_s
  3. Now apply the KSComplement metric to compare the similarities of the distributions (D_r, D_s). Return this score.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions