-
Notifications
You must be signed in to change notification settings - Fork 49
Add metric for sequence length similarity #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
fealho
merged 5 commits into
feature-branch-timeseries-metrics
from
issue-638-sequence-similarity
Nov 5, 2024
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
d5a3725
ADd metric
fealho b9fab0e
Merge branch 'main' into issue-638-sequence-similarity
fealho 9524b00
Fix ordering of the metric
fealho ae46e7e
Add test case
fealho 036de6a
Merge branch 'feature-branch-timeseries-metrics' into issue-638-seque…
fealho File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
"""SequenceLengthSimilarity module.""" | ||
|
||
import pandas as pd | ||
|
||
from sdmetrics.goal import Goal | ||
from sdmetrics.single_column.statistical.kscomplement import KSComplement | ||
|
||
|
||
class SequenceLengthSimilarity: | ||
"""Sequence Length Similarity metric. | ||
|
||
Attributes: | ||
name (str): | ||
Name to use when reports about this metric are printed. | ||
goal (sdmetrics.goal.Goal): | ||
The goal of this metric. | ||
min_value (Union[float, tuple[float]]): | ||
Minimum value or values that this metric can take. | ||
max_value (Union[float, tuple[float]]): | ||
Maximum value or values that this metric can take. | ||
""" | ||
|
||
name = 'Sequence Length Similarity' | ||
goal = Goal.MAXIMIZE | ||
min_value = 0.0 | ||
max_value = 1.0 | ||
|
||
@staticmethod | ||
def compute(real_data: pd.Series, synthetic_data: pd.Series) -> float: | ||
"""Compute this metric. | ||
|
||
The length of a sequence is determined by the number of times the same sequence key occurs. | ||
For example if id_09231 appeared 150 times in the sequence key, then the sequence is of | ||
length 150. This metric compares the lengths of all sequence keys in the | ||
real data vs. the synthetic data. | ||
|
||
It works as follows: | ||
- Calculate the length of each sequence in the real data | ||
- Calculate the length of each sequence in the synthetic data | ||
- Apply the KSComplement metric to compare the similarities of the distributions | ||
- Return this score | ||
|
||
Args: | ||
real_data (Union[numpy.ndarray, pandas.DataFrame]): | ||
The values from the real dataset. | ||
synthetic_data (Union[numpy.ndarray, pandas.DataFrame]): | ||
The values from the synthetic dataset. | ||
|
||
Returns: | ||
float: | ||
The score. | ||
""" | ||
return KSComplement.compute(real_data.value_counts(), synthetic_data.value_counts()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
import pandas as pd | ||
|
||
from sdmetrics.timeseries.sequence_length_similarity import SequenceLengthSimilarity | ||
|
||
|
||
class TestSequenceLengthSimilarity: | ||
def test_compute(self): | ||
"""Test it runs.""" | ||
# Setup | ||
real_data = pd.Series(['id1', 'id2', 'id2', 'id3']) | ||
synthetic_data = pd.Series(['id4', 'id5', 'id6']) | ||
|
||
# Run | ||
score = SequenceLengthSimilarity.compute(real_data, synthetic_data) | ||
|
||
# Assert | ||
assert score == 0.6666666666666667 | ||
|
||
def test_compute_one(self): | ||
"""Test it returns 1 when real and synthetic data have the same distribution.""" | ||
# Setup | ||
real_data = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3']) | ||
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6']) | ||
|
||
# Run | ||
score = SequenceLengthSimilarity.compute(real_data, synthetic_data) | ||
|
||
# Assert | ||
assert score == 1 | ||
|
||
def test_compute_low_score(self): | ||
"""Test it for distinct distributions.""" | ||
# Setup | ||
real_data = pd.Series([f'id{i}' for i in range(100)]) | ||
synthetic_data = pd.Series(['id100'] * 100) | ||
|
||
# Run | ||
score = SequenceLengthSimilarity.compute(real_data, synthetic_data) | ||
|
||
# Assert | ||
assert score == 0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.