-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create single table synthesis property #398
Create single table synthesis property #398
Conversation
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## master #398 +/- ##
==========================================
+ Coverage 76.82% 76.91% +0.09%
==========================================
Files 84 85 +1
Lines 3309 3348 +39
==========================================
+ Hits 2542 2575 +33
- Misses 767 773 +6
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few suggestions.
|
||
Args: | ||
real_data (pandas.DataFrame): | ||
The real data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually end args with dots. Same for args below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
""" | ||
name = self.metric.__name__ | ||
error_message = np.nan | ||
if len(synthetic_data) > 10000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here? And should this not log some sort of warning?
It can also be written as one line: sample_size = len(synthetic_data) if len(synthetic_data) <= 10000 else 10000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npatki I was also not sure about this.
For the NewRowSynthesis
, by default, we limit the sample size
of the synthetic data to 10000 right?
Should we raise a log or a warning if the synthetic data is larger than 10000 rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this not to raise this warning (because the user no longer has the possibility to adjust the sample_size
):
warnings.warn(f'The provided `synthetic_sample_size` of {synthetic_sample_size} ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm honestly, I think you can just change that warning to a log and get rid of the logic here. @npatki What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can leave it but make it onle line like @fealho suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes alright, done in 4669397
if progress_bar: | ||
progress_bar.update() | ||
|
||
result = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of the objects a list? If not, it's weird to me we are using a DataFrame when a simple dict would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to keep the same abstraction as the other properties. _generate_details
should generate the details table, which is a DataFrame. I agree it's a bit weird because there is only 1 row.
'Error': error_message, | ||
}, index=[0]) | ||
|
||
if result['Error'].isna().all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can result['Error'] be a list? If not we don't need the all
.
I also find it more intuitive to add the errors to result
if there are any, rather than removing them if there were none.
plotly.graph_objects._figure.Figure | ||
""" | ||
labels = ['Exact Matches', 'Novel Rows'] | ||
values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to_numpy
, you can select the first row with iloc[0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0]) | ||
|
||
average_score = self._compute_average() | ||
if not np.isnan(average_score): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this check, round
works for nans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
'num_matched_rows': 3, | ||
'num_new_rows': 1, | ||
} | ||
# Run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
def test__generate_details(self, newrowsynthesis_mock): | ||
"""Test the ``_generate_details`` method.""" | ||
# Setup | ||
real_data = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the real_data/metadata block, since they are not used for anything (just pass empty dataframes or mocks), and synthetic_data can just be a one liner, since it only checks the size of that. Just to make it more obvious that what's actually been tested is the return of the mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
call(real_data, synthetic_data, synthetic_sample_size=4) | ||
] | ||
|
||
newrowsynthesis_mock.assert_has_calls(expected_calls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to test when synthetic data has over 10,000 rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in b00d7ca
269d73e
to
b00d7ca
Compare
""" | ||
name = self.metric.__name__ | ||
error_message = np.nan | ||
if len(synthetic_data) > 10000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data
|
||
@patch('sdmetrics.reports.single_table._properties.synthesis.' | ||
'NewRowSynthesis.compute_breakdown') | ||
def test__generate_details(self, newrowsynthesis_mock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a unit test for the error case as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was doing it hahaha. Done in a310a75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all those comments!
def test__generate_details_error(self, newrowsynthesis_mock): | ||
"""Test the ``_generate_details`` method when the metric raises an error.""" | ||
# Setup | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete new line
b55ae79
to
6f98d8a
Compare
* definition * unit test * integration test * docstring * modify integration test * typo * quotes * address comments * fix lint * add test error * sample size * blank line * blank line
* definition * unit test * integration test * docstring * modify integration test * typo * quotes * address comments * fix lint * add test error * sample size * blank line * blank line
* definition * unit test * integration test * docstring * modify integration test * typo * quotes * address comments * fix lint * add test error * sample size * blank line * blank line
Resolve #390