Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create single table synthesis property #398

Merged
merged 13 commits into from
Jul 27, 2023

Conversation

R-Palazzo
Copy link
Contributor

Resolve #390

@R-Palazzo R-Palazzo requested a review from a team as a code owner July 21, 2023 10:37
@R-Palazzo R-Palazzo removed the request for review from a team July 21, 2023 10:37
@codecov-commenter
Copy link

Codecov Report

Patch coverage: 82.05% and project coverage change: +0.09 🎉

Comparison is base (585290f) 76.82% compared to head (269d73e) 76.91%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #398      +/-   ##
==========================================
+ Coverage   76.82%   76.91%   +0.09%     
==========================================
  Files          84       85       +1     
  Lines        3309     3348      +39     
==========================================
+ Hits         2542     2575      +33     
- Misses        767      773       +6     
Impacted Files Coverage Δ
...rics/reports/single_table/_properties/synthesis.py 81.57% <81.57%> (ø)
...trics/reports/single_table/_properties/__init__.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few suggestions.


Args:
real_data (pandas.DataFrame):
The real data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually end args with dots. Same for args below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

"""
name = self.metric.__name__
error_message = np.nan
if len(synthetic_data) > 10000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here? And should this not log some sort of warning?

It can also be written as one line: sample_size = len(synthetic_data) if len(synthetic_data) <= 10000 else 10000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npatki I was also not sure about this.
For the NewRowSynthesis, by default, we limit the sample size of the synthetic data to 10000 right?
Should we raise a log or a warning if the synthetic data is larger than 10000 rows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this not to raise this warning (because the user no longer has the possibility to adjust the sample_size):

warnings.warn(f'The provided `synthetic_sample_size` of {synthetic_sample_size} '

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm honestly, I think you can just change that warning to a log and get rid of the logic here. @npatki What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can leave it but make it onle line like @fealho suggested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes alright, done in 4669397

if progress_bar:
progress_bar.update()

result = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the objects a list? If not, it's weird to me we are using a DataFrame when a simple dict would suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to keep the same abstraction as the other properties. _generate_details should generate the details table, which is a DataFrame. I agree it's a bit weird because there is only 1 row.

'Error': error_message,
}, index=[0])

if result['Error'].isna().all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can result['Error'] be a list? If not we don't need the all.

I also find it more intuitive to add the errors to result if there are any, rather than removing them if there were none.

plotly.graph_objects._figure.Figure
"""
labels = ['Exact Matches', 'Novel Rows']
values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to_numpy, you can select the first row with iloc[0].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0])

average_score = self._compute_average()
if not np.isnan(average_score):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this check, round works for nans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

'num_matched_rows': 3,
'num_new_rows': 1,
}
# Run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

def test__generate_details(self, newrowsynthesis_mock):
"""Test the ``_generate_details`` method."""
# Setup
real_data = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the real_data/metadata block, since they are not used for anything (just pass empty dataframes or mocks), and synthetic_data can just be a one liner, since it only checks the size of that. Just to make it more obvious that what's actually been tested is the return of the mock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

call(real_data, synthetic_data, synthetic_sample_size=4)
]

newrowsynthesis_mock.assert_has_calls(expected_calls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to test when synthetic data has over 10,000 rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in b00d7ca

@R-Palazzo R-Palazzo force-pushed the issue-390-synthesis-property branch from 269d73e to b00d7ca Compare July 24, 2023 11:18
@R-Palazzo R-Palazzo requested a review from fealho July 24, 2023 11:30
@R-Palazzo R-Palazzo changed the base branch from master to diagnostic-report-properties July 25, 2023 09:10
"""
name = self.metric.__name__
error_message = np.nan
if len(synthetic_data) > 10000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data


@patch('sdmetrics.reports.single_table._properties.synthesis.'
'NewRowSynthesis.compute_breakdown')
def test__generate_details(self, newrowsynthesis_mock):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a unit test for the error case as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was doing it hahaha. Done in a310a75

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all those comments!

def test__generate_details_error(self, newrowsynthesis_mock):
"""Test the ``_generate_details`` method when the metric raises an error."""
# Setup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete new line

@R-Palazzo R-Palazzo force-pushed the issue-390-synthesis-property branch from b55ae79 to 6f98d8a Compare July 27, 2023 08:35
@R-Palazzo R-Palazzo merged commit b57019f into diagnostic-report-properties Jul 27, 2023
45 checks passed
@R-Palazzo R-Palazzo deleted the issue-390-synthesis-property branch July 27, 2023 09:17
R-Palazzo added a commit that referenced this pull request Jul 28, 2023
* definition

* unit test

* integration test

* docstring

* modify integration test

* typo

* quotes

* address comments

* fix lint

* add test error

* sample size

* blank line

* blank line
R-Palazzo added a commit that referenced this pull request Aug 23, 2023
* definition

* unit test

* integration test

* docstring

* modify integration test

* typo

* quotes

* address comments

* fix lint

* add test error

* sample size

* blank line

* blank line
amontanez24 pushed a commit that referenced this pull request Sep 13, 2023
* definition

* unit test

* integration test

* docstring

* modify integration test

* typo

* quotes

* address comments

* fix lint

* add test error

* sample size

* blank line

* blank line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create single table synthesis property
4 participants