Create single table synthesis property #398

R-Palazzo · 2023-07-21T10:37:29Z

Resolve #390

codecov-commenter · 2023-07-21T10:40:26Z

Codecov Report

Patch coverage: 82.05% and project coverage change: +0.09 🎉

Comparison is base (585290f) 76.82% compared to head (269d73e) 76.91%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #398      +/-   ##
==========================================
+ Coverage   76.82%   76.91%   +0.09%     
==========================================
  Files          84       85       +1     
  Lines        3309     3348      +39     
==========================================
+ Hits         2542     2575      +33     
- Misses        767      773       +6

Impacted Files	Coverage Δ
...rics/reports/single_table/_properties/synthesis.py	`81.57% <81.57%> (ø)`
...trics/reports/single_table/_properties/__init__.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

fealho

Looking good, just a few suggestions.

fealho · 2023-07-23T00:14:17Z

sdmetrics/reports/single_table/_properties/synthesis.py

+
+        Args:
+            real_data (pandas.DataFrame):
+                The real data


We usually end args with dots. Same for args below.

Yes, done in b00d7ca

fealho · 2023-07-23T00:15:41Z

sdmetrics/reports/single_table/_properties/synthesis.py

+        """
+        name = self.metric.__name__
+        error_message = np.nan
+        if len(synthetic_data) > 10000:


Why is this here? And should this not log some sort of warning?

It can also be written as one line: sample_size = len(synthetic_data) if len(synthetic_data) <= 10000 else 10000

@npatki I was also not sure about this.
For the NewRowSynthesis, by default, we limit the sample size of the synthetic data to 10000 right?
Should we raise a log or a warning if the synthetic data is larger than 10000 rows?

I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data

I did this not to raise this warning (because the user no longer has the possibility to adjust the sample_size):

SDMetrics/sdmetrics/single_table/new_row_synthesis.py

Line 68 in ddec4f7

warnings.warn(f'The provided `synthetic_sample_size` of {synthetic_sample_size} '

Hmm honestly, I think you can just change that warning to a log and get rid of the logic here. @npatki What do you think?

Or you can leave it but make it onle line like @fealho suggested

Yes alright, done in 4669397

fealho · 2023-07-23T00:27:53Z

sdmetrics/reports/single_table/_properties/synthesis.py

+            if progress_bar:
+                progress_bar.update()
+
+        result = pd.DataFrame({


Are any of the objects a list? If not, it's weird to me we are using a DataFrame when a simple dict would suffice.

It's to keep the same abstraction as the other properties. _generate_details should generate the details table, which is a DataFrame. I agree it's a bit weird because there is only 1 row.

fealho · 2023-07-23T00:30:17Z

sdmetrics/reports/single_table/_properties/synthesis.py

+            'Error': error_message,
+        }, index=[0])
+
+        if result['Error'].isna().all():


Can result['Error'] be a list? If not we don't need the all.

I also find it more intuitive to add the errors to result if there are any, rather than removing them if there were none.

fealho · 2023-07-23T00:34:52Z

sdmetrics/reports/single_table/_properties/synthesis.py

+            plotly.graph_objects._figure.Figure
+        """
+        labels = ['Exact Matches', 'Novel Rows']
+        values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0])


I don't think you need to_numpy, you can select the first row with iloc[0].

Yes, done in b00d7ca

fealho · 2023-07-23T00:37:36Z

sdmetrics/reports/single_table/_properties/synthesis.py

+        values = list(self._details[['Num Matched Rows', 'Num New Rows']].to_numpy()[0])
+
+        average_score = self._compute_average()
+        if not np.isnan(average_score):


I don't think you need this check, round works for nans.

Yes, done in b00d7ca

fealho · 2023-07-23T00:39:40Z

tests/unit/reports/single_table/_properties/test_synthesis.py

+            'num_matched_rows': 3,
+            'num_new_rows': 1,
+        }
+        # Run


Yes, done in b00d7ca

fealho · 2023-07-23T00:56:21Z

tests/unit/reports/single_table/_properties/test_synthesis.py

+    def test__generate_details(self, newrowsynthesis_mock):
+        """Test the ``_generate_details`` method."""
+        # Setup
+        real_data = pd.DataFrame({


I don't think you need the real_data/metadata block, since they are not used for anything (just pass empty dataframes or mocks), and synthetic_data can just be a one liner, since it only checks the size of that. Just to make it more obvious that what's actually been tested is the return of the mock.

Yes, done in b00d7ca

fealho · 2023-07-23T00:57:52Z

tests/unit/reports/single_table/_properties/test_synthesis.py

+            call(real_data, synthetic_data, synthetic_sample_size=4)
+        ]
+
+        newrowsynthesis_mock.assert_has_calls(expected_calls)


You also need to test when synthetic data has over 10,000 rows.

Yes, done in b00d7ca

amontanez24 · 2023-07-25T17:28:37Z

sdmetrics/reports/single_table/_properties/synthesis.py

+        """
+        name = self.metric.__name__
+        error_message = np.nan
+        if len(synthetic_data) > 10000:


I don't think we need this here. Just pass it in as 10000 to the metric. If the data has more than 10000 rows it will subsample, otherwise it will use all the data

amontanez24 · 2023-07-25T17:29:57Z

tests/unit/reports/single_table/_properties/test_synthesis.py

+
+    @patch('sdmetrics.reports.single_table._properties.synthesis.'
+           'NewRowSynthesis.compute_breakdown')
+    def test__generate_details(self, newrowsynthesis_mock):


can we add a unit test for the error case as well

Yes I was doing it hahaha. Done in a310a75

amontanez24

LGTM!

fealho

Thanks for addressing all those comments!

fealho · 2023-07-27T04:44:47Z

tests/unit/reports/single_table/_properties/test_synthesis.py

+    def test__generate_details_error(self, newrowsynthesis_mock):
+        """Test the ``_generate_details`` method when the metric raises an error."""
+        # Setup
+


Delete new line

* definition * unit test * integration test * docstring * modify integration test * typo * quotes * address comments * fix lint * add test error * sample size * blank line * blank line

R-Palazzo requested review from amontanez24 and fealho July 21, 2023 10:37

R-Palazzo requested a review from a team as a code owner July 21, 2023 10:37

R-Palazzo removed the request for review from a team July 21, 2023 10:37

fealho requested changes Jul 23, 2023

View reviewed changes

R-Palazzo force-pushed the issue-390-synthesis-property branch from 269d73e to b00d7ca Compare July 24, 2023 11:18

R-Palazzo requested a review from fealho July 24, 2023 11:30

R-Palazzo changed the base branch from master to diagnostic-report-properties July 25, 2023 09:10

amontanez24 requested changes Jul 25, 2023

View reviewed changes

amontanez24 approved these changes Jul 26, 2023

View reviewed changes

fealho approved these changes Jul 27, 2023

View reviewed changes

R-Palazzo added 13 commits July 27, 2023 09:32

definition

945505c

unit test

6df0092

integration test

9377124

docstring

1d2c45f

modify integration test

d18af0b

typo

2f53d52

quotes

54fa3d7

address comments

e54b96a

fix lint

9f5fea5

add test error

b58f67b

sample size

ef358ab

blank line

3d71fca

blank line

6f98d8a

R-Palazzo force-pushed the issue-390-synthesis-property branch from b55ae79 to 6f98d8a Compare July 27, 2023 08:35

R-Palazzo merged commit b57019f into diagnostic-report-properties Jul 27, 2023
45 checks passed

R-Palazzo deleted the issue-390-synthesis-property branch July 27, 2023 09:17

R-Palazzo mentioned this pull request Aug 23, 2023

Properties for diagnostic report #424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create single table synthesis property #398

Create single table synthesis property #398

R-Palazzo commented Jul 21, 2023

codecov-commenter commented Jul 21, 2023

fealho left a comment

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

amontanez24 Jul 25, 2023

R-Palazzo Jul 25, 2023

amontanez24 Jul 25, 2023

amontanez24 Jul 25, 2023

R-Palazzo Jul 25, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

fealho Jul 23, 2023

R-Palazzo Jul 24, 2023

amontanez24 Jul 25, 2023

amontanez24 Jul 25, 2023

R-Palazzo Jul 25, 2023

amontanez24 left a comment

fealho left a comment

fealho Jul 27, 2023

Create single table synthesis property #398

Create single table synthesis property #398

Conversation

R-Palazzo commented Jul 21, 2023

codecov-commenter commented Jul 21, 2023

Codecov Report

fealho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

fealho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment