-
Notifications
You must be signed in to change notification settings - Fork 64
Add working multi table benchmark #504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature_branch/mutli_table_benchmark
Are you sure you want to change the base?
Add working multi table benchmark #504
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature_branch/mutli_table_benchmark #504 +/- ##
========================================================================
+ Coverage 76.39% 77.34% +0.94%
========================================================================
Files 30 30
Lines 2411 2472 +61
========================================================================
+ Hits 1842 1912 +70
+ Misses 569 560 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
df5e11c to
c51dc98
Compare
amontanez24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
|
|
||
| for synthesizer in synthesizers: | ||
| if synthesizer not in SDV_SINGLE_TABLE_SYNTHESIZERS: | ||
| if synthesizer not in SDV_SYNTHESIZERS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if someone passes in HMA to the single table benchmark function? Should we check here that the modality matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add validation now that I have the flags in, I forgot to update the logic after rebasing it. Nice catch!
| all_columns = list(table.columns) | ||
| mandatory = list(metadata._get_all_keys(table_name)) | ||
| optional_columns = [column for column in all_columns if column not in mandatory] | ||
| if len(mandatory) >= 10: | ||
| keep_columns = mandatory | ||
| else: | ||
| extra = 10 - len(mandatory) | ||
| keep_columns = mandatory + optional_columns[:extra] | ||
|
|
||
| # Filter only the subset of columns that will be used | ||
| subset_column_schema = { | ||
| column_name: table.columns[column_name] | ||
| for column_name in keep_columns | ||
| if column_name in table.columns | ||
| } | ||
| # Replace the columns with only the subset ones | ||
| metadata_dict['tables'][table_name]['columns'] = subset_column_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to consolidate this chunk with the single table version of it below? Should be a very similar process
Lines 54 to 76 in f5abf35
| max_rows, max_columns = (1000, 10) | |
| tables = metadata_dict.get('tables', {}) | |
| mandatory_columns = [] | |
| table_name, table_info = next(iter(tables.items())) | |
| columns = table_info.get('columns', {}) | |
| keep_columns = list(columns) | |
| if modality == 'sequential': | |
| seq_index = table_info.get('sequence_index') | |
| seq_key = table_info.get('sequence_key') | |
| mandatory_columns = [col for col in (seq_index, seq_key) if col] | |
| optional_columns = [col for col in columns if col not in mandatory_columns] | |
| # If we have too many columns, drop extras but never mandatory ones | |
| if len(columns) > max_columns: | |
| keep_count = max_columns - len(mandatory_columns) | |
| keep_columns = mandatory_columns + optional_columns[:keep_count] | |
| table_info['columns'] = { | |
| column_name: column_definition | |
| for column_name, column_definition in columns.items() | |
| if column_name in keep_columns | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let me split it up, at first I wanted to keep both separated clearly but it makes sense that each table of the multi table goes over the same process as the single-table. Will update now.
|
|
||
| if len(metrics) > 0: | ||
| metrics, metric_kwargs = get_metrics(metrics, modality='single-table') | ||
| metrics, metric_kwargs = get_metrics(metrics, modality=modality.replace('_', '-')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update get_metrics to just use the same modality formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it then undid the changes because I did not want to change much the error messaging:
raise ValueError(f'Unknown {modality} metric: {metric}') from NoneBut I can surely update that if it is fine. (Basically didn't want to touch much the external portion of benchmark, just focus on the required to make this work).
| uniform_not_included = bool( | ||
| MultiTableUniformSynthesizer not in synthesizers | ||
| and MultiTableUniformSynthesizer.__name__ not in synthesizers | ||
| ) | ||
| if uniform_not_included: | ||
| LOGGER.info('Adding MultiTableUniformSynthesizer to the list of synthesizers.') | ||
| synthesizers.append('MultiTableUniformSynthesizer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and the single table one can probably be consolidated into one function that takes in modality. Also we can call it ensure_fallback_is_included and make the fallback synthesizer a parameter
| pd.testing.assert_frame_equal(result_2, saved_result_2, check_dtype=False) | ||
|
|
||
|
|
||
| def test_benchmark_multi_table_basic_synthesizers(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice testing! Can we add a test that forces the fallback to be tested? We can either make a multitable synthesizer that always errors or mock a function in HMA. Whichever is easiest?
| empty_scores = pd.DataFrame() | ||
| mock__generate_job_args_list.return_value = [] | ||
| mock__get_empty_dataframe.return_value = empty_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are passing in no synthesizers or datasets, why do you also need to mock the jobs returned? Shouldn't that list be empty?
Resolves #486
CU-86b7cjbz8