Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocess and fit methods to multi-table synthesizers #1074

Closed
amontanez24 opened this issue Oct 19, 2022 · 0 comments
Closed

Add preprocess and fit methods to multi-table synthesizers #1074

amontanez24 opened this issue Oct 19, 2022 · 0 comments
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Oct 19, 2022

Problem Description

As a user, it would be useful to be able to preprocess my data in a separate step from modeling. It would also be helpful to do this from the multi-table level.

Acceptance criteria

Add the following methods:

  • preprocess(data)
    • data is a dictionary mapping each table name to a pandas.dataFrame
    • This method should essentially loop through the single table synthesizers for each table and call preprocess on them with the proper data
    • It should return a dictionary mapping each table name to the transformed data
    • This method can be added to the BaseMultiTableSynthesizer
    • It should only raise one warning if any of the synthesizers have been fit. The warning should read:
      Warning: This synthesizer has already been fit. To use the new preprocessed data, please refit the synthesizer using 'fit' or 'fit_processed_data'
  • fit_processed_data(processed_data)
    • processed_data is a dictionary mapping each table name to a pandas.dataFrame. This data should have already been ran through he data processor.
    • This method will be specific to each MultiTableSynthesizer, so for now only needs to be implemented in the HMASynthesizer.
  • fit(data)
    • data is a dictionary mapping each table name to a pandas.dataFrame
    • should call preprocess and then fit_processed_data

Expected behavior

  • preprocess
    • This method should essentially loop through each table and call SingleTableSynthesizer.preprocess with the correct data
  • fit_processed_data(processed_data)
    • This is where the current HMA algorithm should take place. Each child table should be modeled and then the parameters for that model should be used to extend the table of the parent until eventually the parent is modeled. The code in hma should be reviewed as influence.

Additional context

  • It is a requirement that the primary keys be available to the MultiTableSynthesizer before it fits the models. This should be satisfied as the DataProcessor now makes the primary key the index during transform
  • There is a slight change in the workflow from what happens in hma. We now transform each table first, and then will be calling the fit method for each model and extending the tables with model parameters of the child table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants