Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the difference between combining and separating tables? #2092

Closed
limhasic opened this issue Jun 26, 2024 · 3 comments
Closed

What is the difference between combining and separating tables? #2092

limhasic opened this issue Jun 26, 2024 · 3 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@limhasic
Copy link

After multi-table synthesis and joining all tables, existing single table synthesis
What is the difference between combining and separating tables?

  1. USE Multi Table Metadata API
  2. split after Join table and USE Single Table Metadata API

Could you please explain it well?

@limhasic limhasic added new Automatic label applied to new issues question General question about the software labels Jun 26, 2024
@srinify
Copy link
Contributor

srinify commented Jun 27, 2024

Hi @limhasic 👋

I'm a bit confused about exactly what you're asking -- do you mind clarifying a bit further? I understood your question to be -- "Why keep tables laid out in a multi-table pattern when I can just combine them into a single table and use SDV instead that way?" If this is incorrect, let me know!

Here's the relevant key differences:

Single Table: Works best when you have a single identifier column (e.g. user_id) that can uniquely link and identify the entities in your data. If you have other columns with identifier-like properties (e.g. post_id) in the same dataset, then single table models will not learn the relationships between your primary identifier column (user_id) and your secondary one (post_id). Your synthetic data may have rows containing user_id and post_id value pairs that don't exist in your real data

Multi Table: Supports cases where you have multiple identifier / id columns in your data that have a relational link between them. With Multi Table, you can specify the relationships between identifier columns and SDV will learn to model them more effectively. For example, SDV will maintain referential integrity when generating synthetic data (e.g. the combinations of user_id and post_id will match the same ones in your real data)

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jun 27, 2024
@npatki
Copy link
Contributor

npatki commented Jun 27, 2024

Hi @limhasic,

To add to this, we always recommend you to use with data that is the closest to its original source. The more you modify the data (splitting, joining, etc.), the more logic/dependencies you will be introducing into your dataset. As a result, it becomes much more difficult for SDV synthesizers to learn this out-of-the-box, because they must reverse-engineer all the changes that were introduced.

Hope that helps, and as @srinify mentioned, it would be helpful if you can provide an example to help us clarify the question further. Thanks.

@srinify
Copy link
Contributor

srinify commented Jul 8, 2024

Hi @limhasic we hope our answers were helpful! It's been 2 weeks since we've heard from you and our general posture is to close out issues with no response after 2 weeks!

If you have more questions, feel free to open more issues!

@srinify srinify closed this as completed Jul 8, 2024
@srinify srinify added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants