# MultiTable synthesis - Database with attribute tables

Relational databases often include attribute tables that provide descriptive metadata about entities in a primary table. These attribute tables, such as those describing product categories, geographic locations, or league information, enhance the database structure by offering additional context and classification for the main data.

When generating synthetic data for schemas with attribute tables, it is essential to:
- Ensure that all references in the main table align with valid entries in the attribute tables.
- Maintain the logical connections between entities and their attributes.
- Preserve the distribution and relationships between attribute values and the main data.

This notebook introduces strategies for generating synthetic datasets that respect the relationships between primary and attribute tables. By maintaining relational integrity and ensuring realistic attribute distributions, the resulting data can be effectively used for testing, analytics, or machine learning applications.


In this example we will explore this can be achive by leveraging the [Football database](https://www.kaggle.com/datasets/technika148/football-database).

### Read the data from the Data Catalog

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{insert-datasource-id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMultiMetadata Summary 
 
[0m[1mTables Summary [0m
[1mNumber of tables: [0m7 
 
    Table name  # cols  # nrows        Primary keys                        Foreign keys PK characteristics                                                FK characteristics Notes
0      leagues       3        5          [leagueID]                                                   [id]                                                                        
1      players       2     7659          [playerID]                                                   [id]                                                                        
2        teams       2      146            [teamID]                                                   [id]                                                                        
3        games      34    12680            [gameID]  [leagueID, homeTeamID, awayTeamID]               [id]  {'leagueID': ['id'], 'homeTeamID': ['id'], 'awayTeamID': ['id']}      
4  appearances   

## Synthetic data generation - train & sampling

### Train

In this particular database setting, some of the tables are actually attribute tables, meaning, they only hold information that describes a certain attribute/activity/place etc. We can consider as attribute tables the leagues, teams and players  table.  

For that reason we will add the *attribute_tables* configuration to the fit of the synthetic data generator. Attribute tables will be still generated, but with its values masked/anonymized, so the synthetic data generated can comply with privacy regulations.

In [None]:
from ydata.synthesizers.multitable.model import MultiTableSynthesizer

synthesizer = MultiTableSynthesizer()

synthesizer.fit(
        X=dataset,
        metadata=metadata,
     attribute_tables=["leagues", 'teams', 'players']  # or ["leagues"]
)

### Sample

In [None]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
dest_connector = Connectors.get(uid='{insert-connector-id}')

In [None]:
synthesizer.sample(1., connector=dest_connector.connector)