# Advanced Tutorial on metasyn

In this tutorial, we will be creating synthetic data using the `metasyn` package.

Some advanced features of metasyn will be covered further along the tutorial, such as handling dates, setting distributions, ensuring uniqueness in columns and adding variable descriptions.

For more information refer to the [user's guide](https://metasynth.readthedocs.io/en/latest/usage/usage.html) on the docs.

## Step 0: Install the metasyn package and import required packages

First, let's install the metasyn package.

In [1]:
# Run the following line to install metasyn
# %pip install metasyn

Now, let's import the required packages.

In [2]:
# import required packages
from pathlib import Path

import polars as pl

from metasyn import MetaFrame, VarSpec, demo_file

## Step 1: Loading the dataset

The first step to create synthetic data is to load your dataset into a DataFrame. For this tutorial, we will be using the [Titanic dataset](https://www.kaggle.com/c/titanic/data), which can easily be accessed through the metasyn `demo_file()` function. 

It is important to set the data types of columns in the DataFrame correctly, as this will help metasyn to infer the correct distributions for each variable later.


> **Note** 
> In this tutorial we use [Polars](https://pola.rs) to create the DataFrame, as that is what metasyn uses internally. Pandas is also supported, but will automatically be converted to Polars by metasyn. For best results it is recommended to use Polars.

In [3]:
# get the demonstration data file
csv_path = demo_file("titanic")

# ensure columns are of the correct type
data_types = {"Sex": pl.Categorical, "Embarked": pl.Categorical}

# read the data from the csv path
df = pl.read_csv(csv_path, schema_overrides=data_types, try_parse_dates=True)

# check out the data
df.head()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Owen Harris""","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. John Bradley (Fl…","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Miss. Laina""","""female""",26,0,"""STON/O2. 3101282""",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs. Jacques Heath (…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. William Henry""","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


We can check the data types of our DataFrame as follows:

In [4]:
df.schema

Schema([('PassengerId', Int64),
        ('Name', String),
        ('Sex', Categorical(ordering='physical')),
        ('Age', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', Categorical(ordering='physical')),
        ('Birthday', Date),
        ('Board time', Time),
        ('Married since', Datetime(time_unit='us', time_zone=None)),
        ('all_NA', String)])

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can get some more information on the DataFrame by calling the `describe()` on it, this will give us some information on the distribution of the variables:  

In [5]:
df.describe()

statistic,PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
str,f64,str,str,f64,f64,str,f64,str,str,str,str,str,str
"""count""",891.0,"""891""","""891""",714.0,891.0,"""891""",891.0,"""204""","""889""","""813""","""812""","""799""","""0"""
"""null_count""",0.0,"""0""","""0""",177.0,0.0,"""0""",0.0,"""687""","""2""","""78""","""79""","""92""","""891"""
"""mean""",446.0,,,29.693277,0.381594,,32.204208,,,"""1921-07-27 22:08:24.798000""","""14:38:10.014778""","""2022-07-31 03:43:48.767209""",
"""std""",257.353842,,,14.524527,0.806057,,49.693429,,,,,,
"""min""",1.0,"""Abbing, Mr. Anthony""",,0.0,0.0,"""110152""",0.0,"""A10""",,"""1903-07-28""","""10:39:40""","""2022-07-15 12:21:15""",
"""25%""",224.0,,,20.0,0.0,,7.925,,,"""1911-09-18""","""12:39:02""","""2022-07-23 11:16:56""",
"""50%""",446.0,,,28.0,0.0,,14.4542,,,"""1922-03-26""","""14:29:34""","""2022-07-31 00:36:56""",
"""75%""",669.0,,,38.0,0.0,,31.0,,,"""1930-08-29""","""16:40:12""","""2022-08-08 03:35:52""",
"""max""",891.0,"""van Melkebeke, Mr. Philemon""",,80.0,6.0,"""WE/P 5735""",512.3292,"""T""",,"""1940-05-27""","""18:39:28""","""2022-08-15 10:32:05""",


## Step 2: Generating a MetaFrame

Now that we have properly formatted our DataFrame, we can easily generate a MetaFrame for it. 
We'll do this without passing in any optional parameters, but later on in this tutorial we will cover how custom parameters can help provide control over the MetaFrame generation process. 

> **MetaFrames:**
> A MetaFrame is an object which captures the essential aspects of the dataset, including variable names, types, data types, the percentage of missing values, and distribution parameters. MetaFrame objects capture all the information needed to generate a synthetic dataset that aligns with the original dataset, without containing any *entries* of the original dataset.

More information on generating MetaFrames can be found on the metasyn docs ['generating metaframes'](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html) page.

Generating a MetaFrame is simple, and can be done by simply calling the `MetaFrame.fit_dataframe()` class method, passing in the DataFrame as a parameter.

In [6]:
# Generate and fit a MetaFrame to the DataFrame
mf = MetaFrame.fit_dataframe(df)

100%|██████████| 13/13 [00:03<00:00,  3.38it/s]


We can use the built-in Python `print` function to display the (statistical) metadata contained in the MetaFrame in an easy-to-read format:

In [7]:
print(mf)

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.unique_key
	- Provenance: builtin
	- Parameters:
		- lower: 1
		- consecutive: True
	

Column 2: "Name"
- Variable Type: string
- Data Type: String
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.freetext
	- Provenance: builtin
	- Parameters:
		- locale: EN
		- avg_sentences: 2.4691358024691357
		- avg_words: 4.093153759820426
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical(ordering='physical')
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.truncated_normal
	- Provenance: builtin
	- Parameters:
		- lower: -1e-08
	

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and save it into a GMF file using `mf.save()`, passing in the filepath as a parameter. 


> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

More information on saving and importing MetaFrames can be found on the metasyn docs ['saving and importing metaframes'](https://metasynth.readthedocs.io/en/latest/usage/saving_metaframes.html) page.

In [8]:
file_path = Path("gmf_files", "example_gmf_titanic.json")

# Serialize and save the MetaFrame to a GMF file
mf.save(file_path)

The GMF file should now be saved to the specified filepath, feel free to open and inspect it!

It's also possible to preview how the saved file would look, without actually saving it to disk. This can be done as follows:

In [9]:
# Get a preview of the GMF file (`repr()`) and print it (`print()`)
print(repr(mf))

MetaFrame: size = (891 x 13) <MetaVar <PassengerId, core.unique_key>, MetaVar <Name, core.freetext>, ...>


A GMF file can be imported and loaded into a MetaFrame using the `MetaFrame.load()` class method, passing in the file path as a parameter. 

In [10]:
# Create a MetaFrame based on a GMF (.json or .toml) file
mf = MetaFrame.load(file_path)

## Step 4: Generating synthetic data

Once a MetaFrame is loaded, synthetic data can be generated from it. We can do so by using the the `synthesize` method of the MetaFrame, passing in how many rows the generated data should contain as a parameter. This returns a DataFrame with the synthetic data.

More information on generating synthetic data based on MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html).

In [11]:
# generate synthetic data
syn_df = mf.synthesize(5)

We can now view the synthetic data:

In [12]:
syn_df

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Sister. Benefit. Reality. Deci…","""male""",42.0,0,"""3928""",10.508311,,"""S""",1913-07-13,13:34:24,2022-07-21 01:17:45,
2,"""Truth. System. Role.""","""male""",53.0,0,"""601707""",15.501538,,"""S""",1937-08-31,12:53:59,2022-07-29 22:21:06,
3,"""What.""","""male""",26.0,0,"""4697""",10.67768,,"""S""",1926-06-10,18:23:23,2022-08-12 06:55:58,
4,"""Bed. Late. Town.""","""female""",,0,"""897872""",15.867427,,"""S""",1924-09-05,12:39:47,2022-08-05 04:59:33,
5,"""Large business cup effect. Thr…","""male""",,0,"""6633""",104.973689,,"""S""",1927-12-13,10:46:22,,


As you can see, the synthetic data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

The `MetaFrame.fit_dataframe()` method allows you to have more control over how your synthetic dataset is generated by passing in an optional `spec` (short for specification) parameter. `spec` is a dictionary that can be used to give metasyn instructions on a per-variable basis, these instructions can range from setting a variable to be unique, to directly setting its distribution. 

### Spec: Fake names (and other Faker data types)

Currently, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in metasyn is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, metasyn supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We can specify metasyn to use Faker names for the `Name` column as follows:

In [13]:
# First, we create a specification dictionary for the variables
from metasyn.distribution import FakerDistribution

var_specs = [
    VarSpec("Name", distribution=FakerDistribution("name")),
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(5)

100%|██████████| 13/13 [00:00<00:00, 14.79it/s]


PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Jesus Armstrong""","""male""",12,0,"""4765""",0.984583,,"""C""",1913-05-13,17:59:25,2022-08-01 04:52:02,
2,"""Jeff Ball""","""female""",11,0,"""5832""",48.003537,"""B4""","""S""",1921-05-06,12:40:37,,
3,"""Lindsey Collins""","""female""",36,0,"""346812""",31.828248,,"""S""",1929-07-28,15:55:15,2022-08-15 04:35:43,
4,"""Vincent Stevenson""","""male""",45,1,"""0369""",13.633204,,"""S""",1919-03-30,,2022-08-09 06:01:43,
5,"""Michele Whitney""","""male""",48,0,"""288891""",3.003839,"""E 4""","""S""",1926-09-06,13:46:37,2022-07-18 13:16:16,


That already looks a lot better for the `Name` column!

### Spec: Setting distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the distribution with the best fit from all available distributions for the variable type. However, we can also manually specify which distribution to fit, or simply specify the distribution including the parameters for the variable.

In [14]:
from metasyn.distribution import DiscreteUniformDistribution

var_specs = [
    VarSpec("Name", distribution=FakerDistribution("name")),
    VarSpec("Fare", distribution="lognormal"), # estimate / fit an exponential distribution based on the data
    VarSpec("Age", distribution=DiscreteUniformDistribution(20, 40)) # fully specify a distribution for age (uniform between 20 and 40)
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(5)

100%|██████████| 13/13 [00:00<00:00, 18.91it/s]


PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Steven Solis""","""male""",26,0,"""06781""",4.426615,,"""Q""",1917-10-31,16:19:12,2022-08-03 02:07:29,
2,"""Courtney Ramirez""","""male""",35,0,"""63056""",2.3421,,"""Q""",1931-03-16,18:01:24,2022-07-31 10:38:49,
3,"""Yvonne Hansen""","""male""",30,1,"""5978""",0.157967,,"""C""",,16:47:23,2022-07-15 19:29:30,
4,"""Mark Green""","""male""",24,0,"""3317""",0.880757,,"""S""",1906-04-17,13:02:33,2022-07-15 23:48:15,
5,"""Tony Reynolds""","""female""",39,0,"""195928""",0.422501,,"""C""",1910-08-19,16:04:09,2022-08-03 02:35:05,


### Spec: Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [15]:
from metasyn.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[A-F][0-9]{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [A-F]?(\d{2,3})?


var_specs = [
    VarSpec("Name", distribution=FakerDistribution("name")),
    VarSpec("Fare", distribution="lognormal"), # estimate / fit an exponential distribution based on the data
    VarSpec("Age", distribution=DiscreteUniformDistribution(20, 40)), # fully specify a distribution for age (uniform between 20 and 40)
    VarSpec("Cabin", distribution=cabin_distribution),  # Use our previously defined distribution
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(10)

100%|██████████| 13/13 [00:00<00:00, 23.38it/s]


PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Timothy Davidson""","""male""",,0,"""FU 597164""",2.837321,,"""C""",1907-06-17,12:23:18,2022-08-09 15:00:58,
2,"""Christine Russell""","""female""",31.0,0,"""61912""",1.610465,"""B042""","""C""",1916-05-16,15:33:51,2022-07-17 05:51:50,
3,"""James Morris DVM""","""male""",22.0,0,"""178671""",0.835741,"""E958""","""S""",1934-10-29,18:22:11,,
4,"""Jared Hart""","""female""",29.0,0,"""380514""",2.904262,"""A48""","""S""",,14:07:08,2022-07-28 05:03:04,
5,"""Jamie Medina""","""male""",20.0,0,"""0050""",1.432564,,"""Q""",1932-10-06,,2022-07-16 00:21:03,
6,"""Karen Rodriguez""","""female""",29.0,0,"""9589""",0.570791,,"""Q""",1939-12-23,16:39:01,2022-08-04 15:56:22,
7,"""Richard Young""","""female""",,1,"""974010""",3.177976,,"""S""",1916-03-11,,2022-07-26 15:39:12,
8,"""Tiffany Jones""","""male""",25.0,0,"""965653""",0.801674,,"""S""",1919-08-27,17:00:25,2022-08-11 19:24:36,
9,"""Richard Andrews""","""male""",28.0,0,"""93545""",1.798628,,"""S""",1933-01-30,17:36:25,2022-07-23 10:37:15,
10,"""Rebecca Bradshaw DDS""","""female""",,0,"""QAAV 834235""",1.917046,"""D862""","""Q""",1905-09-15,12:40:01,,


## Step 6: Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [16]:
df.mean()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,str,f64,str,cat,datetime[ms],time,datetime[μs],str
446.0,,,29.693277,0.381594,,32.204208,,,1921-07-27 22:08:24.798,14:38:10.014778325,2022-07-31 03:43:48.767209,


In [17]:
mf.synthesize().mean()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,str,f64,str,cat,datetime[ms],time,datetime[μs],str
446.0,,,29.353693,0.05275,,1.573629,,,1922-02-15 09:31:27.379,14:42:30.612345679,2022-07-31 05:06:31.243750,


Then, we can also see how many missing values are in each column

In [18]:
df.null_count()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,177,0,0,0,687,2,78,79,92,891


In [19]:
mf.synthesize().null_count()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,168,0,0,0,702,8,78,84,92,891


## Step 7: Adding descriptions to variables

With the data being taken care of, we can still do one last thing. We can add descriptions to the variables, to clarify what they mean. This can be particularly useful when sharing the `MetaFrame` or generated data with others, as it gives them more context to what they're working with.

One way of adding a description to a variable, is by setting it in the `spec` dictionary, this can be done by simply adding a `description` key with the description as a value. For example, adding a description to the `Cabin` column can be done as follows:

In [20]:
var_specs = [
    # Utilize the Faker library to synthesize realistic names for the `Name` column
    VarSpec("Name", distribution=FakerDistribution("name")),

    # Fit `Fare` to an log-normal distribution, but base the parameters on the data
    VarSpec("Fare", distribution="lognormal"), # estimate / fit an exponential distribution based on the data

    # Set the `Age` column to a discrete uniform distribution ranging from 20 to 40
    VarSpec("Age", distribution=DiscreteUniformDistribution(20, 40)), # fully specify a distribution for age (uniform between 20 and 40)

    # Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
    VarSpec("Cabin", distribution=cabin_distribution, description="The cabin number of the passenger."),  # Use our previously defined distribution
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)

100%|██████████| 13/13 [00:00<00:00, 21.15it/s]


We can get a list of all the descriptions in the fitted `MetaFrame` by accessing its `descriptions` property, as follows:

In [21]:
print(mf.descriptions)

{'Cabin': 'The cabin number of the passenger.'}


Instead of setting the description in the variable specification (which happens before fitting a `MetaFrame` to a `DataFrame`), we can assign a description to an already generated `MetaFrame` by directly setting a column's description attribute. For example, we can assign a description to the `PassengerId` column as follows:

In [22]:
mf["PassengerId"].description = "The ID of each passenger, as assigned by Pandas."

print(mf.descriptions)

{'PassengerId': 'The ID of each passenger, as assigned by Pandas.', 'Cabin': 'The cabin number of the passenger.'}


We can also set multiple descriptions of an already generated `MetaFrame` at once by passing in a dictionary of descriptions to its `descriptions` property. For example, we can set descriptions for the `Age` and `Name` columns as follows:

In [23]:
mf.descriptions = {"Name": "Name of the passenger", "Age": "Age of the passenger in years"}

print(mf.descriptions)

{'PassengerId': 'The ID of each passenger, as assigned by Pandas.', 'Name': 'Name of the passenger', 'Age': 'Age of the passenger in years', 'Cabin': 'The cabin number of the passenger.'}


Instead of a dictionary, it is also possible to pass in a list of descriptions to the `descriptions` property of a `MetaFrame`. 

This can only be done if the list has the same length as the number of variables. In other words, each description must be passed in. 

This can be useful for example when generating placeholder descriptions automatically through list comprehension, as is done in the following example:

In [24]:
mf.descriptions = [f"Placeholder description for {var.name}" for var in mf.meta_vars]

print(mf.descriptions)

{'PassengerId': 'Placeholder description for PassengerId', 'Name': 'Placeholder description for Name', 'Sex': 'Placeholder description for Sex', 'Age': 'Placeholder description for Age', 'Parch': 'Placeholder description for Parch', 'Ticket': 'Placeholder description for Ticket', 'Fare': 'Placeholder description for Fare', 'Cabin': 'Placeholder description for Cabin', 'Embarked': 'Placeholder description for Embarked', 'Birthday': 'Placeholder description for Birthday', 'Board time': 'Placeholder description for Board time', 'Married since': 'Placeholder description for Married since', 'all_NA': 'Placeholder description for all_NA'}


## The end

That's it for this tutorial! You should now have a good understanding of how to use metasyn to generate synthetic data from a dataset. If you want to learn more, check out the [metasyn docs](https://metasynth.readthedocs.io/en/latest/).

If you have any questions, feel free to [reach out](https://metasynth.readthedocs.io/en/latest/about/contact.html).

