# Advanced Tutorial on metasyn

In this tutorial, we will be creating synthetic data using the `metasyn` package.

Some advanced features of metasyn will be covered further along the tutorial, such as handling dates, setting distributions, ensuring uniqueness in columns and adding variable descriptions.

For more information refer to the [user's guide](https://metasynth.readthedocs.io/en/latest/usage/usage.html) on the docs.

## Step 0: Install the metasyn package and import required packages

First, let's install the metasyn package.

In [None]:
# Run the following line to install metasyn
# %pip install metasyn

Now, let's import the required packages.

In [None]:
# import required packages
import polars as pl

from metasyn import MetaFrame, demo_file
from metasyn.config import VarConfig
from metasyn.util import DistributionSpec

## Step 1: Loading the dataset

The first step to create synthetic data is to load your dataset into a DataFrame. For this tutorial, we will be using the [Titanic dataset](https://www.kaggle.com/c/titanic/data), which can easily be accessed through the metasyn `demo_file()` function. 

It is important to set the data types of columns in the DataFrame correctly, as this will help metasyn to infer the correct distributions for each variable later.


> **Note** 
> In this tutorial we use [Polars](https://pola.rs) to create the DataFrame, as that is what metasyn uses internally. Pandas is also supported, but will automatically be converted to Polars by metasyn. For best results it is recommended to use Polars.

In [None]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}

# create the DataFrame
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

We can check the data types of our DataFrame as follows:

In [None]:
dict(zip(df.columns, df.dtypes))

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can get some more information on the DataFrame by calling the `describe()` on it, this will give us some information on the distribution of the variables:  

In [None]:
df.describe()

## Step 2: Generating a MetaFrame

Now that we have properly formatted our DataFrame, we can easily generate a MetaFrame for it. 
We'll do this without passing in any optional parameters, but later on in this tutorial we will cover how custom parameters can help provide control over the MetaFrame generation process. 

> **MetaFrames:**
> A MetaFrame is an object which captures the essential aspects of the dataset, including variable names, types, data types, the percentage of missing values, and distribution parameters. MetaFrame objects capture all the information needed to generate a synthetic dataset that aligns with the original dataset, without containing any *entries* of the original dataset.

More information on generating MetaFrames can be found on the metasyn docs ['generating metaframes'](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html) page.

Generating a MetaFrame is simple, and can be done by simply calling the `MetaFrame.fit_dataframe()` class method, passing in the DataFrame as a parameter.

In [None]:
# Generate and fit a MetaFrame to the DataFrame 
mf = MetaFrame.fit_dataframe(df)

We can use the built-in Python `print` function to display the (statistical) metadata contained in the MetaFrame in an easy-to-read format:

In [None]:
print(mf)

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and export it into a GMF file using `mf.export()`, passing in the filepath as a parameter. 


> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

More information on exporting and importing MetaFrames can be found on the metasyn docs ['exporting and importing metaframes'](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html) page.

In [None]:
file_path = "example_gmf_titanic.json"

# Serialize and export the MetaFrame to a GMF file
mf.export(file_path)

The GMF file should now be saved to the specified filepath, feel free to open and inspect it!

It's also possible to preview how the exported file would look, without actually saving it to disk. This can be done as follows:

In [None]:
# Get a preview of the GMF file (`repr()`) and print it (`print()`)
print(repr(mf))

A GMF file can be imported and loaded into a MetaFrame using the `MetaFrame.from_json()` class method, passing in the file path as a parameter. 

In [None]:
# Create a MetaFrame based on a GMF (.json) file
mf = MetaFrame.from_json(file_path)

## Step 4: Generating synthetic data

Once a MetaFrame is loaded, synthetic data can be generated from it. We can do so by using the the `synthesize` method of the MetaFrame, passing in how many rows the generated data should contain as a parameter. This returns a DataFrame with the synthetic data.

More information on generating synthetic data based on MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html).

In [None]:
# generate synthetic data
syn_df = mf.synthesize(5)

We can now view the synthetic data:

In [None]:
syn_df

As you can see, the synthetic data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

The `MetaFrame.fit_dataframe()` method allows you to have more control over how your synthetic dataset is generated by passing in an optional `spec` (short for specification) parameter. `spec` is a dictionary that can be used to give metasyn instructions on a per-variable basis, these instructions can range from setting a variable to be unique, to directly setting its distribution. 

### Spec: Setting variables to have unique variables

During the MetaFrame generation at the start (using `MetaFrame.fit_dataframe()`), metaframe detected a column (PassengerId) as possibly unique, as indicated by the following warning:

> *"Variable PassengerId seems unique, but not set to be unique."*

This is because this column holds a unique identifier for each passenger, which is in fact unique to each passenger. As such, we want the synthetic data generated for this column to be unique as well. 

In order to set a variable to be unique, we can add a `unique` key to the specification dictionary of the variable, and set it to `True`. We can do it for the `PassengerId` column as follows:


In [None]:
# First, we create a specification dictionary for the variables
var_spec = [VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True))]

# then, we add that dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, var_specs=var_spec)

# then, let's check what the metadata about PassengerId contains!
mf["PassengerId"].to_dict()

So let's check what is generated from this new MetaFrame:

In [None]:
mf.synthesize(5)

As you can see, the `PassengerId` column is now unique!

### Spec: Fake names (and other Faker data types)

Currently, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in metasyn is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, metasyn supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We can specify metasyn to use Faker names for the `Name` column as follows:

In [None]:
# First, we create a specification dictionary for the variables
from metasyn.distribution import FakerDistribution

var_specs = [
    VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
    VarConfig(name="Name", dist_spec=FakerDistribution("name")),
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(5)

That already looks a lot better for the `Name` column!

### Spec: Setting distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the distribution with the best fit from all available distributions for the variable type. However, we can also manually specify which distribution to fit, or simply specify the distribution including the parameters for the variable.

In [None]:
from metasyn.distribution import DiscreteUniformDistribution

var_specs = [
    VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
    VarConfig(name="Name", dist_spec=FakerDistribution("name")),
    VarConfig(name="Name", dist_spec="LogNormalDistribution"), # estimate / fit an exponential distribution based on the data
    VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)) # fully specify a distribution for age (uniform between 20 and 40)
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(5)

### Spec: Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [None]:
from metasyn.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[A-F][0-9]{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [A-F]?(\d{2,3})?

var_specs = [
    VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
    VarConfig(name="Name", dist_spec=FakerDistribution("name")),
    VarConfig(name="Name", dist_spec="LogNormalDistribution"), # estimate / fit an exponential distribution based on the data
    VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)), # fully specify a distribution for age (uniform between 20 and 40)
    VarConfig(name="Cabin", dist_spec=cabin_distribution), # Use the regex distribution for the cabin
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
mf.synthesize(10)

## Step 6: Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [None]:
df.mean()

In [None]:
mf.synthesize(len(df)).mean()

Then, we can also see how many missing values are in each column

In [None]:
df.null_count()

In [None]:
mf.synthesize(len(df)).null_count()

## Step 7: Adding descriptions to variables

With the data being taken care of, we can still do one last thing. We can add descriptions to the variables, to clarify what they mean. This can be particularly useful when sharing the `MetaFrame` or generated data with others, as it gives them more context to what they're working with.

One way of adding a description to a variable, is by setting it in the `spec` dictionary, this can be done by simply adding a `description` key with the description as a value. For example, adding a description to the `Cabin` column can be done as follows:

In [None]:
var_specs = [
    # Ensure unique values for the `PassengerId` column
    VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),

    # Utilize the Faker library to synthesize realistic names for the `Name` column
    VarConfig(name="Name", dist_spec=FakerDistribution("name")),

    # Fit `Fare` to an log-normal distribution, but base the parameters on the data
    VarConfig(name="Name", dist_spec="LogNormalDistribution"),

    # Set the `Age` column to a discrete uniform distribution ranging from 20 to 40
    VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)),

    # Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
    VarConfig(name="Cabin", dist_spec=cabin_distribution, description="The cabin number of the passenger."),
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs) 

We can get a list of all the descriptions in the fitted `MetaFrame` by accessing its `descriptions` property, as follows:

In [None]:
print(mf.descriptions)

Instead of setting the description in the variable specification (which happens before fitting a `MetaFrame` to a `DataFrame`), we can assign a description to an already generated `MetaFrame` by directly setting a column's description attribute. For example, we can assign a description to the `PassengerId` column as follows:

In [None]:
mf["PassengerId"].description = "The ID of each passenger, as assigned by Pandas."

print(mf.descriptions)

We can also set multiple descriptions of an already generated `MetaFrame` at once by passing in a dictionary of descriptions to its `descriptions` property. For example, we can set descriptions for the `Age` and `Name` columns as follows:

In [None]:
mf.descriptions = {"Name": "Name of the passenger", "Age": "Age of the passenger in years"}

print(mf.descriptions)

Instead of a dictionary, it is also possible to pass in a list of descriptions to the `descriptions` property of a `MetaFrame`. 

This can only be done if the list has the same length as the number of variables. In other words, each description must be passed in. 

This can be useful for example when generating placeholder descriptions automatically through list comprehension, as is done in the following example:

In [None]:
mf.descriptions = [f"Placeholder description for {var.name}" for var in mf.meta_vars]

print(mf.descriptions)

## The end

That's it for this tutorial! You should now have a good understanding of how to use metasyn to generate synthetic data from a dataset. If you want to learn more, check out the [metasyn docs](https://metasynth.readthedocs.io/en/latest/).

If you have any questions, feel free to [reach out](https://metasynth.readthedocs.io/en/latest/about/contact.html).

