# Advanced Tutorial on metasyn

In this tutorial, we will be creating a `MetaFrame`, which is a metadata representation of a given dataset, and proceed by generating synthetic data from it. In the process, we are going to walk through some of the advanced abilities of metasyn, such as handling dates, setting distributions and ensuring uniqueness in columns. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats. 

## Step 0: Install the metasyn package and import required packages

In [None]:
# uncomment the following line and run the cell to install metasyn
# %pip install metasyn

In [None]:
# import required packages
import datetime as dt
import polars as pl
from metasyn import MetaFrame, demo_file

## Step 1: Transforming your data into a polars DataFrame

The first step in creating the MetaFrame is reading and converting your dataset to a polars DataFrame. 

In [None]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}

df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

Now, let's check the data types of our DataFrame:

In [None]:
dict(zip(df.columns, df.dtypes))

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can also inspect the data a bit more with `describe()`.

In [None]:
df.describe()

## Step 2: Creating a MetaFrame object from a DataFrame

Now that we have properly formatted our DataFrame, we can easily generate a MetaFrame for it. For now we'll do this using the default settings (i.e. without specifying any optional parameters). 

> **MetaFrames:**
> A **MetaFrame** is an object which captures the essential aspects of the dataset, including variable names, types, data types, the percentage of missing values, and distribution attributes. MetaFrame objects capture all the information needed to generate a synthetic dataset that aligns with the original dataset, without containing any *entries* of the original dataset.

More information on generating MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html).

In [None]:
mf = MetaFrame.fit_dataframe(df)

We can call the `print` function to display the (statistical metadata contained in the) MetaFrame in an easy-to-read format:

In [None]:
print(mf)

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and export it into a GMF file using `mf.export()`, passing in the filepath as a parameter. 


> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

More information on exporting and importing MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html).

In [None]:
file_path = "example_gmf_titanic.json"

# Serialize and export the MetaFrame to a GMF file
mf.export(file_path)

You can now open and read the GMF formatted .json file!

If you'd like to preview how the exported file would look, without saving it to disk, this can be done by using the `repr` function as follows:

In [None]:
gmf_preview = repr(mf)
print(gmf_preview)

A (previously exported) GMF file can be imported and loaded into a MetaFrame using the `MetaFrame.from_json()` class method, passing in the file path as a parameter. 

In [None]:
# Create a MetaFrame based on a GMF (.json) file
mf = MetaFrame.from_json(file_path)

## Step 4: Generating synthetic data from a MetaFrame

Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data.

More information on generating synthetic data based on MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html).

In [None]:
# generate synthetic data
mf.synthesize(5)

As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

### Set unique columns

One column (PassengerId) has been detected as possibly unique by metasyn, as indicated by the following warning:

> "Variable PassengerId seems unique, but not set to be unique."

This column holds a variable with unique passenger identifiers, so in fact we do want synthetic data generated for this column to be unique as well. We can add this to the metadata by creating a list of options which we call a `specification`, or `spec`:

In [None]:
# First, we create a specification dictionary for the variables
var_spec = {
    "PassengerId": {"unique": True}
}

# then, we add that dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, spec=var_spec)

# then, let's check what the metadata about PassengerId contains!
mf["PassengerId"].to_dict()

So let's check what is generated from this new MetaFrame:

In [None]:
mf.synthesize(5)

Now we that the `PassengerId` column is correctly represented with increasing id numbers.

### Fake names (and others)

As one can see, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in metasyn is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, metasyn supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We fake names as follows:

In [None]:
# First, we create a specification dictionary for the variables
from metasyn.distribution import FakerDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

That already looks a lot better for the `Name` column!

### Set distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the best fitting from available distributions for the variable type. However, we can also manually specify which distribution to fit, or we can even just fully specify how the variable should be generated.

In [None]:
from metasyn.distribution import DiscreteUniformDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "LogNormalDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)} # fully specify a distribution for age (uniform between 20 and 40)
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

### Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [None]:
from metasyn.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[ABCDEF][0-9]{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\d{2,3})?

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(10)

## Step 6: Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [None]:
df.mean()

In [None]:
mf.synthesize(len(df)).mean()

Then, we can also see how many missing values are in each column

In [None]:
df.null_count()

In [None]:
mf.synthesize(len(df)).null_count()

## Step 7: Adding descriptions to variables

With the data being taken care of, we can still do one last thing. We can add descriptions to the variables, to clarify what they mean. This can be particularly useful when sharing the `MetaFrame` or generated data with others, as it gives them more context to what they're working with.

It is possible to specify a description for each variable. This can be done by adding a `description` key to the specification dictionary of a variable,  before creating a `MetaFrame`. For example, adding a description to the `Cabin` column can be done as follows:

In [None]:
var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution, "description": "The cabin number of the passenger."},
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec) 

We can get a list of all the descriptions in the fitted `MetaFrame` by accessing its `descriptions` property, as follows:

In [None]:
print(mf.descriptions)

Instead of setting the description in the variable specification (which happens before fitting a `MetaFrame` to a `DataFrame`), we can assign a description to an already generated `MetaFrame` by directly setting a column's description attribute. For example, we can assign a description to the `PassengerId` column as follows:

In [None]:
mf["PassengerId"].description = "The ID of each passenger, as assigned by Pandas."

print(mf.descriptions)

We can also set multiple descriptions of an already generated `MetaFrame` at once by passing in a dictionary of descriptions to its `descriptions` property. For example, we can set descriptions for the `Age` and `Name` columns as follows:

In [None]:
mf.descriptions = {"Name": "Name of the passenger", "Age": "Age of the passenger in years"}

print(mf.descriptions)

Instead of a dictionary, it is also possible to pass in a list of descriptions to the `descriptions` property of a `MetaFrame`. 

This can only be done if the list has the same length as the number of variables. In other words, each description must be passed in. 

This can be useful for example when generating placeholder descriptions automatically through list comprehension, as is done in the following example:

In [None]:
mf.descriptions = [f"Placeholder description for {var.name}" for var in mf.meta_vars]

print(mf.descriptions)