# Advanced Tutorial on metasyn

In this tutorial, we will be creating a `MetaFrame` (which is a metadata representation of a given dataset) and then generate synthetic data from it.  

Some advanced features of metasyn will be covered further along the tutorial, such as handling dates, setting distributions, ensuring uniqueness in columns and adding variable descriptions.

## Step 0: Install the metasyn package and import required packages

First, let's install the metasyn package.

In [None]:
# Run the following line to install metasyn
%pip install metasyn

Now, let's import the required packages.

In [3]:
# import required packages
import datetime as dt
import polars as pl
from metasyn import MetaFrame, demo_file

## Step 1: Loading the dataset

Before we can create a MetaFrame, we need to load the dataset into a DataFrame. For this tutorial, we will be using the [Titanic dataset](https://www.kaggle.com/c/titanic/data), which can easily be accessed through the metasyn `demo_file()` function. 

It is important to set the data types of columns in the DataFrame correctly, as this will help metasyn to infer the correct distributions for each variable later.


> **Note** 
> In this tutorial we use [Polars](https://pola.rs) to create the DataFrame, as that is what metasyn uses internally. Pandas is also supported, but will automatically be converted to Polars by metasyn. For best results it is recommended to use Polars.

In [4]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}

# create the DataFrame
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Ow…","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. …","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Mis…","""female""",26,0,"""STON/O2. 31012…",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs.…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. Wil…","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


We can check the data types of our DataFrame as follows:

In [5]:
dict(zip(df.columns, df.dtypes))

{'PassengerId': Int64,
 'Name': Utf8,
 'Sex': Categorical,
 'Age': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Categorical,
 'Birthday': Date,
 'Board time': Time,
 'Married since': Datetime(time_unit='us', time_zone=None),
 'all_NA': Utf8}

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can get some more information on the DataFrame by calling the `describe()` on it, this will give us some information on the distribution of the variables:  

In [11]:
df.describe()

describe,PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
str,f64,str,str,f64,f64,str,f64,str,str,str,str,str,str
"""count""",891.0,"""891""","""891""",891.0,891.0,"""891""",891.0,"""891""","""891""","""891""","""891""","""891""","""891"""
"""null_count""",0.0,"""0""","""0""",177.0,0.0,"""0""",0.0,"""687""","""2""","""78""","""79""","""92""","""891"""
"""mean""",446.0,,,29.693277,0.381594,,32.204208,,,,,,
"""std""",257.353842,,,14.524527,0.806057,,49.693429,,,,,,
"""min""",1.0,"""Abbing, Mr. An…",,0.0,0.0,"""110152""",0.0,"""A10""",,"""1903-07-28""","""10:39:40""","""2022-07-15 12:…",
"""25%""",223.0,,,20.0,0.0,,7.8958,,,,,,
"""50%""",446.0,,,28.0,0.0,,14.4542,,,,,,
"""75%""",669.0,,,38.0,0.0,,31.0,,,,,,
"""max""",891.0,"""van Melkebeke,…",,80.0,6.0,"""WE/P 5735""",512.3292,"""T""",,"""1940-05-27""","""18:39:28""","""2022-08-15 10:…",


## Step 2: Generating a MetaFrame

Now that we have properly formatted our DataFrame, we can easily generate a MetaFrame for it. 
For now we'll do this without passing in any optional parameters, but later on in this tutorial we will cover how custom parameters can help provide control over the MetaFrame generation process. 

> **MetaFrames:**
> A **MetaFrame** is an object which captures the essential aspects of the dataset, including variable names, types, data types, the percentage of missing values, and distribution attributes. MetaFrame objects capture all the information needed to generate a synthetic dataset that aligns with the original dataset, without containing any *entries* of the original dataset.

More information on generating MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html).

Generating a MetaFrame is simple, and can be done by simply calling the `MetaFrame.fit_dataframe()` class method, passing in the DataFrame as a parameter.

In [12]:
# Generate and fit a MetaFrame to the DataFrame 
mf = MetaFrame.fit_dataframe(df)

Variable PassengerId seems unique, but not set to be unique.

  if values.str.lengths().mean() > 10:
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = next_series.drop_nulls().str.lengths().mean()
  avg_len_next = 

We can use the built-in Python `print` function to display the (statistical metadata contained in the) MetaFrame in an easy-to-read format:

In [14]:
print(mf)

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.discrete_uniform
	- Provenance: builtin
	- Parameters:
		- low: 1
		- high: 892
	

Column 2: "Name"
- Variable Type: string
- Data Type: Utf8
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.regex
	- Provenance: builtin
	- Parameters:
		- regex: [A-Z][a-z]{2,9}[,][ ][M][a-z]{1,5}[\.][ ][A-Z][a-z]{3,8}(|[ ][A-Z](|[a-z]{3,8}))
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	- Parameters:
		- labels: [ 0  1  2  3  4  5  6  7  8  9 10 11

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and export it into a GMF file using `mf.export()`, passing in the filepath as a parameter. 


> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

More information on exporting and importing MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html).

In [None]:
file_path = "example_gmf_titanic.json"

# Serialize and export the MetaFrame to a GMF file
mf.export(file_path)

The GMF file should now be saved to the specified filepath, feel free to open and inspect it!

It's also possible to preview how the exported file would look, without actually saving it to disk. This can be done as follows:

In [None]:
# Get a preview the GMF file
gmf_preview = repr(mf)

# Print the preview
print(gmf_preview)

A (previously exported) GMF file can be imported and loaded into a MetaFrame using the `MetaFrame.from_json()` class method, passing in the file path as a parameter. 

In [None]:
# Create a MetaFrame based on a GMF (.json) file
mf = MetaFrame.from_json(file_path)

## Step 4: Generating synthetic data

Once a MetaFrame is loaded, synthetic data can be generated from it. We can do so by calling the `synthesize` on the MetaFrame, passing in how many rows the generated data should contain as a parameter. This returns a DataFrame with the synthetic data.

More information on generating synthetic data based on MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html).

In [15]:
# generate synthetic data
syn_df = mf.synthesize(5)

We can now view the synthetic data:

In [17]:
syn_df

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,f32,cat,date,time,datetime[μs],f32
359,"""Uutbw, Mnbna. …","""male""",,0,"""5756""",107.655552,,"""S""",,12:40:53,2022-08-12 04:05:10,
292,"""Nolrm, Muv. Yv…","""male""",24.0,2,"""HZ 6428""",15.318505,,"""S""",1933-01-30,13:38:39,2022-07-27 10:05:07,
465,"""Wpjblaxepi, Mp…","""female""",,0,"""OZDF 118563""",28.148854,,"""S""",1923-01-31,11:07:23,2022-08-08 02:29:35,
880,"""Pudn, Mh. Uazd…","""female""",54.0,2,"""829558""",31.847405,,"""S""",1924-05-21,18:27:22,2022-08-01 16:09:40,
532,"""Rnsw, Mndiqd. …","""male""",30.0,1,"""258592""",28.991591,,"""S""",1925-12-08,11:15:30,2022-07-16 06:39:34,


As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

### Set unique columns

One column (PassengerId) has been detected as possibly unique by metasyn, as indicated by the following warning:

> "Variable PassengerId seems unique, but not set to be unique."

This column holds a variable with unique passenger identifiers, so in fact we do want synthetic data generated for this column to be unique as well. We can add this to the metadata by creating a list of options which we call a `specification`, or `spec`:

In [None]:
# First, we create a specification dictionary for the variables
var_spec = {
    "PassengerId": {"unique": True}
}

# then, we add that dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, spec=var_spec)

# then, let's check what the metadata about PassengerId contains!
mf["PassengerId"].to_dict()

So let's check what is generated from this new MetaFrame:

In [None]:
mf.synthesize(5)

Now we that the `PassengerId` column is correctly represented with increasing id numbers.

### Fake names (and others)

As one can see, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in metasyn is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, metasyn supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We fake names as follows:

In [None]:
# First, we create a specification dictionary for the variables
from metasyn.distribution import FakerDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

That already looks a lot better for the `Name` column!

### Set distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the best fitting from available distributions for the variable type. However, we can also manually specify which distribution to fit, or we can even just fully specify how the variable should be generated.

In [None]:
from metasyn.distribution import DiscreteUniformDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "LogNormalDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)} # fully specify a distribution for age (uniform between 20 and 40)
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

### Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [None]:
from metasyn.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[ABCDEF][0-9]{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\d{2,3})?

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(10)

## Step 6: Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [None]:
df.mean()

In [None]:
mf.synthesize(len(df)).mean()

Then, we can also see how many missing values are in each column

In [None]:
df.null_count()

In [None]:
mf.synthesize(len(df)).null_count()

## Step 7: Adding descriptions to variables

With the data being taken care of, we can still do one last thing. We can add descriptions to the variables, to clarify what they mean. This can be particularly useful when sharing the `MetaFrame` or generated data with others, as it gives them more context to what they're working with.

It is possible to specify a description for each variable. This can be done by adding a `description` key to the specification dictionary of a variable,  before creating a `MetaFrame`. For example, adding a description to the `Cabin` column can be done as follows:

In [None]:
var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution, "description": "The cabin number of the passenger."},
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec) 

We can get a list of all the descriptions in the fitted `MetaFrame` by accessing its `descriptions` property, as follows:

In [None]:
print(mf.descriptions)

Instead of setting the description in the variable specification (which happens before fitting a `MetaFrame` to a `DataFrame`), we can assign a description to an already generated `MetaFrame` by directly setting a column's description attribute. For example, we can assign a description to the `PassengerId` column as follows:

In [None]:
mf["PassengerId"].description = "The ID of each passenger, as assigned by Pandas."

print(mf.descriptions)

We can also set multiple descriptions of an already generated `MetaFrame` at once by passing in a dictionary of descriptions to its `descriptions` property. For example, we can set descriptions for the `Age` and `Name` columns as follows:

In [None]:
mf.descriptions = {"Name": "Name of the passenger", "Age": "Age of the passenger in years"}

print(mf.descriptions)

Instead of a dictionary, it is also possible to pass in a list of descriptions to the `descriptions` property of a `MetaFrame`. 

This can only be done if the list has the same length as the number of variables. In other words, each description must be passed in. 

This can be useful for example when generating placeholder descriptions automatically through list comprehension, as is done in the following example:

In [None]:
mf.descriptions = [f"Placeholder description for {var.name}" for var in mf.meta_vars]

print(mf.descriptions)