# Advanced Tutorial on metasyn

In this tutorial, we will be creating a `MetaFrame`, which is a metadata representation of a given dataset, and proceed by generating synthetic data from it. In the process, we are going to walk through some of the advanced abilities of metasyn, such as handling dates, setting distributions and ensuring uniqueness in columns. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats. 

## Step 0: Install the metasyn package and import required packages

In [3]:
# uncomment the following line and run the cell to install metasyn
# %pip install metasyn

In [4]:
# import required packages
import datetime as dt
import polars as pl
from metasyn import MetaFrame, demo_file

## Step 1: Transforming your data into a polars DataFrame

The first step in creating the MetaFrame is reading and converting your dataset to a polars DataFrame. 

In [5]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}

df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Ow…","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. …","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Mis…","""female""",26,0,"""STON/O2. 31012…",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs.…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. Wil…","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


Now, let's check the data types of our DataFrame:

In [6]:
dict(zip(df.columns, df.dtypes))

{'PassengerId': Int64,
 'Name': Utf8,
 'Sex': Categorical,
 'Age': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Categorical,
 'Birthday': Date,
 'Board time': Time,
 'Married since': Datetime(time_unit='us', time_zone=None),
 'all_NA': Utf8}

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can also inspect the data a bit more with `describe()`.

In [7]:
df.describe()

describe,PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
str,f64,str,str,f64,f64,str,f64,str,str,str,str,str,str
"""count""",891.0,"""891""","""891""",891.0,891.0,"""891""",891.0,"""891""","""891""","""891""","""891""","""891""","""891"""
"""null_count""",0.0,"""0""","""0""",177.0,0.0,"""0""",0.0,"""687""","""2""","""78""","""79""","""92""","""891"""
"""mean""",446.0,,,29.693277,0.381594,,32.204208,,,,,,
"""std""",257.353842,,,14.524527,0.806057,,49.693429,,,,,,
"""min""",1.0,"""Abbing, Mr. An…",,0.0,0.0,"""110152""",0.0,"""A10""",,"""1903-07-28""","""10:39:40""","""2022-07-15 12:…",
"""max""",891.0,"""van Melkebeke,…",,80.0,6.0,"""WE/P 5735""",512.3292,"""T""",,"""1940-05-27""","""18:39:28""","""2022-08-15 10:…",
"""median""",446.0,,,28.0,0.0,,14.4542,,,,,,
"""25%""",223.0,,,20.0,0.0,,7.8958,,,,,,
"""75%""",669.0,,,38.0,0.0,,31.0,,,,,,


## Step 2: Creating a MetaFrame object from a DataFrame

Now a lot of work has already gone into creating a properly formatted DataFrame. This work pays off at this stage: let's convert the DataFrame to a MetaFrame structure with the default options. Note: this takes a little bit of time!

In [8]:
mf = MetaFrame.fit_dataframe(df)

Variable PassengerId seems unique, but not set to be unique.



Then, we can simply print the MetaFrame to display it in an easy-to-read format:

In [9]:
print(mf)

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.discrete_uniform
	- Provenance: builtin
	- Parameters:
		- low: 1
		- high: 892
	

Column 2: "Name"
- Variable Type: string
- Data Type: Utf8
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.regex
	- Provenance: builtin
	- Parameters:
		- regex: [A-Z][a-z]{2,9}[,][ ][M]((|[a][s][t][e])[r][\.][ ][A-Z][a-z]{3,7}(|[ ][A-Z][a-z]{3,7})|[i][s]{2,2}[\.][ ][A-Z][a-z]{3,8}(|[ ][A-Z][a-z]{4,7}))
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and export it into a GMF file. 


> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

In [11]:
# save the metadata to a file
file_path = "example_gmf_titanic.json"
mf.export(file_path)

# you can now open and read the json file!

Alternatively, we can preview how the exported file would look, without saving it to disk as follows:

In [None]:
gmf_preview = repr(mf)
print(gmf_preview)

## Step 4: Generating synthetic data from the metadata

A previously exported MetaFrame (.json) file can be loaded into a MetaFrame object. 

In [12]:
#load previously exported MetaFrame (.json) file
mf = MetaFrame.from_json(file_path)

Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data.

In [13]:
# generate synthetic data
mf.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,f32,cat,date,time,datetime[μs],f32
44,"""Agptbme, Miss.…","""male""",17.0,0,"""60775""",88.179267,,"""S""",,18:11:28,2022-08-11 02:03:31,
655,"""Atzljlwzp, Mas…","""female""",,1,"""598738""",8.664989,,"""C""",1914-06-10,16:58:14,2022-08-02 08:13:51,
111,"""Sir, Mr. Xzzqe…","""male""",26.0,0,"""PC 077395""",47.715067,,"""S""",1936-06-18,17:40:49,2022-08-10 14:20:01,
523,"""Ftmq, Mr. Wjlt…","""male""",34.0,2,"""PC 4541""",29.418323,,"""C""",,18:25:02,2022-07-22 19:37:23,
426,"""Ewxlihlpq, Mr.…","""male""",24.0,0,"""2337""",57.856906,,"""S""",1936-11-12,11:29:18,2022-07-19 18:40:49,


As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

### Set unique columns

One column (PassengerId) has been detected as possibly unique by metasyn, as indicated by the following warning:

> "Variable PassengerId seems unique, but not set to be unique."

This column holds a variable with unique passenger identifiers, so in fact we do want synthetic data generated for this column to be unique as well. We can add this to the metadata by creating a list of options which we call a `specification`, or `spec`:

In [14]:
# First, we create a specification dictionary for the variables
var_spec = {
    "PassengerId": {"unique": True}
}

# then, we add that dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, spec=var_spec)

# then, let's check what the metadata about PassengerId contains!
mf["PassengerId"].to_dict()

{'name': 'PassengerId',
 'type': 'discrete',
 'dtype': 'Int64',
 'prop_missing': 0.0,
 'distribution': {'implements': 'core.unique_key',
  'version': '1.0',
  'provenance': 'builtin',
  'class_name': 'UniqueKeyDistribution',
  'parameters': {'low': 1, 'consecutive': 1}}}

So let's check what is generated from this new MetaFrame:

In [15]:
mf.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,f32,cat,date,time,datetime[μs],f32
1,"""Yefogmjpea, Mi…","""male""",17,0,"""115936""",25.237458,,"""C""",1924-05-13,13:56:50,2022-07-26 08:22:00,
2,"""Pxfibl, Mr. If…","""male""",25,0,"""PC 8109""",24.613732,,"""S""",1927-04-15,,2022-07-25 18:00:04,
3,"""Zkpidj, Mr. Gn…","""male""",70,0,"""873293""",98.012502,,"""S""",1911-02-03,11:40:45,2022-07-25 18:05:10,
4,"""Qmmcsyf, Miss.…","""male""",38,0,"""529583""",7.802801,,"""C""",1921-12-22,,2022-07-22 18:25:04,
5,"""Sdg, Miss. Fiw…","""male""",26,0,"""64638""",9.796996,,"""S""",1932-11-17,12:32:26,2022-08-02 02:25:18,


Now we that the `PassengerId` column is correctly represented with increasing id numbers.

### Fake names (and others)

As one can see, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in metasyn is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, metasyn supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We fake names as follows:

In [16]:
# First, we create a specification dictionary for the variables
from metasyn.distribution import FakerDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,f32,cat,date,time,datetime[μs],f32
1,"""Jessica Rodrig…","""female""",29.0,0,"""89526""",3.406816,,"""S""",1931-03-25,14:20:23,2022-07-25 09:18:33,
2,"""Ryan Hunt""","""male""",30.0,0,"""909714""",17.027656,,"""C""",1915-09-07,,2022-07-26 10:32:42,
3,"""Lauren Webb""","""female""",,0,"""19210""",96.355455,,"""S""",1922-11-20,14:00:11,2022-08-12 07:52:11,
4,"""Derek Carpente…","""male""",7.0,1,"""PC 10949""",14.2708,,"""S""",1921-07-19,,2022-07-27 06:21:23,
5,"""David Payne""","""male""",,0,"""494771""",26.470742,,"""S""",1924-08-09,10:43:13,2022-07-17 02:25:01,


That already looks a lot better for the `Name` column!

### Set distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the best fitting from available distributions for the variable type. However, we can also manually specify which distribution to fit, or we can even just fully specify how the variable should be generated.

In [17]:
from metasyn.distribution import DiscreteUniformDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "LogNormalDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)} # fully specify a distribution for age (uniform between 20 and 40)
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],f32
1,"""Jessica Rodrig…","""female""",28.0,0,"""ZV.A. 3151""",3.882531,,"""S""",1939-10-08,17:46:39,2022-08-13 14:47:30,
2,"""Ryan Hunt""","""female""",36.0,2,"""7319""",0.902582,,"""S""",1928-08-19,11:04:56,2022-07-23 13:12:24,
3,"""Lauren Webb""","""male""",,0,"""PC 957193""",0.525197,"""Q 9""","""C""",1915-09-12,14:08:43,2022-07-23 04:50:25,
4,"""Derek Carpente…","""female""",36.0,0,"""290374""",0.402547,"""V18""","""S""",1933-04-06,18:12:14,2022-08-05 08:44:36,
5,"""David Payne""","""male""",29.0,2,"""3822""",12.817667,,"""C""",1925-05-09,12:33:11,2022-08-04 17:45:26,


### Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [18]:
from metasyn.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[ABCDEF][0-9]{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\d{2,3})?

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution}
}

mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf.synthesize(10)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],f32
1,"""Jessica Rodrig…","""male""",,0,"""93804""",23.708468,,"""S""",1910-01-07,15:09:30,2022-08-09 07:48:24,
2,"""Ryan Hunt""","""female""",22.0,1,"""2898""",18.812796,"""A002""","""S""",1938-08-09,13:35:09,2022-07-24 17:19:40,
3,"""Lauren Webb""","""female""",,0,"""0798""",63.257429,"""A648""","""C""",1934-06-28,12:21:46,2022-08-02 23:22:06,
4,"""Derek Carpente…","""female""",32.0,2,"""593499""",23.324004,,"""S""",1932-05-05,,2022-07-30 17:03:11,
5,"""David Payne""","""female""",31.0,0,"""527893""",47.319097,,"""S""",1934-04-28,11:25:22,2022-07-17 19:52:04,
6,"""Christopher Yo…","""male""",20.0,0,"""673979""",94.228127,,"""S""",1920-03-05,17:05:51,2022-08-03 06:05:54,
7,"""Shawn Walters""","""female""",,0,"""9008""",215.074375,,"""S""",1907-07-05,17:29:12,2022-08-01 02:54:07,
8,"""Patrick Camach…","""male""",23.0,0,"""172498""",11.913815,"""C87""","""C""",1940-01-22,12:07:01,2022-08-04 18:09:12,
9,"""Shawn Townsend…","""male""",32.0,0,"""049428""",100.368084,,"""S""",1920-12-04,14:55:24,2022-08-14 22:48:27,
10,"""Eric Washingto…","""male""",37.0,0,"""5364""",56.337843,,"""S""",1918-03-14,14:48:56,2022-08-02 19:54:57,


## Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [19]:
df.mean()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,str,f64,str,cat,date,time,datetime[μs],str
446.0,,,29.693277,0.381594,,32.204208,,,,,,


In [20]:
mf.synthesize(len(df)).mean()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,str,f64,str,cat,date,time,datetime[μs],f32
446.0,,,29.352617,0.353535,,31.671909,,,,,,


Then, we can also see how many missing values are in each column

In [21]:
df.null_count()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,177,0,0,0,687,2,78,79,92,891


In [22]:
mf.synthesize(len(df)).null_count()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,175,0,0,0,717,0,70,79,86,891
