# Advanced Tutorial on MetaSynth

In this tutorial, we will be creating a `generative metadata format` (`gmf`) metadata file from a dataset using MetaSynth. We are going to walk through some of the advanced abilities of MetaSynth, such as handling dates, setting distributions and ensuring uniqueness in columns. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats. 

You can run this notebook by checking out the MetaSynth repo and installing metasynth with `pip install metasynth`

In [1]:
# %pip install metasynth

In [2]:
# import required packages
import datetime as dt
import polars as pl
from metasynth import MetaDataset, demo_file

## Step 1: Transforming your data into a polars DataFrame

The first step in creating the metadata is reading and converting your dataset to a polars DataFrame. 

In [3]:
demonstration_fp = demo_file()
df = pl.read_csv(demonstration_fp, try_parse_dates=True, dtypes={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical})
df.head()

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Ow…","""male""",22,0,7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. …","""female""",38,0,71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Mis…","""female""",26,0,7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs.…","""female""",35,0,53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. Wil…","""male""",35,0,8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


Now, let's check the data types of our DataFrame:

In [4]:
dict(zip(df.columns, df.dtypes))

{'PassengerId': Int64,
 'Name': Utf8,
 'Sex': Categorical,
 'Age': Int64,
 'Parch': Int64,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Categorical,
 'Birthday': Date,
 'Board time': Time,
 'Married since': Datetime(time_unit='us', time_zone=None),
 'all_NA': Utf8}

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary.

## Step 2: Creating a MetaDataset object from a DataFrame

Now a lot of work has already gone into creating a properly formatted dataframe. This work pays off at this stage: let's convert the DataFrame to a meta_dataset structure with the default options. Note: this takes a little bit of time!

In [6]:
meta_dataset = MetaDataset.from_dataframe(df)

Variable PassengerId seems unique, but not set to be unique.



Then, we can show the metadata as a dictionary:

In [7]:
print(meta_dataset)

# Rows: 891
# Columns: 12

{'name': 'PassengerId', 'description': None, 'type': 'discrete', 'dtype': 'Int64', 'prop_missing': 0.0, 'distribution': "{'implements': 'core.discrete_uniform', 'provenance': 'builtin', 'class_name': 'DiscreteUniformDistribution', 'parameters': {'low': 1, 'high': 892}}"}

{'name': 'Name', 'description': None, 'type': 'string', 'dtype': 'Utf8', 'prop_missing': 0.0, 'distribution': '.[]{12,82}'}

{'name': 'Sex', 'description': None, 'type': 'categorical', 'dtype': 'Categorical', 'prop_missing': 0.0, 'distribution': "{'implements': 'core.multinoulli', 'provenance': 'builtin', 'class_name': 'MultinoulliDistribution', 'parameters': {'labels': array(['female', 'male'], dtype='<U6'), 'probs': array([0.35241302, 0.64758698])}}"}

{'name': 'Age', 'description': None, 'type': 'discrete', 'dtype': 'Int64', 'prop_missing': 0.19865319865319866, 'distribution': "{'implements': 'core.discrete_uniform', 'provenance': 'builtin', 'class_name': 'DiscreteUniformDistribution', 'p

## Step 3: Saving the metadata in a file

After creating the metadata, we can save it to a file. The default format is `json`, meaning the file is quite legible by humans and computers alike. Therefore, it can be checked by the data controller and, when the disclosure risk is deemed to be low, this file can be shared with others.

In [8]:
file_path = "demonstration_metadata.json"
meta_dataset.to_json(file_path)

## Step 4: Generating synthetic data from the metadata

Upon receiving this file, you can use the MetaSynth package to generate a synthetic version of the dataset:

In [9]:
new_meta_dataset = MetaDataset.from_json(file_path)
new_meta_dataset.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,str,cat,date,time,datetime[μs],f32
803,"""9T(y>?R.qYXV%t…","""male""",,0,2.56137,"""Ad!-I*""","""S""",1922-05-16,17:15:49,2022-08-14 11:06:28,
437,"""`.b)P@g,wH+<kb…","""male""",21.0,0,13.841555,,"""S""",1932-01-27,16:30:46,2022-07-15 12:52:28,
791,"""""$)UV7 At_CcQ…","""male""",19.0,1,6.040909,"""B`3E X""","""C""",1918-08-24,15:36:32,2022-07-17 01:02:36,
763,"""eclykNmK)_5AMK…","""male""",,0,117.469172,,"""S""",1930-02-24,11:41:42,2022-07-22 17:20:30,
490,""" ! *@g5nj6e3^…","""male""",34.0,0,24.080239,,"""S""",1916-11-17,13:53:16,2022-07-15 23:04:10,


As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. In the next sections, we will explore manual changes we can make to improve the quality of the synthetic data.

## Step 5: Improving the quality of the synthetic data

### Set unique columns

One column (PassengerId) has been detected as possibly unique by MetaSynth, as indicated by the following warning:

> "Variable PassengerId seems unique, but not set to be unique."

This column holds a variable with unique passenger identifiers, so in fact we do want synthetic data generated for this column to be unique as well. We can add this to the metadata by creating a list of options which we call a `specification`, or `spec`:

In [10]:
# First, we create a specification dictionary for the variables
var_spec = {
    "PassengerId": {"unique": True}
}

# then, we add that dictionary as the `spec` argument
meta_dataset = MetaDataset.from_dataframe(df, spec=var_spec)

# then, let's check what the metadata about PassengerId contains!
meta_dataset["PassengerId"].to_dict()

{'name': 'PassengerId',
 'type': 'discrete',
 'dtype': 'Int64',
 'prop_missing': 0.0,
 'distribution': {'implements': 'core.unique_key',
  'provenance': 'builtin',
  'class_name': 'UniqueKeyDistribution',
  'parameters': {'low': 1, 'consecutive': 1}}}

So let's check what is generated from this new metadata:

In [11]:
meta_dataset.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,f32,cat,date,time,datetime[μs],f32
1,""" }}j$NN%|(P7Z…","""male""",78.0,0,10.41064,,"""S""",1919-09-21,11:02:37,,
2,""".Q=,^dy^Bi3S`l…","""male""",,1,23.2323,,"""S""",1935-08-14,10:59:55,2022-08-05 00:59:17,
3,""":1 -*BISW	j5 b…","""male""",39.0,1,8.431676,,"""S""",1913-03-20,15:44:37,2022-08-08 11:06:02,
4,"""?ez3=d ""-[]JD5…","""male""",19.0,2,13.494657,,"""S""",1936-07-01,14:30:42,2022-07-29 17:58:08,
5,"""i$-#7i~IvS>IM%…","""male""",17.0,1,4.245998,,"""Q""",1937-12-27,18:00:10,2022-07-23 13:01:22,


Now we that the `PassengerId` column is correctly represented with increasing id numbers.

### Fake names (and others)

As one can see, the `Name` of the passengers is not quite so well synthesized. The reason is that the string type interpreter in MetaSynth is designed for `structured` strings (like room numbers such as `B1.09`, `B1.01` or `A1.08`) and not unstructured strings. However, MetaSynth supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker are not based on the real data at all so they do not disclose any info about the real data.

We fake names as follows:

In [12]:
# First, we create a specification dictionary for the variables
from metasynth.distribution import FakerDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")}
}

meta_dataset = MetaDataset.from_dataframe(df, spec=var_spec)
meta_dataset.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,f32,cat,date,time,datetime[μs],f32
1,"""John Snyder""","""female""",11.0,1,20.730102,,"""C""",1923-02-24,15:18:06,2022-08-02 02:22:32,
2,"""Loretta Sutton…","""male""",17.0,0,33.783819,,"""S""",1920-03-02,11:11:34,2022-07-29 10:35:05,
3,"""Jeffrey Romero…","""male""",,1,116.916082,,"""C""",1904-03-22,13:39:50,2022-08-12 17:17:57,
4,"""Kristin Silva""","""female""",69.0,1,0.785428,,"""S""",1929-02-07,,2022-08-09 19:34:13,
5,"""Jamie Adkins""","""male""",,1,23.869532,,"""S""",1927-11-24,,2022-07-29 16:21:41,


That already looks a lot better for the `Name` column!

### Set distributions manually

Without user input, the distribution chosen for each variable is inferred by choosing the best fitting from available distributions for the variable type. However, we can also manually specify which distribution to fit, or we can even just fully specify how the variable should be generated.

In [13]:
from metasynth.distribution import DiscreteUniformDistribution

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "LogNormalDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)} # fully specify a distribution for age (uniform between 20 and 40)
}

meta_dataset = MetaDataset.from_dataframe(df, spec=var_spec)
meta_dataset.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,str,cat,date,time,datetime[μs],f32
1,"""John Snyder""","""female""",,1,0.497802,,"""S""",1903-10-17,12:25:53,2022-07-24 05:03:27,
2,"""Loretta Sutton…","""male""",,0,0.502027,,"""S""",,15:55:38,2022-07-21 18:33:18,
3,"""Jeffrey Romero…","""female""",21.0,0,1.350654,,"""S""",1930-07-15,15:42:38,,
4,"""Kristin Silva""","""male""",,0,0.608075,"""EZOH a|Oo1""","""S""",1920-07-01,14:51:13,2022-08-09 04:45:51,
5,"""Jamie Adkins""","""female""",,0,0.170791,,"""S""",1919-01-17,15:17:18,2022-08-05 16:31:15,


### Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [36]:
from metasynth.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\d{2,3})?

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": FakerDistribution("name")},
    "Fare": {"distribution": "ExponentialDistribution"}, # estimate / fit an exponential distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution}
}

meta_dataset = MetaDataset.from_dataframe(df, spec=var_spec)
meta_dataset.synthesize(10)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,f64,str,cat,date,time,datetime[μs],f32
1,"""John Snyder""","""male""",28.0,0,16.816117,"""E742""","""Q""",1924-09-14,17:08:32,2022-08-09 22:38:21,
2,"""Loretta Sutton…","""female""",31.0,0,5.628338,,"""S""",1939-06-19,14:58:47,2022-07-30 10:11:59,
3,"""Jeffrey Romero…","""female""",27.0,0,72.696756,,"""S""",1927-08-03,,2022-08-07 08:12:35,
4,"""Kristin Silva""","""female""",28.0,0,39.490222,,"""S""",1906-09-30,15:09:54,2022-07-20 07:48:03,
5,"""Jamie Adkins""","""male""",35.0,1,12.817315,,"""S""",1907-04-26,17:49:43,2022-08-02 18:23:52,
6,"""Amanda Hawkins…","""male""",22.0,0,28.477249,,"""S""",1929-11-08,,2022-08-13 05:19:31,
7,"""Mark Lopez""","""female""",22.0,1,8.303784,,"""S""",1924-05-29,15:19:10,2022-08-01 04:05:40,
8,"""Pamela Wheeler…","""male""",29.0,1,26.624476,,"""S""",1910-05-16,14:00:12,2022-08-12 19:31:09,
9,"""Sydney Bennett…","""male""",33.0,0,29.606436,"""C286""","""C""",1931-06-22,12:22:27,2022-07-19 04:42:25,
10,"""Jennifer Cohen…","""male""",,1,42.645574,"""E53""","""Q""",1906-09-25,17:24:27,2022-07-24 08:45:19,


## Comparing the final synthetic dataset to the original

Let's first compare the averages of the numerical columns:

In [41]:
df.mean()

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,f64,str,cat,date,time,datetime[μs],str
446.0,,,29.693277,0.381594,32.204208,,,,,,


In [42]:
meta_dataset.synthesize(len(df)).mean()

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
f64,str,cat,f64,f64,f64,str,cat,date,time,datetime[μs],f32
446.0,,,29.417867,0.362514,33.143879,,,,,,


Then, we can also see how many missing values are in each column

In [44]:
df.null_count()

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,177,0,0,687,2,78,79,92,891


In [45]:
meta_dataset.synthesize(len(df)).null_count()

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,179,0,0,690,0,82,66,89,891
