# Advanced Tutorial on MetaSynth

In this tutorial, we will be creating a `generative metadata format` (`gmf`) metadata file from a dataset using MetaSynth. We are going to walk through some of the advanced abilities of MetaSynth, such as handling dates, setting distributions and ensuring uniqueness in columns. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats. 

You can run this notebook by checking out the MetaSynth repo and installing metasynth with `pip install metasynth`

In [1]:
# %pip install metasynth

In [2]:
# import required packages
import datetime as dt
import pandas as pd
from metasynth import MetaDataset
from utils import get_demonstration_fp

## Step 1: Transforming your data into a pandas DataFrame

The first step in creating the metadata is reading and converting your dataset to a pandas DataFrame. 

In [3]:
demonstration_fp = get_demonstration_fp()
df = pd.read_csv(demonstration_fp)
df.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,"Braund, Mr. Owen Harris",male,22.0,0,7.25,,S,1925-09-07,12:33:17,2022-08-10 20:55:21
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,0,71.2833,C85,C,1918-06-15,11:41:00,2022-08-02 14:27:58
2,3,"Heikkinen, Miss. Laina",female,26.0,0,7.925,,S,1920-10-29,12:54:20,2022-07-17 14:30:14
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,0,53.1,C123,S,1937-02-03,11:05:58,2022-07-16 06:28:22
4,5,"Allen, Mr. William Henry",male,35.0,0,8.05,,S,1936-09-21,11:42:34,2022-08-01 02:04:21


MetaSynth will automatically generate the metadata from this DataFrame object so it is important to __ensure the data types for all the variables are correct__. For example, in the dataset above we see that Age is a floating point number whereas it should be an integer (22 instead of 22.0). In addition, there are a few categorical variables (Sex, Parch, Embarked) which are loaded in as string data types.

In general, we support [pandas dtypes](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes). For our example dataset we can specify the `dtypes` and load the dataset as follows:

In [4]:
dtypes = {
    "Survived": "category",  # Categories should be assigned this type.
    "Name": "string",  # Strings should be assigned like this
    "Age": "Int64",  # Integer columns that have NA's in them should be explicitly nullable integers.
    "Sex": "category",
    "SibSp": "category",
    "Parch": "category",
    "Ticket": "string",
    "Cabin": "string",
    "Embarked": "category",
}
df = pd.read_csv(demonstration_fp, dtype=dtypes)
df.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,"Braund, Mr. Owen Harris",male,22,0,7.25,,S,1925-09-07,12:33:17,2022-08-10 20:55:21
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,0,71.2833,C85,C,1918-06-15,11:41:00,2022-08-02 14:27:58
2,3,"Heikkinen, Miss. Laina",female,26,0,7.925,,S,1920-10-29,12:54:20,2022-07-17 14:30:14
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,0,53.1,C123,S,1937-02-03,11:05:58,2022-07-16 06:28:22
4,5,"Allen, Mr. William Henry",male,35,0,8.05,,S,1936-09-21,11:42:34,2022-08-01 02:04:21


### Dates, times, and datetimes

One exception to the above is dates, times, and datetimes. Here, we use the types from the built-in `datetime` package. Thus we have to manually transform the strings in the columns with date, time, and datetimes to their proper objects. Since the columns in our example dataset follow the standard ISO-format, we can convert them with the `fromisoformat` method. If they are written in a different format, check out the [datetime library documentation](https://docs.python.org/3/library/datetime.html) on how to convert the strings to datetime/time/date objects.

In [5]:
df["Birthday"] = [dt.date.fromisoformat(x) for x in df["Birthday"]]
df["Board time"] = [dt.time.fromisoformat(x) for x in df["Board time"]]
df["Married since"] = [dt.datetime.fromisoformat(x) for x in df["Married since"]]

Now, let's check the data types of our DataFrame:

In [6]:
df.dtypes

PassengerId               int64
Name                     string
Sex                    category
Age                       Int64
Parch                  category
Fare                    float64
Cabin                    string
Embarked               category
Birthday                 object
Board time               object
Married since    datetime64[ns]
dtype: object

We see that most variables are now nicely specified as strings, categories and ints where necessary. For the dates and times we just created, we see the dtype `object`. This is the "catch-all" dtype for pandas. But don't worry, these columns have the correct type and MetaSynth will deal with it correctly:

In [7]:
df["Birthday"][0]

datetime.date(1925, 9, 7)

### Specifying the distribution of structured strings

For more or less structured strings, we can manually set the structure of the strings based on regular expressions. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can include this as follows:

In [8]:
from metasynth.distribution import RegexDistribution
from metasynth.distribution import DiscreteUniformDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
cabin_distribution = RegexDistribution(r"[ABCDEF]\d{2,3}")  # Add the r so that it becomes a literal string.
# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\d{2,3})?

var_spec = {
    "PassengerId": {"unique": True}, 
    "Name": {"distribution": "faker.name"},
    "Fare": {"distribution": "LogNormalDistribution"}, # estimate / fit a log-normal distribution based on the data
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)
    "Cabin": {"distribution": cabin_distribution}
}

meta_dataset = MetaDataset.from_dataframe(df, spec=var_spec, privacy_package="disclosure", n_avg=2)
meta_dataset.synthesize(10)

PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
i64,str,cat,i64,cat,f64,f64,cat,date,time,datetime[μs]
0,"""Kevin Griffith...","""female""",34.0,"""0""",1.010605,,"""S""",1908-03-20,15:57:08,2022-08-12 15:00:46
1,"""Jeremy Newman""","""female""",31.0,"""0""",0.217891,,"""S""",1932-11-05,11:44:04,2022-07-28 16:31:04
2,"""Rebecca Colema...","""female""",,"""0""",0.684621,,"""S""",1912-07-19,15:29:43,2022-08-08 20:44:04
3,"""Martin Wilson""","""female""",,"""0""",1.860385,,"""S""",1913-09-08,12:42:26,2022-07-25 16:08:36
4,"""Antonio Keller...","""female""",32.0,"""1""",1.395188,,"""S""",1925-06-08,14:26:20,2022-08-14 19:01:54
5,"""Susan Jackson""","""male""",,"""0""",2.410634,,"""S""",1903-08-21,11:19:07,2022-07-24 18:53:17
6,"""Darlene Frazie...","""female""",33.0,"""0""",0.664675,,"""S""",1922-01-08,17:09:24,2022-07-18 13:53:49
7,"""Melissa Woods""","""female""",,"""0""",0.479335,,"""S""",1913-05-28,12:59:55,2022-07-26 11:10:51
8,"""Joanna Taylor""","""female""",21.0,"""0""",2.362664,,"""S""",1915-12-16,18:36:34,2022-07-29 15:12:32
9,"""Steven Barber""","""female""",35.0,"""2""",0.337777,,"""C""",1936-05-14,12:23:54,2022-07-15 23:13:03
