# Advanced Tutorial on MetaSynth

This is an example workflow for if you have a CSV file (but easily adapted to xls/other formats) and want to use MetaSynth to create a synthetic metadata file / create synthetic data.

You can run this notebook by checking out the MetaSynth repo and installing metasynth with `pip install .`.

It shows some of the more advanced abilities of MetaSynth, such as handling dates, setting distributions and ensuring uniqueness in columns.

In [1]:
import datetime as dt
import pandas as pd
from metasynth import MetaDataset, MetaVar
from metasynth.distribution import DiscreteUniformDistribution

### For reading the CSV file define the pandas types for each column

This is the easiest way to do it, though of course this can also be remedied after reading in the CSV file

In [2]:
dtypes = {
    "Survived": "category",  # Categories should be assigned this type.
    "Name": "string",  # Strings should be assigned like this
    "Age": "Int64",  # Integer columns that have NA's in them should be explicitly nullable integers.
    "Sex": "category",
    "SibSp": "category",
    "Parch": "category",
    "Ticket": "string",
    "Cabin": "string",
    "Embarked": "category",
}

### Read the CSV from a file

In [3]:
df = pd.read_csv("demonstration.csv", dtype=dtypes)

### Adjust columns with dates/date-times/times

We have to manually cast the columns with date, time, and datetimes. Since the columns were written in ISO-format, they are read back with the `fromisoformat` method. If they are written in a different format, check out the datetime library documentation on how to convert the strings to datetime/time/date objects.

In [4]:
df["Birthday"] = [dt.date.fromisoformat(x) for x in df["Birthday"]]
df["Board time"] = [dt.time.fromisoformat(x) for x in df["Board time"]]
df["Married since"] = [dt.datetime.fromisoformat(x) for x in df["Married since"]]

### Convert the DataFrame to a MetaSynth dataset

Apart from supplying the DataFrame, there are a few options to adjust the inference.

- distribution: You can manually set the distribution of a variable. This can be a string, (distribution) class, or
    distribution instance. You do this by passing the keyword `distribution` with a dictionary.
- unique: You can force a variable to be unique during generation. This is currently only applicable to Integer and String variables. Not all distributions implement a unique variant, so the choice of distributions is more limited.
- privacy_package: This feature is still under construction, but the idea is that the string of another package can be supplied here that modifies the distributions to be leak less information, for example using differential privacy. These supplemental packages are still under development.

In [5]:
dataset = MetaDataset.from_dataframe(
    df,
    distribution={"Name": "faker.name",  # Use fake names -> no original data is used.
                   "Fare": "LogNormalDistribution",  # Use a log normal distribution for the Fare, fit parameters.
                   "Age": DiscreteUniformDistribution(20, 40),  # Use uniform distribution, no fitting.
                 },
    unique={"PassengerId": True}  # Force the column 'PassengerId' to be unique.
)

### Compare the original DataFrame

In [6]:
df

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,"Braund, Mr. Owen Harris",male,22,0,7.2500,,S,1933-03-10,15:47:49.715986,2022-07-25 00:11:23.175187
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,0,71.2833,C85,C,1912-10-01,14:35:15.821941,2022-08-04 06:48:25.673303
2,3,"Heikkinen, Miss. Laina",female,26,0,7.9250,,S,1938-10-30,14:19:23.630043,2022-07-27 14:51:52.214838
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,0,53.1000,C123,S,1903-09-03,12:21:08.950791,2022-08-12 05:43:01.383134
4,5,"Allen, Mr. William Henry",male,35,0,8.0500,,S,1908-07-22,16:07:27.645982,2022-07-18 14:33:12.037248
...,...,...,...,...,...,...,...,...,...,...,...
886,887,"Montvila, Rev. Juozas",male,27,0,13.0000,,S,1904-05-25,11:35:30.956355,2022-08-08 17:25:22.268903
887,888,"Graham, Miss. Margaret Edith",female,19,0,30.0000,B42,S,1936-09-11,10:43:49.671126,2022-08-09 08:49:15.010494
888,889,"Johnston, Miss. Catherine Helen ""Carrie""",female,,2,23.4500,,S,1917-10-09,17:21:15.777408,2022-07-22 17:15:14.155749
889,890,"Behr, Mr. Karl Howell",male,26,0,30.0000,C148,C,1913-04-12,15:28:57.481429,2022-08-15 07:47:46.565579


### To the synthesized DataFrame

In [7]:
dataset.synthesize(100)

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,Chad Patel,female,25,0,1.266781,,S,1936-09-06,12:57:27.806784,2022-08-10 21:15:53.748277523
1,2,Joseph Dunn,male,30,0,0.517259,,S,1910-05-04,14:23:56.763392,2022-07-15 22:36:17.839494302
2,3,Mary Wheeler,male,24,0,1.161631,A4,S,1926-05-25,12:58:07.308224,2022-08-08 09:45:11.531794853
3,4,Cameron Huffman,male,39,2,0.444786,,S,1919-09-17,17:04:57.715703,2022-07-20 03:32:17.950606279
4,5,Laurie Trevino,male,25,0,0.435297,,S,1924-04-25,14:42:25.795744,2022-08-10 13:35:31.807950512
...,...,...,...,...,...,...,...,...,...,...,...
95,96,Frederick Wilson,female,,0,0.348073,,S,1919-03-07,13:37:59.080215,2022-08-10 01:47:45.711553297
96,97,Cynthia Lee,male,33,0,0.205917,,S,1915-11-09,14:00:04.558347,2022-07-18 10:22:29.686049701
97,98,Megan Williams,male,23,0,0.900390,,S,1939-10-10,17:40:32.776049,2022-07-28 17:33:54.815544171
98,99,Michael Greer,female,39,0,3.075273,C808,S,1926-09-24,15:03:40.463333,2022-07-21 18:29:29.466321173
