# Advanced Tutorial on MetaSynth

This is an example workflow for if you have a CSV file (but easily adapted to xls/other formats) and want to use MetaSynth to create a synthetic metadata file / create synthetic data.

You can run this notebook by checking out the MetaSynth repo and installing metasynth with `pip install .`.

It shows some of the more advanced abilities of MetaSynth, such as handling dates, setting distributions and ensuring uniqueness in columns.

In [1]:
import datetime as dt
import pandas as pd
from metasynth import MetaDataset, MetaVar
from metasynth.distribution import DiscreteUniformDistribution

### For reading the CSV file define the pandas types for each column

This is the easiest way to do it, though of course this can also be remedied after reading in the CSV file

In [2]:
dtypes = {
    "Survived": "category",  # Categories should be assigned this type.
    "Name": "string",  # Strings should be assigned like this
    "Age": "Int64",  # Integer columns that have NA's in them should be explicitly nullable integers.
    "Sex": "category",
    "SibSp": "category",
    "Parch": "category",
    "Ticket": "string",
    "Cabin": "string",
    "Embarked": "category",
}

### Read the CSV from a file

In [3]:
df = pd.read_csv("demonstration.csv", dtype=dtypes)

### Compare the original DataFrame

Let's first see what the original DataFrame looks like:

In [4]:
pd.set_option('display.max_rows', 5)
df

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,"Braund, Mr. Owen Harris",male,22,0,7.2500,,S,1927-03-18,16:26:28.944096,2022-07-18 20:50:55.208931
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,0,71.2833,C85,C,1921-06-01,15:11:57.571852,2022-07-30 13:09:28.309247
...,...,...,...,...,...,...,...,...,...,...,...
889,890,"Behr, Mr. Karl Howell",male,26,0,30.0000,C148,C,1932-04-20,16:11:11.230772,2022-07-27 05:42:51.179638
890,891,"Dooley, Mr. Patrick",male,32,0,7.7500,,Q,1906-02-15,17:06:36.866675,2022-07-18 21:20:51.180569


### Adjust columns with dates/date-times/times

We have to manually cast the columns with date, time, and datetimes. Since the columns were written in ISO-format, they are read back with the `fromisoformat` method. If they are written in a different format, check out the datetime library documentation on how to convert the strings to datetime/time/date objects.

In [5]:
df["Birthday"] = [dt.date.fromisoformat(x) for x in df["Birthday"]]
df["Board time"] = [dt.time.fromisoformat(x) for x in df["Board time"]]
df["Married since"] = [dt.datetime.fromisoformat(x) for x in df["Married since"]]

### Convert DataFrame with default options

Let's first convert the DataFrame to a meta_dataset structure with the default options.

In [6]:
meta_dataset = MetaDataset.from_dataframe(df)
meta_dataset.synthesize(5)

Variable PassengerId seems unique, but not set to be unique.



Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,"/S,Y-*^ /BZ :s>)^\E d.""@I@[RNv9_)^US[W <fA@o....",female,38,0,32.854138,,S,1938-12-03,11:11:31.687521,2022-08-08 03:39:18.326472349
1,2,"+,2O IM+t!UIuGn[""x|Fr.|>gU",male,3,0,4.179439,,S,1917-04-08,12:18:56.760538,2022-08-09 10:25:05.521453740
2,3,\>WasvZ^BxRPh{X(KVz9S#.+2*Ap:i{n.}6!]eck\xl vz,male,51,0,30.966114,,C,1912-03-30,13:28:08.733395,2022-08-13 14:11:09.386668718
3,4,"XXYW{M Mr|JF/E_b#2)| }lhxy""zasP9ISd u",male,77,2,1.493637,,S,1929-11-23,10:53:30.651066,2022-08-14 13:55:13.887436879
4,5,?esa0|PiE{i'EtT263Gc |T\zNY; 8aF=}B-?xz/&0|$...,male,40,0,80.972737,,S,1932-08-01,14:04:31.293774,2022-08-08 23:57:12.667786042


### Set unique columns

As you can see, there is one column (PassengerId) that has been detected as unique by MetaSynth. This gives a warning to the user that one might want to add this to the arguments when creating the `meta_dataset`. Without explicitly telling MetaSynth about the uniqueness of columns, MetaSynth will select only non-unique distributions. To surpress the warning one can set the uniqueness of columns to `False`, but in this case we want the `PassengerId` to be unique:

In [7]:
meta_dataset = MetaDataset.from_dataframe(df, unique={"PassengerId": True})
meta_dataset.synthesize(5)

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,IPGGSl'[p[I}%0X s@<y@s6wun2	j&O$56R#buIR&op@...,female,41,0,34.487166,,Q,1933-02-22,13:24:23.933614,2022-07-24 11:36:58.943335651
1,2,"gz. oW?	s4&g1=Gy]""-UEF<{!ja4 *n8~Lu,vGL0v*z}h...",female,11,0,11.411029,,Q,1928-02-29,15:05:03.204114,2022-08-08 20:33:23.854920559
2,3,"3'%!RNf3m2.] bo*""6sd6%=~sQ&WlyZ^NVD 7)",male,3,0,8.137719,,Q,1940-05-21,12:37:47.191860,2022-08-09 17:58:47.584643916
3,4,"yKTA'5{:1UX""fR6Eep/_9qSNfao:RWa|Ue]GTj[R)UJa?,...",female,66,0,19.792011,,C,1931-08-08,17:18:48.658613,2022-08-10 15:16:15.584206619
4,5,IN|OPita'	V3]WP1`/\3IJ%yW q>f/B<=<Ke^jwP3k@B...,male,53,0,30.392249,,S,1915-07-25,17:14:35.429485,2022-08-14 07:21:12.028817483


### Fake names (and others)

As one can see, the `Name` of the passengers is not quite so well synthesized. The reason is that MetaSynth by default is able to work well with structured strings (think `R123`). However, it supports the [faker](https://faker.readthedocs.io/en/master/index.html) package, which includes a lot of data types that it can fake. The columns using faker will be completely generative, i.e. they do not use the original data in any way and are thus privacy safe.

We fake names as follows:

In [8]:
meta_dataset = MetaDataset.from_dataframe(
    df,
    unique={"PassengerId": True},
    distribution={"Name": "faker.name"   # Use fake names -> no original data is used.
                 })
meta_dataset.synthesize(5)

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,Diane Perez,male,65.0,0,8.801308,,S,1906-05-09,12:25:28.823726,2022-08-15 17:59:57.890636192
1,2,Timothy Lee,male,34.0,0,18.683875,,S,1920-01-09,18:12:56.626296,2022-07-28 02:22:43.897241695
2,3,Rachel Barnes,female,,0,21.413619,,S,1928-05-28,12:36:28.439398,2022-08-14 17:51:51.140238876
3,4,Angela Jackson,male,36.0,2,17.301115,,S,1934-12-23,18:04:09.704319,2022-07-26 13:48:53.062253638
4,5,Derrick Duke,female,29.0,1,5.695523,448.0,S,1915-03-01,15:46:49.684362,2022-08-07 02:28:53.358116721


### Set distributions manually

Without user input, the distributions are inferred by choosing the best fitting from available distributions. A user can however also manually set a distribution (either with or without providing the parameters to the distribution).

In [9]:
meta_dataset = MetaDataset.from_dataframe(
    df,
    distribution={"Name": "faker.name",  # Use fake names -> no original data is used.
                   "Fare": "LogNormalDistribution",  # Use a log normal distribution for the Fare, fit parameters.
                   "Age": DiscreteUniformDistribution(20, 40),  # Use uniform distribution, no fitting.
                 },
    unique={"PassengerId": True}  # Force the column 'PassengerId' to be unique.
)
meta_dataset.synthesize(5)

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,Kimberly House,male,36.0,1,1.103517,,S,1904-09-28,15:41:26.225594,2022-07-17 03:46:21.230597739
1,2,Brandy Rice,male,28.0,0,2.172598,B8,S,1929-09-28,12:13:10.154044,2022-08-02 19:51:46.379148512
2,3,Jennifer Mooney,female,,0,0.289978,,S,1917-08-02,12:07:34.133282,2022-08-04 21:33:10.087911850
3,4,Todd Washington,female,,0,7.068193,CBG05,S,1921-02-10,15:20:05.160218,2022-07-15 12:14:27.692548334
4,5,Jennifer Perez,female,24.0,0,1.760132,,S,1916-08-14,13:55:08.695122,2022-08-05 18:46:40.118202507


### Set regex distribution manually

For more or less structured strings, we can manually set the structure of the strings. For example, we see that most Cabins are structured like [A-F] and then 2 or 3 digit numbers. We can encapsulate this as follows:

In [12]:
from metasynth.distribution import RegexDistribution

# To create a regex distribution, you need a list of tuples, where each tuple is an element.
# The first part of the tuple is a string representation of the regex, while the second is the proportion of the
# time the regex element is used.
regex = RegexDistribution([(r"[ABCDEF]", 1), (r"\d{2,3}", 0.9)])  # Add the r so that it becomes a literal string.

meta_dataset = MetaDataset.from_dataframe(
    df,
    distribution={"Name": "faker.name",  # Use fake names -> no original data is used.
                   "Fare": "LogNormalDistribution",  # Use a log normal distribution for the Fare, fit parameters.
                   "Age": DiscreteUniformDistribution(20, 40),  # Use uniform distribution, no fitting.
                   "Cabin": regex,  # Manually set regex distribution.
                 },
    unique={"PassengerId": True}  # Force the column 'PassengerId' to be unique.
)
meta_dataset.synthesize(5)

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,1,Mallory Blair,male,39.0,0,0.33398,,S,1935-03-24,13:45:43.605526,2022-07-29 08:58:09.168029519
1,2,Angela Hogan,male,35.0,0,0.612758,C34,S,1915-12-01,11:49:27.850256,2022-08-14 08:31:33.696420720
2,3,Elizabeth Henderson,male,,0,0.763967,,S,1939-11-25,12:32:44.345763,2022-08-11 15:59:02.157057653
3,4,Susan Howard,male,27.0,0,0.149619,,S,1922-06-14,18:07:45.747317,2022-07-23 02:27:24.819813276
4,5,Jeremy Williamson,male,33.0,0,0.459128,,S,1912-06-10,15:06:42.432252,2022-08-08 03:01:31.229623805


### Privacy package (experimental)

The last feature that is currently experimental is the implementation of privacy features to ensure there is less disclosure. We are working on a [disclosure control](https://github.com/sodascience/metasynth-disclosure-control) extention of MetaSynth that replaces the fitting methods with one that is more safe. 

In [11]:
meta_dataset = MetaDataset.from_dataframe(
    df,
    distribution={"Name": "faker.name",  # Use fake names -> no original data is used.
                 },
    unique={"PassengerId": True},  # Force the column 'PassengerId' to be unique.
    privacy_package="disclosure",  # Use the metasynth-disclosure package (needs to be installed).
)
meta_dataset.synthesize(5)

Variable Age seems unique, but not set to be unique.

  df = fun(x) - f0
  df = fun(x) - f0


Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since
0,4,Cassandra Gray,male,2.0,0,77.868293,,Q,1912-12-29,14:11:31.089643,2022-08-13 09:24:39.487873818
1,5,Christopher Hughes,female,4.0,1,250.955985,,Q,1936-06-17,17:56:42.531979,2022-08-09 01:57:16.488588629
2,6,Edward Santos,female,1.0,2,20.15956,,S,1911-06-27,11:15:56.303016,2022-07-27 19:57:17.235885458
3,7,Kenneth Garcia,male,,2,61.437273,,S,1923-10-12,12:45:48.840217,2022-07-17 19:10:47.807715953
4,8,Shelby Proctor,female,,0,2.541149,Port Ricky,S,1933-01-21,10:48:39.910753,2022-08-15 16:09:24.029157483
