# Getting started with metasyn

In this tutorial, we will create a `MetaFrame` (a metadata representation of a given dataset) and then generate synthetic data from it. This example workflow starts from a `.csv` file as input, but it is easily adapted to other formats.  

## Step 0: Install the metasyn package and import required packages
First, install the metasyn package in your session:

In [None]:
# uncomment the following line and run the cell to install metasyn
# %pip install metasyn

In [1]:
# import required packages
import datetime as dt
import polars as pl
from metasyn import MetaFrame, demo_file

## Step 1: Load the data into a data frame

The first step in creating the metadata is reading and converting your dataset to a DataFrame with the correct data types. We use the [Polars](https://pola.rs) dataframe library for this (but you could also use Pandas!)

In [2]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Ow…","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. …","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Mis…","""female""",26,0,"""STON/O2. 31012…",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs.…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. Wil…","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


Now, let's check the data types of our DataFrame:

In [6]:
dict(zip(df.columns, df.dtypes))

{'PassengerId': Int64,
 'Name': Utf8,
 'Sex': Categorical,
 'Age': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Categorical,
 'Birthday': Date,
 'Board time': Time,
 'Married since': Datetime(time_unit='us', time_zone=None),
 'all_NA': Utf8}

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can also inspect the data a bit more with `glimpse()`.

In [7]:
df.glimpse()

Rows: 891
Columns: 13
$ PassengerId            <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Name                   <str> Braund, Mr. Owen Harris, Cumings, Mrs. John Bradley (Florence Briggs Thayer), Heikkinen, Miss. Laina, Futrelle, Mrs. Jacques Heath (Lily May Peel), Allen, Mr. William Henry, Moran, Mr. James, McCarthy, Mr. Timothy J, Palsson, Master. Gosta Leonard, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), Nasser, Mrs. Nicholas (Adele Achem)
$ Sex                    <cat> male, female, female, female, male, male, male, male, female, female
$ Age                    <i64> 22, 38, 26, 35, 35, None, 54, 2, 27, 14
$ Parch                  <i64> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0
$ Ticket                 <str> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330877, 17463, 349909, 347742, 237736
$ Fare                   <f64> 7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708
$ Cabin                  <str> None, C85, None, C123, None, None, E46, None, None, None


## Step 2: Generating a MetaFrame

Now that we have properly formatted our DataFrame, we can easily generate a MetaFrame for it. 

> **MetaFrames:**
> A **MetaFrame** is an object which captures the essential aspects of the dataset, including variable names, types, data types, the percentage of missing values, and distribution attributes. MetaFrame objects capture all the information needed to generate a synthetic dataset that aligns with the original dataset, without containing any *entries* of the original dataset.

More information on generating MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html).

In [8]:
# Generate and fit a MetaFrame to the DataFrame 
mf = MetaFrame.fit_dataframe(df)

Variable PassengerId seems unique, but not set to be unique.



We can call the `print` function to display the (statistical metadata contained in the) MetaFrame in an easy-to-read format:

In [12]:
print(mf)

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.discrete_uniform
	- Provenance: builtin
	- Parameters:
		- low: 1
		- high: 892
	

Column 2: "Name"
- Variable Type: string
- Data Type: Utf8
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.regex
	- Provenance: builtin
	- Parameters:
		- regex: [A-Z][a-z]{2,9}[,][ ][M]((|[a][s][t][e])[r][\.][ ][A-Z][a-z]{3,7}(|[ ][A-Z][a-z]{3,7})|[i][s]{2,2}[\.][ ][A-Z][a-z]{3,8}(|[ ][A-Z][a-z]{4,7}))
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.multinoulli
	- Provenance: builtin
	

## Step 3: Exporting the MetaFrame

After creating the MetaFrame, Metasyn can serialize and export it into a GMF file using `mf.export()`, passing in the filepath as a parameter.

> **GMF files:**
> GMF files are JSON files that follow the [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format), a format designed to contain statistical metadata for (tabular) datasets that has been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

More information on exporting and importing MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html).

In [13]:
file_path = "demonstration_metadata.json"

# Serialize and export the MetaFrame to a GMF file
mf.export(file_path)

You can now open and read the GMF formatted .json file!

A (previously exported) GMF file can be imported and loaded into a MetaFrame using the `MetaFrame.from_json()` class method, passing in the file path as a parameter. 

In [14]:
# Create a MetaFrame based on a GMF (.json) file
mf = MetaFrame.from_json(file_path)

## Step 4: Generating synthetic data from a MetaFrame

Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data.

More information on generating synthetic data based on MetaFrames can be found on the metasyn docs, [here](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html).


In [15]:
# generate synthetic data
mf.synthesize(5)

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],f32
432,"""Uvy, Miss. Jop…","""female""",23.0,0,"""426628""",0.50295,,"""S""",1921-10-08,16:34:41,2022-08-05 01:14:46,
422,"""Hzqybuluy, Mr.…","""female""",19.0,0,"""4426""",12.037848,"""C2""","""S""",1909-05-11,18:15:32,,
645,"""Vdz, Miss. Ttn…","""female""",,2,"""346080""",67.768537,"""X8""","""S""",1912-02-05,14:26:59,2022-07-24 20:05:30,
417,"""Kurqbf, Mr. Tm…","""female""",50.0,0,"""28148""",0.616262,,"""S""",1930-09-23,17:23:54,2022-07-25 18:27:00,
786,"""Vome, Mr. Mekc…","""male""",45.0,2,"""38747""",0.002768,,"""S""",1937-11-03,14:16:40,2022-07-18 15:40:32,


As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. Below, a brief example is shown of such potential manual improvements. If you want to know more about these improvements, take a look at our [advanced tutorial](https://colab.research.google.com/github/sodascience/metasyn/blob/main/examples/advanced_tutorial.ipynb). 

In [17]:
from metasyn.distribution import LogNormalDistribution, DiscreteUniformDistribution, RegexDistribution, FakerDistribution

# Using some advanced features of metasyn
var_spec = {
    # Ensure that the passengerId column is unique
    "PassengerId": {"unique": True}, 
    # Use fake names for the name column
    "Name": {"distribution": FakerDistribution("name")}, 
     # Estimate / fit a lognormal distribution
    "Fare": {"distribution": LogNormalDistribution},
    # Manually set a distribution for age 
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)},
    # Manually set a regex distribution for cabin
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")}
}

# create the high-quality metadata
mf = MetaFrame.fit_dataframe(df, spec=var_spec)

# generate synthetic data
syn_df = mf.synthesize(len(df))
syn_df.head()

PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],f32
1,"""Christopher Ma…","""male""",33.0,0,"""7009""",1.004962,"""A131""","""C""",1916-12-03,12:22:37,2022-08-10 16:40:02,
2,"""Nicole Crane""","""male""",,0,"""9323""",1.071922,,"""C""",1935-01-13,17:59:27,,
3,"""Seth Rojas""","""male""",28.0,1,"""4588""",0.331356,,"""S""",1934-09-20,11:38:14,2022-08-01 23:53:23,
4,"""Bethany Rosari…","""female""",24.0,0,"""5690""",1.396114,,"""S""",1926-10-18,17:11:01,2022-07-15 18:10:49,
5,"""Amber Peters""","""male""",,0,"""808590""",0.851275,"""F932""","""S""",,17:39:35,2022-08-12 13:53:02,


Now, let's compare the synthetic data to the real data:

In [18]:
df.describe()

describe,PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
str,f64,str,str,f64,f64,str,f64,str,str,str,str,str,str
"""count""",891.0,"""891""","""891""",891.0,891.0,"""891""",891.0,"""891""","""891""","""891""","""891""","""891""","""891"""
"""null_count""",0.0,"""0""","""0""",177.0,0.0,"""0""",0.0,"""687""","""2""","""78""","""79""","""92""","""891"""
"""mean""",446.0,,,29.693277,0.381594,,32.204208,,,,,,
"""std""",257.353842,,,14.524527,0.806057,,49.693429,,,,,,
"""min""",1.0,"""Abbing, Mr. An…",,0.0,0.0,"""110152""",0.0,"""A10""",,"""1903-07-28""","""38380000000000…","""2022-07-15 12:…",
"""max""",891.0,"""van Melkebeke,…",,80.0,6.0,"""WE/P 5735""",512.3292,"""T""",,"""1940-05-27""","""67168000000000…","""2022-08-15 10:…",
"""median""",446.0,,,28.0,0.0,,14.4542,,,,,,
"""25%""",223.0,,,20.0,0.0,,7.8958,,,,,,
"""75%""",669.0,,,38.0,0.0,,31.0,,,,,,


In [19]:
syn_df.describe()

describe,PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
str,f64,str,str,f64,f64,str,f64,str,str,str,str,str,f64
"""count""",891.0,"""891""","""891""",891.0,891.0,"""891""",891.0,"""891""","""891""","""891""","""891""","""891""",891.0
"""null_count""",0.0,"""0""","""0""",174.0,0.0,"""0""",0.0,"""667""","""1""","""93""","""70""","""83""",891.0
"""mean""",446.0,,,29.44212,0.402918,,1.657678,,,,,,
"""std""",257.353842,,,5.587902,0.888347,,2.0115,,,,,,
"""min""",1.0,"""Aaron Avila""",,20.0,0.0,"""0003""",0.042881,"""A002""",,"""1903-08-01""","""38380000000000…","""2022-07-15 13:…",
"""max""",891.0,"""Zachary Skinne…",,39.0,6.0,"""ZP.A. 9850""",18.589588,"""F96""",,"""1940-05-22""","""67166000000000…","""2022-08-15 09:…",
"""median""",446.0,,,29.0,0.0,,1.057567,,,,,,
"""25%""",223.0,,,25.0,0.0,,0.519708,,,,,,
"""75%""",669.0,,,34.0,0.0,,1.968376,,,,,,
