# Getting started with MetaSynth

In this tutorial, we will be creating a `generative metadata format` (`gmf`) metadata file from a dataset using MetaSynth, and then we will generate synthetic data from it. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats. 

First, install the metasynth package in your session:

In [None]:
# uncomment the following line and run the cell to install metasynth
# %pip install metasynth

In [None]:
# import required packages
import datetime as dt
import polars as pl
from metasynth import MetaFrame, demo_file

## Step 1: Load the data into a data frame

The first step in creating the metadata is reading and converting your dataset to a DataFrame with the correct data types. We use the [polars](https://pola.rs) dataframe library for this (but you could also use pandas!)

In [None]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
demo_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=demo_types)

# check out the data
df.head()

Now, let's check the data types of our DataFrame:

In [None]:
dict(zip(df.columns, df.dtypes))

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary.

In [None]:
# you can also inspect the data a bit more with describe()
df.describe()

## Step 2: Creating a MetaFrame object from a DataFrame

Now a lot of work has already gone into creating a properly formatted dataframe. This work pays off at this stage: let's convert the DataFrame to a meta_dataset structure with the default options.

In [None]:
meta_dataset = MetaFrame.fit_dataframe(df)

Then, we can show the metadata as a dictionary:

In [None]:
print(meta_dataset)

## Step 3: Saving the metadata in a file

After creating the metadata, we can save it to a file. The default format is `json`, meaning the file is quite legible by humans and computers alike. Therefore, it can be checked by the data controller and, when the disclosure risk is deemed to be low, this file can be shared with others.

In [None]:
file_path = "demonstration_metadata.json"
meta_dataset.to_json(file_path)

# you can now open and read the json file!

## Step 4: Generating synthetic data from the metadata

Upon receiving this file, you can use the MetaSynth package to generate a synthetic version of the dataset:

In [None]:
new_meta_dataset = MetaFrame.from_json(file_path)
new_meta_dataset.synthesize(5)

As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. Below, we create this metadata with additional manual improvements. If you want to know more about these improvements, take a look at our [advanced tutorial](https://colab.research.google.com/github/sodascience/metasynth/blob/main/examples/advanced_tutorial.ipynb). 

In [None]:
from metasynth.distribution import DiscreteUniformDistribution, RegexDistribution, FakerDistribution

# Using some advanced features of metasynth
var_spec = {
    # Ensure that the passengerId column is unique
    "PassengerId": {"unique": True}, 
    # Use fake names for the name column
    "Name": {"distribution": FakerDistribution("name")}, 
     # Estimate / fit an exponential distribution
    "Fare": {"distribution": "LogNormalDistribution"},
    # Manually set a distribution for age 
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)},
    # Manually set a regex distribution for cabin
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF]\d{2,3}")}
}

# create the high-quality metadata
meta_dataset = MetaFrame.fit_dataframe(df, spec=var_spec)

# generate synthetic data
syn_df = meta_dataset.synthesize(len(df))
syn_df.head()

Now, let's compare the synthetic data to the real data:

In [None]:
df.describe()

In [None]:
syn_df.describe()