# Getting started with MetaSynth

In this tutorial, we will be creating a `MetaFrame`, which is a metadata representation of a given dataset, and proceed by generating synthetic data from it. This example workflow starts from a `.csv` file as input, but it easily adapted to other formats.  

## Step 0: Install the metasynth package and import required packages
First, install the metasynth package in your session:

In [None]:
# uncomment the following line and run the cell to install metasynth
# %pip install metasynth

In [None]:
# import required packages
import datetime as dt
import polars as pl
from metasynth import MetaFrame, demo_file

## Step 1: Load the data into a data frame

The first step in creating the metadata is reading and converting your dataset to a DataFrame with the correct data types. We use the [Polars](https://pola.rs) dataframe library for this (but you could also use Pandas!)

In [None]:
# get the path of the demo csv
demo_file_path = demo_file()

# read the data with the correct categorical variables
data_types={
    "Sex": pl.Categorical,
    "Embarked": pl.Categorical
}
df = pl.read_csv(demo_file_path, try_parse_dates=True, dtypes=data_types)

# check out the data
df.head()

Now, let's check the data types of our DataFrame:

In [None]:
dict(zip(df.columns, df.dtypes))

We see that most variables are now nicely specified as strings, categories, dates and ints where necessary. We can also inspect the data a bit more with `describe()`.

In [None]:
df.describe()

## Step 2: Creating a MetaFrame object from a DataFrame

Now a lot of work has already gone into creating a properly formatted dataframe. This work pays off at this stage: let's convert the DataFrame to a MetaFrame structure with the default options.

In [None]:
mf = MetaFrame.fit_dataframe(df)

Then, we can simply print the MetaFrame to display it in an easy-to-read format:

In [None]:
print(mf)

Alternatively, we can preview the MetaFrame as it would be output to a file

In [None]:
json_preview = repr(mf)
print(json_preview)

## Step 3: Saving the metadata in a file

After creating the MetaFrame, we can save it to a file. The default format is `JSON`, which is both easy to read for humans and computers. This allows one to manually inspect the metadata file and verify no sensitive information would be shared. If the disclosure risk is deemed low, the JSON file can then be securely provided to others for exploratory analysis or other uses without exposing private data. 

In [None]:
# save the metadata to a file
file_path = "demonstration_metadata.json"
mf.to_json(file_path)

# you can now open and read the json file!

## Step 4: Generating synthetic data from the metadata

A previously exported MetaFrame (.json) file can be loaded into a MetaFrame object. 

In [None]:
#load previously exported MetaFrame (.json) file
mf = MetaFrame.from_json(file_path)

Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data.

In [None]:
# generate synthetic data
mf.synthesize(5)

As you can see, the fake data looks a lot like the real data! However, it could still use some improvement. Below, a brief example is shown of such potential manual improvements. If you want to know more about these improvements, take a look at our [advanced tutorial](https://colab.research.google.com/github/sodascience/metasynth/blob/main/examples/advanced_tutorial.ipynb). 

In [None]:
from metasynth.distribution import DiscreteUniformDistribution, RegexDistribution, FakerDistribution

# Using some advanced features of metasynth
var_spec = {
    # Ensure that the passengerId column is unique
    "PassengerId": {"unique": True}, 
    # Use fake names for the name column
    "Name": {"distribution": FakerDistribution("name")}, 
     # Estimate / fit an exponential distribution
    "Fare": {"distribution": "LogNormalDistribution"},
    # Manually set a distribution for age 
    "Age": {"distribution": DiscreteUniformDistribution(20, 40)},
    # Manually set a regex distribution for cabin
    "Cabin": {"distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")}
}

# create the high-quality metadata
mf = MetaFrame.fit_dataframe(df, spec=var_spec)

# generate synthetic data
syn_df = mf.synthesize(len(df))
syn_df.head()

Now, let's compare the synthetic data to the real data:

In [None]:
df.describe()

In [None]:
syn_df.describe()