# Setting descriptions in MetaSynth

This is an example workflow for if you have a CSV file (but easily adapted to xls/other formats) and want to use MetaSynth to create a synthetic metadata file / create synthetic data.

You can run this notebook by checking out the MetaSynth repo and installing metasynth with `pip install .`.

It shows some of the more advanced abilities of MetaSynth, such as handling dates, setting distributions and ensuring uniqueness in columns.

In [1]:
import datetime as dt
import pandas as pd
from metasynth import MetaDataset, MetaVar
from metasynth.distribution import DiscreteUniformDistribution
from pprint import pprint

### For reading the CSV file define the pandas types for each column

This is the easiest way to do it, though of course this can also be remedied after reading in the CSV file

In [2]:
dtypes = {
    "Survived": "category",  # Categories should be assigned this type.
    "Name": "string",  # Strings should be assigned like this
    "Age": "Int64",  # Integer columns that have NA's in them should be explicitly nullable integers.
    "Sex": "category",
    "SibSp": "category",
    "Parch": "category",
    "Ticket": "string",
    "Cabin": "string",
    "Embarked": "category",
}

### Read the CSV from a file

In [3]:
df = pd.read_csv("demonstration.csv", dtype=dtypes)

### Compare the original DataFrame

Let's first see what the original DataFrame looks like:

In [4]:
pd.set_option('display.max_rows', 5)
df

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
0,1,"Braund, Mr. Owen Harris",male,22,0,7.2500,,S,1937-10-28,15:53:04,2022-08-05 04:43:34,
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,0,71.2833,C85,C,,12:26:00,2022-08-07 01:56:33,
...,...,...,...,...,...,...,...,...,...,...,...,...
889,890,"Behr, Mr. Karl Howell",male,26,0,30.0000,C148,C,1905-04-16,13:37:08,2022-07-23 08:04:22,
890,891,"Dooley, Mr. Patrick",male,32,0,7.7500,,Q,,17:21:32,2022-08-01 11:07:59,


### Adjust columns with dates/date-times/times

We have to manually cast the columns with date, time, and datetimes. Since the columns were written in ISO-format, they are read back with the `fromisoformat` method. If they are written in a different format, check out the datetime library documentation on how to convert the strings to datetime/time/date objects.

In [5]:
df["Birthday"] = [dt.date.fromisoformat(x) if not pd.isna(x) else pd.NA for x in df["Birthday"]]
df["Board time"] = [dt.time.fromisoformat(x) if not pd.isna(x) else pd.NA for x in df["Board time"]]
df["Married since"] = [dt.datetime.fromisoformat(x) if not pd.isna(x) else pd.NA for x in df["Married since"]]

### Set descriptions while processing the dataframe.

In [6]:
from metasynth.distribution import FakerDistribution
meta_dataset = MetaDataset.from_dataframe(
    df,
    spec={
        "Name": {"distribution": FakerDistribution("name")},
        "Fare": {"distribution": "LogNormalDistribution"},
        "Age": {"distribution": DiscreteUniformDistribution(20, 40)},
        "PassengerId": {"unique": True},
        "Cabin": {"distribution": FakerDistribution("city"), "description": "The cabin number of the passenger."},
    }
)
pprint(meta_dataset.descriptions)

{'Cabin': 'The cabin number of the passenger.'}


### Set descriptions for specific variables.

In [7]:
meta_dataset["PassengerId"].description = "Passenger ID assigned by pandas."
pprint(meta_dataset.descriptions)

{'Cabin': 'The cabin number of the passenger.',
 'PassengerId': 'Passenger ID assigned by pandas.'}


### Set multiple descriptions at the same time.

In [8]:
meta_dataset.descriptions = {"Name": "Name of the passenger", "Age": "Age of the passenger in years"}
pprint(meta_dataset.descriptions)

{'Age': 'Age of the passenger in years',
 'Cabin': 'The cabin number of the passenger.',
 'Name': 'Name of the passenger',
 'PassengerId': 'Passenger ID assigned by pandas.'}


### Set all descriptions at the same time.

In [9]:
meta_dataset.descriptions = [var.name for var in meta_dataset.meta_vars]
pprint(meta_dataset.descriptions)

{'Age': 'Age',
 'Birthday': 'Birthday',
 'Board time': 'Board time',
 'Cabin': 'Cabin',
 'Embarked': 'Embarked',
 'Fare': 'Fare',
 'Married since': 'Married since',
 'Name': 'Name',
 'Parch': 'Parch',
 'PassengerId': 'PassengerId',
 'Sex': 'Sex',
 'all_NA': 'all_NA'}
