# Usage of datafaker

### Import two classes from datafaker
1. DatasetDestriber can infer the domain of each column in dataset.
2. SyntheticDataGenerator can generate synthetic data according to the dataset description.

In [1]:
from datafaker import DatasetDestriber, SyntheticDataGenerator

### Data types
 The datafaker currently supports 4 basic data types.

| data type | example |
|-----------|---------|
| integer   | id, age, ...|
| float     | score, rating, ...|
| string    | first name, gender, ...|
| datetime  | birthday, event time, ...|

The data types can be part of the input. If not, they will be inferred from the dataset.

### Data description format

The domain of data is described as follows.
- The "catagorical" indicates attributes with particular values, e.g., "gender", "nationality".
- Most domains are modeled by a histogram, except noncategorical "string".

|data type|categorical  |min             |max             |values             |probabilities      |values count      |missing rate|
|---------|----------|----------------|----------------|-------------------|-------------------|------------------|------------|
|int      |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|float    |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   True   |min in length   |max in length   |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   False  |min in length   |max in length   |0                  |0                  |0               |missing rate|
|datetime |True/False|min in timestamp|max in timestamp|x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|

##### Step 1: Specify the directories for input and output files

In [2]:
input_dataset_file = './raw_data/AdultIncomeData/adult.csv'
dataset_description_file = './output/description/AdultIncomeData_description.csv'
synthetic_data_file = './output/synthetic_data/AdultIncomeData_synthetic.csv'

##### Step 2: Initialize a DatasetDescriber

In [3]:
describer = DatasetDestriber()

Initialized a dataset description generator.


##### Step 3: Generate dataset description

- description1 is inferred by code.
- description2 also contains customization on datatypes and category indicators from the user.

In [4]:
description1 = describer.get_dataset_description(file_name=input_dataset_file)
description2 = describer.get_dataset_description(file_name=input_dataset_file,
                                                 column_to_datatype_dict={'education-num': 'float'},
                                                 column_to_categorical_dict={'native-country':False,'age':True})

The input dataset is

In [5]:
describer.input_dataset.head()

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,2,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,3,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,4,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,5,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The dataset description inferred by code is

In [6]:
description1

Unnamed: 0_level_0,data type,categorical,min,max,values,probabilities,values count,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[-31.56, 1629.0, 3257.0, 4885.0, 6513.0, 8141....","[0.0500291760081, 0.0499984644206, 0.049998464...",20.0,0.0
age,int,False,17.0,90.0,"[16.927, 20.65, 24.3, 27.95, 31.6, 35.25, 38.9...","[0.0740149258315, 0.097048616443, 0.0755812167...",20.0,0.0
workclass,string,True,2.0,17.0,"[ Local-gov, Self-emp-not-inc, Never-worked,...","[0.0642793525997, 0.0780381437917, 0.000214981...",9.0,0.0
fnlwgt,int,False,12285.0,1484705.0,"[10812.58, 85906.0, 159527.0, 233148.0, 306769...","[0.137680046682, 0.265163846319, 0.33613832499...",20.0,0.0
education,string,True,4.0,13.0,"[ Assoc-voc, 10th, 7th-8th, Preschool, HS-...","[0.0424434139001, 0.0286539111207, 0.019839685...",16.0,0.0
education-num,int,True,1.0,16.0,"[16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[0.01268388563, 0.00156629096158, 0.0051595466...",16.0,0.0
marital-status,string,True,8.0,22.0,"[ Married-spouse-absent, Never-married, Sepa...","[0.0128374435675, 0.32809188907, 0.03147937716...",7.0,0.0
occupation,string,True,2.0,18.0,"[ Transport-moving, Exec-managerial, Craft-r...","[0.0490464052087, 0.124873314702, 0.1258867970...",15.0,0.0
relationship,string,True,5.0,15.0,"[ Not-in-family, Other-relative, Own-child, ...","[0.255059734038, 0.0301280673198, 0.1556463253...",6.0,0.0
race,string,True,6.0,19.0,"[ Amer-Indian-Eskimo, Black, Asian-Pac-Islan...","[0.00955130370689, 0.0959429992936, 0.03190933...",5.0,0.0


The dataset description inferred by code, which also contains the datatypes and categorical indicators from the user.
    - "education-num" is of datat type "float".
    - "native-country" is not categrocial.
    - "age" is categorical.

In [7]:
description2

Unnamed: 0_level_0,data type,categorical,min,max,values,probabilities,values count,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[-31.56, 1629.0, 3257.0, 4885.0, 6513.0, 8141....","[0.0500291760081, 0.0499984644206, 0.049998464...",20.0,0.0
age,int,True,17.0,90.0,"[32, 48, 64, 80, 17, 33, 49, 65, 81, 18, 34, 5...","[0.0254291944351, 0.0166763920027, 0.006388010...",73.0,0.0
workclass,string,True,2.0,17.0,"[ Local-gov, Self-emp-not-inc, Never-worked,...","[0.0642793525997, 0.0780381437917, 0.000214981...",9.0,0.0
fnlwgt,int,False,12285.0,1484705.0,"[10812.58, 85906.0, 159527.0, 233148.0, 306769...","[0.137680046682, 0.265163846319, 0.33613832499...",20.0,0.0
education,string,True,4.0,13.0,"[ Assoc-voc, 10th, 7th-8th, Preschool, HS-...","[0.0424434139001, 0.0286539111207, 0.019839685...",16.0,0.0
education-num,float,True,1.0,16.0,"[16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[0.01268388563, 0.00156629096158, 0.0051595466...",16.0,0.0
marital-status,string,True,8.0,22.0,"[ Married-spouse-absent, Never-married, Sepa...","[0.0128374435675, 0.32809188907, 0.03147937716...",7.0,0.0
occupation,string,True,2.0,18.0,"[ Transport-moving, Exec-managerial, Craft-r...","[0.0490464052087, 0.124873314702, 0.1258867970...",15.0,0.0
relationship,string,True,5.0,15.0,"[ Not-in-family, Other-relative, Own-child, ...","[0.255059734038, 0.0301280673198, 0.1556463253...",6.0,0.0
race,string,True,6.0,19.0,"[ Amer-Indian-Eskimo, Black, Asian-Pac-Islan...","[0.00955130370689, 0.0959429992936, 0.03190933...",5.0,0.0


##### Step 4: save the dataset description

In [8]:
describer.dataset_description.to_csv(dataset_description_file)

### Generate synthetic data

###### Step 1: Initialize a SyntheticDataGenerator.

In [9]:
generator = SyntheticDataGenerator()

Initialized a synthetic data generator.


##### Step 2: Generate 10 rows in sysnthetic dataset

The values are sampled from the histograms in dataset description file.

In [10]:
synthetic_dataset = generator.get_synthetic_data(dataset_description_file, N=10)
synthetic_dataset

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,29518,33,Private,303172,Bachelors,10,Never-married,Sales,Own-child,White,Male,5418,15,16,RmnbJjiZrZRdUnJnSiBw,<=50K
1,18568,44,Self-emp-not-inc,152725,10th,12,Married-civ-spouse,Farming-fishing,Unmarried,White,Male,576,35,28,WOJhVtCBDSdxKIDInyZY,<=50K
2,27064,49,Private,14808,Bachelors,9,Married-civ-spouse,Tech-support,Husband,White,Female,432,8,44,AHEmZjrJIpRxzYdOEPCJ,>50K
3,7395,28,Self-emp-not-inc,157437,Assoc-voc,13,Married-civ-spouse,Other-service,Husband,White,Female,467,1753,33,ZdESAwnzumFoCLmlDzZU,<=50K
4,31329,47,Private,235808,9th,9,Never-married,Sales,Other-relative,White,Male,569,2,10,ZhQnqtbEILwsxELJTqis,<=50K
5,12875,38,Private,236685,HS-grad,9,Never-married,Prof-specialty,Own-child,White,Male,341,34,39,GxgCfvuIYaoLvwQvxpja,<=50K
6,9743,54,Private,153450,Assoc-voc,13,Married-civ-spouse,Transport-moving,Husband,White,Female,63,5,43,WSfAbYJlOikreDOQSjqq,>50K
7,15148,35,?,234345,HS-grad,9,Married-civ-spouse,Sales,Unmarried,White,Female,394,31,59,prGwDIrJjTqgaCbDIfuM,>50K
8,26565,18,Private,149319,HS-grad,9,Separated,Craft-repair,Husband,White,Female,249,1,21,mQiWuCtkxLOFncBYcgpV,>50K
9,19452,44,Self-emp-not-inc,97171,HS-grad,13,Never-married,Sales,Husband,White,Male,33,36,46,LpOdUPsKZXldrsqytXuL,>50K


##### Step 3: Random missing on columns

On column "workclass" with missing rate 0.3

In [11]:
generator.random_missing_on_column(col='workclass', missing=0.3)
generator.synthetic_dataset

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,29518,33,Private,303172,Bachelors,10,Never-married,Sales,Own-child,White,Male,5418,15,16,RmnbJjiZrZRdUnJnSiBw,<=50K
1,18568,44,Self-emp-not-inc,152725,10th,12,Married-civ-spouse,Farming-fishing,Unmarried,White,Male,576,35,28,WOJhVtCBDSdxKIDInyZY,<=50K
2,27064,49,,14808,Bachelors,9,Married-civ-spouse,Tech-support,Husband,White,Female,432,8,44,AHEmZjrJIpRxzYdOEPCJ,>50K
3,7395,28,Self-emp-not-inc,157437,Assoc-voc,13,Married-civ-spouse,Other-service,Husband,White,Female,467,1753,33,ZdESAwnzumFoCLmlDzZU,<=50K
4,31329,47,,235808,9th,9,Never-married,Sales,Other-relative,White,Male,569,2,10,ZhQnqtbEILwsxELJTqis,<=50K
5,12875,38,Private,236685,HS-grad,9,Never-married,Prof-specialty,Own-child,White,Male,341,34,39,GxgCfvuIYaoLvwQvxpja,<=50K
6,9743,54,Private,153450,Assoc-voc,13,Married-civ-spouse,Transport-moving,Husband,White,Female,63,5,43,WSfAbYJlOikreDOQSjqq,>50K
7,15148,35,?,234345,HS-grad,9,Married-civ-spouse,Sales,Unmarried,White,Female,394,31,59,prGwDIrJjTqgaCbDIfuM,>50K
8,26565,18,Private,149319,HS-grad,9,Separated,Craft-repair,Husband,White,Female,249,1,21,mQiWuCtkxLOFncBYcgpV,>50K
9,19452,44,,97171,HS-grad,13,Never-married,Sales,Husband,White,Male,33,36,46,LpOdUPsKZXldrsqytXuL,>50K


##### Step 4: Save the synthetic dataset

In [12]:
synthetic_dataset.to_csv(synthetic_data_file)