### An example for generating *Aggregate-Filter* for longitudinal multiple-child data 
######  Before executing this notebook, please makes sure that data was imported earlier into the database.

In [1]:
! pip install --upgrade pip
! pip install fuzzy_sql-2.0.0b0-py3-none-any.whl

Processing ./fuzzy_sql-2.0.0b0-py3-none-any.whl
fuzzy-sql is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [2]:
import json
import os
from pathlib import Path

from fuzzy_sql.generate import gen_aggfltr_queries
from fuzzy_sql.report import Report

DATASET_NAME='cms'


In [3]:
# set directories
DATA_DIR=os.path.join(os.getcwd(),'data')
DB_DIR=os.path.join(os.getcwd(),'databases')

metadata_dir = os.path.join(DATA_DIR, DATASET_NAME,'metadata')
db_path = os.path.join(DB_DIR, f'{DATASET_NAME}.db')

### GENERATING RANDOM QUERIES 

In [4]:
# Create lists with table names. Table names shall be identical to the names initially created in the database.
real_tbl_lst=['s1_ben_sum_2008','s1_ben_sum_2009','s1_ben_sum_2010','s1_carrier_1a',\
    's1_carrier_1b','s1_inpatient','s1_outpatient','s1_prescrp']
syn_tbl_lst=['s2_ben_sum_2008','s2_ben_sum_2009','s2_ben_sum_2010','s2_carrier_1a',\
    's2_carrier_1b','s2_inpatient','s2_outpatient','s2_prescrp']

# Read metadata from the provided json files into a list of dictionaries. 
# Note 1: Both real and synthetic data should have the same metadata file.
# Note 2: Each input table in real_tbl_lst above shall have its own metadata file.
# Note 2: The json file name shall match that of the real data file name in real_tbl_lst. 
metadata_lst = []
for tbl_name in real_tbl_lst:
    with open(os.path.join(metadata_dir, tbl_name+'.json'), 'r') as f:
        metadata_lst.append(json.load(f))

In [5]:
rnd_queries=gen_aggfltr_queries(10,db_path, real_tbl_lst, metadata_lst,  syn_tbl_lst )

Generated Random Aggregate Filter Query - 1 in 10.0 seconds.
Generated Random Aggregate Filter Query - 2 in 10.0 seconds.
Cant wait any further! I am skipping this one!
Generated Random Aggregate Filter Query - 3 in 4.6 seconds.
Cant wait any further! I am skipping this one!
Cant wait any further! I am skipping this one!
Cant wait any further! I am skipping this one!
Cant wait any further! I am skipping this one!
Generated Random Aggregate Filter Query - 4 in 4.3 seconds.
Generated Random Aggregate Filter Query - 5 in 4.7 seconds.
Generated Random Aggregate Filter Query - 6 in 4.3 seconds.
Cant wait any further! I am skipping this one!
Cant wait any further! I am skipping this one!
Generated Random Aggregate Filter Query - 7 in 5.0 seconds.
Generated Random Aggregate Filter Query - 8 in 4.6 seconds.
Generated Random Aggregate Filter Query - 9 in 4.2 seconds.
Cant wait any further! I am skipping this one!
Generated Random Aggregate Filter Query - 10 in 5.3 seconds.


### REPORTING 

In [6]:
rprtr=Report(real_tbl_lst, rnd_queries)
rprtr.print_html_mltpl(f'{DATASET_NAME}.html')
rprtr.plot_violin('Hellinger',f'{DATASET_NAME}_hlngr.png' )
rprtr.plot_violin('Euclidean',f'{DATASET_NAME}_ecldn.png' )