# Introduction

This notebook runs causal inference methods on semi-synthetic datasets generated using the ICETEA framework. 

In this example, we use the AIPW estimator with three base-models: 
- inceptionV3
- resnet50 
- linear regression

The functions assume the dataset is available in a folder in TFRecord format. If running on Google Colab, we recommend saving this dataset on Google Drive. We also recommend the adoption of GPUs for better performance. 

# Imports

In [1]:
#!mkdir estimators 
!git clone --branch clean_up https://github.com/raquelaoki/icetea
!mv  -v /content/icetea/* /content/

Cloning into 'icetea'...
remote: Enumerating objects: 308, done.[K
remote: Counting objects: 100% (308/308), done.[K
remote: Compressing objects: 100% (208/208), done.[K
remote: Total 308 (delta 152), reused 235 (delta 82), pack-reused 0[K
Receiving objects: 100% (308/308), 410.38 KiB | 6.22 MiB/s, done.
Resolving deltas: 100% (152/152), done.
renamed '/content/icetea/config_yaml' -> '/content/config_yaml'
renamed '/content/icetea/estimators' -> '/content/estimators'
renamed '/content/icetea/helper_data.py' -> '/content/helper_data.py'
renamed '/content/icetea/helper_parameters.py' -> '/content/helper_parameters.py'
renamed '/content/icetea/icetea_data_simulation.py' -> '/content/icetea_data_simulation.py'
renamed '/content/icetea/icetea_feature_extraction.py' -> '/content/icetea_feature_extraction.py'
renamed '/content/icetea/main.py' -> '/content/main.py'
renamed '/content/icetea/notebooks' -> '/content/notebooks'
renamed '/content/icetea/plots.py' -> '/content/plots.py'
renamed 

In [2]:
import cv2
import os
import pandas as pd
import logging
import tensorflow as tf
from tensorflow.io import gfile
import matplotlib.pyplot as plt
import yaml

#Local Imports 
import helper_data as hd
import icetea_feature_extraction as fe
import icetea_data_simulation as ds 
import utils 
import plots
import helper_parameters as hp

debuging = False
use_tpu = False
strategy = None

#logging.basicConfig(level=logging.DEBUG)

path_config = '/content/config_yaml/'

if debuging: 
  config_paths ={
      'path_root':'/content/drive/MyDrive/ColabNotebooks/data',
      'path_images_png':'icetea_png/sample/' ,
      'path_tfrecords':'testing/' ,
      'path_tfrecords_new':'new_data_small/',
      'path_features':'testing/' ,
      'path_results':'testing/' ,
      'path_meta':'trainLabels.csv'
  }
else:
    with open(os.path.join(path_config, 'paths.yaml')) as f:
      config_paths = yaml.safe_load(f)
    config_paths = config_paths['parameters']

from google.colab import drive
drive.mount('/content/drive')

for key in config_paths: 
  if key!= 'path_root': 
    path = os.path.join(config_paths['path_root'], config_paths[key])
  else: 
    path = config_paths['path_root']
  
  if key != 'path_meta':
    assert os.path.isdir(path), config_paths[key]+ ': Folder does not exist!'

Mounted at /content/drive


In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Jun  3 22:52:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Causal Inference 

We illustrate how to use the dataset with an AIPW and three different base-models. 

First, we load a list with all available synthetic datasets. These datasets should already be part of the TFRecord files. (The Data Simulation phase takes care of it.)

Each row on list_of_datasets is the seed of a semi-synthetic dataset generated with ICETEA. `sim_id` is a unique identifier of the datasets, defined by their knobs, setting id, and repetition index. Check icetea_feature_extraction_data_simulation.ipynb for a complete description on it. 

In [4]:
list_of_datasets = pd.read_csv(os.path.join(config_paths['path_root'], 
                                            config_paths['path_features'],
                                            'true_tau_sorted.csv'))

print(list_of_datasets.head())
print('Total: ',list_of_datasets.shape[0])

#Below we pick two seeds, indexes 0 and 1, to run the estimators:
running_indexes=[0,1]

   Unnamed: 0  gamma repetition knob setting_id                  sim_id  \
0           0    0.5         b0   ks        ks0  sim_ks0_b0_0.1_0.5_0.5   
1           1    1.0         b0   kh        kh2     sim_kh2_b0_10_0.5_1   
2           2    0.5         b0   kh        kh1   sim_kh1_b0_10_0.5_0.5   
3           3    0.5         b0   ko        ko0     sim_ko0_b0_10_0_0.5   
4           4    0.5         b0   ko        ko1   sim_ko1_b0_10_0.5_0.5   

        tau  setting  alpha  beta running  
0 -1.270433        0    0.1   0.5     yes  
1  1.000000        2   10.0   0.5    done  
2  1.000000        1   10.0   0.5    done  
3  1.000000        0   10.0   0.0    done  
4  1.000000        1   10.0   0.5    done  
Total:  270


Loading config files:

In [5]:
with open(os.path.join(path_config, 'causal_inference_setup.yaml')) as f:
    config_ci = yaml.safe_load(f)
config_ci = config_ci['parameters']

param_data = {}
param_data = utils.adding_paths_to_config(param_data, config_paths)
param_data['image_size'] = config_ci['image_size']
param_data['batch_size'] = config_ci['batch_size']
param_data['name'] = config_ci['name_data']
param_data['prefix_train'] = config_ci['prefix_train']
param_data['path_tfrecords_new'] = os.path.join(param_data['path_root'],param_data['path_tfrecords_new'])
param_data['output_name'] = config_ci['output_name'] 

param_method = {}
#param_method = utils.adding_paths_to_config(param_method, config_paths)
param_method['name_estimator'] = config_ci['name_estimator']
param_method['name_metric'] = config_ci['name_metric']
param_method['name_base_model'] = config_ci['name_base_model']
param_method['learn_prop_score'] = config_ci['learn_prop_score']
param_method['name_prop_score'] = config_ci['name_prop_score']
param_method['epochs'] = config_ci['epochs']
param_method['steps'] = config_ci['steps']
model_repetitions = config_ci.get('repetitions',1)

if debuging:
  param_method['epochs']= [1]
  param_method['steps'] = [1]
  param_data['batch_size']=4

print(param_method)
config_methods = hp.create_configs(param_method)

{'name_estimator': ['aipw'], 'name_metric': ['mse'], 'name_base_model': ['image_regression', 'resnet50', 'inceptionv3'], 'learn_prop_score': [False], 'name_prop_score': [''], 'epochs': [5], 'steps': [5]}


Running the simulations for the selected indexes. 

Approximated time per index: ~1h 30m (depends on GPU).

In [None]:
for i, sim_id in enumerate(list_of_datasets['sim_id']):
  if i in running_indexes: 
    print('running ', i)
    #  Loads dataset with appropried sim_id.
    #  Creates a temporary DataFrame to keep the repetitions results under this dataset;
    #  Meaning: data is loaded once, and we have several models (defined in parameters.config_methods)
    #  using this dataset. 
    results_one_dataset = pd.DataFrame()
    for _config in config_methods:
        _config['image_size'] = config_ci['image_size'][0]
        results_one_config = utils.repeat_experiment(param_data=param_data,
                                                      param_method=_config,
                                                      seed_data=sim_id,
                                                      seed_method=i,
                                                      use_tpu=use_tpu,
                                                      strategy=strategy)
        results_one_config['sim_id'] = sim_id
        results_one_config = pd.merge(results_one_config, list_of_datasets, how='left')
        results_one_dataset = pd.concat([results_one_dataset, results_one_config])
        # Intermediate save.
        utils.save_results(using_gc=False, params=param_data, results=results_one_dataset, i=i)
    # Final save.
    utils.save_results(using_gc=False, params=param_data, results=results_one_dataset, i=i)



In [None]:
results_one_dataset