## Requirements and Dependencies

Install dependencies and ensure that Python version >= 3.10.

In [2]:
%pip install requirements

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement requirements (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for requirements[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import os
import sys
import pandas as pd
import json

sys.path.append("../")
import codex

with open(os.path.join("tutorial_materials/input/demo_input-dataset_eval.json")) as f:
    codex_input = json.load(f)

## 1) Tabular Dataset
CODEX computes combinatorial coverage metrics over specified features of a tabular dataset, that is, a dataset organized into a table of rows and columns.

In the case that the input data of a machine learning dataset is not tabular, tabular metadata or tabular data of the original data's dervied features of latent space can act as a surrocate for the input data. In this case, how meaningful the CODEX experiments are depends on how meaningful data is of the application. **REVISE. These can include ingrained features, or envirionmental conditions of a sample.

The provided dataset columns should be comprised of: 
- One column for sample ID's, a unique identifier given to each sample prior to running CODEX.
- Multiple columns as features (or meta features), or variables (or meta variables) in the data. 
    - Value assignment: The value present in a column is the value that the specific sample has for the feature value.
- At least one column for labels, or classes of samples.

Consider this example abstract dataset containing feature columns "A", "B", "C", "D" as well as a unique sample ID column and label column(s). Multiple strengths can be provided in one input file to obtain a set of CC_t outputs. 

In [4]:
dataset = pd.read_csv("tutorial_materials/datasets_tabular/abstract_native.csv").drop(
    "Unnamed: 0", axis=1
)
display(dataset)

Unnamed: 0,id,A,B,C,D,lab
0,0,a2,12.561,c1,d2,l1
1,1,a1,4.238,c2,d3,l1
2,2,a1,9.700,c2,d2,l1
3,3,a1,17.740,c2,d4,l2
4,4,a2,22.950,c1,d2,l3
...,...,...,...,...,...,...
995,995,a1,7.249,c3,d2,l1
996,996,a2,23.911,c3,d3,l3
997,997,a2,0.619,c1,d3,l2
998,998,a1,13.062,c2,d1,l1


In [5]:
id_col_name = codex_input["sample_id_column"]
label_col_name = "lab"  # codex_input['label_column']

display(
    "SAMPLE ID COLUMN",
    type(dataset[id_col_name]),
    dataset[id_col_name],
)
display("LABEL COLUMN", type(dataset[label_col_name]), dataset[label_col_name])

'SAMPLE ID COLUMN'

pandas.core.series.Series

0        0
1        1
2        2
3        3
4        4
      ... 
995    995
996    996
997    997
998    998
999    999
Name: id, Length: 1000, dtype: int64

'LABEL COLUMN'

pandas.core.series.Series

0      l1
1      l1
2      l1
3      l2
4      l3
       ..
995    l1
996    l3
997    l2
998    l1
999    l2
Name: lab, Length: 1000, dtype: object

In [6]:
display("FEATURES IN DATASET", dataset[["A", "B", "C", "D"]].head())

display("VALUES OF FEATURES")
print("Values for A in dataset:", dataset["A"].unique().tolist())
print("Values for B in dataset:", dataset["B"].unique().tolist())
print("Values for C in dataset:", dataset["C"].unique().tolist())
print("Values for D in dataset:", dataset["D"].unique().tolist())

'FEATURES IN DATASET'

Unnamed: 0,A,B,C,D
0,a2,12.561,c1,d2
1,a1,4.238,c2,d3
2,a1,9.7,c2,d2
3,a1,17.74,c2,d4
4,a2,22.95,c1,d2


'VALUES OF FEATURES'

Values for A in dataset: ['a2', 'a1']
Values for B in dataset: [12.561, 4.238, 9.7, 17.74, 22.95, 17.404, 27.431, 6.718, 19.498, 26.667, 16.187, 16.269, 14.63, 12.354, 16.6, 9.876, 22.227, 11.985, 11.981, 8.381, 18.247, 2.475, 25.809, 15.765, 20.458, 12.083, 19.587, 5.127, 13.993, 24.23, 18.73, 11.884, 15.701, 15.253, 26.584, 5.635, 24.941, 18.649, 17.615, 29.179, 8.387, 15.212, 0.043, 26.122, 0.907, 3.262, 2.291, 11.421, 19.15, 13.21, 18.217, 5.567, 29.053, 13.746, 10.951, 14.206, 24.663, 21.543, 20.781, 20.606, 25.689, 9.734, 26.231, 0.795, 11.391, 10.737, 27.153, 5.164, 12.411, 6.071, 12.483, 8.001, 7.805, 16.19, 23.314, 8.191, 28.216, 25.55, 29.863, 23.061, 17.949, 2.266, 2.992, 14.384, 18.336, 28.091, 6.775, 20.095, 6.211, 15.397, 19.45, 0.39, 18.621, 14.124, 14.182, 19.952, 24.088, 26.087, 17.343, 28.003, 2.346, 26.498, 24.411, 3.071, 7.851, 1.837, 28.637, 1.933, 9.886, 21.683, 27.565, 10.992, 18.253, 7.137, 7.88, 2.025, 9.184, 12.28, 29.148, 24.533, 9.096, 13.294, 26.024, 3.469,

Note that feature B is a continuous variable.

## 2) Binning file and universe construction

CODEX implements combinatorial coverage over a defined universe that describes the input space. Because of the nature of combinatorial testing, CODEX can only operate over discrete values in the data. Therefore, continuous variables in the data must be discretized. Below is an example of a universe that describes an input space to the dataset.


In [7]:
universe, dataset_df, features = codex.define_input_space(codex_input)

display(universe)

{'features': ['A', 'B', 'C', 'D'],
 'levels': [['a1', 'a2'],
  ['[0.0,10.0)', '[10,20.0)', '[20.0,30.0]'],
  ['c1', 'c2', 'c3'],
  ['d1', 'd2', 'd3', 'd4']]}

A binning file specifying the order and sizes of the bins can be provided to bin features part of coverage. While this requires knowledge on the part of the user, this allows the user to define bins that are ideally semantically meaningful in order and size. While continuous variables in the **require** a binning scheme specified in the binning file, categorical variables can also **optionally** be referenced in the binning file to specify an order. Otherwise, categorical variables are learned from the data in the order of appearance. 

With the dataset and a binning file, a universe that describes the input space with all of its features and levels can be defined. Once constructed, this universe can be saved from the experiment results to define the universe for other experiments and runs by passing in the universe filename from the universe folder into the input file.

In [8]:
with open("tutorial_materials/binning/bins-abstract.txt") as f:
    display("BINNING SCHEME")
    print(f.read())

universe, dataset_df_proc, features = codex.define_input_space(codex_input)
display("PROCESSED DATASET", dataset_df_proc.head())
display("UNIVERSE", universe)

'BINNING SCHEME'

A: a1;a2
B: [0.0,10.0);[10,20.0);[20.0,30.0]
C: c1;c2;c3
D: d1;d2;d3;d4


'PROCESSED DATASET'

Unnamed: 0,id,A,B,C,D
0,0,a2,"[10,20.0)",c1,d2
1,1,a1,"[0.0,10.0)",c2,d3
2,2,a1,"[0.0,10.0)",c2,d2
3,3,a1,"[10,20.0)",c2,d4
4,4,a2,"[20.0,30.0]",c1,d2


'UNIVERSE'

{'features': ['A', 'B', 'C', 'D'],
 'levels': [['a1', 'a2'],
  ['[0.0,10.0)', '[10,20.0)', '[20.0,30.0]'],
  ['c1', 'c2', 'c3'],
  ['d1', 'd2', 'd3', 'd4']]}

To summarize, the preprocessing of the binning file that is required from the user is as follows:
- All categorical variables and no ordering: Run CODEX with binning file passed as *null*
- Categorical variables that *do not* require ordering and continuous variables that *do* require ordering: Specify bins and order for continuous features and allow CODEX to learn order of categorical variables
- Categorical and continuous variables that *both require* ordering and binning: Specify bins and order for continuous variables, and order for cateogircal variables.

## 3) Input File

CODEX modes require an input file to run. These input files contain requirements and information required to run the experiment, including user-defind variables for the experiment and locations of files like split and performance files required for mode-specific analyses.

Consider the most complete version of the input file, containing fields for every mode:

In [9]:
with open(os.path.join("tutorial_materials/input/demo_input-dataset_eval.json")) as f:
    codex_input = json.load(f)
    display(codex_input)

{'mode': 'dataset evaluation',
 'codex_directory': './',
 'config_id': '__DEMO-dataset_eval_results',
 'dataset_name': 'Abstract',
 'model_name': '',
 'data_directory': 'tutorial_materials/datasets_tabular/',
 'dataset_file': 'abstract_native.csv',
 'features': ['A', 'B', 'C', 'D'],
 'bin_file': 'tutorial_materials/binning/bins-abstract.txt',
 'sample_id_column': 'id',
 'universe': None,
 'use_augmented_universe': False,
 'save_universe': False,
 'counting_mode': 'label_exclusive',
 't': [1, 2, 3, 4],
 'split_folder': '',
 'split_file': '',
 'performance_folder': '',
 'performance_file': '',
 'metric': '',
 'timed_output': False}

Give universe field value ^^

### Input File Fields: Explanation and Designations

**Mode-generic fields**
- mode: String designating the type of CODEX experiment to run
    - Supported CODEX modes include: ['dataset evaluation', 'dataset split evaluation', 'dataset split comparison', 'performance by interaction', 'model probing', 'balanced test construction']
- codex_directory: Directory pathway of where CODEX files reside, including binning files, dataset files, and output directories.
- config_id: Unique identifier name for experiment, which creates the directory to which results of a CODEX outputs are saved relative to the codex directory.
- timed_output: Boolean whether to tag CODEX directory output with a timestamp at which the experiment was executed.
- dataset: Common name of the dataset used in the experiment.
- model_name: Common name of the model used for performance results inputs in the experiment.
- dataset_directory: Directory pathway relative to the CODEX directory where tabular dataset files reside.
- dataset_file: Filename of tabular dataset used in CODEX experiment.
- features: List of names of features to compute coverage over as they appear in the dataset.
- bin_file: File pathway relative to codex directory of the binning text file.
- sample_id_column: Name of the column desginated as the sample ID column as it appears in the dataset.
- universe_folder: Directory pathway relative to CODEX directory where universe JSON files reside in.
- universe: Filename of a universe JSON file passed in. Pass in null if no pre-defined universe is to be used.
- save_universe: Boolean whether to save the universe for this experiment to the universe folder. Also uses the universe filename to define the saved pathway.
- use_augmented_universe: Boolean whether to build on the learned universe with extra features/values that do not appear in the predefined universe or the binning scheme.
- counting_mode: Boolean defining whether to include the label column in the coverage computations.
    - Note: currentnly false, risk of instability if true
- t: List of integers $t=1..k$, where k is number of features *** combinations for which to compute ombinatorial coverage over. 
    - ^ Theory will be introduced at some point earlier
    - Pull from detailed README version

**Mode-specific fields**
- split_folder: Directory pathway relative to CODEX directory where split files reside.
    - Dataset split evaluation, dataset split comparison, performance by interaction, model probing modes
- split_file: Filename or list of filenames (for dataset split comparison) of split JSON files to be used in CODEX experiments.
    - Dataset split evaluation, dataset split comparison, performance by interaction, model probing modes
- performance_folder: Directory pathway relative to CODEX directory where performance files reside.
    - Dataset split evaluation, dataset split comparison, performance by interaction, model probing modes
- performance_file: Filename or list of filenames (for dataset split comparison) of performance JSON files to be used in CODEX experiments.
    - Dataset split evaluation, dataset split comparison, performance by interaction, model probing mode.
- metric: String or list of strings of the names of performance metrics to compute over as they appear in the performance files.
    - Dataset split evaluation, dataset split comparison, performance by interaction, model probing modes
- map_functions:
- SDCC-direction: List of SDCC set difference directions to plot differential performance for.
    - Dataset split evaluation, dataset split comparison



<explantion of fields>

## CODEX OUTPUT

<>
Folder creation, made from dir+config, etc.
Running same input config ID/no timed tag overwrites -> 

- Mode specific outputs
- Mode generic outputs
    - Walk through coverage.json

## Running CODEX, CODEX output

The input file provides all the locations and experiment variables for CODEX to run the specific experiment. When ran, CODEX stores experiment results in a directory named after the unique config ID, where it stores coverage results, figures, and logs for information and debugging.

In [10]:
results = codex.run(codex_input)
output_dir, strengths = codex.define_experiment_variables(codex_input)

print(os.listdir(output_dir))

['coverage.json', 'CC']


<Figure size 432x288 with 0 Axes>

If the input file dictates the parameters of the experiment, an output file titled 'coverage.json' includes the results of the experiment.

**Mode-generic results:**
- universe: Dictionary containing the universe used for computing coverage
- mode: Mode of the experiment
For each t,
- count appearing interactions and total possible interactions: Number of possible and number of total possible interactions
- CC: Combinatorial coverage value
- ranks: Names of t-way combinations in order
- combination counts: List of lists containing the number of interaction appearances for each combination
- missing interactions: List of lists of combinations and interactions that should be, but are not present in the data

**Mode-specific results:**
- SDCC: Set difference combinatorial coverage value
    - Dataset split evaluation
- performance: Contains performance metrics for each interaction by metrics
    - performance by interaction mode
- human readable performance: Contains performance metrics for each interaction by metrics
    - performance by interaction mode

In [11]:
# codex.output.output_json_readable(results, print_json=True)
display(results)

{'universe': {'features': ['A', 'B', 'C', 'D'],
  'levels': [['a1', 'a2'],
   ['[0.0,10.0)', '[10,20.0)', '[20.0,30.0]'],
   ['c1', 'c2', 'c3'],
   ['d1', 'd2', 'd3', 'd4']]},
 1: {'count appearing interactions': 12,
  'total possible interactions': 12,
  'CC': 1.0,
  'ranks': ['A', 'B', 'C', 'D'],
  'combination counts': [[504, 496],
   [329, 350, 321],
   [391, 287, 322],
   [243, 232, 255, 270]],
  'missing interactions': []},
 2: {'count appearing interactions': 53,
  'total possible interactions': 53,
  'CC': 1.0,
  'ranks': ['A*B', 'A*C', 'B*C', 'A*D', 'B*D', 'C*D'],
  'combination counts': [[157, 172, 180, 170, 167, 154],
   [180, 211, 149, 138, 175, 147],
   [124, 136, 131, 107, 100, 80, 98, 114, 110],
   [129, 114, 111, 121, 135, 120, 129, 141],
   [76, 82, 85, 78, 74, 80, 83, 94, 78, 92, 100, 78],
   [101, 70, 72, 93, 67, 72, 100, 75, 80, 97, 75, 98]],
  'missing interactions': []},
 3: {'count appearing interactions': 102,
  'total possible interactions': 102,
  'CC': 1.0,
 