# Experiment 1a: the influence of a stable confounder on classification performance

In this experiment, we use a simplified causal graph consisting of three nodes: immune state, confounder, and AIRR, to show if the distribution of a confounder is stable in the study cohort, and source and target populations, it does not influence the prediction task.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. Confounder also in this setting has two values: `C1` and `C2`. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual.

Steps:

1. Simulate training and test dataset from a causal graph to include: confounder (implemented by implanting 3-mer `ADR` in the repertoire), immune state (implemented by implanting a signal in the repertoire but the exact 3-mer depends on the value of the confounder; it is either `QPR` or `EQY`)

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set and assess its performance on the test set in the presence of confounder with stable distribution across train and test.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [9]:
import yaml
import dagsim.base as ds
import numpy as np
from util.dataset_util import make_AIRR_dataset, make_dataset, setup_path
from util.implanting import make_immune_state_signals, make_confounding_signal
from util.experiment1 import get_immune_state, get_confounder, get_repertoire

## Step 1: AIRR simulation from a causal graph

In [10]:
result_path = setup_path('./experiment1a/')
data_path = setup_path("./experiment1a/data/")

# how many repertoires to make for training and testing
train_example_count = 200
test_example_count = 100

# immune state: two binomial distributions depending on the confounder value with probability of success p
immune_state_p_conf1 = 0.8 # for confounder = C1
immune_state_p_conf2 = 0.2 # for confounder = C2

# confounder: binomial distribution with probability of success p
confounder_p_train = 0.5
confounder_p_test = 0.5

# other parameters
immune_state_implanting_rate = 0.02 # percentage of repertoire sequences to include immune state signal
confounding_implanting_rate = 0.2
sequence_count = 500 # number of sequences in one repertoire

Removing experiment1a...


In [11]:
immune_state_signal_name = "immune_state"
immune_state_signals = make_immune_state_signals(signal_name=immune_state_signal_name)
confounding_signal = make_confounding_signal()

# define nodes of the causal graph

index_node = ds.Generic(name="index", function=np.arange, size_field="stop")

confounder_node = ds.Generic(name="confounder", function=get_confounder, arguments={"p": confounder_p_train})

immune_state_node = ds.Generic(name="immune_state", function=get_immune_state, 
                               arguments={"confounder": confounder_node, "p_conf1": immune_state_p_conf1,
                                         "p_conf2": immune_state_p_conf2})

repertoire_node = ds.Generic(name="repertoire", function=get_repertoire, 
                             arguments={"immune_state": immune_state_node, "confounder": confounder_node, 
                                        "path": data_path / "train", "sequence_count": sequence_count, 
                                        "immune_state_signals": immune_state_signals, 
                                        "confounding_signal": confounding_signal, "seed": index_node, 
                                        'immune_state_implanting_rate': immune_state_implanting_rate, 
                                        'confounding_implanting_rate': confounding_implanting_rate})

# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_1a", 
                 list_nodes=[index_node, confounder_node, immune_state_node, repertoire_node])
graph.draw()



subprocess 7501 is still running



In [12]:
# simulate a dataset using the graph

study_cohort_data = graph.simulate(num_samples=train_example_count, 
                                   csv_name=str(data_path / "train/study_cohort"))

# make an AIRR dataset from the generated repertoires to be used for training

train_dataset = make_dataset(repertoire_paths=study_cohort_data["repertoire"], path=data_path / 'train', 
                             dataset_name="experiment1a_train", 
                             signal_names=[immune_state_signal_name, confounder_node.name])

# make a test dataset

repertoire_node.additional_parameters['path'] = data_path / 'test' # update result_path: to be removed with DagSim update
confounder_node.additional_parameters['p'] = confounder_p_test # update the confounder distribution parameter for test

test_data = graph.simulate(num_samples=test_example_count, csv_name=str(data_path / "test/test_cohort"))

test_dataset = make_dataset(repertoire_paths=test_data["repertoire"], path=data_path / 'test', 
                            dataset_name="experiment1a_test", 
                            signal_names=[immune_state_signal_name, confounder_node.name])

# merge datasets (but the distinction between train and test will be kept)

dataset = make_AIRR_dataset(train_dataset, test_dataset, data_path / 'full_dataset')

Simulation started
Simulation finished in 201.5333 seconds
Simulation started
Simulation finished in 101.0018 seconds


## Step 2: train an ML model and assess performance

In [13]:
specs = {
    "definitions": {
        "datasets": {
            "dataset1": {
                "format": 'AIRR',
                "params": {
                    "path": str(data_path / 'full_dataset'),
                    "metadata_file": str(data_path / 'full_dataset/metadata.csv')
                }
            }
        },
        "encodings": {
            "kmer_frequency": {
                "KmerFrequency": {"k": 3}
            }
        },
        "ml_methods": {
            "logistic_regression": {
                "LogisticRegression": {
                    "penalty": "l1",
                    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    "show_warnings": False
                },
                "model_selection_cv": True,
                "model_selection_n_folds": 5
            }
        },
        "reports": {
            "motif_recovery": { # to check how much coefficients overlap with the immune state signal that was implanted
                "MotifSeedRecovery": {
                    "implanted_motifs_per_label": {
                        "immune_state": {
                            "seeds": ["EQY", "QPR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        },
                        "confounder": {
                            "seeds": ["ADR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        }
                    }
                }
            },
            "coefficients": {
                "Coefficients": { # show top 25 logistic regression coefficients and what k-mers they correspond to
                    "coefs_to_plot": ['n_largest'],
                    "n_largest": [25]
                }
            },
            "feature_comparison": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "color_grouping_label": "confounder",
                    "show_error_bar": False,
                    "keep_fraction": 0.1,
                    "log_scale": True
                }
            }, 
            "training_performance": {
                "TrainingPerformance": {
                    "metrics": ["balanced_accuracy", "log_loss", "auc"]
                }
            }
        }
    },
    "instructions": {
        'train_ml': {
            "type": "TrainMLModel",
            "assessment": { # ensure here that train and test dataset are fixed, as per simulation
                "split_strategy": "manual",
                "split_count": 1,
                "manual_config": {
                    "train_metadata_path": str(data_path / "train/experiment1a_train_metadata.csv"),
                    "test_metadata_path": str(data_path / "test/experiment1a_test_metadata.csv")
                },
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison"]
                }
            },
            "selection": {
                "split_strategy": "k_fold",
                "split_count": 5,
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison"]
                }
            },
            "settings": [
                {"encoding": "kmer_frequency", "ml_method": "logistic_regression"}
            ],
            "dataset": "dataset1",
            "refit_optimal_model": False,
            "labels": ["immune_state"],
            "optimization_metric": "balanced_accuracy",
            "metrics": ['log_loss', 'auc']
        }
    }
}

ml_result_path = setup_path("./experiment1a/ml_result/")
specification_path = ml_result_path / "specs.yaml"

with open(specification_path, "w") as file:
    yaml.dump(specs, file)

In [14]:
# run immuneML with the specs file

from immuneML.app.ImmuneMLApp import ImmuneMLApp

output_path = ml_result_path / "result/"

app = ImmuneMLApp(specification_path = specification_path, result_path = output_path)
result = app.run()

print("The results are located under ./experiment1a/")

2022-01-04 10:44:03.596881: Setting temporary cache path to experiment1a/ml_result/result/cache
2022-01-04 10:44:03.597840: ImmuneML: parsing the specification...

2022-01-04 10:44:25.261557: Full specification is available at experiment1a/ml_result/result/full_specs.yaml.

2022-01-04 10:44:25.262666: ImmuneML: starting the analysis...

2022-01-04 10:44:25.263442: Instruction 1/1 has started.
2022-01-04 10:44:25.410154: Training ML model: running outer CV loop: started split 1/1.

2022-01-04 10:44:25.546428: Hyperparameter optimization: running the inner loop of nested CV: selection for label immune_state (label 1 / 1).

2022-01-04 10:44:25.548360: Evaluating hyperparameter setting: kmer_frequency_logistic_regression...
2022-01-04 10:44:25.550528: Encoding started...
2022-01-04 10:44:31.411898: Encoding finished.
2022-01-04 10:44:31.412923: ML model training started...
2022-01-04 10:45:24.229705: ML model training finished.
       immune_state_x  confounder feature  valuemean_x  values

2022-01-04 10:48:00.572503: Encoding finished.
       immune_state_x  confounder feature  valuemean_x  valuestd_x  \
9499            False        True     FKY     0.000000    0.000000   
6805            False       False     VMC     2.761227    5.166271   
417             False       False     CAY     2.668685    1.887951   
2675            False       False     HVI     2.641805    4.942767   
11814           False        True     MKR     0.000000    0.000000   
...               ...         ...     ...          ...         ...   
14453           False        True     VFE     0.814836    1.351441   
2197            False       False     GPM     0.234247    0.662550   
14543           False        True     VKQ     0.201435    0.753701   
13790           False        True     SMV     0.277357    1.037774   
10212           False        True     HGF     0.813807    2.075289   

       immune_state_y  valuemean_y  valuestd_y  
9499             True     3.132685    4.430489  
6805          

2022-01-04 10:50:22.167368: Completed hyperparameter setting kmer_frequency_logistic_regression.

2022-01-04 10:50:22.168706: Hyperparameter optimization: running the inner loop of nested CV: completed selection for label immune_state (label 1 / 1).

2022-01-04 10:50:22.169262: Training ML model: running the inner loop of nested CV: retrain models for label immune_state (label 1 / 1).

2022-01-04 10:50:22.171159: Evaluating hyperparameter setting: kmer_frequency_logistic_regression...
2022-01-04 10:50:22.172812: Encoding started...
2022-01-04 10:50:26.105506: Encoding finished.
2022-01-04 10:50:26.105980: ML model training started...
2022-01-04 10:51:26.239528: ML model training finished.
       immune_state_x  confounder feature  valuemean_x  valuestd_x  \
5307            False       False     QPR     0.103194    0.153521   
13262           False        True     QYL     0.112848    0.564267   
13272           False        True     QYY     0.266143    0.694090   
13253           False 

In [15]:
from util.plotting import plot_validation_vs_test_performance

plot_validation_vs_test_performance(iml_result=result, result_path=ml_result_path)

In [16]:
# show what the model has learned

from IPython.display import IFrame


IFrame(src=str(ml_result_path / "result/HTML_output/train_ml_split_1_immune_state_kmer_frequency_logistic_regression_optimal_reports_ml_method_coefficients_largest_25_coefficients.html"),  width=700, height=600)