# Experiment 1b: confounder distribution changes but does not influence the performance

In this experiment, we use a simplified causal graph consisting of three nodes: immune state, confounder, and AIRR, to show if the distribution of a confounder changes in between the source and target populations, it is still possible that the prediction performance remains the same.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. Confounder also in this setting has two values: `C1` and `C2`. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual.

Steps:

1. Simulate training and test dataset from a causal graph to include: confounder (implemented by implanting 3-mer `ADR` in the repertoire), immune state (implemented by implanting a signal in the repertoire but the exact 3-mer depends on the value of the confounder; it is either `QPR` or `EQY`)

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set and assess its performance on the test set in the presence of confounder with slightly different distributions across train and test sets.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [1]:
import os
import yaml
import dagsim.base as ds
import numpy as np
from pathlib import Path
from util.dataset_util import make_AIRR_dataset, make_dataset, setup_path
from util.implanting import make_immune_state_signals, make_confounding_signal
from util.experiment1 import get_immune_state, get_confounder, get_repertoire
from immuneML.simulation.implants.Signal import Signal

## Step 1: AIRR simulation from a causal graph

In [9]:
result_path = setup_path('./experiment1b/')
data_path = setup_path("./experiment1b/data/")

train_example_count = 200
test_example_count = 100

# immune state: two binomial distributions depending on the confounder value with probability of success p
immune_state_p_conf1 = 0.8 # for confounder = C1
immune_state_p_conf2 = 0.2 # for confounder = C2

# confounder: binomial distribution with probability of success p
confounder_p_train = 0.4
confounder_p_test = 0.5

# other parameters
immune_state_implanting_rate = 0.02 # percentage of repertoire sequences to include immune state signal
confounding_implanting_rate = 0.2
sequence_count = 500 # number of sequences in one repertoire

Removing experiment1b...


In [10]:
immune_state_signal_name = "immune_state"
immune_state_signals = make_immune_state_signals(signal_name=immune_state_signal_name)
confounding_signal = make_confounding_signal()

# define nodes of the causal graph

index_node = ds.Node(name="index", function=np.arange, size_field="stop")

confounder_node = ds.Node(name="confounder", function=get_confounder, kwargs={"p": confounder_p_train})

immune_state_node = ds.Node(name="immune_state", function=get_immune_state, 
                               kwargs={"confounder": confounder_node, "p_conf1": immune_state_p_conf1,
                                         "p_conf2": immune_state_p_conf2})

repertoire_node = ds.Node(name="repertoire", function=get_repertoire, 
                             kwargs={"immune_state": immune_state_node, "confounder": confounder_node, 
                                        "path": data_path / "train", "sequence_count": sequence_count, 
                                        "immune_state_signals": immune_state_signals, 
                                        "confounding_signal": confounding_signal, "seed": index_node, 
                                        'immune_state_implanting_rate': immune_state_implanting_rate, 
                                        'confounding_implanting_rate': confounding_implanting_rate})

# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_1b", 
                 list_nodes=[index_node, confounder_node, immune_state_node, repertoire_node])
graph.draw()



subprocess 16466 is still running



In [11]:
# simulate a dataset using the graph

study_cohort_data = graph.simulate(num_samples=train_example_count, 
                                   csv_name=str(data_path / "train/study_cohort"))

# make an AIRR dataset from the generated repertoires to be used for training

train_dataset = make_dataset(repertoire_paths=study_cohort_data["repertoire"], path=data_path / 'train', 
                             dataset_name="experiment1b_train", 
                             signal_names=[immune_state_signal_name, confounder_node.name])

# make a test dataset

repertoire_node.additional_parameters['path'] = data_path / 'test' # update result_path: to be removed with DagSim update
confounder_node.additional_parameters['p'] = confounder_p_test # update the confounder distribution parameter for test

test_data = graph.simulate(num_samples=test_example_count, csv_name=str(data_path / "test/test_cohort"))

test_dataset = make_dataset(repertoire_paths=test_data["repertoire"], path=data_path / 'test', 
                            dataset_name="experiment1b_test", 
                            signal_names=[immune_state_signal_name, confounder_node.name])

# merge datasets (but the distinction between train and test will be kept)

dataset = make_AIRR_dataset(train_dataset, test_dataset, data_path / 'full_dataset')

Simulation started
Simulation finished in 213.2542 seconds
Simulation started
Simulation finished in 118.9263 seconds


## Step 2: train an ML model and assess performance

In [12]:
specs = {
    "definitions": {
        "datasets": {
            "dataset1": {
                "format": 'AIRR',
                "params": {
                    "path": str(data_path / 'full_dataset'),
                    "metadata_file": str(data_path / 'full_dataset/metadata.csv')
                }
            }
        },
        "encodings": {
            "kmer_frequency": {
                "KmerFrequency": {"k": 3}
            }
        },
        "ml_methods": {
            "logistic_regression": {
                "LogisticRegression": {
                    "penalty": "l1",
                    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    "show_warnings": False
                },
                "model_selection_cv": True,
                "model_selection_n_folds": 3
            }
        },
        "reports": {
            "motif_recovery": { # to check how much coefficients overlap with the immune state signal that was implanted
                "MotifSeedRecovery": {
                    "implanted_motifs_per_label": {
                        "immune_state": {
                            "seeds": ["EQY", "QPR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        },
                        "confounder": {
                            "seeds": ["ADR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        }
                    }
                }
            },
            "coefficients": {
                "Coefficients": { # show top 25 logistic regression coefficients and what k-mers they correspond to
                    "coefs_to_plot": ['n_largest'],
                    "n_largest": [25]
                }
            },
            "feature_comparison_with_confounder": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "color_grouping_label": "confounder",
                    "show_error_bar": False,
                    "log_scale": True
                }
            },
            "feature_comparison_no_confounder": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "show_error_bar": False,
                    "log_scale": True
                }
            }, 
            "training_performance": {
                "TrainingPerformance": {
                    "metrics": ["balanced_accuracy", "log_loss", "auc"]
                }
            }
        }
    },
    "instructions": {
        'train_ml': {
            "type": "TrainMLModel",
            "assessment": { # ensure here that train and test dataset are fixed, as per simulation
                "split_strategy": "manual",
                "split_count": 1,
                "manual_config": {
                    "train_metadata_path": str(data_path / "train/experiment1b_train_metadata.csv"),
                    "test_metadata_path": str(data_path / "test/experiment1b_test_metadata.csv")
                },
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison_with_confounder", "feature_comparison_no_confounder"]
                }
            },
            "selection": {
                "split_strategy": "k_fold",
                "split_count": 5,
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison_with_confounder", "feature_comparison_no_confounder"]
                }
            },
            "settings": [
                {"encoding": "kmer_frequency", "ml_method": "logistic_regression"}
            ],
            "dataset": "dataset1",
            "refit_optimal_model": False,
            "labels": ["immune_state"],
            "optimization_metric": "balanced_accuracy",
            "metrics": ['log_loss', 'auc']
        }
    }
}

ml_result_path = setup_path("./experiment1b/ml_result/")
specification_path = ml_result_path / "specs.yaml"

with open(specification_path, "w") as file:
    yaml.dump(specs, file)

In [13]:
# run immuneML with the specs file

from immuneML.app.ImmuneMLApp import ImmuneMLApp

output_path = ml_result_path / "result/"

app = ImmuneMLApp(specification_path = specification_path, result_path = output_path)
result = app.run()

print("The results are located under ./experiment1b/")

2022-01-04 15:01:57.660099: Setting temporary cache path to experiment1b/ml_result/result/cache
2022-01-04 15:01:57.660895: ImmuneML: parsing the specification...

2022-01-04 15:02:19.126723: Full specification is available at experiment1b/ml_result/result/full_specs.yaml.

2022-01-04 15:02:19.128032: ImmuneML: starting the analysis...

2022-01-04 15:02:19.128574: Instruction 1/1 has started.
2022-01-04 15:02:19.234606: Training ML model: running outer CV loop: started split 1/1.

2022-01-04 15:02:19.351248: Hyperparameter optimization: running the inner loop of nested CV: selection for label immune_state (label 1 / 1).

2022-01-04 15:02:19.353400: Evaluating hyperparameter setting: kmer_frequency_logistic_regression...
2022-01-04 15:02:19.355802: Encoding started...
2022-01-04 15:02:26.822154: Encoding finished.
2022-01-04 15:02:26.822827: ML model training started...
2022-01-04 15:02:56.859840: ML model training finished.
       immune_state_x  confounder feature  valuemean_x  values

      immune_state_x feature  valuemean_x  valuestd_x  immune_state_y  \
0              False     AAA     0.469096    0.753738            True   
1              False     AAC     0.362096    0.992309            True   
2              False     AAD     1.418734    0.985237            True   
3              False     AAE     0.757363    1.043764            True   
4              False     AAF     0.707518    1.010754            True   
...              ...     ...          ...         ...             ...   
7758           False     YYS     0.397421    0.958271            True   
7759           False     YYT     0.265934    1.004563            True   
7760           False     YYV     0.093907    0.768661            True   
7761           False     YYW     0.171302    0.985408            True   
7762           False     YYY     0.063119    0.516649            True   

      valuemean_y  valuestd_y  
0        0.707854    1.137975  
1        0.318043    1.005236  
2        0.373401    0.7464

      immune_state_x feature  valuemean_x  valuestd_x  immune_state_y  \
0              False     AAA     0.522117    0.738967            True   
1              False     AAC     0.500199    1.119301            True   
2              False     AAD     1.361811    1.058935            True   
3              False     AAE     0.533155    1.054417            True   
4              False     AAF     1.007083    1.125659            True   
...              ...     ...          ...         ...             ...   
7752           False     YYS     0.168490    0.694704            True   
7753           False     YYT     0.159061    0.655824            True   
7754           False     YYV     0.000000    0.000000            True   
7755           False     YYW     0.374290    1.543236            True   
7756           False     YYY     0.226638    0.934452            True   

      valuemean_y  valuestd_y  
0        0.746528    1.159573  
1        0.269113    0.851010  
2        0.276871    0.6763

      immune_state_x feature  valuemean_x  valuestd_x  immune_state_y  \
0              False     AAA     0.504864    0.819573            True   
1              False     AAC     0.309793    0.922649            True   
2              False     AAD     1.442588    0.985329            True   
3              False     AAE     0.646015    1.017244            True   
4              False     AAF     0.631754    0.938444            True   
...              ...     ...          ...         ...             ...   
7756           False     YYS     0.365937    0.945923            True   
7757           False     YYT     0.186295    0.751129            True   
7758           False     YYV     0.083414    0.687847            True   
7759           False     YYW     0.233335    1.095767            True   
7760           False     YYY     0.115307    0.667404            True   

      valuemean_y  valuestd_y  
0        0.698887    1.110269  
1        0.376426    1.053969  
2        0.320315    0.6920

      immune_state_x feature  valuemean_x  valuestd_x  immune_state_y  \
0              False     AAA     0.468056    0.901521            True   
1              False     AAC     0.197391    0.781363            True   
2              False     AAD     1.827229    0.846663            True   
3              False     AAE     0.775130    1.400711            True   
4              False     AAF     0.573712    0.891588            True   
...              ...     ...          ...         ...             ...   
7820           False     YYS     0.388112    0.962768            True   
7821           False     YYT     0.479254    1.099571            True   
7822           False     YYV     0.130804    0.915630            True   
7823           False     YYW     0.240555    1.178473            True   
7824           False     YYY     0.399598    1.419881            True   

      valuemean_y  valuestd_y  
0        0.443617    0.801196  
1        0.183652    0.728230  
2        0.359026    0.7414

In [14]:
from util.plotting import plot_validation_vs_test_performance

plot_validation_vs_test_performance(iml_result=result, result_path=ml_result_path)

In [15]:
# show what the model has learned

from IPython.display import IFrame

IFrame(src=str(ml_result_path / "result/HTML_output/train_ml_split_1_immune_state_kmer_frequency_logistic_regression_optimal_reports_ml_method_coefficients_largest_25_coefficients.html"),  width=700, height=600)
