# Experiment 1a: the influence of a stable confounder on classification performance

In this experiment, we use a simplified causal graph consisting of three nodes: immune state, confounder, and AIRR, to show if the distribution of a confounder is stable in the study cohort, and source and target populations, it does not influence the prediction task.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. Confounder also in this setting has two values: `C1` and `C2`. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual.

Steps:

1. Simulate training and test dataset from a causal graph to include: confounder (implemented by implanting 3-mer `ADR` in the repertoire), immune state (implemented by implanting a signal in the repertoire but the exact 3-mer depends on the value of the confounder; it is either `QPR` or `EQY`)

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set and assess its performance on the test set in the presence of confounder with stable distribution across train and test.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [11]:
import yaml
import dagsim.base as ds
import numpy as np
from util.dataset_util import make_AIRR_dataset, make_dataset, setup_path
from util.implanting import make_immune_state_signals, make_confounding_signal
from util.experiment1 import get_immune_state, get_confounder, get_repertoire

## Step 1: AIRR simulation from a causal graph

In [12]:
result_path = setup_path('./experiment1a/')
data_path = setup_path("./experiment1a/data/")

# how many repertoires to make for training and testing
train_example_count = 200
test_example_count = 100

# immune state: two binomial distributions depending on the confounder value with probability of success p
immune_state_p_conf1 = 0.8 # for confounder = C1
immune_state_p_conf2 = 0.2 # for confounder = C2

# confounder: binomial distribution with probability of success p
confounder_p_train = 0.5
confounder_p_test = 0.5

# other parameters
immune_state_implanting_rate = 0.02 # percentage of repertoire sequences to include immune state signal
confounding_implanting_rate = 0.2
sequence_count = 500 # number of sequences in one repertoire

Removing experiment1a...


In [13]:
immune_state_signal_name = "immune_state"
immune_state_signals = make_immune_state_signals(signal_name=immune_state_signal_name)
confounding_signal = make_confounding_signal()

# define nodes of the causal graph

index_node = ds.Node(name="index", function=np.arange, kwargs={"start": 0}, size_field="stop")

confounder_node = ds.Node(name="confounder", function=get_confounder, kwargs={"p": confounder_p_train})

immune_state_node = ds.Node(name="immune_state", function=get_immune_state, 
                               kwargs={"confounder": confounder_node, "p_conf1": immune_state_p_conf1,
                                         "p_conf2": immune_state_p_conf2})

repertoire_node = ds.Node(name="repertoire", function=get_repertoire, 
                             kwargs={"immune_state": immune_state_node, "confounder": confounder_node, 
                                        "path": data_path / "train", "sequence_count": sequence_count, 
                                        "immune_state_signals": immune_state_signals, 
                                        "confounding_signal": confounding_signal, "seed": index_node, 
                                        'immune_state_implanting_rate': immune_state_implanting_rate, 
                                        'confounding_implanting_rate': confounding_implanting_rate})

# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_1a", 
                 list_nodes=[index_node, confounder_node, immune_state_node, repertoire_node])
graph.draw()


In [14]:
# simulate a dataset using the graph

study_cohort_data = graph.simulate(num_samples=train_example_count, 
                                   csv_name=str(data_path / "train/study_cohort"))

# make an AIRR dataset from the generated repertoires to be used for training

train_dataset = make_dataset(repertoire_paths=study_cohort_data["repertoire"], path=data_path / 'train', 
                             dataset_name="experiment1a_train", 
                             signal_names=[immune_state_signal_name, confounder_node.name])

Simulation started
Simulation finished in 142.2393 seconds


In [15]:
# make a test dataset
confounder_node = ds.Node(name="confounder", function=get_confounder, kwargs={"p": confounder_p_test})

repertoire_node = ds.Node(name="repertoire", function=get_repertoire,
                             kwargs={"immune_state": immune_state_node, "confounder": confounder_node,
                                        "path": data_path / "test", "sequence_count": sequence_count,
                                        "immune_state_signals": immune_state_signals,
                                        "confounding_signal": confounding_signal, "seed": index_node,
                                        'immune_state_implanting_rate': immune_state_implanting_rate,
                                        'confounding_implanting_rate': confounding_implanting_rate})

graph = ds.Graph(name="graph_experiment_1a",
                 list_nodes=[index_node, confounder_node, immune_state_node, repertoire_node])

test_data = graph.simulate(num_samples=test_example_count, csv_name=str(data_path / "test/test_cohort"))

test_dataset = make_dataset(repertoire_paths=test_data["repertoire"], path=data_path / 'test',
                            dataset_name="experiment1a_test",
                            signal_names=[immune_state_signal_name, confounder_node.name])

# merge datasets (but the distinction between train and test will be kept)

dataset = make_AIRR_dataset(train_dataset, test_dataset, data_path / 'full_dataset')

Simulation started
Simulation finished in 69.2155 seconds


## Step 2: train an ML model and assess performance

In [16]:
specs = {
    "definitions": {
        "datasets": {
            "dataset1": {
                "format": 'AIRR',
                "params": {
                    "path": str(data_path / 'full_dataset'),
                    "metadata_file": str(data_path / 'full_dataset/metadata.csv')
                }
            }
        },
        "encodings": {
            "kmer_frequency": {
                "KmerFrequency": {"k": 3}
            }
        },
        "ml_methods": {
            "logistic_regression": {
                "LogisticRegression": {
                    "penalty": "l1",
                    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    "show_warnings": False
                },
                "model_selection_cv": True,
                "model_selection_n_folds": 5
            }
        },
        "reports": {
            "motif_recovery": { # to check how much coefficients overlap with the immune state signal that was implanted
                "MotifSeedRecovery": {
                    "implanted_motifs_per_label": {
                        "immune_state": {
                            "seeds": ["EQY", "QPR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        },
                        "confounder": {
                            "seeds": ["ADR"],
                            "hamming_distance": False,
                            "gap_sizes": [0] # no gaps
                        }
                    }
                }
            },
            "coefficients": {
                "Coefficients": { # show top 25 logistic regression coefficients and what k-mers they correspond to
                    "coefs_to_plot": ['n_largest'],
                    "n_largest": [25]
                }
            },
            "feature_comparison": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "color_grouping_label": "confounder",
                    "show_error_bar": False,
                    "keep_fraction": 0.1,
                    "log_scale": True
                }
            }, 
            "training_performance": {
                "TrainingPerformance": {
                    "metrics": ["balanced_accuracy", "log_loss", "auc"]
                }
            }
        }
    },
    "instructions": {
        'train_ml': {
            "type": "TrainMLModel",
            "assessment": { # ensure here that train and test dataset are fixed, as per simulation
                "split_strategy": "manual",
                "split_count": 1,
                "manual_config": {
                    "train_metadata_path": str(data_path / "train/experiment1a_train_metadata.csv"),
                    "test_metadata_path": str(data_path / "test/experiment1a_test_metadata.csv")
                },
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison"]
                }
            },
            "selection": {
                "split_strategy": "k_fold",
                "split_count": 5,
                "reports": {
                    "models": ["coefficients", "motif_recovery", "training_performance"],
                    "encoding": ["feature_comparison"]
                }
            },
            "settings": [
                {"encoding": "kmer_frequency", "ml_method": "logistic_regression"}
            ],
            "dataset": "dataset1",
            "refit_optimal_model": False,
            "labels": ["immune_state"],
            "optimization_metric": "balanced_accuracy",
            "metrics": ['log_loss', 'auc']
        }
    }
}

ml_result_path = setup_path("./experiment1a/ml_result/")
specification_path = ml_result_path / "specs.yaml"

with open(specification_path, "w") as file:
    yaml.dump(specs, file)

In [18]:
# run immuneML with the specs file

from immuneML.app.ImmuneMLApp import ImmuneMLApp

output_path = ml_result_path / "result/"

app = ImmuneMLApp(specification_path = specification_path, result_path = output_path)
result = app.run()

print("The results are located under ./experiment1a/")

2022-01-18 20:19:54.777263: Setting temporary cache path to experiment1a/ml_result/result/cache
2022-01-18 20:19:54.779205: ImmuneML: parsing the specification...



ERROR:root:

2022-01-18 20:19:54.849619 --- Exception in _parse_report : ReportParser: invalid parameter __init__() got an unexpected keyword argument 'keep_fraction' when specifying parameters in {'FeatureComparison': {'color_grouping_label': 'confounder', 'comparison_label': 'immune_state', 'keep_fraction': 0.1, 'log_scale': True, 'show_error_bar': False}} under key feature_comparison. Valid parameter names are: ['self', 'dataset', 'result_path', 'comparison_label', 'color_grouping_label', 'row_grouping_label', 'column_grouping_label', 'show_error_bar', 'name']




Exception: ReportParser: invalid parameter __init__() got an unexpected keyword argument 'keep_fraction' when specifying parameters in {'FeatureComparison': {'color_grouping_label': 'confounder', 'comparison_label': 'immune_state', 'keep_fraction': 0.1, 'log_scale': True, 'show_error_bar': False}} under key feature_comparison. Valid parameter names are: ['self', 'dataset', 'result_path', 'comparison_label', 'color_grouping_label', 'row_grouping_label', 'column_grouping_label', 'show_error_bar', 'name']

ImmuneMLParser: an error occurred during parsing in function _parse_report  with parameters: ('feature_comparison', {'FeatureComparison': {'color_grouping_label': 'confounder', 'comparison_label': 'immune_state', 'keep_fraction': 0.1, 'log_scale': True, 'show_error_bar': False}}, SymbolTable()).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.

In [None]:
from util.plotting import plot_validation_vs_test_performance

plot_validation_vs_test_performance(iml_result=result, result_path=ml_result_path)

In [None]:
# show what the model has learned

from IPython.display import IFrame


IFrame(src=str(ml_result_path / "result/HTML_output/train_ml_split_1_immune_state_kmer_frequency_logistic_regression_optimal_reports_ml_method_coefficients_largest_25_coefficients.html"),  width=700, height=600)