# Experiment 2: Selection Bias and Batch Effects

In this experiment, we combine selection bias and batch effects. The AIRR data comes from 2 hospitals: `hospital1` that recruits mostly diseased and `hospital2` that recruits mostly healthy individuals. We explore two possible scenarios here:

1. the performance of the ML model when there is a selection bias during training, but not in the test dataset.

2. the performance of the ML model when there is no connection between the immune state and AIRR: we will show that ML models will in this case learn only a spurious correlation.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual. Hospital is a binary variable (`hospital1` or `hospital2`). Each hospital has their own experimental protocol that influences the observed AIRR. The influence of the experimental protocol on AIRR is manifested via higher frequency of some k-mers in the sequenced AIRRs.

Steps for each scenario:

1. Simulate training and test dataset from a causal graph to include the variables as described above.

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set which has selection bias and assess its performance on the test set when there is no selection bias.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [1]:
import os
import yaml
import dagsim.base as ds
import numpy as np
from pathlib import Path
from util.dataset_util import make_olga_repertoire, load_iml_repertoire, make_AIRR_dataset, make_dataset, setup_path
from util.experiment2 import make_immune_state_signal
from util.simulation import get_immune_state, get_hospital, get_exp_protocol, get_repertoire, get_selection
from immuneML.util.PathBuilder import PathBuilder
from immuneML.simulation.implants.Signal import Signal

In [2]:
setup_path("./experiment2") # remove results from the previous run

train_example_count = 200
test_example_count = 50

Removing experiment2...


## Scenario 1: immune state signal is stronger than the influence of the experimental protocol

In [3]:
# define and build path, remove content if not empty

scenario1_path = setup_path("./experiment2/scenario1/")
scenario1_data_path = setup_path(scenario1_path / "data")

### Step 1: AIRR simulation from a causal graph

In [4]:
# define constants for the simulation

p_immune_state = 0.5 # parameter of binomial distribution for the immune state
p_hospital = 0.5 # parameter of binomial distribution for selecting between hospitals 1 and 2

sequence_count = 2000
immune_state_implanting_rate = 0.01
protocol_implanting_rate = 0.01

immune_state_signal = make_immune_state_signal()


# define nodes of ]the causal graph

immune_state_node = ds.Generic(name="immune_state", function=get_immune_state, arguments={"p": p_immune_state})

hospital_node = ds.Generic(name="hospital", function=get_hospital, arguments={"p": p_hospital})

experimental_protocol_node = ds.Generic(name="exp_protocol", function=get_exp_protocol, arguments={"hospital": hospital_node})

repertoire_node = ds.Generic(name="repertoire", function=get_repertoire, 
                             arguments={"immune_state": immune_state_node, 
                                        "experimental_protocol": experimental_protocol_node,
                                        "path": scenario1_data_path / "train", "sequence_count": sequence_count, 
                                        "immune_state_signal": immune_state_signal, 
                                        'immune_state_implanting_rate': immune_state_implanting_rate, 
                                        "protocol_implanting_rate": protocol_implanting_rate})

selection_node = ds.Selection(name="S", function=get_selection,
                              arguments={"hospital": hospital_node, "immune_state": immune_state_node})

# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_2_1", 
                 list_nodes=[immune_state_node, hospital_node, experimental_protocol_node, repertoire_node, 
                             selection_node])
graph.draw()

In [5]:
training_data_sc1 = graph.simulate(num_samples=train_example_count, selection=True,
                                   csv_name=str(scenario1_data_path / "train/study_cohort"))

# make an AIRR dataset from the generated repertoires to be used for training

train_dataset = make_dataset(repertoire_paths=training_data_sc1["repertoire"], path=scenario1_data_path / 'train', 
                             dataset_name="experiment2_sc1_train", 
                             signal_names=[immune_state_signal.id, experimental_protocol_node.name])

# make a test dataset

repertoire_node.additional_parameters['path'] = scenario1_data_path / 'test' 

test_data = graph.simulate(num_samples=test_example_count, csv_name=str(scenario1_data_path / "test/test_cohort"), 
                           selection=False)

test_dataset = make_dataset(repertoire_paths=test_data["repertoire"], path=scenario1_data_path / 'test',
                            dataset_name="experiment2_sc1_test", 
                            signal_names=[immune_state_signal.id, experimental_protocol_node.name])

# merge datasets (but the distinction between train and test will be kept in the ML analysis part)

dataset = make_AIRR_dataset(train_dataset, test_dataset, scenario1_data_path / 'full_dataset')

Simulation started
Simulation finished in 638.6343 seconds
Simulation started
Simulation finished in 75.3197 seconds


### Step 2: Training an ML model

In [23]:
specs = {
    "definitions": {
        "datasets": {
            "dataset1": {
                "format": 'AIRR',
                "params": {
                    "path": str(scenario1_data_path / 'full_dataset'),
                    "metadata_file": str(scenario1_data_path / 'full_dataset/metadata.csv')
                }
            }
        },
        "encodings": {
            "kmer_frequency": {
                "KmerFrequency": {"k": 3}
            }
        },
        "ml_methods": {
            "logistic_regression": {
                "LogisticRegression": {
                    "penalty": "l1",
                    "C": [0.01, 0.1, 1, 10, 100],
                    "max_iter": 1500,
                    "show_warnings": False
                },
                "model_selection_cv": True,
                "model_selection_n_folds": 5
            }
        },
        "reports": {
            "coefficients": {
                "Coefficients": { # show top 25 logistic regression coefficients and what k-mers they correspond to
                    "coefs_to_plot": ['n_largest'],
                    "n_largest": [25]
                }
            },
            "feature_comparison": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "color_grouping_label": "experimental_protocol",
                    "show_error_bar": False,
                    "keep_fraction": 0.1
                }
            }
        }
    },
    "instructions": {
        'train_ml': {
            "type": "TrainMLModel",
            "assessment": { # ensure here that train and test dataset are fixed, as per simulation
                "split_strategy": "manual",
                "split_count": 1,
                "manual_config": {
                    "train_metadata_path": str(scenario1_data_path / "train/experiment2_sc1_train_metadata.csv"),
                    "test_metadata_path": str(scenario1_data_path / "test/experiment2_sc1_test_metadata.csv")
                },
                "reports": {
                    "models": ["coefficients"],
                    "encoding": ["feature_comparison"]
                }
            },
            "selection": {
                "split_strategy": "random",
                "train_percentage": 0.7,
                "split_count": 5,
                "reports": {
                    "models": ["coefficients"],
                    "encoding": ["feature_comparison"]
                }
            },
            "settings": [
                {"encoding": "kmer_frequency", "ml_method": "logistic_regression"}
            ],
            "dataset": "dataset1",
            "refit_optimal_model": False,
            "labels": ["immune_state"],
            "optimization_metric": "balanced_accuracy",
            "metrics": ['log_loss', 'auc']
        }
    }
}

scenario1_ml_result_path = setup_path("./experiment2/scenario1/ml_result/")
scenario1_specs_path = scenario1_ml_result_path / "specs.yaml"

with open(scenario1_specs_path, "w") as file:
    yaml.dump(specs, file)

Removing experiment2/scenario1/ml_result...


In [24]:
# run immuneML with the specs file

from immuneML.app.ImmuneMLApp import ImmuneMLApp

scenario1_output_path = scenario1_ml_result_path / "result/"

app = ImmuneMLApp(specification_path = scenario1_specs_path, result_path = scenario1_output_path)
result = app.run()

print("The results are located under ./experiment2/scenario1/")

2022-01-17 15:29:45.813923: Setting temporary cache path to experiment2/scenario1/ml_result/result/cache
2022-01-17 15:29:45.814991: ImmuneML: parsing the specification...

2022-01-17 15:30:30.456961: Full specification is available at experiment2/scenario1/ml_result/result/full_specs.yaml.

2022-01-17 15:30:30.458089: ImmuneML: starting the analysis...

2022-01-17 15:30:30.458611: Instruction 1/1 has started.
2022-01-17 15:30:30.549570: Training ML model: running outer CV loop: started split 1/1.

2022-01-17 15:30:30.656400: Hyperparameter optimization: running the inner loop of nested CV: selection for label immune_state (label 1 / 1).

2022-01-17 15:30:30.658570: Evaluating hyperparameter setting: kmer_frequency_logistic_regression...
2022-01-17 15:30:30.659897: Encoding started...
2022-01-17 15:30:42.311634: Encoding finished.
2022-01-17 15:30:42.312068: ML model training started...
2022-01-17 15:31:55.672430: ML model training finished.
     feature  immune_state_x  experimental_p



2022-01-17 15:32:06.224491: Encoding finished.
     feature  immune_state_x  experimental_protocol  valuemean_x  valuestd_x  \
5430     QPR           False                   True     0.380952    0.275040   
2777     HYI           False                   True     0.642163    2.686461   
2768     HWW           False                   True     0.327783    1.966701   
5300     QHF           False                   True     0.261024    0.681701   
2491     HFY           False                   True     0.318331    0.603925   
...      ...             ...                    ...          ...         ...   
645      CPK           False                   True     0.674929    1.253957   
5213     QCV           False                   True     0.457539    1.229521   
1059     DQE           False                   True     2.243256    1.308209   
4491     NGR           False                   True     1.731784    1.102255   
7766     YLQ           False                   True     0.563871    0.983



2022-01-17 15:32:15.789269: Encoding finished.
2022-01-17 15:32:15.789689: ML model training started...
2022-01-17 15:33:28.336661: ML model training finished.
     feature  immune_state_x  experimental_protocol  valuemean_x  valuestd_x  \
5432     QPR           False                   True     0.374521    0.266289   
5302     QHF           False                   True     0.506314    0.843868   
2475     HFF           False                   True     0.379817    0.701466   
99       AFY           False                   True     1.550439    1.050458   
84       AFF           False                   True     1.290208    0.977024   
...      ...             ...                    ...          ...         ...   
3533     KVG           False                   True     1.436309    0.997422   
532      CHP           False                   True     0.261728    0.845271   
965      DKP           False                   True     0.951370    1.094204   
6080     SGD           False            



2022-01-17 15:35:09.156850: Encoding finished.
2022-01-17 15:35:09.157305: ML model training started...
2022-01-17 15:36:20.655051: ML model training finished.
     feature  immune_state_x  experimental_protocol  valuemean_x  valuestd_x  \
5432     QPR           False                   True     0.366350    0.315127   
5302     QHF           False                   True     0.421277    0.795944   
99       AFY           False                   True     1.679206    0.992079   
2477     HFF           False                   True     0.350969    0.711340   
4644     NQH           False                   True     0.370658    0.778922   
...      ...             ...                    ...          ...         ...   
1834     FPA           False                   True     0.904152    0.929721   
1791     FLT           False                   True     0.933119    0.929788   
3648     LEA           False                   True     1.990087    1.079934   
1741     FIH           False            

2022-01-17 15:39:45.844467: Encoding finished.
     feature  immune_state_x  experimental_protocol  valuemean_x  valuestd_x  \
5442     QPR           False                   True     0.432263    0.319628   
4030     MCV           False                   True     0.000000    0.000000   
4372     MYF           False                   True     0.000000    0.000000   
2760     HWF           False                   True     1.281224    2.748080   
6766     TWW           False                   True     1.210196    2.466957   
...      ...             ...                    ...          ...         ...   
5570     QYD           False                   True     0.318413    0.682939   
2134     GHW           False                   True     0.953795    1.208022   
463      CEE           False                   True     0.395127    0.967864   
2312     GST           False                   True     3.682583    0.973940   
6637     TPL           False                   True     1.936029    0.965

In [33]:
from util.plotting import plot_balanced_error_rate
    
plot_balanced_error_rate(iml_result=result, result_path=scenario1_ml_result_path)

## Scenario 2: immune state does not influence the AIRR: learning spurious correlations

In [None]:
# define and build path, remove content if not empty

scenario2_path = setup_path("./experiment2/scenario2/")
scenario2_data_path = setup_path(scenario2_path / "data")

# define constants for the simulation

p_immune_state = 0.5 # parameter of binomial distribution for the immune state
p_hospital = 0.5 # parameter of binomial distribution for selecting between hospitals 1 and 2

sequence_count = 2000
immune_state_implanting_rate = 0.0
protocol_implanting_rate = 0.04

immune_state_signal = make_immune_signal()

In [None]:
repertoire_node = ds.Generic(name="repertoire", function=get_repertoire,
                             arguments={"immune_state": False, "experimental_protocol": experimental_protocol_node,
                                        "path": scenario2_data_path / "train", "sequence_count": sequence_count, 
                                        "immune_state_signal": immune_state_signal,
                                        'repertoire_implanting_rate': repertoire_implanting_rate})

# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_2_2", 
                 list_nodes=[index_node, immune_state_node, hospital_node, experimental_protocol_node, 
                             repertoire_node, selection_node])
graph.draw()

In [None]:
# make a train dataset

training_data_sc2 = graph.simulate(num_samples=train_example_count, 
                                   csv_name=str(scenario3_data_path / "train/study_cohort"))

train_dataset = make_dataset(repertoires=training_data_sc2["repertoire"], path=scenario2_data_path / 'train', 
                             dataset_name="experiment2_sc2_train",
                             signal_names=[immune_state_signal.id, experimental_protocol_node.name])

# make a test dataset

repertoire_node.additional_parameters['path'] = scenario3_data_path / 'test' # update result_path: to be removed with DagSim update

test_data = graph.simulate(num_samples=test_example_count, csv_name=str(scenario2_data_path / "test/test_cohort"), 
                           selection=False)

test_dataset = make_dataset(repertoires=test_data["repertoire"], path=scenario2_data_path / 'test',
                            dataset_name="experiment2_sc2_test", 
                            signal_names=[immune_state_signal.id, experimental_protocol_node.name])

# merge datasets (but the distinction between train and test will be kept in the ML analysis part)

dataset = make_AIRR_dataset(train_dataset, test_dataset, scenario2_data_path / 'full_dataset')

### Step 2: Training an ML model

In [None]:
specs = {
    "definitions": {
        "datasets": {
            "dataset1": {
                "format": 'AIRR',
                "params": {
                    "path": str(scenario2_data_path / 'full_dataset'),
                    "metadata_file": str(scenario2_data_path / 'full_dataset/metadata.csv')
                }
            }
        },
        "encodings": {
            "kmer_frequency": {
                "KmerFrequency": {"k": 3}
            }
        },
        "ml_methods": {
            "logistic_regression": {
                "LogisticRegression": {
                    "penalty": "l1",
                    "C": [0.01, 0.1, 1, 10, 100],
                    "max_iter": 1500,
                    "show_warnings": False
                },
                "model_selection_cv": True,
                "model_selection_n_folds": 5
            }
        },
        "reports": {
            "coefficients": {
                "Coefficients": { # show top 25 logistic regression coefficients and what k-mers they correspond to
                    "coefs_to_plot": ['n_largest'],
                    "n_largest": [25]
                }
            },
            "feature_comparison": {
                "FeatureComparison": {
                    "comparison_label": "immune_state",
                    "color_grouping_label": "experimental_protocol",
                    "show_error_bar": False,
                    "keep_fraction": 0.1
                }
            }
        }
    },
    "instructions": {
        'train_ml': {
            "type": "TrainMLModel",
            "assessment": { # ensure here that train and test dataset are fixed, as per simulation
                "split_strategy": "manual",
                "split_count": 1,
                "manual_config": {
                    "train_metadata_path": str(scenario2_data_path / "train/experiment2_sc2_train_metadata.csv"),
                    "test_metadata_path": str(scenario2_data_path / "test/experiment2_sc2_test_metadata.csv")
                },
                "reports": {
                    "models": ["coefficients"],
                    "encoding": ["feature_comparison"]
                }
            },
            "selection": {
                "split_strategy": "random",
                "train_percentage": 0.7,
                "split_count": 1,
                "reports": {
                    "models": ["coefficients"],
                    "encoding": ["feature_comparison"]
                }
            },
            "settings": [
                {"encoding": "kmer_frequency", "ml_method": "logistic_regression"}
            ],
            "dataset": "dataset1",
            "refit_optimal_model": False,
            "labels": ["immune_state"],
            "optimization_metric": "balanced_accuracy",
            "metrics": ['log_loss', 'auc']
        }
    }
}

scenario2_ml_result_path = setup_path("./experiment2/scenario2/ml_result/")
scenario2_specs_path = scenario2_ml_result_path / "specs.yaml"

with open(scenario1_specs_path, "w") as file:
    yaml.dump(specs, file)

In [None]:
# run immuneML with the specs file

from immuneML.app.ImmuneMLApp import ImmuneMLApp

scenario2_output_path = scenario2_ml_result_path / "result/"

app = ImmuneMLApp(specification_path = scenario2_specs_path, result_path = scenario2_output_path)
result = app.run()

print("The results are located under ./experiment2/scenario2/")

In [None]:
from util.plotting import plot_balanced_error_rate
    
plot_balanced_error_rate(iml_result=result, result_path=scenario2_ml_result_path)