# Experiment 3: Selection Bias and Batch Effects

In this experiment, we combine selection bias and batch effects. The AIRR data comes from 2 hospitals: `hospital1` that recruits mostly diseased and `hospital2` that recruits mostly healthy individuals. We explore three possible scenarios here:

1. the immune state signal is stronger than the influence of the experimental protocol used for AIRR-sequencing,

2. the immune state signal is weaker than the influence of the experimental protocol,

3. there is no connection between the immune state and AIRR: we will show that ML models will in this case learn only a spurious correlation.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual. Hospital is a binary variable (`hospital1` or `hospital2`). Each hospital has their own experimental protocol that influences the observed AIRR. The influence of the experimental protocol on AIRR is manifested via higher frequency of some k-mers in the sequenced AIRRs.

Steps for each scenario:

1. Simulate training and test dataset from a causal graph to include the variables as described above.

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set which has selection bias and assess its performance on the test set when there is no selection bias.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [None]:
import os
import yaml
import dagsim.baseDS as ds
import numpy as np
from pathlib import Path
from util.repertoire_util import make_olga_repertoire, load_iml_repertoire, make_AIRR_dataset, make_dataset
from util.implanting import make_immune_signal, make_repertoire_without_signal, make_repertoire_with_signal, make_exp_protocol_signal
from util.simulation import get_immune_state, get_hospital, get_exp_protocol, get_repertoire
from immuneML.util.PathBuilder import PathBuilder
from immuneML.simulation.implants.Signal import Signal

In [None]:
# remove results from the previous run

import shutil

if Path("./experiment3/").is_dir():
    shutil.rmtree("./experiment3/")

## Scenario 1: immune state signal is stronger than the influence of experimental protocol

In [None]:
# define and build path, remove content if not empty

scenario1_path = Path("./experiment3/scenario1/")

if scenario1_path.is_dir():
    shutil.rmtree()
    
PathBuilder.build(scenario1_path)

data_path = PathBuilder.build(scenario1_path / "data")

### Step 1: AIRR simulation from a causal graph

In [None]:
# define constants for the simulation

p_immune_state = 0.5 # parameter of binomial distribution for the immune state
p_hospital = 0.5 # parameter of binomial distribution for selecting between hospitals 1 and 2

sequence_count = 2000
repertoire_implanting_rate = 0.01

immune_state_signal = make_immune_signal()

In [None]:
index_node = ds.Generic(name="index", function=np.arange, size_field="stop")

immune_state_node = ds.Generic(name="immune_state", function=get_immune_state, arguments={"p": p_immune_state})

hospital_node = ds.Generic(name="hospital", function=get_hospital, arguments={"p": p_hospital})

experimental_protocol_node = ds.Generic(name="exp_protocol", function=get_exp_protocol, 
                                        arguments={"hospital": hospital_node})

repertoire_node = ds.Generic(name="repertoire", function=get_repertoire, 
                             arguments={"immune_state": immune_state_node, "experimental_protocol": experimental_protocol_node,
                                        "path": data_path / "train", "sequence_count": sequence_count, "signal": immune_state_signal, 
                                        "seed": index_node, 'repertoire_implanting_rate': repertoire_implanting_rate})


In [None]:
# make a causal graph using DagSim and show it graphically

graph = ds.Graph(name="graph_experiment_3_1", list_nodes=[index_node, immune_state_node, hospital_node, experimental_protocol_node, repertoire_node])
graph.draw()