# Experiment 3: Selection Bias and Batch Effects

In this experiment, we combine selection bias and batch effects. The AIRR data comes from 2 hospitals: `hospital1` that recruits mostly diseased and `hospital2` that recruits mostly healthy individuals. We explore three possible scenarios here:

1. the immune state signal is stronger than the influence of the experimental protocol used for AIRR-sequencing,

2. the immune state signal is weaker than the influence of the experimental protocol,

3. there is no connection between the immune state and AIRR: we will show that ML models will in this case learn only a spurious correlation.

Immune state is a binary variable and can have values `True` or `False` to indicate if an individual is diseased or healthy. AIRR is a set of sequences simulated based on the values of the immune state and the confounder for the given individual. Hospital is a binary variable (`hospital1` or `hospital2`). Each hospital has their own experimental protocol that influences the observed AIRR. The influence of the experimental protocol on AIRR is manifested via higher frequency of some k-mers in the sequenced AIRRs.

Steps for each scenario:

1. Simulate training and test dataset from a causal graph to include the variables as described above.

2. Train an ML model (here: logistic regression on repertoires represented by the k-mer frequencies) on the train set which has selection bias and assess its performance on the test set when there is no selection bias.

Software used: 

- DagSim for simulation of the causal graph; 
- immuneML v2.1 for implanting signal in AIRRs and for training and assessing machine learning classifiers; 
- OLGA for simulation of naive AIRRs

In [None]:
import os
import yaml
import dagsim.baseDS as ds
import numpy as np
from pathlib import Path
from util.repertoire_util import make_olga_repertoire, load_iml_repertoire, make_AIRR_dataset, make_dataset
from util.implanting import make_immune_signal, make_repertoire_without_signal, make_repertoire_with_signal
from immuneML.util.PathBuilder import PathBuilder
from immuneML.simulation.implants.Signal import Signal

In [None]:
# remove results from the previous run

import shutil

if Path("./experiment3/").is_dir():
    shutil.rmtree("./experiment3/")

## Scenario 1: immune state signal is stronger than the influence of experimental protocol

### Step 1: AIRR simulation from a causal graph

In [None]:
# define functions to create immune state, hospital, experimental protocol, and AIRR

def get_immune_state(p_immune_state: float) -> bool:
    return np.random.binomial(n=1, p=p_immune_state)
    
def get_hospital(p) -> bool:
    return "hospital1" if np.random.binomial(n=1, p=p) > 0.5 else "hospital2"

def get_repertoire(immune_state: bool, experimental_protocol: int, path: Path, sequence_count: int, signal: Signal, seed: int, repertoire_implanting_rate: float) -> str:
        
    PathBuilder.build(path)    

    # make OLGA repertoire from the default OLGA TCRB model
    naive_repertoire = make_olga_repertoire(path=path, sequence_count=sequence_count, seed=seed)
    
    # implant a signal in the repertoire based on the immune state
    if immune_state:
        repertoire = make_repertoire_with_signal(repertoire=naive_repertoire, signal=signal, result_path=path / "immuneML_with_signal/", repertoire_implanting_rate=repertoire_implanting_rate)
    else:
        repertoire = make_repertoire_without_signal(repertoire=naive_repertoire, signal_name=signal.id, result_path=path / "immuneML_with_signal/")

    if experimental_protocol == 1:
        
        
    return repertoire.data_filename