# [$R^2$] Repeatable
##### Last Updated: October 2022


## Introduction: <sup>[1]</sup>
> The code is running and producing the expected results. The next step is to make sure that you can produce the same output over successive runs of your program. In other words, the next step is to make your program deterministic, producing repeatable output. Repeatability is valuable. If a run of the program produces a particularly puzzling result, repeatability allows you to scrutinize any step of the execution of the program by re-running it again with extraneous prints, or inside a debugger. Repeatability is also useful to prove that the program did indeed produce the published results. Repeatability is not always possible or easy (Diethelm, 2012; Courtès and Wurmus, 2015). But for sequential and deterministically parallel programs (Hines, and Carnevale, 2008; Collange et al., 2015) not depending on analog inputs, it often comes down to controlling the initialization of the pseudo-random number generators (RNG).
> 
> For our program, that means setting the seed of the random module. We may also want to save the output of the program to a file, so that we can easily verify that consecutive runs do produce the same output: eyeballing differences is unreliable and time-consuming, and therefore won't be done systematically.

## Solution:

To address code repeatability, we initialize the random seed with a fixed integer value prior to code execution. There are two potential parts of the code where randomization occurs and repeatability may not hold. We perform our checks here in this notebook. 

### Synthetic Data Generation:

For the synthetic data generation step, when an integer value is specified for optional parameter `seed` in the `SyntheticData` class, the NumPy random state is set to this value. We will first show that if `seed` is not specified, repeatability does not hold. Then we will specify `seed` in class instantiation to validate repeatability in synthetic data generation.


In [1]:
## Without a fixed integer random seed
from src.SyntheticData import SyntheticData

c, k = 0.1, 1.6
num_points = 100000

no_seed = SyntheticData(
    treatment_effect = c, 
    treatment_assignment_bias = k)
df, config = no_seed.generate(num_points = num_points)

df.head()

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome
0,-0.313022,0,1,0,0,0,ok
1,0.786144,0,1,0,0,0,ok
2,-1.700151,1,1,0,0,0,ok
3,-1.422688,0,1,0,0,1,ok
4,0.520892,0,0,1,0,1,harm


In [2]:
_df, _config = no_seed.generate(num_points = num_points)
_df.head()

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome
0,0.320898,0,1,0,0,1,ok
1,-0.471875,1,1,0,0,0,ok
2,-0.680067,1,1,0,0,1,ok
3,-0.68738,1,1,0,0,0,ok
4,-2.222283,1,1,0,0,0,ok


In [3]:
df.equals(_df)

False

In [4]:
### With a fixed integer random seed

c, k = 0.1, 1.6
num_points = 100000

with_seed = SyntheticData(
    treatment_effect = c, 
    treatment_assignment_bias = k,
    seed = 0
)

df, config = with_seed.generate(num_points = num_points)
df.head()

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome
0,1.764052,1,1,1,1,1,harm
1,0.400157,0,1,0,0,1,ok
2,0.978738,1,1,0,0,0,ok
3,2.240893,1,1,0,0,1,ok
4,1.867558,0,1,0,0,0,ok


In [5]:
_df, config = with_seed.generate(num_points = num_points)
_df.head()

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome
0,1.764052,1,1,1,1,1,harm
1,0.400157,0,1,0,0,1,ok
2,0.978738,1,1,0,0,0,ok
3,2.240893,1,1,0,0,1,ok
4,1.867558,0,1,0,0,0,ok


In [6]:
df.equals(_df)

True

### Model Fitting:

In the model fitting step, the `SupervisedLearningModel` class relies on using `sklearn`'s `train_test_split()` function. The class optionally allows the user to specify integer `seed` that controls how train and test sets are defined for this experiment. If `seed` is not specified, according to `sklearn`'s documentation:

> `random_state`
> Whenever randomization is part of a Scikit-learn algorithm, a `random_state` parameter may be provided to control the random number generator used. Note that the mere presence of `random_state` doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. `shuffle`, being set.
>
> The passed value will have an effect on the reproducibility of the results returned by the function (`fit`, `split`, or any other function like `k_means`). `random_state`’s value may be:
>
> `None (default)`
Use the global random state instance from `numpy.random`. Calling the function multiple times will reuse the same instance, and will produce different results.

Similarly to the synthetic data generation step, we will first show that repeatability does not hold when `seed` is not specified. Then by using `df` previously defined, we will create an instance of the `SupervisedLearningModel` class with `seed` parameter, apply the same fitting procedures, and test for equivalence of our results to prove repeatability.


In [7]:
## Without a fixed integer random seed
from sklearn.linear_model import LogisticRegression
from src.SupervisedLearningModel import SupervisedLearningModel

model_list = ['propensity', 'observational', 'counterfactual']

results = df.copy()
for model_key in model_list:
    params = config.copy()
    clf = LogisticRegression(penalty = 'none')
    no_seed_model = SupervisedLearningModel(model = clf, name = model_key)
    if model_key == 'propensity':
        params['target'] = params['treat']['name']
    else:
        params['target'] = params['outcome']['name']
    no_seed_model.fit(results, params)
    train = results[params['features']['training']]
    if model_key == 'propensity':
        results[model_key] = clf.predict_proba(train)[:, 1:]
    else:
        results[model_key] = clf.predict_proba(train)[:, :1]
        
results

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome,propensity,observational,counterfactual
0,1.764052,1,1,1,1,1,harm,0.943044,0.158822,0.779542
1,0.400157,0,1,0,0,1,ok,0.471493,0.233616,0.476512
2,0.978738,1,1,0,0,0,ok,0.884415,0.134992,0.614637
3,2.240893,1,1,0,0,1,ok,0.963580,0.174893,0.851501
4,1.867558,0,1,0,0,0,ok,0.790547,0.303216,0.801171
...,...,...,...,...,...,...,...,...,...,...
99995,-0.337715,0,0,0,0,0,ok,0.301653,0.203104,0.301097
99996,-2.028548,0,0,0,0,0,ok,0.075757,0.144651,0.072008
99997,0.726182,0,1,0,0,0,ok,0.551396,0.248073,0.558852
99998,-1.167831,1,0,0,0,0,ok,0.481248,0.084845,0.153248


In [8]:
_results = df.copy()
for model_key in model_list:
    params = config.copy()
    clf = LogisticRegression(penalty = 'none')
    no_seed_model = SupervisedLearningModel(model = clf, name = model_key)
    if model_key == 'propensity':
        params['target'] = params['treat']['name']
    else:
        params['target'] = params['outcome']['name']
    no_seed_model.fit(_results, params)
    train = results[params['features']['training']]
    if model_key == 'propensity':
        _results[model_key] = clf.predict_proba(train)[:, 1:]
    else:
        _results[model_key] = clf.predict_proba(train)[:, :1]
        
_results

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome,propensity,observational,counterfactual
0,1.764052,1,1,1,1,1,harm,0.942432,0.157088,0.775389
1,0.400157,0,1,0,0,1,ok,0.470596,0.233467,0.474255
2,0.978738,1,1,0,0,0,ok,0.883661,0.133952,0.610118
3,2.240893,1,1,0,0,1,ok,0.963094,0.172663,0.848052
4,1.867558,0,1,0,0,0,ok,0.788710,0.301423,0.798236
...,...,...,...,...,...,...,...,...,...,...
99995,-0.337715,0,0,0,0,0,ok,0.301684,0.203594,0.300175
99996,-2.028548,0,0,0,0,0,ok,0.076372,0.146125,0.072430
99997,0.726182,0,1,0,0,0,ok,0.550095,0.247600,0.556110
99998,-1.167831,1,0,0,0,0,ok,0.482136,0.085023,0.152540


In [9]:
results.equals(_results)

False

In [10]:
### With a fixed integer random seed

results = df.copy()
for model_key in model_list:
    params = config.copy()
    clf = LogisticRegression(penalty = 'none')
    with_seed_model = SupervisedLearningModel(model = clf, name = model_key, seed = 0)
    if model_key == 'propensity':
        params['target'] = params['treat']['name']
    else:
        params['target'] = params['outcome']['name']
    with_seed_model.fit(results, params)
    train = results[params['features']['training']]
    if model_key == 'propensity':
        results[model_key] = clf.predict_proba(train)[:, 1:]
    else:
        results[model_key] = clf.predict_proba(train)[:, :1]
        
results

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome,propensity,observational,counterfactual
0,1.764052,1,1,1,1,1,harm,0.943788,0.157911,0.776830
1,0.400157,0,1,0,0,1,ok,0.473533,0.234511,0.478288
2,0.978738,1,1,0,0,0,ok,0.885344,0.134639,0.611137
3,2.240893,1,1,0,0,1,ok,0.964167,0.173575,0.849429
4,1.867558,0,1,0,0,0,ok,0.793377,0.302766,0.802018
...,...,...,...,...,...,...,...,...,...,...
99995,-0.337715,0,0,0,0,0,ok,0.302425,0.204496,0.302788
99996,-2.028548,0,0,0,0,0,ok,0.075290,0.146741,0.072685
99997,0.726182,0,1,0,0,0,ok,0.553913,0.248710,0.560508
99998,-1.167831,1,0,0,0,0,ok,0.480242,0.085422,0.151681


In [11]:
_results = df.copy()
for model_key in model_list:
    params = config.copy()
    clf = LogisticRegression(penalty = 'none')
    with_seed_model = SupervisedLearningModel(model = clf, name = model_key, seed = 0)
    if model_key == 'propensity':
        params['target'] = params['treat']['name']
    else:
        params['target'] = params['outcome']['name']
    with_seed_model.fit(_results, params)
    train = results[params['features']['training']]
    if model_key == 'propensity':
        _results[model_key] = clf.predict_proba(train)[:, 1:]
    else:
        _results[model_key] = clf.predict_proba(train)[:, :1]
        
_results

Unnamed: 0,Z,A,treat_num,Y,Y_1,Y_0,outcome,propensity,observational,counterfactual
0,1.764052,1,1,1,1,1,harm,0.943788,0.157911,0.776830
1,0.400157,0,1,0,0,1,ok,0.473533,0.234511,0.478288
2,0.978738,1,1,0,0,0,ok,0.885344,0.134639,0.611137
3,2.240893,1,1,0,0,1,ok,0.964167,0.173575,0.849429
4,1.867558,0,1,0,0,0,ok,0.793377,0.302766,0.802018
...,...,...,...,...,...,...,...,...,...,...
99995,-0.337715,0,0,0,0,0,ok,0.302425,0.204496,0.302788
99996,-2.028548,0,0,0,0,0,ok,0.075290,0.146741,0.072685
99997,0.726182,0,1,0,0,0,ok,0.553913,0.248710,0.560508
99998,-1.167831,1,0,0,0,0,ok,0.480242,0.085422,0.151681


In [12]:
results.equals(_results)

True

## References:

[1] Benureau FCY, Rougier NP. <i>Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions.</i> Front Neuroinform. 2018 Jan 4;11:69. doi: 10.3389/fninf.2017.00069. PMID: 29354046; PMCID: PMC5758530.