![PPGI_UFRJ](imagens/ppgi-ufrj.png)
# Fundamentos de Ciência de Dados

---
# PPGI/UFRJ 2020.3/2022.2, 2024.2
## Prof Sergio Serra e Jorge Zavaleta

---
# Reprodutibilidade em Python

Fonte: Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into
Scientific Contributions
Fabien C. Y. Benureau  and Nicolas P. Rougier

> "Replicability is a cornerstone of science. If an experimental result cannot be re-obtained by an independent party, it merely becomes, at best, an observation that may inspire future research (Mesirov, 2010; Open Science Collaboration, 2015)."

# R0 - Irreproducibility

A program can fail as a scientific contribution in many different ways for
many different reasons, i.e. code errors; depracted methods; older compiler versions, lack od documentation, ...

In [10]:
import random

for i in xrange(10):                  #xrange?
    step = random.choice([-1,+1])
    x += step
print x,                              #print? 

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? (22499713.py, line 6)

# R1 - Re-Runnable

Re-runnable code should describe—with enough details to be recreated—an execution environment in which it is executable.  It is far from being either obvious or easy.

In [11]:
# Random walk (R1: re-runnable)
# Tested with Python 3.8
# Where S = steps, D = Data, E= Environmente and R = Results

import random

x =  0                              # Inicialization
walk = []
for i in range(10):                 # Loop - S’= S and D’= D
    step = random.choice([-1,+1])   # random.choice() function returns a random element from the non-empty sequence - E’~ E
    x += step                       # 
    walk.append(x)

print(walk)                         # Output - R != R’

[1, 2, 1, 2, 3, 2, 1, 2, 1, 0]


### Run again, again and again...

The output are the  same?

#### S’= S and E’~ E and D’= D and R != R’ 

In [12]:
# Random walk (R1: re-runnable)
# Tested with Python 3.8
# Where S = steps, D = Data, E= Environmente and R = Results

import random

x =  0                              # Inicialization
walk = []
for i in range(10):                 # Loop - S’= S and D’= D
    step = random.choice([-1,+1])   # random.choice() function returns a random element from the non-empty sequence - E’~ E
    x += step                       # 
    walk.append(x)

print(walk)                         # Output - R != R’

[-1, -2, -1, -2, -1, -2, -3, -2, -1, -2]


## R2 - Repeatable

A repeatable code is one that can be rerun and that produces the SAME results on successive runs
Program needs to be deterministic

Control the initialization of pseudo-random number generators 

Previous results need to be available (it is possible to compare with current results) 

#### S’= S and E’~ E and D’= D and R = R’ 

In [13]:
# Random walk (R2: repeatable)
# Tested with Python 3

import random

random.seed(1)                     # RNG initialization

x =  0                             # Inicialization
walk = []
for i in range(10):                # Loop - S’= S and D’= D           
    step = random.choice([-1,+1])  # random.choice() function returns a random element from the non-empty sequence - E’~ E
    x += step                      # pseudo-random number generator between Python 3.2 and Python 3.3.
    walk.append(x)

print(walk)
                                   # Saving output to disk
with open('data/results-R2.txt', 'w') as fd:
    fd.write(str(walk))            # Output - R = R’ on the same Python engine!


[-1, -2, -1, -2, -1, 0, 1, 2, 1, 0]


## R3 - Reproducible

A repeatable code is one that can be rerun and that produces the SAME results on successive runs
Program needs to be deterministic

#### S’= S and E’~ E and D’= D and R = R’ 

In [14]:
# Random walk (R3)
# Copyright (c) 2017 N.P. Rougier and F.C.Y. Benureau
# Adapted by Serra
# Release under the Windows 10
# Tested with 64 bit (AMD64) 

import sys, subprocess, datetime, random

def compute_walk():
    x = 0
    walk = []
    for i in range(10):
        if random.uniform(-1, +1) > 0:
            x += 1
        else:
            x -= 1
        walk.append(x)
    return walk

# Unit test
random.seed(42)
assert compute_walk() == [1,0,-1,-2,-1,0,1,0,-1,-2]

# Random walk for 10 steps
seed = 1
random.seed(seed)
walk = compute_walk()

# Display & Save scientific results & Poor Provenance
# Update str(datetime.datetime.utcnow()) to datetime.datetime.now(datetime.UTC)

print(walk)
results = {
    "data"     : walk,
    "seed"     : seed,
    "timestamp": str(datetime.datetime.now(datetime.UTC)),
    "system"   : sys.version}
with open("data/results-R3a.txt", "w") as fd:
    fd.write(str(results))


[-1, 0, 1, 0, -1, -2, -1, 0, -1, -2]


## R3 - Reproducible 

A repeatable code is one that can be rerun and that produces the SAME results on successive runs
Program needs to be deterministic

#### S’= S and E’~ E and D’= D and R = R’ 

#### Some Provenance

In [15]:
# Copyright (c) 2017 N.P. Rougier and F.C.Y. Benureau
# Adapted by Serra
# Release under the Windows 10
# Tested with 64 bit (AMD64) 

import sys, subprocess, datetime, random

# Retrospective Provenance
agent  =  "Sergio Serra"
myseed = 42             

def compute_walk():
    x = 0
    walk = []
    for i in range(10):
        if random.uniform(-1, +1) > 0:
            x += 1
        else:
            x -= 1
        walk.append(x)
    return walk

# If repository is dirty, don't run anything
if subprocess.call(("notepad", "diff-index",
                    "--quiet", "HEAD")):
    print("Repository is dirty, please commit first")
    sys.exit(1)

# Get git hash if any
hash_cmd = ("notepad", "rev-parse", "HEAD")
revision = subprocess.check_output(hash_cmd)

# Unit test
random.seed(int(myseed))
assert compute_walk() == [1,0,-1,-2,-1,0,1,0,-1,-2]

# Random walk for 10 steps
seed = 1
random.seed(seed)
walk = compute_walk()

# Display & save results & some retrospective provenance
print(walk)
results = {
    "data"     : walk,
    "seed"     : seed,
    "myseed"   : myseed,
    "timestamp": str(datetime.datetime.now(datetime.UTC)),
    "revision" : revision,
    "system"   : sys.version,
    "agent"    : agent
    }
with open("data/results-R3b.txt", "w") as fd:
    fd.write(str(results))


[-1, 0, 1, 0, -1, -2, -1, 0, -1, -2]


## R3 - Reproducible - Rich Version

A repeatable code is one that can be rerun and that produces the SAME results on successive runs
Program needs to be deterministic

#### S’= S and E’~ E and D’= D and R = R’ 

#### Data Provenance - Poor view

In [16]:
# Random walk (R4)
# Copyright (c) 2017 N.P. Rougier and F.C.Y. Benureau
# Adapted by Serra
# Release under the Windows 10
# Pyhton 3.8 - Jupyter notebook
# Tested with 64 bit (AMD64) 

import sys, subprocess, datetime, random

# Retrospective Provenance
agent    = input("Enter the name of the one who is running the program: ")      #PROV-Agent
entity   = input("Enter the name of the Dataset: ")                             #PROV-Entity
activity = input("Enter the name of the Essay: ")                               #PROV-Activity

def compute_walk(count, x0=0, step=1, seed=0):
    """Random walk
       count: number of steps
       x0   : initial position (default 0)
       step : step size (default 1)
       seed : seed for the initialization of the
	     random generator (default 0)
    """
    random.seed(seed)
    x = x0
    walk = []
    for i in range(count):
        if random.uniform(-1, +1) > 0:
            x += 1
        else:
            x -= 1
        walk.append(x)
    return walk

def compute_results(count, x0=0, step=1, seed=0):
    """Compute a walk and return it with context"""
    # If repository is dirty, don't do anything
    if subprocess.call(("notepad", "diff-index",
                        "--quiet", "HEAD")):
        print("data/Repository is dirty, please commit")
        sys.exit(1)

    # Get git hash if any
    hash_cmd = ("notepad", "rev-parse", "HEAD")
    revision = subprocess.check_output(hash_cmd)

    # Compute results and Full Retrospective Provenance
    walk = compute_walk(count=count, x0=x0,
                        step=step, seed=seed)
    return {
        "data"      : walk,
        "parameters": {"count": count, "x0": x0,
                       "step": step, "seed": seed},
        "timestamp" : str(datetime.datetime.now(datetime.UTC)),
        "revision"  : revision,
        "system"    : sys.version,
        "Provenance": { 
                        "PROV-agent "     : agent,  "wasAttributedTo "
                        "PROV-entity "    : entity, "wasGeneratedBy "
                        "PROV-activity "  : activity}
           }

if __name__ == "__main__":
    # Unit test checking reproducibility
    # (will fail with Python<=3.2)
    assert (compute_walk(10, 0, 1, 42) ==
	        [1,0,-1,-2,-1,0,1,0,-1,-2])

    # Simulation parameters
    count, x0, seed = 10, 0, 1
    results = compute_results(count, x0=x0, seed=seed)

    # Save & display results
    with open("data/results-R4.txt", "w") as fd:
        fd.write(str(results))
    print(results["data"])

[-1, 0, 1, 0, -1, -2, -1, 0, -1, -2]


---
#### &copy; Copyright 2021-2022, 2024.2 - Sergio Serra & Jorge Zavaleta