# Reproducible Data Science

## Tory W. Clasen
## Chief Data Scientist
### 18 JUL 19

# My Backstory


 - 20 years of software development (Python < 0.9!)
 -  8 years of cyber stuff
 -  4 years of data science
 
 
 ** Remind Tory to go on a tangent, time permitting

# What does it mean for research to be reproducible?

 - <B>Reproducibility:</B>

A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study.

** <B>functional purity</B>

 - <B>Replicability:</B>

This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods). 

# Reproducicble research has 4 components:

## Source Data

In [14]:
%%bash
head -n 4 malware_hash_list.txt

b439982f30b5b47fde89ff1384b671e0
de6fdc009fda2de3cf4dab4bfd4529c2
62028945d0ab974e183756ebbb1ca07f
f6febaf70042268a17efc985875131f5


In [15]:
%%bash
cat 1_download_samples.sh

#!/bin/bash

source config.env

if [ ! -d samples ]; then
    mkdir samples

    for sample in $( cat malware_hash_list.txt ); do
	echo "Downloading $sample"
        curl -o ./samples/$sample  "https://malshare.com/api.php?api_key=$MALSHARE_API_KEY&action=getfile&hash=$sample"
    done
fi


## Source Code

In [16]:
%%bash
cat 2_hash_samples.sh

#!/bin/bash

source config.env

if [ ! -d hashes ]; then
    mkdir hashes

    docker run \
        --rm \
        -it \
        -v "$PWD"/samples:/archive \
        -v "$PWD"/hashes:/hashes \
        -w /hashes \
        --name pharos \
        eschwartz/pharos:$PHAROS_DOCKER_TAG \
        /bin/bash -c "\
            for i in /archive/*; do \
                fn2hash \$i; \
            done > hashes.csv;\
            for i in /archive/*; do \
	        fn2yara \$i; \
	    done"
fi


## Configuration Information

In [17]:
%%bash
cat config.env

JUPYTER_PASSWORD=""
JUPYTER_DOCKER_TAG="6c3390a9292e"

MALSHARE_API_KEY="<REPLACE WITH YOUR API KEY>"

PHAROS_DOCKER_TAG="latest"

** <B>Pets vs. Cattle</B>

## Intermediate Data and results

### Show your work!

# Reproducible Data Science with 10 easy rules, you won't believe #9!

## Rule 1: For Every Result, Keep Track of How It Was Produced

 - "OP Notes"
 - Jupyter Notebooks
 - "script" command

## Rule 2: Avoid Manual Data Manipulation Steps

 - Don't use excel

 - Instead use:
 
`zcat conn.gz.log | bro-cut id.orig_h id.orig_p id.resp_h id.resp_p orig_bytes | awk '{ a["\t", $1, "\t", $2, "\t", $3, "\t", $4] +=$5 } END { for ( i in a ) { print i, "\t", a[i] } }' | sort -rnk6 | head -n $size 2>/dev/null`

## Rule 3: Archive the Exact Versions of All External Programs Used

Do this:

`docker run \
        --rm \
        -it \
        -v "$PWD"/samples:/archive \
        -v "$PWD"/hashes:/hashes \
        -w /hashes \
        --name pharos \
        eschwartz/pharos:$PHAROS_DOCKER_TAG`

But don't do this:

`PHAROS_DOCKER_TAG="latest"`

** <B>CI/CD Pipelines in Data Science</B>

## Rule 4: Version Control All Custom Scripts

In [None]:
%%bash
git status

## Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

## Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds


** <B>functional purity</B>

## Rule 7: Always Store Raw Data behind Plots

## Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

## Rule 9: Connect Textual Statements to Underlying Results

## Rule 10: Provide Public Access to Scripts, Runs, and Results

# Questions?


https://www.amstat.org/asa/files/pdfs/POL-ReproducibleResearchRecommendations.pdf

https://data-ken.org/how-to-build-reproducable-data-science-workflow.html

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s9