# Code Workflows and Tooling for Research in Python

## My experience: a balancing act

<center><img src="images/ss_shipit.jpg" width="500"></center>



## My experience: a balancing act

- "getting things done" vs. "doing things the right way"
- improving workflows is a process

- "just ship it" mentality
- "technical debt" analogy: the uncertainty of research makes taking on technical debt that much easier
- try to start building habits that make "doing things the right way" efficient
- it is a process: don't expect yourself to change all at once, make incremental improvements in workflow
- also about building the right habits
- will go through some not so great examples (from yours truly), some easy things to always do, some things to try and incorporate into your workflow, as well as some more involved setups
- something easy to do and should be done, med, hard
- will focus on Python here, but the same tooling exists and principles apply to other languages

## Outline

- Documentation
- Testing
- Version Control
- Automation
- Reproducibility
- Workflow Discussion


## Documentation

1. Comment __while__ you code
2. Ideally, follow a Docstring style
3. Consider documentation generators

- Write comments as you go: your future self will thank you
- Use inline comments `#` to provide context
- Use docstrings `""" """` to describe the behavior of functions

In [1]:
# Not good
def f(n):
    return 1 << n

In [2]:
# Better
def power_of_two(n):
    """Calculates 2^n."""
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n

In [3]:
        help(power_of_two)

Help on function power_of_two in module __main__:

power_of_two(n)
    Calculates 2^n.



- Disclaimer: do not reinvent the wheel

In [35]:
import numpy as np
def power_of_two(n):
    """Calculates 2^n, but n can be negative and non-integer now!"""
    return np.power(2, n)

In [10]:
# Not good
def f(l):
    ps = []
    n = len(l)
    for i in range(1<<n):
        s = [l[j] for j in range(n) if (i & 1 << j)]
        ps.append(s)
    return ps

In [13]:
# Better
def powerset(l):
    """Returns the power set of l."""
    
    ps = []
    n = len(l)
    # use n-bit binary number to indicate whether an item is included in a set
    for i in range(1<<n):
        # generate subset by checking which bits are 1 in i
        s = [l[j] for j in range(n) if (i & 1 << j)]
        ps.append(s)
    return ps

In [None]:
# Disclaimer: do not reinvent the wheel
# itertools-provided recipe for powerset
from itertools import chain, combinations
def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [12]:
help(powerset)

Help on function powerset in module __main__:

powerset(l)
    Returns the power set of l.



### Docstring styles

- reST, Numpy, Google standardized documentation styles

In [16]:
# Google-style docstrings
def power_of_two(n):
    """(Short description): Calculates 2^n.
    
    (Longer description): Calculates non-negative powers of two via bit shift.
    
    Args:
        n (int): the exponent to raise 2 to.
    Returns:
        int: 2^n.
    Raises:
       ValueError: if n < 0.
    """
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n

In [39]:
help(power_of_two)

Help on function power_of_two in module __main__:

power_of_two(n)



In [7]:
! cat sphinx_demo/code/power_of_two.py
! make -C sphinx_demo html

def power_of_two(n):
    """Calculates :math:`2^n`.
    
    Calculates non-negative powers of two via bit shift.
    
    Args:
        n (int): the exponent to raise 2 to.
    Returns:
        int: :math:`2^n`.
    Raises:
       ValueError: if :math:`n \lt 0`.
    """
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n
make: Entering directory '/mnt/c/Users/1994t/Documents/Github/code-workflow-lab-teaching/sphinx_demo'
[01mRunning Sphinx v1.8.5[39;49;00m
[01mloading pickled environment... [39;49;00mdone
[01mbuilding [mo]: [39;49;00mtargets for 0 po files that are out of date
[01mbuilding [html][39;49;00m: targets for 0 source files that are out of date
[01mupdating environment: [39;49;00m0 added, 0 changed, 0 removed
[01mlooking for now-outdated files... [39;49;00mnone found
[01mno targets are out of date.[39;49;00m
[01mbuild succeeded.[39;49;00m

The HTML pages are in _build/html.
make: Leaving directory '/mnt/c/Users/1994t/Documents/Github/code-work

<center><img src="images/sphinx_doc.PNG" width="1500"></center>

## Testing

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

- defensive coding
- research code scares me -- since often we don't know what's "correct"

## Testing

1. Think about what your code "should do"
2. Write dedicated tests while you code
3. Consider testing frameworks

- we all write tests -- little print statements to verify the output, etc
- but we end up throwing them away

### Test-driven development mindset

- write code to pass tests -> makes coding sessions more directed
- forces you to think about failure points in your code

### Fizzbuzz example

- `fizzbuzz` function, on input `n`:
    - if n is divisible by 3, print "fizz"
    - if n is divisible by 5, print "buzz"
    - if n is divisible by both 3 and 5, print "fizzbuzz"
    - otherwise, output n

In [22]:
def fizzbuzz(n):
    pass

In [24]:
# assert statements are your friend
assert fizzbuzz(3) == 'fizz', 'fails divides by 3 case'
assert fizzbuzz(5) == 'buzz', 'fails divides by 5 case'
assert fizzbuzz(15) == 'fizzbuzz', 'fails divides by 15 case'
assert fizzbuzz(1) == 1, 'fails else case'

AssertionError: fails divides by 3 case

- don't delete the assert statements! They help you regression test

### Docstrings part II: testing

In [25]:
def fizzbuzz(n):
    """Performs the fizzbuzz function on input n.
    
    Doctests for regression testing, and examples of usage:
    
    >>> fizzbuzz(3)
    'fizz'
    >>> fizzbuzz(5)
    'buzz'
    >>> fizzbuzz(15)
    'fizzbuzz'
    >>> fizzbuzz(1)
    1
    """
    out = ""
    if n % 3 == 0: out += "fizz"
    if n % 5 == 0: out += "buzz"
    return out if len(out) > 0 else n

In [33]:
import doctest
doctest.testmod(verbose=False)

TestResults(failed=0, attempted=4)

### Consider testing frameworks like `unittest`

- provides automation, shared setup/teardown of tests

In [36]:
"""Unit tests for feature extraction methods.


Test contact hashes are:
['1002060a7f4fe408f8137f12982e5d64cf34693',
'10413044ad5f1183e38f5ddf17259326e976231']

"""

import datetime
import os
import pickle

import numpy as np
import pandas as pd
import unittest

class FeatureExtractTests(unittest.TestCase):

    def assert_frame_equal_dict(self, actual_df, expected_dict, columns, check_dtype=True):
        """Helper function for doing df to dict comparison on the given columns."""

        expected_df = pd.DataFrame.from_dict(expected_dict).T
        expected_df.columns = columns

        pd.testing.assert_frame_equal(actual_df[columns],
                                      expected_df,
                                      check_dtype=check_dtype)


    def setUp(self):
        """Populates test DataFrames common to all test cases."""
        self.pid1 = '1002060'
        self.pid2 = '1041304'

        self.combined_hash1 = '1002060a7f4fe408f8137f12982e5d64cf34693'
        self.combined_hash2 = '10413044ad5f1183e38f5ddf17259326e976231'

        with open("../data/test_comm.df", 'rb') as comm_file:
            self.raw_df = pickle.load(comm_file)
            self.call_df = self.raw_df.loc[self.raw_df['comm_type'] == 'PHONE']
            self.sms_df = self.raw_df.loc[self.raw_df['comm_type'] == 'SMS']

        with open("../data/test_emm.df", 'rb') as emm_file:
            self.emm_df = pickle.load(emm_file)


    def test_init_feature_df(self):
        """"Tests init_feature_df function.
        
        Checks whether total_comms, total_comm_days, and contact_type columns are populated correctly.
        """
        expected_dict = {
            (self.pid1, self.combined_hash1): [8, 2, 'friend'],
            (self.pid2, self.combined_hash2): [6, 3, 'family_live_together']
        }

        expected_df = pd.DataFrame.from_dict(expected_dict).T
        expected_df.index = expected_df.index.rename(['pid', 'combined_hash'])
        expected_df = expected_df.rename({
                                            0: "total_comms",
                                            1: "total_comm_days",
                                            2: "contact_type"
                                         },
                                         axis='columns')
        expected_df['total_comms'] = expected_df['total_comms'].astype(int)
        expected_df['total_comm_days'] = expected_df['total_comm_days'].astype(int)
        
        actual_df = init_feature_df(self.raw_df)

        pd.testing.assert_frame_equal(actual_df, expected_df)

```bash
(code-workflow) tliu@DESKTOP-3QP831J:feature_extract$ python test_feature_extract.py -v
test_build_avoidance_features (__main__.FeatureExtractTests) ... ok
test_build_channel_selection_features (__main__.FeatureExtractTests) ... ok
test_build_count_features (__main__.FeatureExtractTests) ... ok
test_build_demo_features (__main__.FeatureExtractTests) ... ok
test_build_duration_features (__main__.FeatureExtractTests) ... ok
test_build_holiday_features (__main__.FeatureExtractTests) ... ok
test_build_intensity_features (__main__.FeatureExtractTests) ... ok
test_build_maintenance_features (__main__.FeatureExtractTests) ... ok
test_build_temporal_features (__main__.FeatureExtractTests) ... ok
test_filter_by_holiday (__main__.FeatureExtractTests) ... ok
test_init_feature_df (__main__.FeatureExtractTests) ... ok

----------------------------------------------------------------------
Ran 11 tests in 2.086s

OK

```

## Version Control

## Version Control

0. Use it!
1. Commit messages should be informative
2. Ideally subdivide tasks into concrete commits
3. Consider branching strategies

### Commit Messages

![](images/bad_commits.PNG)

### Commit Messages

- Summarize commit in brief, imperative statement
- Use commit body for more details if needed

### Consider Development Branches

- treat `master` as "protected" branch
- work on new features in separate branches

### Development Branch Workflow

![](images/feature_branch_01.svg)

```
# create new branch, dev-branch
git checkout -b dev-branch master

# do work
git add ...
git commit ...

# push to remote dev-branch, open PR
git push origin dev-branch
```

### Development Branch Workflow

![](images/feature_branch_02.svg)

### Thoughts on Code Review

- "could you read my paper draft?" -> "could you review my code?"

<center><img src="images/code_review.PNG" width="500"></center>

## Automation

## Automation 

1. Move your work out of "interactive mode" as much as possible
2. Ideally batch process computation
3. Consider workflow automation tools like `make`

### `argparse` is your friend

- allows parameterization of entire modules
- another form of documentation!

In [42]:
! python scripting_demo/med_extract.py -h

usage: med_extract.py [-h] [--test] [--m_dm_outcome] [--rx_dm]
                      data_dir yr q out_dir {m,lr,r} chunksize

Extract data from Optum raw files and dump to DataFrames

positional arguments:
  data_dir        directory with all Optum data
  yr              the year to target
  q               the quarter to target
  out_dir         output directory
  {m,lr,r}        Optum table type to target: medical (m), lab reports (lr),
                  prescriptions (r)
  chunksize       number of rows to read per chunk

optional arguments:
  -h, --help      show this help message and exit
  --test          whether to make a test run of the data extraction
  --m_dm_outcome  perform diabetes med outcome extraction
  --rx_dm         perform diabetes rx extraction


- argparse is yet another form of documentation

### Batch processing

```bash

# nohup means ignore hangup (logouts), ampersand means run in the background
nohup python batch_example.py &

# alternatively, use tmux/screen window managers
tmux new -s batch_session
```

- much better than running python ... (especially better than running a computation in jupyter)
- more sophisticated way of scheduling jobs in the background

### Consider `make` for complex workflows

- disclaimer: I haven't yet encountered a workflow in my research career that make makes significantly easier

```bash

# command from earlier to build documentation
sphinx-build . _build -b html

# Makefile equivalent
make html
```

In [38]:
! cat sphinx_demo/Makefile

# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
SOURCEDIR     = .
BUILDDIR      = _build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

## Reproducibility

1. Parameterize code to facilitate A/B testing
2. Ideally, track and parameterize your environment too
3. Consider Docker for more complex builds

### Bash scripts

- Code form of an experimental procedure
- gives us a mechanism to track changes in runs, A/B testing

### Tracking your environment

## Workflow

### Jupyter notebooks are __notebooks__

- good for playing with the data
- good for presenting and visualizing results
- bad for "doing work" in between

### Jupyter notebook bloat

- A cell that looks like [this](https://gist.github.com/tliu526/6e23aa99a323646be98691fb6d6a0f55)

### Jupyter notebook bloat

- as opposed to a cell that looks like this:

```python

from notebook_utils import build_hist, ttest_df, build_bar, ...
```

- note that "doing work" may never happen on a particular research thread, hence the appeal of Jupyter notebooks for research

- deliverables on a project:

- a paper (but what goes into a paper?)
    - graphs
    - experiments
    - leaving a "code trail" 
    - (possibly) a code package!

## References

- [Atlassian Git Feature Branch Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)
- [Git User Manual](https://git-scm.com/docs/user-manual.html)
- [GNU Make Manual](https://www.gnu.org/software/make/manual/html_node/index.html#SEC_Contents)
- [argparse documentation](https://docs.python.org/3/library/argparse.html)
- [tmux cheat sheet](https://tmuxcheatsheet.com/)