# Code Workflows and Python Tooling for Research

## My experience: a balancing act

<center><img src="images/ss_shipit.jpg" width="500"></center>



- "just shipping it" vs. "coding the *right way*"
- improving workflows is a process

- "just ship it" mentality
- "technical debt" analogy: the uncertainty of research makes taking on technical debt that much easier
- try to start building habits that make "doing things the right way" efficient
- it is a process: don't expect yourself to change all at once, make incremental improvements in workflow
- also about building the right habits
- will go through some easy things to always do, some things to try and incorporate into your workflow, as well as some more involved setups
- will focus on Python here, but the same tooling exists and principles apply to other languages

## Outline

- Documentation
- Testing
- Version Control
- Automation
- Reproducibility
- Workflow Discussion


## Documentation

1. Comment __while__ you code
2. Ideally, follow a Docstring style
3. Consider documentation generators

- Write comments as you go: your future self will thank you
- Use inline comments `#` to provide context
- Use docstrings `""" """` to describe the behavior of functions

In [None]:
# Not good
def f(n):
    return 1 << n

In [85]:
# Better
def power_of_two(n):
    """Calculates 2^n."""
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n

In [86]:
help(power_of_two)

Help on function power_of_two in module __main__:

power_of_two(n)
    Calculates 2^n.



- Disclaimer: do not reinvent the wheel

In [None]:
import numpy as np
def power_of_two(n):
    """Calculates 2^n, but n can be negative and non-integer now!"""
    return np.power(2, n) 

### Docstring styles

- reST, Numpy, Google standardized documentation styles

In [87]:
# Google-style docstrings
def power_of_two(n):
    """(Short description): Calculates 2^n.
    
    (Longer description): Calculates non-negative powers of two via bit shift.
    
    Args:
        n (int): the exponent to raise 2 to.
    Returns:
        int: 2^n.
    Raises:
       ValueError: if n < 0.
    """
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n

In [88]:
help(power_of_two)

Help on function power_of_two in module __main__:

power_of_two(n)
    (Short description): Calculates 2^n.
    
    (Longer description): Calculates non-negative powers of two via bit shift.
    
    Args:
        n (int): the exponent to raise 2 to.
    Returns:
        int: 2^n.
    Raises:
       ValueError: if n < 0.



### Consider documentation generators

In [89]:
! cat sphinx_demo/code/power_of_two.py
! make -C sphinx_demo html

def power_of_two(n):
    """Calculates :math:`2^n`.
    
    Calculates non-negative powers of two via bit shift.
    
    Args:
        n (int): the exponent to raise 2 to.
    Returns:
        int: :math:`2^n`.
    Raises:
       ValueError: if :math:`n \lt 0`.
    """
    
    # left bit shift by n equivalent to 2^n.
    return 1 << n
make: Entering directory '/mnt/c/Users/1994t/Documents/Github/code-workflow-lab-teaching/sphinx_demo'
[01mRunning Sphinx v1.8.5[39;49;00m
[01mloading pickled environment... [39;49;00mdone
[01mbuilding [mo]: [39;49;00mtargets for 0 po files that are out of date
[01mbuilding [html][39;49;00m: targets for 0 source files that are out of date
[01mupdating environment: [39;49;00m0 added, 0 changed, 0 removed
[01mlooking for now-outdated files... [39;49;00mnone found
[01mno targets are out of date.[39;49;00m
[01mbuild succeeded.[39;49;00m

The HTML pages are in _build/html.
make: Leaving directory '/mnt/c/Users/1994t/Documents/Github/code-work

<center><img src="images/sphinx_doc.PNG" width="1500"></center>

## Testing

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

- defensive coding
- research code scares me -- since in scientific settings what is "correct" may be unclear

## Testing

1. Think about what your code "should do"
2. Write dedicated tests while you code
3. Consider testing frameworks

- we all write tests -- little print statements to verify the output, etc
- but we end up throwing them away

### Test-driven development mindset

- write code to pass tests -> makes coding sessions more directed
- forces you to think about failure points in your code

### Fizzbuzz example



- `fizzbuzz` function, on input `n`:
    - if n is divisible by 3, print "fizz"
    - if n is divisible by 5, print "buzz"
    - if n is divisible by both 3 and 5, print "fizzbuzz"
    - otherwise, output n

In [98]:
def fizzbuzz(n):
    if n % 15 == 0: return 'fizzbuzz'
    if n % 3 == 0: return 'fizz'
    if n % 5 == 0: return 'buzz'
    return n

In [99]:
# assert statements are your friend
assert fizzbuzz(3) == 'fizz', 'fails divides by 3 case'
assert fizzbuzz(5) == 'buzz', 'fails divides by 5 case'
assert fizzbuzz(15) == 'fizzbuzz', 'fails divides by 15 case'
assert fizzbuzz(1) == 1, 'fails else case'

- don't delete the assert statements! They help you regression test

### Docstrings part II: testing

In [100]:
def fizzbuzz(n):
    """Performs the fizzbuzz function on input n.
    
    Doctests for regression testing, and examples of usage:
    
    >>> fizzbuzz(3)
    'fizz'
    >>> fizzbuzz(5)
    'buzz'
    >>> fizzbuzz(15)
    'fizzbuzz'
    >>> fizzbuzz(1)
    1
    """
    out = ""
    if n % 3 == 0: out += "fizz"
    if n % 5 == 0: out += "buzz"
    return out if len(out) > 0 else n

In [102]:
import doctest
doctest.testmod(verbose=True)

Trying:
    fizzbuzz(3)
Expecting:
    'fizz'
ok
Trying:
    fizzbuzz(5)
Expecting:
    'buzz'
ok
Trying:
    fizzbuzz(15)
Expecting:
    'fizzbuzz'
ok
Trying:
    fizzbuzz(1)
Expecting:
    1
ok
7 items had no tests:
    __main__
    __main__.FeatureExtractTests
    __main__.FeatureExtractTests.assert_frame_equal_dict
    __main__.FeatureExtractTests.setUp
    __main__.FeatureExtractTests.test_init_feature_df
    __main__.f
    __main__.power_of_two
1 items passed all tests:
   4 tests in __main__.fizzbuzz
4 tests in 8 items.
4 passed and 0 failed.
Test passed.


TestResults(failed=0, attempted=4)

### Consider testing frameworks like `unittest`

- provides automation, shared setup/teardown of tests

In [None]:
"""Unit tests for feature extraction methods.


Test contact hashes are:
['1002060a7f4fe408f8137f12982e5d64cf34693',
'10413044ad5f1183e38f5ddf17259326e976231']

"""

import datetime
import os
import pickle

import numpy as np
import pandas as pd
import unittest

class FeatureExtractTests(unittest.TestCase):

    def assert_frame_equal_dict(self, actual_df, expected_dict, columns, check_dtype=True):
        """Helper function for doing df to dict comparison on the given columns."""

        expected_df = pd.DataFrame.from_dict(expected_dict).T
        expected_df.columns = columns

        pd.testing.assert_frame_equal(actual_df[columns],
                                      expected_df,
                                      check_dtype=check_dtype)


    def setUp(self):
        """Populates test DataFrames common to all test cases."""
        self.pid1 = '1002060'
        self.pid2 = '1041304'

        self.combined_hash1 = '1002060a7f4fe408f8137f12982e5d64cf34693'
        self.combined_hash2 = '10413044ad5f1183e38f5ddf17259326e976231'

        with open("../data/test_comm.df", 'rb') as comm_file:
            self.raw_df = pickle.load(comm_file)
            self.call_df = self.raw_df.loc[self.raw_df['comm_type'] == 'PHONE']
            self.sms_df = self.raw_df.loc[self.raw_df['comm_type'] == 'SMS']

        with open("../data/test_emm.df", 'rb') as emm_file:
            self.emm_df = pickle.load(emm_file)


    def test_init_feature_df(self):
        """"Tests init_feature_df function.
        
        Checks whether total_comms, total_comm_days, and contact_type columns are populated correctly.
        """
        expected_dict = {
            (self.pid1, self.combined_hash1): [8, 2, 'friend'],
            (self.pid2, self.combined_hash2): [6, 3, 'family_live_together']
        }

        expected_df = pd.DataFrame.from_dict(expected_dict).T
        expected_df.index = expected_df.index.rename(['pid', 'combined_hash'])
        expected_df = expected_df.rename({
                                            0: "total_comms",
                                            1: "total_comm_days",
                                            2: "contact_type"
                                         },
                                         axis='columns')
        expected_df['total_comms'] = expected_df['total_comms'].astype(int)
        expected_df['total_comm_days'] = expected_df['total_comm_days'].astype(int)
        
        actual_df = init_feature_df(self.raw_df)

        pd.testing.assert_frame_equal(actual_df, expected_df)

In [None]:
(code-workflow) tliu@DESKTOP-3QP831J:feature_extract$ python test_feature_extract.py -v
test_build_avoidance_features (__main__.FeatureExtractTests) ... ok
test_build_channel_selection_features (__main__.FeatureExtractTests) ... ok
test_build_count_features (__main__.FeatureExtractTests) ... ok
test_build_demo_features (__main__.FeatureExtractTests) ... ok
test_build_duration_features (__main__.FeatureExtractTests) ... ok
test_build_holiday_features (__main__.FeatureExtractTests) ... ok
test_build_intensity_features (__main__.FeatureExtractTests) ... ok
test_build_maintenance_features (__main__.FeatureExtractTests) ... ok
test_build_temporal_features (__main__.FeatureExtractTests) ... ok
test_filter_by_holiday (__main__.FeatureExtractTests) ... ok
test_init_feature_df (__main__.FeatureExtractTests) ... ok

----------------------------------------------------------------------
Ran 11 tests in 2.086s

OK

## Version Control

## Version Control

0. Use it!
1. Commit messages should be informative
2. Ideally subdivide tasks into concrete commits
3. Consider branching strategies

### Commit Messages


<center><img src="images/bad_commits.PNG" width="500"></center>


### Commit Messages

- Summarize commit in brief, imperative statement
- Use commit body for more details if needed

### Consider Development Branches

- treat `master` as "protected" branch
- work on new features in separate branches

### Development Branch Workflow

![](images/feature_branch_01.svg)

In [None]:
# create new branch, dev-branch
git checkout -b dev-branch master

# do work
git add ...
git commit ...

# push to remote dev-branch, open PR
git push origin dev-branch

![](images/feature_branch_02.svg)

### Thoughts on Code Review

- "could you read my paper draft?" -> "could you review my code?"

<center><img src="images/code_review.PNG" width="500"></center>

## Automation

## Automation 

1. Move your work out of "interactive mode" as often as possible
2. Ideally batch process computation
3. Consider workflow automation tools like `make`

### `argparse` is your friend

- allows parameterization of entire modules
- another form of documentation!

In [None]:
import argparse
parser = argparse.ArgumentParser(description="Extract data from Optum raw files and dump to DataFrames")
parser.add_argument('data_dir', help='directory with all Optum data')
parser.add_argument('yr', help='the year to target')
parser.add_argument('q', help='the quarter to target')
parser.add_argument('out_dir', help='output directory')
parser.add_argument('table_type', choices=['m', 'lr', 'r'], help='Optum table type to target: medical (m), lab reports (lr), prescriptions (r)')
parser.add_argument('chunksize', type=int, help='number of rows to read per chunk')
parser.add_argument('--test', action='store_true', help='whether to make a test run of the data extraction')
parser.add_argument('--m_dm_outcome', action='store_true', help='perform diabetes med outcome extraction')
parser.add_argument('--rx_dm', action='store_true', help='perform diabetes rx extraction')


args = parser.parse_args()
 

In [103]:
! python scripting_demo/med_extract.py -h

usage: med_extract.py [-h] [--test] [--m_dm_outcome] [--rx_dm]
                      data_dir yr q out_dir {m,lr,r} chunksize

Extract data from Optum raw files and dump to DataFrames

positional arguments:
  data_dir        directory with all Optum data
  yr              the year to target
  q               the quarter to target
  out_dir         output directory
  {m,lr,r}        Optum table type to target: medical (m), lab reports (lr),
                  prescriptions (r)
  chunksize       number of rows to read per chunk

optional arguments:
  -h, --help      show this help message and exit
  --test          whether to make a test run of the data extraction
  --m_dm_outcome  perform diabetes med outcome extraction
  --rx_dm         perform diabetes rx extraction


- argparse is yet another form of documentation

### Batch processing

In [None]:
# nohup means ignore hangup (logouts), ampersand means run in the background
nohup python batch_example.py &

# alternatively, use tmux/screen window managers
tmux new -s batch_session

- much better than running python ... (especially better than running a computation in jupyter)
- more sophisticated way of scheduling jobs in the background

### Consider `make` for complex workflows

- used for building software, can be also used for data transformation
- disclaimer: I haven't yet encountered a workflow in my research career that make makes significantly easier

In [None]:
# command from earlier to build documentation
sphinx-build . _build -b html

# Makefile equivalent
make html

In [104]:
! cat sphinx_demo/Makefile

# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
SOURCEDIR     = .
BUILDDIR      = _build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

## Reproducibility

## Reproducibility

1. Parameterize code to facilitate A/B testing
2. Ideally, track and parameterize your environment too
3. Consider Docker for more complex builds

### Bash scripts

- Code form of an experimental procedure
- gives us a mechanism to track changes in runs, A/B testing

In [105]:
! cat scripting_demo/tie_str_rf_reg.sh

#!/bin/bash

# AutoML random forest runs for tie strength score prediction
python run_automl.py \
       ../data/final_features/all_tie_str_baseline \   # input features
       final_results/tie_str/tie_str_baseline_rf_reg \ # output name
       tie_str_score \                                 # outcome variable (regression)
       --run_time 1440 --task_time 21600 \             # training time
       --rand_forest \                                 # only train random forest estimators
       > tie_str_baseline_rf_reg.out;




### Tracking your environment

- use `virtualenv` or `conda` to make your environments portable

In [106]:
! conda env list

# conda environments:
#
base                     /home/tliu/miniconda3
code-workflow         *  /home/tliu/miniconda3/envs/code-workflow
py37                     /home/tliu/miniconda3/envs/py37



In [107]:
! conda env export > environment.yml
! cat environment.yml

name: code-workflow
channels:
  - conda-forge
  - anaconda
  - defaults
dependencies:
  - alabaster=0.7.12=py37_0
  - asn1crypto=0.24.0=py37_0
  - babel=2.6.0=py37_0
  - blas=1.0=mkl
  - cffi=1.12.2=py37h2e261b9_1
  - chardet=3.0.4=py37_1
  - cryptography=2.6.1=py37h1ba5d50_0
  - docutils=0.14=py37_0
  - idna=2.8=py37_0
  - imagesize=1.1.0=py37_0
  - intel-openmp=2019.3=199
  - libgfortran-ng=7.3.0=hdf63c60_0
  - mkl=2019.3=199
  - mkl_fft=1.0.10=py37ha843d7b_0
  - mkl_random=1.0.2=py37hd81dba3_0
  - numpy=1.16.2=py37h7e9f1db_0
  - numpy-base=1.16.2=py37hde5b4d6_0
  - packaging=19.0=py37_0
  - pandas=0.24.2=py37he6710b0_0
  - pycparser=2.19=py37_0
  - pyopenssl=19.0.0=py37_0
  - pyparsing=2.3.1=py37_0
  - pysocks=1.6.8=py37_0
  - pytz=2018.9=py37_0
  - requests=2.21.0=py37_0
  - snowballstemmer=1.2.1=py37_0
  - sphinx=1.8.5=py37_0
  - sphinxcontrib=1.0=py37_1
  - sphinxcontrib-websupport=1.1.0=py37_1
  - urllib3=1.24.1=py37_0
  - attrs=19.1.0=py_0
 

### Consider Docker for complex environments

- port, share and reproduce "OS-level" configurations
    - custom library installations (eg, CUDA)
    - UNIX tooling

## Workflow

### Jupyter notebooks are __notebooks__

- good for exploring the data
- good for presenting and visualizing results
- bad for "doing work" in between

### Jupyter notebook bloat

- A cell that looks like [this](https://gist.github.com/tliu526/6e23aa99a323646be98691fb6d6a0f55):

In [None]:
# bad notebook cell
def build_hist(series, bins, xtick_labels, xlabel, title):
    """Builds custom histogram with bar labels and equal bin sizes
    
    :param series: pandas series to bin
    :param bins: list of bin sizes
    :param xtick_labels: list of labels for xticks
    :param xlabel: x-axis str label
    :param title: str title
    """
    counts, _ = np.histogram(series.values, bins=bins)
    rects = plt.bar(range(len(counts)), counts, width=0.5, tick_label=counts)
    for r in rects: 
        h = r.get_height()
        plt.text(r.get_x() + r.get_width()/2, 1.01*h, h, ha='center')
    plt.xticks(range(len(counts)), xtick_labels)
    plt.xlabel(xlabel)
    plt.ylabel("# participants")
    plt.title(title)
    plt.show()
    

def get_score(row, source_df, score_name):
    """Maps the score_name score from the given source DataFrame to a row by pid.
    
    To be used via DataFrame.apply().
    
    Example usage: 
    >>> all_df['score_AUDIT'] = all_df.apply(get_score, 
                                             source_df=screener_df, 
                                             score_name='score_AUDIT',
                                             axis=1)
    """
    
    return source_df[source_df['pid'] == row.pid][score_name].values[0]


def build_ttest_dfs(pid_df, group_col, val_cols):
    """Runs t-tests on the given pid_df DataFrame against the groupings as defined in group_col.
    
    :param pid_df: a DataFrame aggregated by participant pids
    :param group_col: the column name to group on
    :param val_cols: the columns we want to run t-tests on
    :returns: t_df, p_df DataFrames containing the t and p values
    """
    index = pd.Index(data=pid_df[group_col].unique(), name=group_col).sort_values()
    t_df = pd.DataFrame(index = index, columns = val_cols)
    p_df = pd.DataFrame(index = index, columns = val_cols)

    for group in pid_df[group_col].unique():
        selected_group = pid_df[pid_df[group_col] == group]
        rest_group =  pid_df[pid_df[group_col] != group]
        for col in val_cols:
            t, p = ttest_ind(selected_group[col].values, 
                             rest_group[col].values, 
                             nan_policy='omit')

            t_df.loc[group][col] = t
            p_df.loc[group][col] = p
    
    for col in p_df.columns.values:
            t_df[col] = t_df[col].apply(lambda x: format(float(x), '.3f'))
            p_df[col] = p_df[col].apply(lambda x: format(float(x), '.3f'))
    return t_df, p_df


def display_side_by_side(*args):
    """Concats the given DataFrame args to a single df"""
    html_str=''
    for df in args:
        #for col in df.col.values:
        #    df[col] = df[col].apply(lambda x: format(float(x), '2.3d'))
        html_str+=df._repr_html_()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
    

def build_csv_tables(csv_pid, csv_cols, bins, xlabels, score_col, group_col, show_bar=False, ylabel="", title="", alpha=0.05, width=0.7):
    """Builds and displays descriptive statistics for the given csv Dataframe.
    
    :param csv_pid: DataFrame grouped by pid
    :param csv_cols: the target csv column values to display
    :param bins: the bins defined for the disorder score
    :param xlabels: labels for each bin
    :param score_col: the col name for the disorder score
    :param group_col: the chosen group name for the bins
    """
    csv_pid[group_col] = pd.cut(csv_pid[score_col], bins, labels=xlabels)
    csv_pid = csv_pid.dropna(subset=[group_col])
    csv_group = csv_pid.groupby(group_col)[csv_cols]

    t, p = build_ttest_dfs(csv_pid, group_col, csv_cols)

    display_side_by_side(csv_group.mean().style.set_caption("mean"), 
                         csv_group.std().style.set_caption("std dev"),
                         p.style.set_caption("p-values"))

    
    if show_bar:
        yerr=csv_group.std().T
        ax = csv_group.mean().T.plot.bar(yerr=yerr, width=width, rot=0)
        p_list = p.values.flatten()
        std_list = csv_group.std().values.flatten()
        for i, bar in enumerate(ax.patches):
            sig = "*" if float(p_list[i]) < alpha else ""
            height = bar.get_height()
            text = format(height, ".2f") + sig
            ax.annotate(text, (bar.get_x() + bar.get_width()/2, (height + std_list[i])*1.01), ha='center')
        plt.ylabel(ylabel)
        plt.title(title)
        plt.legend(loc='lower right')
        plt.show()
        
    return csv_group, p


def build_bar(std_df, mean_df, p_val_df, ylabel="", title="", alpha=0.05, show_legend=False):
    """Builds a bar chart with the given DataFrames"""
    yerr=std_df.T
    ax = mean_df.T.plot.bar(yerr=yerr, width=0.7, rot=0)
    p_list = p_val_df.values.flatten()
    std_list = std_df.values.flatten()
    for i, bar in enumerate(ax.patches):
        sig = "*" if float(p_list[i]) < alpha else ""
        height = bar.get_height()
        text = format(height, ".2f") + sig
        ax.annotate(text, (bar.get_x() + bar.get_width()/2, (height + std_list[i])*1.01), ha='center')
    plt.ylabel(ylabel)
    plt.title(title)
    if show_legend:
        plt.legend(loc='lower right')
    
    plt.show()

plt.rcParams["figure.figsize"] = [15,6]


def generate_ems_stats(ems_raw, bins, xlabels, score_col, group_name):
    """Convenience method for generating ems.csv statistics."""
    ems_pid = ems_raw.groupby('pid', as_index=False).mean()
    group, p_vals = build_csv_tables(ems_pid, ems_cols, bins, xlabels, score_col, group_name, show_bar=False)
    sleep_qual_mean = group.mean()['sleep_quality']
    sleep_qual_std = group.std()['sleep_quality']
    sleep_qual_p = p_vals['sleep_quality']
    title = "Average sleep quality for {} groups, significance* at alpha=0.05".format(group_name)
    ylabel = "Average sleep quality 0-8 Likert"
    build_bar(mean_df=sleep_qual_mean, std_df=sleep_qual_std, p_val_df=sleep_qual_p, title=title, ylabel=ylabel)
    

def generate_emm_stats(emm_raw, bins, xlabels, score_col, group_name):
    """Convenience method for processing emm.csv statistics."""
    emm_pid = emm_raw.groupby('pid', as_index=False).mean()
    ylabel = "Average EMA score 0-8 Likert" 
    title = "Average EMA responses for {} groups, significance at alpha=0.05".format(group_name)
    group, p = build_csv_tables(emm_pid, emm_cols, bins, xlabels, score_col, group_name, show_bar=True, ylabel=ylabel, title=title)
    

def generate_coe_stats(coe, bins, xlabels, score_col, group_name):
    """Convenience method for processing coe.csv statistics"""
    coe_pid = coe.groupby('pid', as_index=False).mean()
    ylabel = "Average daily frequency" 
    title = "Average communication daily counts for {} groups, significance at alpha=0.05".format(group_name)
    group, p = build_csv_tables(coe_pid, coe_cols, bins, xlabels, score_col, group_name, show_bar=True, ylabel=ylabel, title=title)
    

def generate_scr_stats(scr, bins, xlabels, score_col, group_name):
    """Convenience method for processing scr.csv statistics."""
    
    scr_pid = scr.groupby('pid', as_index=False).mean()
    ylabel = "Average screen on hours" 
    title = "Average screen on time for {} groups, significance at alpha=0.05".format(group_name)
    group, p = build_csv_tables(scr_pid, ['screen_on'], bins, xlabels, score_col, group_name, show_bar=True, ylabel=ylabel, title=title)
    
    
def generate_tch_stats(tch, bins, xlabels, score_col, group_name):
    """Convenience method for processing tch.csv statistics."""
    
    tch_pid = tch.groupby('pid', as_index=False).mean()
    ylabel = "Average daily touches" 
    title = "Average touch count for {} groups, significance at alpha=0.05".format(group_name)
    group, p_vals = build_csv_tables(tch_pid, ['touch_count'], bins, xlabels, score_col, group_name)
    tch_mean = group.mean()['touch_count']
    tch_std = group.std()['touch_count']
    tch_p = p_vals['touch_count']
    build_bar(mean_df=tch_mean, std_df=tch_std, p_val_df=tch_p, title=title, ylabel=ylabel)


def generate_act_stats(act, bins, xlabels, score_col, group_name):
    """Convenience method for processing act.csv statistics."""
    
    act_pid = act.groupby('pid', as_index=False).mean()
    ylabel = "Average daily readings (10 second intervals)" 
    title = "Average accelerometer activity for {} groups, significance at alpha=0.05".format(group_name)
    group, p = build_csv_tables(act_pid, act_cols, bins, xlabels, score_col, group_name, show_bar=True, ylabel=ylabel, title=title, width=0.8)
    
    
def build_corr_mat(corrs, p_vals, labels, title, alpha):
    """returns the matplotlib plt object for the specified correlations."""
    
    plt.rcParams["figure.figsize"] = [20,12]
    plt.imshow(corrs)
    for i in range(len(labels)):
        for j in range(len(labels)):
            text = "{0:.2f}".format(r_corrs[i, j])
            p = p_vals[i,j]
            if p < alpha / len(labels):
                text = text + "*"
            plt.text(j,i, text, ha="center", va="center", color="w")
    plt.xticks([x for x in range(len(labels))], labels, rotation=45, ha="right", rotation_mode='anchor')
    plt.yticks([x for x in range(len(labels))], labels)
    plt.colorbar()
    plt.title(title)
    return plt


def run_r_corr(df, corr_type='spearman', p_correction='BH'):
    """Runs R correlation calculations and p-value corrections on the given dataframe.
    
    :returns: a tuple of (correlations, counts, p_values)
    """
    num_cols = len(df.columns.values)
    r_dataframe = pandas2ri.py2ri(df)
    r_as = r['as.matrix']
    rcorr = r['rcorr'] 
    r_p_adjust = r['p.adjust']
    result = rcorr(r_as(r_dataframe), type=corr_type)
    rho = result[0]
    n = result[1]
    p = result[2]
    
    if p_correction is not None:
        p = r_p_adjust(p, p_correction)
    r_corrs = pandas2ri.ri2py(rho)
    r_p_vals = pandas2ri.ri2py(p)
    r_counts = pandas2ri.ri2py(n)
    r_p_vals = np.reshape(r_p_vals, (num_cols,num_cols))
    return r_corrs, r_counts, r_p_vals

### Jupyter notebook bloat

- as opposed to a cell that looks like this:

In [None]:
from notebook_utils import build_hist, ttest_df, build_bar, ...

- note that "doing work" may never happen on a particular research thread, hence the appeal of Jupyter notebooks for research

### Deliverables on a project

- a paper or presentation (but what goes into a paper?)
    - graphs
    - experimental results
- possibly a code package!

### My workflow (a work in progress)

0. Create dedicated conda env, git repository


#### Data exploration

1. Explore data in Jupyter notebook
2. Migrate common functions into modules


#### Computation

3. Sanity check code with tests
4. Parameterize modules, write scripts as experiment "code trails"
5. Things often don't work, but iterate on experiment runs



#### Deliverables

6. Once things work, process output and generate figures in Jupyter notebook
7. Document repository with steps to reproduce results in paper


### Workflow: what works for you?

## References
- [Google-style docstrings](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
- [Sphinx documentation](http://www.sphinx-doc.org/en/master/)
- [unittest documentation](https://docs.python.org/3/library/unittest.html)
- [Atlassian Git Feature Branch Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow)
- [Git User Manual](https://git-scm.com/docs/user-manual.html)
- [GNU Make Manual](https://www.gnu.org/software/make/manual/html_node/index.html#SEC_Contents)
- [argparse documentation](https://docs.python.org/3/library/argparse.html)
- [tmux cheat sheet](https://tmuxcheatsheet.com/)