# Introduction

## Science becoming increasingly computational

- Modeling
- Simulation
- Data analysis
- Data management

## Research software is more than just code

<img src="fig/research-software.png" alt="Research software" style="width: 12em; float: right;" />

- Data
- Organization
- Communication
- Process

## Key features of research software projects

<center><img src="fig/softwaredevelopment.jpg" alt="Pair programming" style="width: 12em; margin: 1em auto 0 auto" /></center>

- Developers (scientists first, THEN programmers)
- Problems (subtle, complicated, important)
- Requirements (exploring vs. engineering)

## Research software training gap

- Little formal training in software for most scientists
- Existing materials focused mostly on professional programmers
- You've probably picked up some software engineering principles
    - But you're probably missing some
    - And you might not be clear on motivations

## A quick survey

- Development branches (version control)
- Unit testing
- Test-driven development
- Continuous integration

## End game

- Software can be used by others
- Reasonable confidence in accuracy
- Small changes and extensions are safe and easy
- Fast enough to be useful
- Sustainable (during its lifecycle)
- Citable

## Plan for the day

- Morning: What ideals should we aspire to?
- Afternoon: What strategies can we use to get there?

## What is your project's value proposition?

Fill in the template below for your current project.

1. For *[description of target users]*
2. who want to *[statement of their need(s)]*,
3. *[project name]*
4. provides *[statement of key benefits]*.
5. Unlike *[name of alternative solutions(s)]*,
6. our project enables users to *[key differentiator]*.

## Describe how your project is managed

Write a short point-form description (5-6 bullets) of how your current project is managed:

1. Who uses the software?
2. How?
3. How do they find the software?
4. How do they set it up?
5. Who decides what to change and when?
6. How are decisions and changes circulated?

# Organize Deliberately

## Project organization

<img src="fig/noble.png" alt="Research software" style="width: 12em; float: right;" />

- It's like a diet
- An example: "Noble's Rules"  
  ([Noble 2009, *PLOS Comp. Bio.*](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424))
- Details not important,  
  but principles are

## Project organization (cont.)

- **<font color="red">Name all files to reflect content and purpose</font>**
- Use established conventions
- But (deliberate) adaptations are fine

## How Is Your Project Organized?

Draw a diagram of how your project is currently organized.

1. Is this documented anywhere?
2. Would it be intuitive for a newcomer?
3. Are there any changes you could make to take advantage of common conventions?

# Use Version Control

## Version control facilitates...


- Worry-free tinkering
- Collaboration
- Reproducibility
- Transparency

<center><h2><font color="red">Using version control is a professional obligation!</font></h2></center>

## Feature branch workflow

<center><img src="fig/feature-branch.png" alt="Feature branch workflow" style="width: 18em;" /></center>

## Feature branch workflow (cont.)

- Designate a main development branch ("master")
- For each new feature
    - Create a new branch from "master"
    - Implement and test the feature on that branch
    - Merge the branch into "master" when the feature is done
- The "master" branch is always in a clean, runnable state
    - Enforce automatically when possible

## Github flow in action

1. Go to the SNDS Github repo
2. Clone to your laptop
3. Create a new branch with your name as a label
4. Add your name and email to the "CONTRIBUTORS" file.
5. Create a new pull request.

# Automate Frequent Tasks

## Don't Repeat Yourself!

- DRY principle: don't repeat yourself
  > *The only thing you can accomplish by typing something repeatedly is to get it wrong.*
- Use an automated build manager
- Use checklists for tasks you can't automate

## Build managers

- GNU Make
    - *The old standby*
- CMake, automake/autoconf, SCons, etc.
    - *"New" flavors*
- rake, pydoit, SnakeMake, etc.
    - *Language-specific*

## Build managers (cont)

- Key feature: dependencies
    - "X depends on Y depends on Z"
    - Usually implemented using timestamps or hashes
    - Tasks only re-executed when needed
- Originally designed for compiling large programs
- Can be adapted for arbitrary workflows

## Checklists

- "Build file" executed by humans
- Keep in version control
- Adapt over time as needed based on experience and feedback

## Create a task list

1. If your project doesn’t use a build manager, what are the first few tasks you should automate?
2. If your project already uses a build manager, what tasks are used most often?

## Create a setup checklist

1. Write a short point-form checklist describing the things you do when setting up a new machine to do development on your project.
2. How many of the steps in your checklist can be automated using shell scripts or other small programs?
3. How will newcomers know if they have completed the steps in the checklist correctly?

# Make the Software Robust

## Robust software

*Robust* is the difference between 

> *Works for me on my machine.*

and 

> *Works for someone I've never met on a cluster I've never heard of.*

## Taschuk's Rules

See ([Taschuk & Wilson 2017, *PLOS Comp. Bio.*](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005412))

<center>Provide a descriptive README (synopsis, dependencies)</center>

<center><img src="fig/khmer-readme.png" alt="README in terminal and browser" style="width: 24em" /></center>

## Taschuk's Rules (cont.)

<center>Provide a descriptive usage statement<br />Make common operations easy to configure</center>

<center><img src="fig/canon-cli.png" alt="CLI in terminal" style="width: 16em" /></center>

## Taschuk's Rules (cont.)

- Use version control
- Release stable versions w/ meaningful version number
- Reuse existing software (whenever possible)
- Use build tools & package managers for installation
- Do not require special privileges
- Eliminate fixed/absolute file paths
- Include a small data set to test installation
- Produce identical results given identical input

## How do you version now?

1. How many different versions of your project are in use right now? How do you know?
2. If a user has a problem, how will you and they find out which version of the software they have?

## Runtime configuration

1. What options or parameters does your program use?
2. Which ones are users most likely to set or change?
3. How are these parameters set?

# Test All The Things

## Are your tests...

<img src="fig/testing_graphic.jpg" alt="Software testing" style="width: 12em; float: right;" />

- Automated?
- Comprehensive?
- Well documented?

## Types of tests

<img src="fig/science-cat.jpg" alt="Science cat says you need more tests" style="width: 8em; float: right;" />

- Smoke test
- Unit test
- Functional test
- Integration test

## Test evaluation

- Check exact value
- Check match against a pattern
- Check range of value (a.k.a. specify tolerances)

## Use a testing framework

- Exist for most popular languages
    - `pytest` for Python
    - `testthat` for R
- Implement tests as functions
- Each test results in a pass, fail, or error
- Framework finds and executes tests

## What Are Your Tolerances?

1. What will you measure to determine whether your software is correct enough?
2. What tolerances will you accept on answers?
3. Why?

## Exercise: rectangle overlay

<center><img src="fig/overlay.png" alt="Rectangle overlay" style="height: 6em;" /></center>

Given two rectangles, each defined by the four values `[x0, y0, x1, y1]`, return the rectangle representing their overlap. Assume all coordinates are integer.

# Licensing

## Licensing research software

- Creative works automatically eligible for protection
- Reusing creative works without a license is dangerous (infringement lawsuits)
- Adding an license to your project
    - Makes protections and permissions explicit
    - Signals how you wish to engage the wider research community.

## Choose a license

- Put a `LICENSE` or `LICENSE.txt` file in your repository
- Use a common license, **don't write your own**
    - MIT or BSD
    - GPL
    - Creative Commons (CC-0 or CC-BY)
    - others from [Open Source Initiative](http://opensource.org/licenses)

## Licensing considerations

- Do you want to license your project at all? Can you?
- Do you require derivative works to have the same license? (fraught with unintended consequences)
- Is your license compatible with the software your project depends on?

# Hosting

## What are the permanent access points for your project?

- Software under development
- Stable software releases/versions
- Documentation
- Supporting data

## Hosting considerations

- Privacy?
- Ownership, branding
- Reliability
- Management burden

## Hosting options

- Lab / department / university server
- Paid hosting service
- Public hosting service

## Can my work be public?

Find out whether you are allowed to host your work openly on a public forge. Can you do this unilaterally, or do you need permission from someone in your institution? If so, who?

## Sharing your work

For your project, write down where would you point a new lab mate or collaborator to find the following?

- The project's development repository
- The latest stable version / release
- Previous stable versions / releases
- Project documentation
- Supporting data (if any)

# Packaging

## Package managers

- Motivation: ease software installation burden
    - For users and sysadmins at least
    - Result: additional burden on developer
- Single command to install software packages
    - Handles dependencies recursively
    - Easily scripted or otherwise automated

## Dependency management

A hypothetical `requirements.txt` for a python package.

```
Django>=1.9,<1.10
PyYAML
requests>=2.0
```

Install with `pip install -r requirements.txt`

## Common package managers

- **OS-specific**: homebrew, apt, yum, pacman, etc.
- **Language-specific**: pip, cran, npm, gem, cpan, etc.
- **Generic**: conda, others?

## Containers

- Alternative for configuring software environment
- Complementary to, not competitive with, package managers
- Examples: Docker, Singularity

## Ask your doctor: is packaging right for you?


1. What language is your project primarily implemented in? Does a package manager exist for this language?
2. Write down your project’s dependencies. Can these be installed with a package manager? Is this the same as the package manager above?
3. How do you know if your dependency list is correct? How would you know if something changed and it fell out of date?

# Review

## What ideals should my project aspire to?

- Intuitive organization (Noble's Rules)
- Transparency, provenance (version control)
- Usability (build managers, Taschuk's Rules)
- Accuracy, reliability (testing)
- Usability (licensing)
- Availability (hosting, packaging)

## What ideals should my project aspire to?

- Reproducibility
- Replicability
- Repeatability

# Development Paradigms

## Agile development

- Rapid iteration
- Compatible with informal/underdeveloped/changing requirements
- Frequent (daily) short progress updates
- Works well for small teams

## Agile development

- Rapid iteration
- Compatible with fluid requirements
- Daily short progress updates
- Works well for small teams

## Stand-up meetings

<img src="fig/standup.jpg" alt="Standup meetings" style="width: 12em; float: right;" />

- Report and discuss:
    - yesterday
    - today
    - blockers
- "Stand-up": short, focused
- Break down work into small  
  chunks

## Agile works best when...

- Requirements are informal, underdeveloped, or changing
- Developers can communicate continuously
- Team is small
- Team is disciplined
- Team members *like* to be empowered

## Agile is not "cowboy coding"

- Many developers don't like to plan or document their work
- Agile is not an excuse do avoid doing things you don't want to do
- Agile requires **more** discipline (like musical improv)

## Is your team agile?

- Which of the key agile practices described above are you currently using?
- Which do you think you and your team would actually adopt in the next 3-6 months?
- Which do you think are not good fits to your needs or situation?

## "Sturdy" development

- "Measure twice, cut once"
- Requires formal, mature requirements
- More upfront planning and estimation
- Scheduling enforced by managers
- Can scale to *very* large projects

## Classical approach works best when...

- Stakeholders are committed to the long-term
- Work is so large that division of labor is essential
- Problem domain, solutions, and technologies are well understood

## Classical engineering process

<img src="fig/prioritize.png" alt="Prioritization" style="width: 10em; float: right;" />

1. Gather requirements
2. Analysis and estimation
3. Prioritization (3x3 grid)
4. Scheduling
5. Implementation
6. Wrap up

## Software requirements

The classical approach relies on good, strict requirements.

- As unambiguous as a legal contract or a mathematical proof.
- Two independent competent practicioners should reach same conclusions about correctness.

## Prioritize

1. Draw a 3x3 grid of the kind described above.
2. Pick 3-4 open issues for your current project.
3. Decide where each one belongs on the grid. Are any of them so large that they should be broken down into sub-tasks?

## Write requirements

1. Write an unambiguous specification of a feature you would like to add to your current project.
2. Swap specifications with your partner.
3. What is the least helpful (or most damaging) implementation of your partner’s feature you can think of that would technically satisfy the specification they wrote?

# Issues and Action Items

## Issue trackers

- Also called bug trackers
- Shared "to-do" list to manage everything
- Every task is recorded as a separate ticket

## Issue trackers can be used...

- To plan or request new features
- To describe bugs
- To solicit bug reports from community
- To discuss planned changes
- To dump ideas for later reference

<img src="fig/issue-thread-list.png" alt="Issue thread list" />

<img src="fig/issue-thread-short.png" alt="Issue thread list" />

<center><img src="fig/issue-thread-long.png" alt="Issue thread list" style="height: 18em" /></center>

## Components of each issue/task/ticket

- Unique ID (auto-assigned)
- Short desriptive summary (to aid browsing)
- Tags / metadata (to aid searching)
- Status
- Owner (who's responsible)
- Full description
- Threaded discussion

## Key utilities of issue trackers

- Prioritization
    - What has to be done right now? Soon? Later?
- Documenting your work
    - Think of it as a shared lab notebook.
    - Full record of work done, along with relevant discussion, reasoning, etc.
    - Everyone knows what everyone else is working on.

## What's on *YOUR* list?

1. What are the top 3 items on your project’s to-do list?
2. How confident are you that your collaborators and users would agree with your selection?

# Test-Driven Development

## Write tests first

1. Write several tests describing a desired new feature
    - tests won't actually pass
    - serve as requirements
    - go for clear, comprehensive, and precise
2. Implement the desired feature
    - you're done when the tests pass
    - write only enough code to make the tests pass

## Does it work?

Advocates claim TDD:

- Helps with focus
- Ensures code is testable
- Ensures tests actually get written

## Does it work?

Evidence is contradictory

- No strong positive effect from empirical studies
- But many productive programmers swear by it
- Maybe we're measuring the wrong things?

See "Making Software" by Oram & Wilson

## TDD exercise: SNDS

1. Return to the SNDS repo
2. Return to master branch, and create a new branch
3. Write 4-6 tests for the "sums" function
4. Open a new pull request
5. We will implement the "sums" function together

In [1]:
def sum_non_dec_sublists(inlist):
    """
    Given a list of numbers, return a list of the sums of each non-
    decreasing sub-list. For example, if the input is [1, 2, 3, 3, 1,
    5, 6, 3, 1, 2, 3], the output should be [9, 12, 3, 6].
    """
    sums = list()
    
    buffer = list()
    for value in inlist:
        if len(buffer) > 0 and value < buffer[-1]:
            newsum = sum(buffer)
            sums.append(newsum)
            buffer = list()
        buffer.append(value)
    newsum = sum(buffer)
    sums.append(newsum)
    
    return sums

# Code Review

## Code review

<img src="fig/code-review.png" alt="Pair programming" style="width: 6em; float: right;" />

- Most cost effective way to find bugs
- Natural extension of what academics  
  do with each other's papers
- Should be continuous part of the  
  regular development cycle

## Code review works best when...

- Code already passes tests and style checks
- Each review takes less than an hour
- Focus is on behavior, not appearance
- Reviewer has a checklist to refer to

## Not just for code

Reviews should include:

- Description (issues, PRs)
- Documentation
- Commit messages and/or change log

## Code review in academics

- Domain experts are scarce
- No career incentives to review someone else's project
- At present, only sustainable within team projects
- One of many problems with publishing academic research software

## Pair programming

<img src="fig/pairprogramming.png" alt="Pair programming" style="width: 10em; float: right;" />

- Real-time code review
- Helps with knowledge transfer
- Discourages social media
- Studies prove effectiveness

## Pair programming (cont.)

- Most people don't want to do it all the time
- Problematic with highly specialized problem domains
- Save it for onboarding, difficult tasks, or team building

# Continuous Integration

## What is continuous integration?

- Automatically build and test code with each commit
- Post results somewhere the team can see them
- If a build or tests fail, send notifications

## CI and feature branch workflow

- Build and run tests on feature branches *before they're merged*
- Only do code review when CI passes
- Only merge to `master` after CI and code review pass

<center><img src="fig/ci-check-pending.png" alt="CI check pending" style="width: 20em;" /></center>

<center><img src="fig/ci-check-pending.png" alt="CI check pending" style="width: 20em;" /></center>

## CI platforms

- Travis CI
- Drone.io
- Circle CIO
- Gitlab (integrated)
- Jenkins (self-hosted)

## Recommended CI checks

- build/compile
- automated tests
- test coverage
- code style

## Test coverage

- Measures what code is being executed and what isn't
- Coverage doesn't guarantee correctness
    - Just less likely to be incorrect for stupid, obvious reasons
- Coverage can't measure everything
    - Example: how different types of input are handled

<center><img src="fig/ci-coverage-1.png" alt="CI code coverage" style="width: 20em;" /></center>

<center><img src="fig/ci-coverage-2.png" alt="CI code coverage" style="width: 20em;" /></center>

## Code style

- Sometimes called "linters" after the lint program for C
    - **pep8** for Python
    - **formatR** for R
- Automatic enforcement eliminates subjective arguments
    - if the code passes, it's approved
    - guarantee: never have to worry about style
    - discuss exceptions on a case-by-case basis

## Set up CI

Follow the "Continous Integration tutorial" to set up CI for our non-descending sublists repository.

# Compromise

## Technical debt

- Dissonance between conceptual model and code
- Informed, deliberate suspension of best practice
- Term popularized by "lean tech startup" culture

## Tradeoffs

- Technical debt usually necessary to get a project off the ground
    - Not enough time and resources to be thorough and careful with every idea from the beginning
- Technical debt complicates sustained development
    - Adding new features requires going back and cleaning up
    - Debt must be "paid down"...with interest!

## Stay on target!

- Software engineering is not an end unto itself.
- It is a means to an end. What are your project's ends?
- In science, priorities should be
    - accuracy
    - performance
    - ...
    - ...
    - aesthetics, user experience, etc.

## Rapid iterative development

- Build a quick proof-of-concept
- Evaluate accuracy
- Evaluate performance. If unsatisfactory:
    - profile empirically
    - optimize surgically
- Clean up

## Incremental improvement

<img src="fig/fix-all-the-things.png" alt="Fix all the things" style="width: 10em; float: right;" />

- Don't try to FIX ALL TEH THINGS  
  AT ONCE!
- When fixing / extending code:
    - check for clarity
    - improve documentation
    - write unit tests
- Improvements will accumulate  
  quickly

## Summary

<font color="red">Compromise on best practices if you must!</font>

- but not on version control
- and not on automated testing
- be deliberate

## Stupidity-driven development

- Write lots of tests for scientific core of the code
- Don't fuss too much about everything else
- When a bug is encountered:
    - write a new regression test to reproduce the bug
    - fix the bug
    - get on with more important things
- Don't write tests for bugs that will never appear!

# Mentorship

# Build a Community

## Questions of sustainability

- How to get people to use the software?
- How to find scientists working on similar problems?
- How to get people to contribute to the project?
- Fate of the project when funding is depleted?
- How to magnify career impact of the project?

## Build a community

- Treat every user as a potential contributor
- Clarify expectations about peripheral participation
- Lower barriers of entry for use and contribution
- Include a code of conduct
- Manage communication

## Channels

1. What is your project’s primary communication channel?
2. Why and how was it chosen?
3. What discussion(s) take place in other channels? Why?
4. How easy or hard is it for a newcomer to find where things are being discussed?

# Marketing

## Marketing your science

- Tempting to think of science as purely meritocratic
    - "If I do good science, it should speak for itself!"
    - "Advertising is pretentious and self-serving!"

## Scientists are social creatures

- We all have our social networks of trust
- Technology has helped with discoverability of new research
- Still not as effective as getting airtime from a “high-profile” outlet
    - "Glamour" journal (*Nature*, *Science*, etc.)
    - Blog or Twitter account of a "big shot" in the field

## Don't be afraid to market your project

- If it's really good, you should tell people about it
- Concert t-shirt model
    - Funding proposals are your product
    - Papers and software are your   advertisements

## Use social media and blogs

<img src="fig/social-marketing.jpg" alt="Social media marketing" style="width: 10em; float: right;" />

- Papers must be objective
- Social media posts don't!
- Worst case: nobody listens
- Best case: the right people  
  notice and advertise your work

## Citations

- Include a `CITATION` file in your project
- If not yet published, get consensus on what's needed for a "software paper"
    - See JOSS and JORS for checklists
    - Very straightforward if your project is already in good shape
- Continue to plan for future papers

## Add citation info to your project

- asdf

# Conclusion

## Reflection

1. What was the most useful or interesting thing you learned in this class?
2. What was the least useful or interesting?
3. What didn’t make sense?
4. What don’t you believe?
5. What are the next three things you are going to do?