# Best Testing Practices for Data Science

A short tutorial for data scientists on how to write tests for your code and your data. Before the tutorial, please read through this README file, for it contains a lot of useful information that will help you best prepare for the tutorial.

## How to use this repository

The tutorial notes are typed up in Jupyter notebooks, and static HTML versions are available under the [`docs`](./docs/) folder. For the non-bonus material, I suggest working through the notes in order. With the exception of the Projects, the bonus material can be tackled in any order. During the tutorial, be sure to have the HTML versions open.

## Pre-Requisite Knowledge

I am assuming you are of the following type of coder:

- You are a data analytics type, who knows how to read/write CSV files with Pandas, and do basic data manipulation (slicing, indexing rows + columns, using the `.apply()` function).
- You are not necessarily a seasoned software developer who has experience running tests.
- You are comfortable with operating in the Terminal environment.
- You have some rudimentary knowledge of `numpy`, particularly the the `array.min()`, `array.max()`, `array.mean()`, `array.std()`, and `numpy.allclose(a1, a2)` function calls.

In order to prepare for the tutorial, there are some pieces of Python syntax that will come in handy to know:
- the context manager syntax (`with ....`),
- assertions (`assert conditions1 == condition2`),
- file I/O (`with open(....) as f:...`),
- list/dict/tuple comprehensions (`[a for a in container if condition(a)]`),
- checking types & attributes (`isinstance(obj, type) or hasattr(obj, attr)`).

## Feedback

If you've taken a version of this tutorial, please leave feedback [here](https://ericma1.typeform.com/to/Ua0LBs). I use the suggestions in there to adjust the tutorial content and make it better. The changes are always released publicly on GitHub, so everybody benefits!

# Environment Setup

## `conda` setup

This installation route should work cross-platform. I recommend using the [Anaconda distribution](https://www.continuum.io/downloads) of Python because it is a good way to bootstrap your data science environment.

To get setup, create a `conda` environment based on the provided [`environment.yml`](./environment.yml) spec file. Run the following commands in your bash terminal.

```bash
$ bash conda-setup.sh
```

## `pip` setup

The alternative way is to use a virtualenv environment:

```bash
$ bash venv-setup.sh
$ source datatest/bin/activate
```

Alternatively, you can `pip` install each of the dependencies listed in the `environment.yml` file. (The `requirements.txt` file may be less eagerly maintained than the `environment.yml` file, given the `conda`-biases that I have.)

## Manual Setup

If you prefer having more control over your installation process, `conda` or `pip` install the dependencies listed in the `environment.yml` file.

## Checks

To check whether the environment is correctly setup, run the `checkenv.py` script:

```bash
$ python checkenv.py
```

It should print to your terminal, `All packages found; environment checks passed.`. Otherwise, `conda` or `pip` install the necessary packages stated (they will show up one by one).

# Authors

- [Eric J. Ma](http://www.ericmjl.com)

# Contributors

Special thanks goes to individuals who have contributed in ways big and small to the improvement of the material.

- Renee Chu
- Matt Bachmann: @Bachmann1234
- Hugo Bowne-Anderson: @hugobowne
- Boston Python tutorial attendees:
    - @races1986
    - Thao Nguyen: @ThaoNguyen15
    - @ChrisMuir

# Data Credits

- [Divvy Data Set](https://www.divvybikes.com/data)
- [Analyze Boston](https://data.boston.gov/)
- Mia T. Lieberman for the sanitization dataset.


# Best Data Testing Practices for Data Science

Eric J. Ma

MIT Biological Engineering

# How to use these notebooks

- Follow along with Jupyter notebooks in GitHub: [ericmjl/data-testing-tutorial](https://github.com/ericmjl/data-testing-tutorial)
- Most of what we will do is in the terminal & your favourite text editor.

# Why tests?

- We make assumptions about our code & data. 
- There are cases where those assumptions are violated.
- Therefore, automated testing of those assumptions is important.

# Tests: A Definition

> A contract between your current self and your future self.
> What you expect to be right now should hold true in the future.
> What you expect to be wrong now should still be wrong in the future.
> Unless the requirements have changed!

# Lets discuss!

What needs to be tested for:

- code?
- data?
- statistics?

## For code, what needs to be tested?

- Given some example input(s), the output is correct.
- Counter-examples should show up as incorrect.
- Boundary cases are accounted for using defensive programming.
- All lines of stable code are subject to at least one test.

## For data, what needs to be tested?

- Data types are appropriate. (Types)
- Data has not been tampered with. (Integrity)
- Missing values are accounted for. (Completeness)
- Data schema is complete. (Structure)

## For statistical analysis & ML, what else needs to be done?

- Underlying distributions for real-valued (numeric; integer or floats) data.
- Classifying data as categorical, ordinal, count, compositional, or continuous.
- Categorical/ordinal values represented as strings should be converted to numerical representations.

## What you can expect

### Coding

- You'll be implementing only simple functions. Nothing complicated.
- Sample solutions are in the `*_soln.py` files.

### Tutorial Material

- Covered with interspersed lectures.
- Simple exercises designed to get you familiar with how to write tests.
- Give you a set of tools + code to bootstrap testing for another project.

### Bonus Material

- Self-paced material for the final hour of the tutorial or at home.
- More complex topics on the topic of testing.
    - File integrity
    - Test coverage
    - Property-based tests
- More superpowers for data testing!

# Take-Homes

- You'll get a ton of practice with [`pytest`](https://docs.pytest.org/en/latest/) and assertion statements.
- You'll will be left with self-paced learning material for [`hypothesis`](https://hypothesis.readthedocs.io/en/latest/) to do property-based testing.
- You will have a starter set of tools for writing tests for your code and data.

**If anything, I want you to not be afraid to write a test.** If that's all you take back, this tutorial can be deemed a success.

# Let's get going!