# Makefiles for Data Projects

2019-06-19

### Ray Buhr

Data Scientist at Braintree

https://raybuhr.github.io/talks/makefiles-for-data-projects/presentation.slides.html

## Make and Makefiles

`make` is a command line tool for running commands that may depend on the success of other commands.

`Makefile` is a text file that `make` uses uses to define tasks and dependecies.

**IMPORTANT**: indent with tabs (`\t`), not spaces

## Rules and Targets

A _rule_ is the just the steps involved in the task (aka commands).

A _target_ defines the result of the task.

Basic structure of a `Makefile`

```
target:   dependencies ...
          commands
          ...
```

## Why should I care?

If you do this, you get an automatic CLI for your project, 
that can be smart about running the steps of your data project 
in the order that you want.

## Example Time!

`make install` to set virtual environment and install packages 

```Makefile
install:
    python3 -m venv /venv
    source venv/bin/activate
    pip install -r requirements.txt
```

`make test` to run the tests defined for the project

`make test-coverage` to get a test coverage report

```Makefile
test:
    pipenv run python -m pytest --verbose tests

test-coverage:
	pipenv run pytest --cov-config .coveragerc \
        --verbose --cov-report term \
        --cov-report xml --cov=requests tests
```

1. Run a query and save the result to CSV
2. Train a ML model and pickle the object using joblib
3. Save both of those files AWS S3

<br/>

```Makefile
ENV = dev
BUCKET = team-data-science-$(ENV)

data/training_data.csv: install fetch_training_data.sql
    psql -d replica_database -t -A -F"," \
        -f fetch_training_data.sql -o training_data.csv

data/trained_model.joblib.gz: training_data.csv train_model.py 
    source venv/bin/activate && \
    python train_model.py

sync_data_to_s3: data/trained_model.joblib.gz data/training_data.csv
    aws s3 sync data/ s3://$(BUCKET)/data/
```

## Other things to know

**IMPORTANT**: indent with tabs (`\t`), not spaces

`.PHONY` prevents make from confusing the target with a file name

`.DEFAULT_GOAL` sets the task to run by default when only `make` is ran

    this is super **_helpful_** when you define a `help` task 

Add an `@` before a command to stop it from being printed

You can use `;` to separate commands, useful when the first step is to `cd` to a directory

You can have multiple commands in a task

## Online Resources

[Makefile Cheatsheet](https://devhints.io/makefile)

[Makefile Tutorial](https://makefiletutorial.com/)

[Automatic help documentation for Makefile](https://marmelab.com/blog/2016/02/29/auto-documented-makefile.html)

[GNU Make Homepage](https://www.gnu.org/software/make/)