# In which I get to play with cookie cutters. 

I recently started freelance data sciencing on [Upwork](upwork.com), which is awesome! My second contract there involved writing some Python for a client who wanted to be able to execute various analyses and visualizations from his data on securities trading. Since I was essentially responsible for delivering a piece of software to this dude, I spent a little bit of time reading on how to structure a project like this. What I discovered was a pretty solid concensus on how all publicly distributed python packages (modules) should be structured. 

So I conformed the client's code to this standard python package structure and documented the hell out of it. Then something amazing happened... I felt like I had actually created a persistent, useful tool and code base for myself and this guy, rather than a collection of haphazard scripts with a workflow that nobody would be able to easily reproduce one month hence. So here is the revolutionary idea, **treat your data science projects like python packages - structure them the same, standard way each time and document the crap out of them.**

# Structuring the Project
So how *does* the standard layout for a publicly distributed python package look? The simplest version looks like this:

In [12]:
captcha = !tree /F
print repr(captcha).decode('unicode_escape')

SyntaxError: invalid syntax (<ipython-input-12-f990d58a5ad4>, line 2)

Your project directory should have a handful of top level files like `README.md`, `setup.py`, and `requirements.txt`. Then it should have at minimum the following three folders:
- `docs` to store documentation
- `myproject` to store your actual python package (the name of this folder is the name of the package)
- `myproject/test` to store scripts to run tests on your package



Your standard project structure should have a file(s) at the top level that lets anyone exactly duplicate the virtual environment needed for the project to run. Since I use `conda` to manage my python packages I include both a `requirements.txt` file (used by `pip` to recreate an environment) and a `.yml` file (used by `conda` to recreate an environment). If that was confusing to you then you should read the next section.

### Project Structure Resources
- [Jeff Knupp's take on project structure](https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/)
- [The Hitchhiker's Guide's take on project structure](http://docs.python-guide.org/en/latest/writing/structure/)
- [Cookie cutter data science docs](https://drivendata.github.io/cookiecutter-data-science/)

# Use an Environment, I Beg You
Please, for the love of god, use a virtual environment for any substantial project involving code you might ever want to reuse. I can reassure you that although they sound crazy technical they are actually *dead simple* to use. A virtual environment is like a self-contained, preserved, install of a specific version of python and packages. It let's you completely reproduce the working environment for your code at any time - you'll never again break ALL the things by updating!

In a more boring and realistic sense, **a virtual environment is just an isolated folder holding specific versions of python and packages.** You make a virtual environment by indicating which version of python should be copied into your new directory (or specify the path from which to grab python) and it will also install the pip package into the environment by default. After this you can enter the virtual environment and use `pip` (or `conda`) like normal to add packages, *but they will be installed into your new virtual environment directory.* **Being inside a virtual environment is just like telling your OS that your python executable and your PATH (where it looks for modules) have moved to the new virtual environment directory.** 

The standard way of managing virtual environments in Python is with the package manager `pip` and a 3rd party tool `virtualenv` (in Python 3 there is built-in support and you use `pyenv` instead of `virtualenv`). But Conda, the package management solution that ships with the Anaconda distribution of Python has it's own (better) approach for virtual environments. 

Here is a teaser to how simple conda environments are to use:

In [None]:
conda create --name=coolprojectname python=2.7  # Create a new env with only a fresh install of python 2.7 (no packages)
actvate coolprojectname  # Enter the env
conda install pandas  # Install pandas, best module ever, into your new env

conda env export > coolprojectname.yml  # A file that conda can use to recreate the env
pip freeze > requirements.txt  # A file that pip can use to recreate the env
deactivate  # So long for now!

Conda by default puts your virtual environment folders inside `/Anaconda3/envs`, so you should be able to go see a new folder there called `coolprojectname`. To make a conda environment with Python 2.7 *and* all the standard Anaconda packages with it you do `conda create -n myapp python=2.7 anaconda` (notice the `anaconda` at the end).

### Virtual Environment Resources
There are about one billion indroductions to virtual environments, so I'll just leave it at that and point you towards some of them. Go read:

- [Primer on virtual environments by RealPython](https://realpython.com/blog/python/python-virtual-environments-a-primer/)
- [Official conda docs on managing environments.](https://conda.io/docs/using/envs.html#id1)
- [General overview of virtual environments in python](https://www.caktusgroup.com/blog/2016/11/03/managing-multiple-python-projects-virtual-environments/)
- Short [stackoverflow answer](http://stackoverflow.com/questions/34398676/does-conda-replace-the-need-for-virtualenv) and a [longer official article](https://www.continuum.io/blog/developer-blog/python-packages-and-environments-conda) about how conda environments are different (better) than virtualenv.

# Documenting the Code 
In my opinion there are four essential pieces to this. Starting from highest-level and zooming into the detailed they are:
1. A detailed README with the standard sections.
2. Good, long docstrings at each `.py` file.
3. Good, long, RST-style docstrings at the top of each function (basic helper functions can have the single-line docstring)
4. Good, short, inline comments throught the code

The `README.md` is the most critical file in my opinion - the one that everyone looks for when faced with a new unknown repo. It should start with a description of the project and a link to additional documentation (for example hosted at readthedocs). It should also include a "Quickstart" section on how to install and start using the project). If the project has non-python dependencies these should also be stated in the README. 

The definitive source on commenting and docstrings is the `PEP8` style guide.

# Notes from Jeff Knupp's Writeup
[source](https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/)

## Directory Layout

Your project directory should have a handful of top level files like `README.md`, `setup.py`, and `requirements.txt`. Then it should have at minimum the following three folders:
- `docs` to store documentation
- `myproject` to store your actual python package (the name of this folder is the name of the package)
- `myproject/test` to store scripts to run tests on your package

The `setup.py` file is used (by the `disutils` or `setuptools` packages) for the installation of your package, so it contains information like versioning, requirements etc.

The `README.md` is the most critical file. It should start with a description of the project and a link to additional documentation (for example hosted at readthedocs). It should also include a "Quickstart" section on how to install and start using the project). If the project has non-python dependencies these should also be stated in the README. 

|- LICENSE
|- README.md
|- TODO.md
|- docs
|   |-- conf.py
|   |-- generated
|   |-- index.rst
|   |-- installation.rst
|   |-- modules.rst
|   |-- quickstart.rst
|   |-- sandman.rst
|- requirements.txt
|- sandman
|   |-- __init__.py
|   |-- exception.py
|   |-- model.py
|   |-- sandman.py
|   |-- test
|       |-- models.py
|       |-- test_sandman.py
|- setup.py

# Notes from Hitchhiker's Guide
[source](http://docs.python-guide.org/en/latest/writing/structure/)

The goal is write clean code whose logic and dependencies are clear, and organize the files and folders of your project sensibly.

The organization of the code is constrained by Python's module and import systems.