# Python Environments

This notebook documents the commands we will be using in this workshop. For more documentation please see slides.pdf in the same folder.

We are working on mybinder.org. Click this link: [https://mybinder.org/v2/gh/univai-ghf/python-environments/HEAD](https://mybinder.org/v2/gh/univai-ghf/python-environments/HEAD) to work.

## ACTIVITY 1: Creating a virtual environment

- `conda create --name environment-name [python=3.6]`
- `conda activate environment-name`
- `conda deactivate environment-name`
- `conda install <packagename>`

Eg:

```
conda create -n newe
conda activate newe
conda install numpy pandas ipykernel
conda deactivate newe
```

## Making sure environments are available on jupyterlab

- in the base installation, make sure you `conda install nb_conda_kernels`.
- this will depend on how you installed the base. But run the above command is needed
- in our binder system, this is already installed.
- now in every new environment make sure you install `ipykernel`.


![inline](images/newe.png)

---

## Capturing a virtual environment

`conda env export` will capture the exact dependencies. You can now redirect into a file, and use elsewhere on the same OS to recreate this environment.

The file is usually called `environment.yml`, so we do:

`conda env export > environment.yml`

If this file is in a particular folder, just type `conda env create` to create an environment with these packages.

---

## ACTIVITY 2: Using environment.yml files

- you can write your own `environment.yml` file to be much less tied to specific package versions, unless needed.

For example, for an environment `newe2`, we create a folder `newe2`, and in that folder, we create a `environment.yml` with the contents:

```yaml
name: newe2
channels:
- conda-forge
dependencies:
- ipykernel
- matplotlib
- pandas
- numpy
- scipy
- seaborn
- scikit-learn
- tensorflow
- keras
```

---

## A conda env per project

- create a conda environment for each new project
- put an `environment.yml` in each project folder (like `newe2`) with a `name` line reflecting the folder name
- `conda|mamba env create` in project folder (like `newe2`)
- if not per project, at least have one for each new class, or class of projects
- environment for class of projects may grow organically, but capture its requirements from time-to-time.
- for example, on my dual-gpu machine, I have 3 separate environments for pytorch, tensorflow, and jax, as they even had slightly different CUDA requirements.

see [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)


---

## Mamba

- mamba is a version of conda that works with conda but is faster.
- use `mamba` to create and/or install
- use `conda` to activate/deactivate
- See [https://mamba.readthedocs.io/en/latest/user_guide/mamba.html](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html)
- on our binder environment I can issue `mamba env create`. Its faster. But then I must use `conda activate newe2` to startup the new environment
- by default conda will use both default and conda-forge channels, mamba will use conda-forge. I found keras currently only installable with mamba.

---

## What we will do, exactly

1. `cd newe2`
2. `mamba env create`
3. `conda activate newe2`
4. run notebook `keras_perceptron.ipynb`
5. run python file `keras_perceptron.py`
6. `conda deactivate newe2`


---


```yaml
# file name: environment.yml

# Give your project an informative name
name: project-name

# Specify the conda channels that you wish to grab packages from, in order of priority.
channels:
- defaults
- conda-forge

# Specify the packages that you would like to install inside your environment. 
#Version numbers are allowed, and conda will automatically use its dependency 
#solver to ensure that all packages work with one another.
dependencies:
- python=3.7
- conda
- scipy
- numpy
- pandas
- scikit-learn

# There are some packages which are not conda-installable. You can put the pip dependencies here instead.
- pip:
    - tqdm  # for example only, tqdm is actually available by conda.
```

( from http://ericmjl.com/blog/2018/12/25/conda-hacks-for-data-science-efficiency/)

---

## More information

- https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/
- https://goodresearch.dev/setup.html , which is part of the excellent book
- https://goodresearch.dev/index.html

---

## The Importance of Structure

- one might as well use the one env per project ot set-of-projects structure to organize work
- it is really important to organize your data science work well
- a good tool for this is `cookiecutter`, which sets up a template folder structure for you. Install by `pip install cookiecutter` in your base.
- you install a cookiecutter by doing `cookiecutter source`.

---

## Two nice cookiecutters

- https://github.com/patrickmineault/true-neutral-cookiecutter
- Install via: `cookiecutter gh:patrickmineault/true-neutral-cookiecutter`
- https://drivendata.github.io/cookiecutter-data-science/
- Install via: `cookiecutter gh:drivendata/cookiecutter-data-science`

---

## True Neutral Cookiecutter

```
├── data
├── doc
├── results
├── scripts
├── src
│   └── __init__.py
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py
```

---

## ACTIVITY 3: An example

- We do: `cookiecutter gh:patrickmineault/true-neutral-cookiecutter`
- name the project perceptron
- create the conda environment:
- `conda create --name perceptron; conda activate perceptron; mamba install ipykernel numpy tensorflow keras` or do `mamba env create` with an appropriate environment file
- `cd perceptron`
- then do `pip install -e .` which creates the src directory loadable into python

---

## ACTIVITY 3: copy files over

In the perceptron folder:

- `cp ../perceptronfiles/perceptron.ipynb scripts/`
- `cp ../perceptronfiles/config.py src/`

Now run the notebook

---

## Best practices for use

- now we can do development with both the notebook and files. In a notebook cell put the following to have the notebook automatically reload the file when it changes.

```
%load_ext autoreload
%autoreload 2
```

- refactor anything repeated multiple times to python files with functions in them. Notebooks should be very readable
- output all intermediate files into `data` or `results` while writing your pipelines: files from train test splits, parameter values and results, etc
- future you will thank the current you.