<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/006_environments_1_anaconda.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# environment management: `anaconda`

## wait, what class is this?

why are we talking about environments?

<br><div align="center"><img src="https://news-media.stanford.edu/wp-content/uploads/2016/11/10165436/environment_GettyImages-501231894.jpg" width="800px"></div>

in the computer science world, the phrase "environment" is often thrown around with slightly ambiguous meaning. in the broadest sense, it can be the "computing" environment or the "operating" environment -- the combination of hardware and software that a user interacts with; the whole enchilada.

in discussions about specific applications and for certain programming languages, it can be filtered down to the "runtime" environment -- the relevant aspects of the hardware (from that application's point of view) and the codebase which defines that application or language

generally speaking, when I talk about the **environment** I'm focusing on the software (literal files, on your computer's hard drive) that define how *something* behaves. for example...

## your `python` environment

your `python` environment is the tools and packages available to you for use within the `python` programming language, and the way those tools and packages behave. this is completely determined by the literal files defining the `python` language on your computer

### current system `python` environments

let's do a quick `python` version check:

on your `ec2` instance, what `python` version do you have installed?

```sh
python --version
```

```sh
# grrrrr.......
python3 --version
```

In [None]:
%%bash
python --version

different versions of `python` (and different versions of installed packages) correspond to different files defining the language's behavior and thus different levels of compatibility. personally, I think knowing that these files exist is among the more important pieces of information in my `python` learning.

***the way that the code you wrote behaves depends on these files***

recall that the `bash` command `which` will tell us the path of the executable that will actually be called when we type in a command

```sh
which python3
```

In [None]:
%%bash
which python3

your out-of-the-box `ec2` instances will likely return `/usr/bin/python3`. so when you type `python3` on the command line, you will actually call the executable file `/usr/bin/python3`.

the same sort of thing is going on for individual `python` modules we import. Every module has a "private" member `__file__` which lists the path to the file used to define that module:

In [None]:
import os
os.__file__

let's look at that file!

```sh
# for you, it is:
less /usr/lib/python3.6/os.py

#for me, right now, it'll be different -- hence the craziness below. sorry!
```

In [None]:
%%bash
OS_FILE=$(python -c "import os; print(os.__file__)")
cat $OS_FILE

if you change that file, or your friend (who is running your code) doesn't have that same file, the code that uses `os` will be different.

the same caveat goes for every file or environment variable used by your python process on any machine. this collection of files defines what is often called the "`python` environment", and it can be different on any system. `sudo apt install` could totally change it.

yikes!

in the real world, the implication is immediate: if one of my programs only works for version 1.2, and another only works for version 2.1, and the `GOVERNMENT AGENCY NAME REDACTED` sysad just installed library 1.0 and *that* took two years, this  will probably be a problem.

It would be nice if this problem was solved...

### virtual environments

on our `ec2` instances, there is only one `python`: the executable file located at `/usr/bin`.

there is also only one `os.py` file, one `datetime.py` file, one `pickle.py` file, etc. the code that quite literally defines what you can do after you run

```python
import os
```

is in the `/usrlib/python3.6` directory.

what else is in that directory?

for a vanilla installation of `python` on any machine -- like the installation on this `ec2` instance -- there is only one "environment": this collection of `python`-related files.

there is a fixed version in time that defines these files and packages, and no matter who you are or what you are doing on this `ec2`, `python` "looks" the same way.

that consistency is essential for all of the various `python` programs running on your current `ec2` instance at the root level to coordinate and work together. it's a good thing!

but what if you want to do things slightly differently?

what if you want to access a different version of `python` than the one installed?

what if you want to install a package that requires a different version than the version the rest of the system wants to use?

you *could* try and update the system's environment, using `sudo apt get` or `pip` to install different versions... but that's potentially dangerous, because you're changing things for **everyone**, including the root user that runs and maintains the whole operating system. ideally you'd be prevented from breaking everything, but we are very powerful, after all, and I'm sure we can find a way

the solution to problems like this -- wanting to have a `python` environment that is different than the base `python` environment without changing that base `python` environment

"virtual environments" are ways of isolating out the contents (the files) of libraries you're installing.

this is something you've actually probably (*kind of*) done in `R`, actually, without knowing it. if you've ever tried installing a package but didn't have admin rights, the `R` interpreter prompts you to see if there's some other place you'd like to install things (usually in your home directory).

that is a system-level isolation of the files you want to install. When the interpreter is told to load a package, it looks first for your local copy to see if you have anything spicy, and then it checks for a global copy, and then it cries.

so, generalize that idea: let's make *multiple* separate environments (collections of files defining how our `python` code behaves).

we can generalize this beyond just "global" and "user" (as with `R`), even creating a separate environment for each process or code base.

on a very basic level, all we're doing here is re-installing packages into a special sub-directory somewhere on the machine, and then telling `python` (through environment variables like the `PATH` variable) where to look to find them.

instead of using the `python` at `/usr/bin/python3` and the module files at `/usr/lib/python3.6/*.py`, we create some special folder (say, `~/my-virtual-environment`) and install different versions of `python` or `python` packages in

```
~/my-virtual-environment/bin/python3.9001
~/my-virtual-environment/lib/python3.9001/*.py
```

then we edit the `PATH` variable to have `~/my-virtual-environment` at the front and now suddenly when type

```sh
python3
```

our `bash` session finds out special `~/my-virtual-environment` files instead of the "regular" ones

we're tricking `python` into doing the right thing. and `python` is cool about it; once it realizes it's been tricked it's not even mad or anything, in fact it's laughing about it and really *you're* the mad one when you think about it.

once we have the ability to control the versions of the packages we use when we run `python`, it opens up some avenues for getting *really* specific about how the code we use is defined -- specific down to the version of the packages it imports.

often times finished `python` projects will ship with a `requirements.txt` file, which lists each `python` package which should be installed and the exact version that it was tested against, and it is expected that it will be executed by a system with the same packages and versions.

the "virtual environment" is an isolated set of packages that will meet that requirement.

prior to the advent of `conda`, the primary way of creating a virtual environment was to use the python utility `virtualenv`, which is awesome and worth checking out.

that being said, however, it's not what I'll recommend.

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

## generalizing virtual environments: `conda`

`conda`, short for `anaconda`, is a *distribution* of python. it takes the virtual environment concept above and adds a special wrinkle: while most virtual environment managers allow you to install different versions of `python` *packages*, `conda` allows you to install different versions of `python` *itself*.

this should help you deal with any `python2` vs. `python3` problems you may experience, as well as allow you to use the newer features in the `python` language even if the computer you are working on is stuck at `python3.5`

so, let's go ahead and install `conda`, create a virtual environment, and install something.

*note: I would recommend you install `conda` (specifically `miniconda`) on both your laptop and your `ec2` instance, but we will **require** you to install it on your `ec2` instance (it's part of the homework), so you may want to use that instance to do all of this right now*

#### installing `conda`

`conda`, by default, comes with many of the most commonly downloaded `python` packages. This is great because it gives you a pretty solid working base without any modification, *BUT* given our time and bandwidth limits, I'm going to recommend you install the `miniconda` version (the bare bones) and install packages *as needed* instead of up front.

+ [`conda`](https://www.continuum.io/downloads): a big installation, which will take a few minutes, and pre-installs several of the "must haves" (many of the above, and maybe more)
+ [`miniconda`](https://conda.io/miniconda.html): a bare-bones implementation of the above for the *discerning* gentleprogrammer

click on that `miniconda` link (https://conda.io/miniconda.html)

the decision of what to install depends on a few variables:

+ which operating system are you on?
    + `win`, `osx`, and `linux` are the options
+ which "architecture" does your machine have?
    + the options are `32` and `64`
    + this is the number of bits in memory addresses the processor can understand
    + basically, 32 bit processors can handle up to 4GB of RAM at one time, 64 bit processors up to 16 BILLION GB of RAM
+ which version of `python` do you want to use
    + options right now are `2.7`, `3.7`, and `3.8` (depending on your OS)

we will go with `linux`, `64` bit, and `python3.8`

**<div align="center">mini exercise: everyone installs `conda` on their `ec2`</div>**

on your `ec2` instance:

```sh
cd ~
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# when prompted, we do the following:
# press ENTER to read the license
#     press `d` to scroll *d*own
# yes: approve the license
# ENTER: we are okay with this location
# yes: run conda init so that your PATH *always* includes conda
```

then log out and back in and run

```sh
rm ~/Miniconda3-latest-Linux-x86_64.sh
conda update -n base -c defaults conda
```

recall that we previously called

```sh
which python3
```

and got `/usr/bin/python3`, and we also checked the file path to the `os` package (from within a `python` shell):

```python
import os
os.__file__
```

what do we get now, after installing `conda`?

**<div align="center">mini exercise: see how the `python` file paths have changed after `conda` was installed</div>**

+ on your `ec2` instance, from the `bash` prompt
    + run `which python3`
+ actually execute the command `python3`
+ from the `python` command prompt, run
    + `import os; os.__file__`
+ exit `python`
+ look at the contents of those directories on your machine and compare them to the directories we examined from the base environment

*everything* the `conda` command creates or installs is put into one and only one directory. "uninstalling" `conda` is equivalent to simply deleting that directory.

take a step back and think about the **python environments** you have now:

1. our vanilla `ubuntu` `python` installation (came with the `ec2` instance)
1. this new `anaconda`-created environment
    + this environment is called the `anaconda` `base` environment

try the command

```sh
conda env list
```

why stop at only two environments?

we can use the `conda` command to *create* new environments as well. let's try that right now:

```sh
conda create -n l33tmode python=3
```

this will use `conda` to create a new environment named "`l33tmode`" with `python` version 3 installed.

`conda create` creates a new environment inside of new folder under the `env` sub-directory in that main `conda` directory, and installing all of our required packages there.

as the little dialog will state after you create the environment, you have to "activate" that environment if you want to use it. You have to do this any time you want to use a virtual environment.

what we're *actually* doing here is updating the `PATH` environment variable to "point" `python` to our newly created set of files. Now, when we wish to use `python`, we will be using our specialized, isolated versions

So let's do that:

```sh
conda activate l33tmode
```

This should have made our terminal prompt 10 times l33t3r.

**<div align="center">mini exercise: see how the `l33tmode` environment file paths are different than the `base` conda environment file paths</div>**

+ same steps as above:
+ on your `ec2` instance, from the `bash` prompt
    + run `which python3`
+ actually execute the command `python3`
+ from the `python` command prompt, run
    + `import os; os.__file__`
+ exit `python`

how do the file paths you saw differ from the `base` file paths?

```sh
which python3
```

should return

```
/home/ubuntu/miniconda3/envs/l33tmode/bin/python3
```

and

```python
import os
os.__file__
```

should return

```
'/home/ubuntu/miniconda3/envs/l33tmode/lib/python3.8/os.py'
```

these environments are ours to do with as we wish -- we know that we can do anything to them and we won't break anything important on our system. so let's live dangerously! let's install something (or, two somethings) fun:

```sh
conda install ipython pandas
```

and then try it out. from your `bash` command line run

```sh
ipython
```

this should open a fancier python interpreter (`ipython`). inside, run

```python
import pandas as pd

pd.__version__
```

## installing packages with `pip`

within a `conda` environment, you should **always** try to install new packages with `conda` via `conda install [YOUR PACKAGE NAME HERE]`

that being said, not every package is available to install via `conda`. some packages are only installable with `pip` -- luckily for us, `pip install` works with our `conda` environments as well!

just as `conda` points our `bash` session to a special `python` executable:

```sh
which python
```

it also points us to a special `pip`, which will install its packages in our current environment:

```sh
which pip
```

**<div align="center">mini exercise: `pip` install a package into our `conda` environment</div>**

make sure you have `l33tmode` activated:

```sh
conda activate l33tmode
```

then run the commands

```sh
pip install requests
```

afterwards, open a `python` shell and run

```python
import requests
requests.__file__
exit()
```

## freezing and sharing environments

one of the purposes of working with a `python` environment manager like `conda` was to enable us to install whatever we want, but the *reason* we wanted to be able to do that was so that we could make sure that no matter what computer we run our code on we have the same behavior

if we want to do that, we need to be able to

+ **specify** what our environment is when our code is working, and
+ **recreate** that environment in other places

`conda` can help us do both of these things easily.

recall back to when we chose which `miniconda` installer script to download: we needed to choose our operating system (`win`, `osx`, or `linux`) and our architecture (`32` or `64`). to share our environment, we need to answer one question first:

> does the computer where we want to re-create this environment have ***the same*** architecture and os? or are they ***different****?

### architecture and/or os are *the same*: specify and recreate with `conda list --explicit`

within the same OS and architecture, if every package we care about was `conda install`ed we can be incredibly explicit about what should be installed to re-create our *exact* environment, to the file: we use

```sh
conda list --explicit
```

it is common when running this command to write the output to a file `spec-file.txt` which is then used by other users on other (identical os / architecture) computers to create a matching environment

```sh
# create an environments txt file
conda list --explicit > spec-file.txt

# look at the contents
cat spec-file.txt
```

**<div align="center">mini exercise: creating a `spec-file.txt`</div>**

make sure you have `l33tmode` activated:

```sh
conda activate l33tmode
```

then run the commands

```sh
conda list --explicit > spec-file.txt
cat spec-file.txt
```

anyone on a `linux-64` machine can now create a new environment identical to yours by using this file:

```sh
conda create --name l33tmode-clone --file spec-file.txt
```

#### important caveat about `pip` installed packages

we mentioned above that you *can* install packages via `pip`. we installed `requests` this way (`pip install requests`).

`conda list` will ***not*** include any packages we installed with `pip install`. if you look at the contents of `spec-file.txt`, you will see that `requests` is not in it.

theoretically, if you want to have both install types, you *could* have all the `conda` packages installed this way, and then separately install the `pip` packages.

in practice, though, if this is your situation it is more common to use the next method (`conda env export`), which supports both `conda` and `pip` installed packages at the same time.

### architecture and/or os are *different*: specify and recreate with `conda env export`

when the architecture or the OS of the computer trying to recreate an environment is different, it can't install *exactly* the same files -- it has to install a version of those files that was built for *its* architecture and OS.

this means we don't need to know the exact file names, we just need to know the versions of the installed packages, and we can figure out what the file names are for our setup

we can use the `conda env export` command to list the packages installed in a given `conda` environment and possibly the versions at which they were installed.

there are two modes for running the command:

1. `conda env export`: lists *every* package installed and *every* version number
    + this includes the packages we explicitly installed, as well as ones installed because they were dependencies of those packages
1. `conda env export --from-history`: lists *only* the packages we installed explicitly, and only versions if we installed specific versions

**<div align="center">mini exercise: what `--from-history` means</div>**

make sure you have `l33tmode` activated:

```sh
conda activate l33tmode
```

then run the commands

```sh
conda env export
conda env export --from-history
```

what is the difference?

`conda env export --from-history` lists out the packages we chose to `conda install` (e.g. `pandas` and `ipython`).

`conda env export` (no flag) lists out

+ all of those packages *and* their current version numbers (e.g. `pandas=1.1.1` and `ipython=7.18.1`)
+ all the things that our particular OS and architecture needed to get `pandas` and `ipython` installed (e.g. `numpy=1.19.1` and `sqlite=3.33.0`)
+ all of the `pip install`ed packages (at the bottom, under a heading `- pip:`

which one of these options you want to use depends on context:

+ if you want to share environments across different OS, you should probably use `--from-history`
    + if this is the case, you probably want to *explicitly* set versions when you run `conda install`
        + e.g. `conda install ipython=7.18.1 pandas=1.1.1`, so that the versions are in the `--from-history` output
    + you will lose any `pip` install values this way
+ if you will be mostly within one OS, you may want the more explicit output of the flag-less command

**<div align="center">mini exercise: creating an `environment.yml` file</div>**

run the commands

```sh
# if you haven't already
conda activate l33tmode
conda env export > environment.yml
cat environment.yml
```

this `environment.yml` file can be sent to other users or re-used by you on future `ec2` instances to create a new but completely identical environment:

```sh
conda env create --name l33tmode-clone-2 --file environment.yml
```

### `conda env export` vs. `conda list -e`: basic differences

the differences between these two are minor but important:

| command | os- and arch-specific | includes exact versions | includes `pip` files |
|-|-|-|-|
| `conda list -e` | yes* | yes | no |
| `conda env export` | yes* | yes | yes |
| `conda env export --from-history` | no | no | no |

*: technically these may work on other os and arch, but it is not guaranteed

### specify and recreate with `pip freeze`

the `environment.yml` and `spec-file.txt` files you created above can be read by `conda`, but not by other `python` virtual environment or package managers. there is a format for specifying packages to install that is much more broadly recognized in the `python` world -- a `requirements.txt` file. this is the sort of file you could use to install all packages using the basic `pip` package installer, for example.

to create a `requirements.txt` file, you can simply execute

```sh
pip freeze > requirements.txt

# look at the contents
cat requirements.txt
```

you can use this on any system which has `pip` installed to install the listed packages into the active environment with

```sh
pip install -r requirements.txt
```

note that the packages installed this way *are not* `conda` packages

<div align="center"><img src="https://i.ytimg.com/vi/BX1EIlwtQvU/maxresdefault.jpg" width="800px"></div>

# END OF LECTURE

next lecture: [environment management pt. 2: `docker`](006_environments_2_docker.ipynb)