# Introduction

This tutorial explains how to set up a working data science envrionment with Anaconda and Docker. Don't worry if you don't know these names; at the and of the tutorial you will know at least something about Docker and Anaconda. 

If this is your first encounter with development envrionments for data science, you will come across may  many  unfamiliar names and terms. For this tutorial you don't need to understand all the details. In fact, you are encouraged to forget about the details, and to focus on the high level understanding.  

**Part 1** We will show you a working data science envrionment. Playing around with this envrionment will learn you more then I could possibly convey with a thousand words. 

**Part 2** Here we take you trough the steps of building your own data science environment.

### What is a data science envrionment?

With data science envrionment, I mean the collections of software that a data scientist uses to develop a product, or to perform an analysis. A common setup includes at least three types of software packages:
- A programming language (Python and R are very popular)
- A IDE that allows for interactive programming (For example Jupyter notebooks, and R- studio)
- A collection of libraries that is used for the analysis (The Pandas packages for Python, and dplyr for R)

On top off that, you may have other software applications
- Databases (MongoDB, MariaDB)
- Deep learning (Theano, TensorFlow, Keras)
- Distributed systems (Hadoop, Spark)
- Backend for website, app or API (node, flask)
- Frontend (angularjs, d3)
- Streaming frameworks (Kafka)
- Cloud solutions (Amazon, Azure, Google)


#### The reproducibility problem
Suppose I want to work with you on my project. Installing all these packages on your laptop is tedious. Moreover, it is very likely that you cannot exeactly reproduce the setup that I was using. You may have a different operating system, or different versions of your software. It gets even more complicated when different several software applications are configured to work together, as this typically requires manual work.

#### Containers - a solution to the reproducibility problem
A container, is an isolated part of your computer, that encapsulates an application together with all it's dependencies. Programming languages, packages, your code, and even the operating system of our application can be packaged in a container. If you want to work on my project, I send you the project container, which you can start using immedeatly.

# Part 1: Spin up a data science envrionment
Let's check thins in practice.

**Problem:** You want to try a typical data science envrionment on your own laptop. However, you neither have the time or the expertise to install in the required software applications.  
**Solution:** Use a container provided by Anaconda. This container ontains a ready to use data science environment.

**Prerequisites**
The container application that we are going to use managing and running containers, is called Docker. To install Docker on your computer, please follow the [Docker installation instructions](https://www.docker.com/products/overview) for your operating system.

**Stepwise tutorial:**
After installing docker sucessfully, you can follow the steps below inside the terminal of your operating system. 

First, lets check if docker is present. In the terminal you type `docker` + the enter key.
If docker is present on you system, the output looks as follows:

In [1]:
docker

Usage: docker [OPTIONS] COMMAND [arg...]
       docker daemon [ --help | ... ]
       docker [ --help | -v | --version ]

A self-sufficient runtime for containers.

Options:

  --config=~/.docker              Location of client config files
  -D, --debug                     Enable debug mode
  -H, --host=[]                   Daemon socket(s) to connect to
  -h, --help                      Print usage
  -l, --log-level=info            Set the logging level
  --tls                           Use TLS; implied by --tlsverify
  --tlscacert=~/.docker/ca.pem    Trust certs signed only by this CA
  --tlscert=~/.docker/cert.pem    Path to TLS certificate file
  --tlskey=~/.docker/key.pem      Path to TLS key file
  --tlsverify                     Use TLS and verify the remote
  -v, --version                   Print version information and quit

Commands:
    attach    Attach to a running container
    build     Build an image from a Dockerfile
    commit    Create a new im

First we need to download the anaconda container

In [4]:
docker pull continuumio/miniconda

Using default tag: latest
latest: Pulling from continuumio/miniconda
[0BDigest: sha256:1e4a6a3c71d7b7ad92cb600d9eb4746734ea1afca73b824786d857813f3fec59
Status: Image is up to date for continuumio/miniconda:latest


Once the container is downloaded, we spin it up with

In [None]:
docker run -p 8890:8890 continuumio/miniconda /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8890 --no-browser"

Now the Jupyter notebook is available on your local machine at http://localhost:8890/tree in your browser.
If you are using a Docker Machine VM (this may be the case if your laptop runs Windows), you need to use
`http://<DOCKER-MACHINE-IP>:8890`.

What you see, is an empty Jupyter tree. You can click "new" to create a new python notebook, a text file, or to open a shell.

What has just happend? 
* The container 

## Exercise 1: Get a sample notebook running
Now that you have your data science envrionment up, you are ready to play with IPython notebooks.
On the internet, you can find many interesting IPython notebooks. In this exercise, you are going to download a notebook, and play with it. However, usually you will have to install some dependencies first. How you do this, is explained below. 

1. Choose a note book that you find interesting from https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#introductory-tutorials
2. Open the Jupyter interface and choose `new` and then `terminal`
3. Download the notebook that you chose from the terminal. Make sure that the file that you download has the xtension `.ipynb`.  
This is an example:
```
cd /opt/notebooks
wget https://raw.githubusercontent.com/plotly/python-user-guide/master/s3_bubble-charts/s3_bubble-charts.ipynb
```
4. The notebook should appear in the Jupyter overview. Open it and evalute the lines of code by pressing shift + enter
5. You will recieve some errors, because not all depedencies are installed. In the above example, the Python package plotly is not present. `import plotly` results in the error
```
ImportError: No module named plotly
```
6. In order to install plotly, go back to the terminal and type
```
pip install plotly
```
Now try to re-run the import command.
7. Work your way trough the notebook, and install all dependencies that are needed.


#### Beware
Once you close your docker session, all your work will be lost. Each time you close a container, it loses its state. In part II, we are going to save your work by creating a new container.

# Part II: Create your own data science container
In part one, you started out with a standard anaconda conatainer. You donwloaded a sample notebook, and you installed all it's depedencies. 

You changed the state of the container, but changes are not persistent. When you close the container, all changes are lost. We can change your work by creating a new container, based on the current (changed) state of your anaconda container.

In [None]:
docker ps

The name of the container instance is lonely_swartz. Let's rename it to plotly:

In [None]:
docker rename lonely_swartz plotly 

In [None]:
docker commit -m  "Added plotly notebook" plotly 

The plotly image should now appear in your image repository

In [None]:
docker images

You can spin up the image with

In [None]:
docker run -p 8891:8891 plotly /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=88│bash: docker: command not found
91 --no-browser

## Ideas for further learning
For these exercises you can use the Docker documentation, `man docker`, and a search engine.

1. Use `docker build` to build your container from a script called `Dockerfile`
2. Create a git repository with the following files
    a. readme.md that explains what is inside this docker container
    b. Dockerfile
3. Add an `ENTRYPOINT` to your Dockerfile
3. Explore and try more docker images at [docker hub](https://hub.docker.com/)
4. Create an account on docker hub
5. Push your docker container to your docker hub account
6. Figure out how to share files on the host system with the container
7. Correct and extend this notebook, and push changes to github.