# Docker and Python Packaging

<img src="https://miro.medium.com/max/504/1*iBGlEPUruUqqT5NreeEF8g.png" width=200>

## Motivation

In the previous section, we saw how to spin up a cluster using Dask. Using a Docker image spins up the workers with a consistent environment so that distributed flows run reliably. But even beyond the use case with Dask, Docker is an integral part of data workflows. Data professionals often hear phrases like "it worked on my machine" or "nothing changed, but by workflow stopped running."

Docker allows us to build images and run them in other execution environments. An image contains all of the dependencies needed for our application, while a container is an instance of the image. By building a fixed image, we can pin the dependencies so that they stay fixed from run to run. 

## When to use Docker?

Data scientists or data engineers might already be familiar with virtual environments. Virtual environment managers such as `poetry`, `pipenv`, and `conda` allow us to fix package versions. What are the use cases then that call for Docker?

1. You can have multiple containers with different Python versions
2. It allows us to pin the non-Python dependencies such as Java for Spark
3. It is the unit for spinning up clusters/jobs (Kubernetes or Dask)
4. Host legacy applications with older technology (some Prefect users run containers with specific scientific computing libraries)
5. Encourages reproducible work
6. Eases the transition from development to deployment (some services like AWS ECS and Google Vertex need containers to run)

## Docker Architecture

The Aqua Security [documentation](https://www.aquasec.com/cloud-native-academy/docker-container/docker-architecture/) has a very good diagram about the Docker architecture. There are three main parts:

* Client - this can be the Docker client or Python client
* Daemon - the daemon is what orchestrates containers and images
* Registry - used to store images for downloading in other places.

![img](docker_architecture.png)

## Sample Project

Alongside this notebook, there is a folder named `docker_with_custom_module`. It represents a Prefect Flow that uses custom modules defined by the user. Packaging custom code as a Python modules allows us to run with from any directory within the container, as well as reuse it in other projects more easily. Using this folder, we will build a custom Docker image to support our Flow. Below is the directory structure.

```
docker_with_custom_module/
├── components/
│   ├── __init__.py
│   ├── componentA.py
│   ├── componentB.py
├── workflow/
│   ├── custom_flow.py
├── requirements.txt
├── Dockerfile
└── setup.py
```

and a brief description of each of the components:

* components - contains the custom code that will be used in multiple flows
* workflow - contains the Prefect flow
* requirements.txt - dependencies of the project
* Dockerfile - the instructions to package this folder into a Docker image
* setup.py - `pip` looks at this file for instructions how to install the module

### Quick look at the Python code

The components are very simple Python classes. We just want to create something we can import for the main flow. Below is `componentA.py`. `componentB.py` is very similar.

```python
class ComponentA:
    def __init__(self, n=2) -> None:
        self.n = n
```

From there, we can take a look at the `custom_flow.py` in the `workflow` folder. This just imports the components and uses them inside a task.

```python
from prefect import flow, task

from components.componentA import ComponentA 
from components.componentB import ComponentB

@task
def custom_task():
    x = ComponentA(2)
    y = ComponentB(2)
    _sum = x.n + y.n
    print(f"Test {_sum}!")  # Should return 4
    return _sum

@flow
def custom_flow():
    custom_task()
```

### Setup.py

In order to package the `components` as a module, we need to add the `__init__.py` file inside the folder. The package name and version are used by pip to keep track of the package, but they don’t affect how the package is used in Python code. The `find_packages()` function call goes through the subdirectories with an `__init__.py` and includes them in `mypackage`. Notice this file takes care of installing the requirements. The `setup()` function is the one `pip` looks for in order to install the library.

```python
from setuptools import setup, find_packages

with open('requirements.txt') as f:
    requirements = f.read().splitlines()

setup(
    name="mypackage",
    version='0.1',
    packages=find_packages(),
    install_requires=requirements
)
```

### Installing the custom module

With this file written, we can now install the library by doing,

```
pip install -e .
```

and this lets us import `components` from other directories because the Python path can resolve it.

## Building the Docker image

Now that we have the custom module installed, we want to create the Docker image so that we can run it in other execution environments. 

```Dockerfile
FROM prefecthq/prefect:latest

WORKDIR /app

ADD . .

RUN pip install .
```

and that is it we need. Below is the explanation for each line:

1. FROM — this is the base image that we’ll be using for our image. Projects such as Spark or Dask. These are especially useful when you are using tools that are not confined to Python. For example, Spark needs Java in the container also, and using the base Spark image takes are of that.
2. WORKDIR — set the working directory for the container. It will be created if it doesn’t exist
3. ADD — here we add all of our files from the current directory to the container WORKDIR
4. RUN — this is where we install our library (-e is not really needed as that’s for development). This will also install all requirements because of the way we structured our setup.py file earlier.

In order to build the image, we can run a command like the following:

```
docker build . -t test:latest
```

where `test` is the image name and `latest` is the image tag.

### Using the image

In order to check everything is good, we can run the image interactively,

```
docker run --name containername -i -t test:latest sh
```

and from there we should be in the app directory and we can run our flow with:

```
python workflow/flow.py
```

## Upload to a Registry

Now that the image has been created, you can push it to your registry (Dockerhub, AWS ECR, etc.) using the `docker push` command. These registries will have different ways to do it but we'll cover how to do it with DockerHub, which is the de facto registry.

### Auth
We need to auth our CLI session with our image repository otherwisesuperconvenientfreestorage

```console
docker login
```

### Build and Tag
```console
docker build . --tag zzstoatzz/prefect-imgs:dev
```

## Using the image for a Flow

In order to use the image for a flow, we can create a deployment using the `DockerFlowRunner()` with the image that we just uploaded

```python
from prefect.deployments import DeploymentSpec
from prefect.flow_runners import DockerFlowRunner

DeploymentSpec(
    name="docker-example",
    flow=custom_flow,
    flow_runner=DockerFlowRunner("repo/image")
)
```

In the next section, we'll look at advanced patterns for workflow orchestration before doing an end-to-end example.