# Controlboard for Data Analytics-Stacks
## Table of Contents
* [Jupyter-Datascience-Notebook](#Jupyter-Datascience-Notebook)    
* [Very helpful: Manipulate your Docker environment](#Manipulate-your-Docker-environment)

Available stacks to plug into Jupyter:
* [Elastic Stack (formerly ELK-Stack)](#Elastic-Stack-(formerly-ELK-Stack))
* [PostgreSQL Database, using SQLAlchemy](#PostgreSQL-Database-using-SQLAlchemy)
    - [Manage the Stack](#Manage-the-Stack)
    - [Connect to Postgres (or any other database!)](#Connect-to-Postgres-(or-any-other-DB))
    - [Read data from the DB into a Pandas Dataframe](#Read-data-into-a-Pandas-Dataframe)
    - [Create an entity diagram to understand the DB schema](#Create-an-entity-diagram-for-the-database)
* [MySQL Database, using SQLAlchemy](#MySQL)
* [Neo4j](#Neo4j)


***
## Jupyter Datascience-Notebook
#### Summary

Your "standard" [Jupyter Notebook for Datascience](https://github.com/jupyter/docker-stacks/tree/master/datascience-notebook) plus some additional libraries and the Jupyterlab extensions
* [jupyterlab-git: git and GitHub integration](https://github.com/jupyterlab/jupyterlab-git)
* [jupyterlab-lsp: Code completion, function definition look-up and more](https://github.com/krassowski/jupyterlab-lsp)
* [Code debugger](https://github.com/jupyterlab/debugger)

#### Set some variables first - be sure to run this code!
Make sure that you edited the file `environment.env.EXAMPLE` to suit your project.
* `COMPOSE_PROJECT_NAME`: name of this project. Will show up in all container names associated with this project. No spaces or special characters allowed
* `DATALAB_SOURCECODE_DIR`: your Windows directory containing all your source code - including this datalab! Will appear as `/home/jovyan/work` in the Jupyter Notebook
* `DATALAB_DATA_DIR`: your Windows directory containing all data. Will be mounted as `/home/jovyan/data` in the Notebook

Save the adapted file as a new file `environment.env`. **Only data in either directory will survive the destruction of the Jupyter Notebook container!**

Then run this code:

In [5]:
from IPython.display import Markdown as md
import time

for line in open('environment.env').readlines():
    if not line.strip() or line.strip().startswith('#'):
        continue
    variable, value = line.strip().split('=', 1)
    if variable.endswith('_DIR') or variable.endswith('_PATH'):
        # Need to convert Windows paths to the Docker VM running linux
        value = value.replace('\\', '/')
        if ':' in value:
            # Absolute path
            value = '//' + value.replace(':', '')
    %env $variable=$value

# Grab port for later
jupyter_port = ! echo $DATALAB_JUPYTER_PORT
jupyter_port = int(jupyter_port[0])

env: COMPOSE_PROJECT_NAME=petproj
env: DATALAB_SOURCECODE_DIR=//C/Users/awk02119/Documents/GitHub/petproj
env: DATALAB_DATA_DIR=//C/Users/awk02119/Documents/data
env: DATALAB_CONTROLBOARD_PORT=12334
env: DATALAB_JUPYTER_PORT=8888
env: DATALAB_POSTGRES_PORT=5432
env: DATALAB_MYSQL_PORT=3306
env: DATALAB_ELK_ELASTICSEARCH_PORT1=9200
env: DATALAB_ELK_ELASTICSEARCH_PORT2=9300
env: DATALAB_ELK_LOGSTASH_PORT1=5000
env: DATALAB_ELK_LOGSTASH_PORT2=9600
env: DATALAB_ELK_ELASTICSEARCH_KIBANA=5601
env: DATALAB_JUPYTER_COMPOSE_PATH=./jupyter/docker-compose.yml
env: DATALAB_ELK_PATH=./elk/docker-compose.yml
env: DATALAB_POSTGRES_PATH=./postgres/docker-compose.yml
env: DATALAB_MYSQL_PATH=./mysql/docker-compose.yml
env: DATALAB_NEO4J_PATH=./neo4j/docker-compose.yml
env: DATALAB_DOCKER_NETWORK=datalab-network


#### Start the container
Just run this code now:

In [6]:
! sudo -E COMPOSE_FILE=$DATALAB_JUPYTER_COMPOSE_PATH docker-compose up -d

# Check the log file for the URL including a unique token. Display the "correct" URL with a potentially different port
! echo && echo Waiting for 5 seconds for the container to spin up && echo
time.sleep(5)
log = ! sudo -E COMPOSE_FILE=$DATALAB_JUPYTER_COMPOSE_PATH docker-compose logs jupyter
url = 'http://127.0.0.1:8888'
for line in log:
    if url in line:
        break
else:
    print(log)
    raise RuntimeError('Did not find URL in the log above')
url = url[:-4] + str(jupyter_port) + line.split(url, 1)[1]
md(f"**Your Jupyterlab URL is {url}**")

Creating petproj_jupyter_1 ... 
[1Bting petproj_jupyter_1 ... [32mdone[0m
Waiting for 5 seconds for the container to spin up



**Your Jupyterlab URL is http://127.0.0.1:8888/?token=47f596326f066fbfc3ddf20701899051cdb52ba6daff8215**

#### Stopping and cleaning up
Stop the Jupyter Notebook container. Container won't be deleted

In [3]:
! sudo -E COMPOSE_FILE=$DATALAB_JUPYTER_COMPOSE_PATH docker-compose stop

Stopping petproj_jupyter_1 ... 
[1Bping petproj_jupyter_1 ... [32mdone[0m

Remove the container. It will be recreated automatically

In [4]:
! sudo -E COMPOSE_FILE=$DATALAB_JUPYTER_COMPOSE_PATH docker-compose down

Removing petproj_jupyter_1 ... 
[1BNetwork datalab-network is external, skipping


#### How to save your entire computational context if you installed additional packages
You might change your Docker container by installing new **PIP** Python packages e.g. with `pip install <package name>`. This change will be lost with the container. To quickly save your entire pip environment, including all packages, copy-paste the following into your notebook:

In [None]:
! pip freeze > /home/jovyan/work/pip-environment.txt

To load your environment again from scratch, e.g. if you re-created your Docker container:

In [None]:
! pip install -r /home/jovyan/work/pip-environment.txt

If you installed additional Python packages with **Anaconda**, `conda install <package name>`, here's how to save the entire conda environment:

In [None]:
! conda env export -n base > /home/jovyan/work/anaconda-environment.yml

To re-install all Anaconda packages from this file, do:

In [None]:
! conda env update --name base --file /home/jovyan/work/anaconda-environment.yml

***
## Elastic Stack (formerly ELK-Stack)

#### Summary

Elasticsearch, Kibana, Beats, and Logstash. Take data from any source, in any format, then search, analyze, and visualize it in real time.

* **Elasticsearch** is a distributed, RESTful search and analytics engine. As the heart of the Elastic Stack, it centrally stores your data for lightning fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.
* **Kibana** lets you visualize your Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps.
* **Logstash** is a server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite "stash."
* **Beats** is the platform for single-purpose data shippers. They send data from hundreds or thousands of machines and systems to Logstash.

Note that Beats (e.g. Metricbeat or Systembeat) are not included in this stack

#### Connections once the stack has been started
* Direct Kibana browser access: [http://localhost:5601](http://localhost:5601)
* Elasticsearch access for Windows: [http://localhost:9200](http://localhost:9200)
* Access Elasticsearch from **within** a Jupyter container: [http://elasticsearch:9200](http://elasticsearch:9200)
* Logstash access from **within** a Jupyter container: [http://logstash:9600](http://elasticsearch:9600)

#### Create a volume to persist all data

In [None]:
! sudo docker volume create --name=elasticsearch_data

#### Start the stack
Once pull has completed and containers are running, startup might take 1-2 minutes!

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_ELK_PATH docker-compose up -d

#### Stop and remove the stack (Elasticsearch and Kibana data will be retained)

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_ELK_PATH docker-compose down

#### Delete all Elasticsearch and Kibana data

In [None]:
! sudo docker volume rm elasticsearch_data

***
## PostgreSQL Database using SQLAlchemy

* [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL) is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
* [SQLAlchemy](https://www.sqlalchemy.org/) is a GREAT Python wrapper to talk to almost any database.


Check out the [Database getting started Jupyter notebook](database_getting_started.ipynb) for code snippets!

### Manage the Stack
Create a volume to persist all data

In [None]:
! sudo docker volume create postgres_data

Start the stack

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_POSTGRES_PATH docker-compose up -d

Stop and remove the stack (database will be retained)

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_POSTGRES_PATH docker-compose down

Delete the actual database and thus all Postgre data

In [None]:
! sudo docker volume rm postgres_data

***
# MySQL
* [MySQL](https://www.mysql.com) is another popular database.
* [SQLAlchemy](https://www.sqlalchemy.org/) is a GREAT Python wrapper to talk to almost any database.

Check out the [Database getting started Jupyter notebook](database_getting_started.ipynb) for code snippets!

### Manage the stack
Create a volume to persist all data

In [None]:
! sudo docker volume create mysql_data

Start the stack

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_MYSQL_PATH docker-compose up -d

Stop and remove the stack (database will be retained)

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_MYSQL_PATH docker-compose down

Delete the actual database and thus all MySQL data

In [None]:
! sudo docker volume rm mysql_data

***
## Neo4j
[Neo4j](https://neo4j.com/) is the leading graph database platform. The two plugins [APOC](https://neo4j.com/developer/neo4j-apoc/) and [Graph Data Science](https://neo4j.com/docs/graph-data-science/current/) are included in the stack. All data is saved into a new directory `neo4j` in your `DATALAB_DATA_DIR`.
* Neo4j web GUI: http://localhost:7474
* Bolt access: http://localhost:7687

Neo4j features powerful plugins. You probably want to download [Awesome Procedures APOC](https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases) and/or the [Graph Data Science Library](https://github.com/neo4j/graph-data-science/releases). Simply save the `*.tar` file into the folder `./datalab-stacks/neo4j/plugins` **BEFORE** you start the container.

### Manage the Stack
Start the stack. Note: we assume that you saved the entire datalab in a subfolder `datalab` of your `DATALAB_SOURCECODE_DIR` for plugins to work.

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_NEO4J_PATH docker-compose up -d

Stop and remove the stack (database will be retained)

In [None]:
! sudo -E COMPOSE_FILE=$DATALAB_NEO4J_PATH docker-compose down

***
# Manipulate your Docker environment

Show all running and stopped Docker containers

In [None]:
! sudo docker ps -a

Show all Docker images including their filesizes

In [None]:
! sudo docker images

Show all volumes (=data volumes if you choose to not mount a Windows directory, for example):

In [None]:
! sudo docker volume ls

In desperate need to figure out what's eating up your disk space? This command shows where Docker is using disk space:

In [None]:
! sudo docker system df -v

#### Manipulate a container
Set a container name (or CONTAINER ID) first

In [None]:
container = "jupyter"

Stop the container

In [None]:
! sudo docker stop $container

Get the running container's logs saved to the Python variable `logoutput`

In [None]:
logoutput = ! sudo docker logs $container

Restart an existing (currently stopped) container

In [None]:
! sudo docker start $container

Remove the container completely

In [None]:
! sudo docker rm $container

#### Cleaning up and freeing disk space
Remove an image (give either it's name or IMAGE ID)

In [None]:
image = "test"
! sudo docker image rm $image

Remove all stopped containers at once

In [None]:
! sudo docker container prune --force

Remove a volume (=data volume, thus potentially deleting your data!):

In [None]:
volume = "test"
! sudo docker volume rm $volume

**Danger zone**: remove all stopped containers, and all images and all volumes that are currently not associated/mounted with a **running container**. Type the following manually:
* Delete all stopped containers, all "dangling" images, the build cache, any unattached network: ```! sudo docker system prune --force```
* To also delete all currently unused images: ```! sudo docker system prune --all --force```
* To also delete all currently unused volumes (potentially deleting your data!): ```! sudo docker system prune --volumes --force```