This repository covers the streaming analytics and machine learning in the DSR batch. DSR is a three month bootcamp that quickly trains experienced(!) professionals into freshly minted Data Scientists.
Over two days the topics covered are Apache Spark - in particular the Spark Streaming part and Apache Kafka.
As streaming use-cases are particularly involved in terms of architectures the slide presentation covers additional material on Big Data architectures.
Last, but not least, we discuss on how to do Machine Learning in the module.
Batch 11
- switched to Spark 2.2, finally structured streaming API is stable!
The setup is particularly involved. The reason is that using the command line, Docker, and Apache Spark are seperate modules in DSR. Depending on the batch, they are sometimes not given, or in another form. We go through all.
We work with the command line. This is de-facto standard for Data Scientists. On MacOS you should install iTerm2, Linux has a standard installation, and Windows... You may want to google for zsh
, the Z-shell.
Anaconda distribution installed, preferably with Python 3.5. If you have the Python 2 based distribution installed, you can create an environment as follows:
conda create -n YOUR_ENV_NAME python=3.5
source activate YOUR_ENV_NAME
Now you can try pip freeze
to see that only a minimal set of packages is installed. As long as the environment is
Download spark-2.2.0-bin-hadoop2.7.tgz
from the Apache Spark homepage
tar -xzf ~/Downloads/spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 $HOME/spark
In order to run the Spark tools from the CLI, Spark must be added to the $PATH
. Additionally, to import pyspark
special Python libraries should be added to the $PYTHONPATH
, so that everything works fine. Note that we run python programs directly, rather than going for Jupyter notebooks. Thus, possible settings from old modules must be removed.
My configuration looks as follow (in file .bashrc
in the HOME folder)
export SPARK_HOME=$HOME/spark
export PATH="$PATH:$SPARK_HOME/bin" # adds spark-submit, spark-shell etc. to PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=$(which python)
export PYSPARK_DRIVER_OPTS= # unsets if you have it set to notebook
export PYSPARK_PYTHON=$(which ipython)
Kafka is a distributed and fault-tolerant message queue. Kafka depends on Zookeeper for keeping its state. To keep things simple, we run a docker container with both Kafka and Zookeeper. In practice, this is never done, as both services have different requirements.
It is wise, however, to install (get this) Kafka separately as this provides the CLI tools.
Decompress the archive via tar -xzf kafka_2.12-0.10.2.1.tgz
and set KAFKA_HOME
to the target path. I extracted it to ~/Downloads/kafka_2.12-0.10.2.1
and write
export KAFKA_HOME="~/Downloads/kafka_2.12-0.10.2.1"
cd $KAFKA_HOME
We use the docker container image spotify/kafka
, use docker pull spotify/kafka
once. Then to start a container,
docker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=localhost --env ADVERTISED_PORT=9092 spotify/kafka
Starting Zookeeper
bash bin/zookeeper-server-start.sh config/zookeeper.properties
Starting Kafka Broker
bash bin/kafka-server-start.sh config/server.properties
Console Consumer
Useful to test a topic's contents.
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--from-beginning or --offset <offset> \
--topic test
Producing messages in terminal
note that the test topics gets created automatically
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Topics
kafka-topics.sh \
--zookeeper localhost:2181 \
--create --topic test-topic \
--partitions 2 \
--replication-factor 1
kafka-topics.sh --zookeeper localhost:2181 --describe
kafka-topics.sh --zookeeper localhost:2181 --delete --topic test-topic
Consumer Groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
Cleaning up
Stop the Kafka brokers, then Zookeeper. Under /tmp you can delete the kafka-logs and zookeeper, e.g.
rm -rf /tmp/zookeeper /tmp/kafka-logs
└── day1
├── autodatascientist
│ ├── consumer.py
│ ├── exercise.txt
│ ├── producer.py
│ └── requirements.txt
├── simple_consumer
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── run.sh
│ └── simple_consumer.py
├── simple_producer
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── run.sh
│ └── simple_producer.py
└── simple_sparkstream
├── requirements.txt
├── run.sh
└── simple_sparkstream.py
simple_consumer: A simple Kafka consumer in Python, also as a microservice.
simple_producer: A simple Kafka producer in Python, also as a microservice.
simple_sparkstream: A minimal example to show-case how to use the traditional DStream
API of Spark streaming.
autodatascientist: Additional "game" to show-case how to send matrix data over Kafka and process it.