## Introduction
In this report, we will demonstate how to build a pipeline to enable a data science team to obtain valuable insights on game events.  
**TO-DO:** Insert more details here on what can be done in the pipeline, and the business questions we will be answering.

## Setting up the pipeline
Firstly, we will showcase how we built the pipeline. We have chosen to use Kafka, Hadoop and Spark to transport, transform and store the game events. We have defined the configuration for each of these in the **docker-compose.yml** file, below is a description of each part of this file.

### Zookeeper & Kafka
Zookeeper allows us to easily access and manage the Kafka instance. Kafka allows us to easily create a pipeline that we can publish game events to, which can then be consumed by Spark.  

We have opened ports in Zookeeper we have have also reference in the Kafka configuration, this allows us to create topics in Kafka via Zookeeper. We have also exposed ports in Kafka that allow us to publish and consume messages from the topics.

```
zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 32181
      ZOOKEEPER_TICK_TIME: 2000
    expose:
      - "2181"
      - "2888"
      - "32181"
      - "3888"
    extra_hosts:
      - "moby:127.0.0.1"

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:32181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    expose:
      - "9092"
      - "29092"
    extra_hosts:
      - "moby:127.0.0.1"
```

### Hadoop
We have added Cloudera configuration to be able to use Hadoop.

```
  cloudera:
    image: midsw205/cdh-minimal:latest
    expose:
      - "8020" # nn
      - "50070" # nn http
      - "8888" # hue
    extra_hosts:
      - "moby:127.0.0.1"
```

### Spark
We have set up Spark using the MIDS W205 Spark Python base image. We're specifying the dependencyof this service on Cloudera and the Hadoop name node that Spark will use when writing to HDFS.    
In addition to this, we have exposed additional ports which allow us to connect to Spark from a notebook.

```
  spark:
    image: midsw205/spark-python:0.0.5
    stdin_open: true
    tty: true   
    volumes:
      - "~/w205:/w205"
    command: bash
    depends_on:
      - cloudera
    environment:
      HADOOP_NAMENODE: cloudera
    expose:
      - "8888"
      - "7000" #jupyter notebook      
    ports:
      - "8888:8888"
      - "7000:7000" # map instance:service port   
    extra_hosts:
      - "moby:127.0.0.1"
```

### MIDS Base Image
This is the base image that we will use in the container, it allows us to run bash commands, as well as using kafkacat to publish messages to Kafka. We're specifying the w205 volume so that we have access to files.  

```
  mids:
    image: midsw205/base:latest
    stdin_open: true
    tty: true
    expose:
      - "5000"
    ports:
      - "5000:5000"
    volumes:
      - "~/w205:/w205"
    extra_hosts:
      - "moby:127.0.0.1"
```

## Bring up the pipeline
To do this we run the following command:

```
docker-compose up -d
```

## Interacting with the pipeline
### Flask app
**TO-DO:** decribe the flask app, how it works and sample code to interact with it (using Apache Bench). 

### Extracting events
**TO-DO:** describe the .py files that we run using spark submit along with sample code

## Analyzing the events
**TO-DO:** business analysis goes here

In [None]:
sword_purchases = spark.read.parquet('/tmp/sword_purchases')

In [5]:
sword_purchases.show()

+--------------------+--------------------+------+-----------------+---------------+--------------+-----+--------+
|           raw_event|           timestamp|Accept|             Host|     User-Agent|    event_type|color|quantity|
+--------------------+--------------------+------+-----------------+---------------+--------------+-----+--------+
|{"event_type": "p...|2021-07-20 03:47:...|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|  red|       2|
|{"event_type": "p...|2021-07-20 03:47:...|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|  red|       2|
|{"event_type": "p...|2021-07-20 03:47:...|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|  red|       2|
|{"event_type": "p...|2021-07-20 03:47:...|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|  red|       2|
|{"event_type": "p...|2021-07-20 03:47:...|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|  red|       2|
+--------------------+--------------------+------+-----------------+------------

In [8]:
sword_purchases.registerTempTable('sword_purchases')
total_swords = spark.sql("select Host, color, sum(quantity) total from sword_purchases group by Host, color")
total_swords.show()

+-----------------+-----+-----+
|             Host|color|total|
+-----------------+-----+-----+
|user1.comcast.com|  red| 10.0|
+-----------------+-----+-----+

