## Project 3: Understanding User Behavior
### Vaibhav Beohar - UC Berkeley MIDS - Spring 2020 - Section 4

### Quick Introduction to Project 3

The commands run in this notebook, as part of project 3, attempt to do the following:

<li>Invoke a Flask web-server (executed via `game_api.py`) to log events to Kafka</li>

<li>Assemble a data pipeline to catch these events: use Spark streaming to filter select event types from Kafka; land them into HDFS/parquet to make them available</li>

<li>Use Apache Bench to generate test data for our pipeline</li>

<li>Produce a simple set of analytics commands for basic analysis of the events</li>


Included in the git commit are the following: 
<li>`docker-compose.yml`</li>
<li>`game_api.py`</li>
<li>`project_3`</li>

Included in the `game_api.py` file (based on week 12) for the flask web api server are code added for "join guild" and "purchase sword" user events.

This Project_3.ipynb Jupyter Notebook has the code from the spark-submit job file `filtered_writes.py` implicitly included. It has been modified to work in a jupyter notebook with pyspark kernel. 

Additionally, include in this notebook is pyspark code (from week 12 from the MIDS Spring 2020 synch-section 4 class), which reads from the parquet files into a spark dataframe, registers as a temp table, and executes sql queries agains the dataframe in memory, and then converts to Pandas dataframe.

######  Please also find included in the GIT repo, command history that was executed to perform all operations for this project submission (`Commands_VBeohar_history.txt`)

In [None]:
import json
from pyspark.sql import Row
from pyspark.sql.functions import udf
import matplotlib.pyplot as plt

In [2]:
sc

#### Structured Streaming of topics using Apache Kafka and Spark SQL API

##### Below lines Spark SQL’s API calls leveraged to consume and transform data streams from Apache Kafka. We use Kafka to organize data into topics.

In [3]:
@udf('boolean')
def is_purchase(event_as_json):
    event = json.loads(event_as_json)
    if event['event_type'] == 'purchase_sword':
        return True
    return False

##### Construct a streaming DataFrame that reads from events

In [4]:
raw_events = spark \
        .read \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka:29092") \
        .option("subscribe", "events") \
        .option("startingOffsets", "earliest") \
        .option("endingOffsets", "latest") \
        .load()

In [5]:
purchase_events = raw_events \
    .select(raw_events.value.cast('string').alias('raw'),
            raw_events.timestamp.cast('string')) \
    .filter(is_purchase('raw'))

In [6]:
extracted_purchase_events = purchase_events \
    .rdd \
    .map(lambda r: Row(timestamp=r.timestamp, **json.loads(r.raw))) \
    .toDF()

##### printSchema() reveals the schema of our DataFrame

In [7]:
extracted_purchase_events.printSchema()

root
 |-- Accept: string (nullable = true)
 |-- Host: string (nullable = true)
 |-- User-Agent: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [8]:
extracted_purchase_events.show()

+------+-----------------+---------------+--------------+--------------------+
|Accept|             Host|     User-Agent|    event_type|           timestamp|
+------+-----------------+---------------+--------------+--------------------+
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_s

##### Writing event log into HDFS/parquet files under /tmp/purchases 

In [9]:
extracted_purchase_events \
    .write \
    .mode('overwrite') \
    .parquet('/tmp/purchases')

##### Following series of PYSpark codes read from the parquet files into a spark dataframe, register as a temp table (`Purchases`), and execute sql queries agains the dataframe in memory, finally converting those to Pandas dataframe for easy visualization

In [10]:
purchases = spark.read.parquet('/tmp/purchases')


In [11]:
purchases.show()

+------+-----------------+---------------+--------------+--------------------+
|Accept|             Host|     User-Agent|    event_type|           timestamp|
+------+-----------------+---------------+--------------+--------------------+
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_sword|2020-04-16 21:57:...|
|   */*|    user2.att.com|ApacheBench/2.3|purchase_s

In [12]:
purchases.registerTempTable('purchases')

In [13]:
purchases_by_example2 = spark.sql("select * from purchases where Host = 'user1.comcast.com'")

In [14]:
purchases_by_example2.show()

+------+-----------------+---------------+--------------+--------------------+
|Accept|             Host|     User-Agent|    event_type|           timestamp|
+------+-----------------+---------------+--------------+--------------------+
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2020-04-16 23:00:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_s

In [15]:
df = purchases_by_example2.toPandas()

In [17]:
df

Unnamed: 0,Accept,Host,User-Agent,event_type,timestamp
0,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.39
1,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.395
2,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.399
3,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.403
4,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.407
5,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.413
6,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.417
7,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.423
8,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.427
9,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.43


In [16]:
df.describe()

Unnamed: 0,Accept,Host,User-Agent,event_type,timestamp
count,10,10,10,10,10
unique,1,1,1,1,10
top,*/*,user1.comcast.com,ApacheBench/2.3,purchase_sword,2020-04-16 23:00:10.427
freq,10,10,10,10,1


### Analytical Questions:
#### Produce an analytics report where you provide a description of your pipeline and some basic analysis of the events

##### How many purchases were there from user 1 (requested via Comcast)?

In [23]:
purchases_by_example2 = spark.sql("select count(*) from purchases where Host = 'user1.comcast.com'").show()

+--------+
|count(1)|
+--------+
|      10|
+--------+



##### How many total user purchases were requested by all users?

In [25]:
purchases_by_example2 = spark.sql("select count(1) from purchases").show()

+--------+
|count(1)|
+--------+
|      30|
+--------+



##### How many purchases from user 2 (requested via AT&T)?

In [22]:
purchases_by_example2 = spark.sql("select count(*) from purchases where Host = 'user2.att.com'").show()

+--------+
|count(1)|
+--------+
|      20|
+--------+

