# Understanding User Behavior
### Sophia Skowronski | Project 3
### Summer 2020 | MIDS w205 | Fundamentals of Data Engineering

## Summary

Our mobile game has two events we're interested in tracking: buy a sword & join guild. This report is a summary of how the data pipeline was set up so that our data anayltics team can use it to query and solve the business problems at hand.

## Annotations

### Summary of the web-app server and streaming files.

#### Web-app server (`game_api.py`): Add `purchase_sword_event` and `join_guild_event`. Code snippet below:
```
    @app.route("/join_a_guild")
    def join_a_guild():
        join_guild_event = {'event_type': 'join_guild'}
        log_to_kafka('events', join_guild_event)
        return "Guild Joined!\n"
```
- In the `game_api.py` file, we import the Flask module and create a Flask web server.
- Then, we create an instance of the Flask class and call it `app`.
- There are a number of web applications: a `default` page, a `purchase_a_sword` page, and an additional `join_a_guild` page.
- When the user goes to the `join_a_guild` page of the website, the function above will get activated, sending a dictionary with a simple metadata characterstic  (`event_type`) to the `log_to_kafka` function, which combines the request headers with the event data, and then, that message is sent (via the imported `KafKaPRoducer` module) as an encoded json to the `events` topic.

#### Spark streaming file (`write_stream.py`): Create separate sinks for the two event types. Code snippet below:
```
    guild_joins = raw_events \
        .filter(is_join_guild_event(raw_events.value.cast('string'))) \
        .select(raw_events.value.cast('string').alias('raw_event'),
                raw_events.timestamp.cast('string'),
                from_json(raw_events.value.cast('string'),
                    event_schema()).alias('json')) \
        .select('raw_event', 'timestamp', 'json.*')
    ...
    guild_sink = guild_joins \
        .writeStream \
        .format("parquet") \
        .option("checkpointLocation", "/tmp/checkpoints_for_guild_joins") \
        .option("path", "/tmp/guild_joins") \
        .trigger(processingTime="10 seconds") \
        .start()
    ...
    spark.streams.awaitAnyTermination()
```
- In the code above, this is one example of a DataFrame object that is created, written to a sink stream, and stored in HDFS.
- The `write_stream.py` file imports libraries, infers schema by describing what the data is supposed to look like (in a function called `event_schema`), calls `@udf` Boolean functions to filter the event data, and initializes a Spark session.
- Within the Spark session, Sparks reads from the Kafka "events" stream, does filtered transformations on the source events by calling the `@udf` functions, creates two sinks for the data, and writes those sinks to Parquet. A trigger is defined so Spark processes the data and sends it to HDFS every 10 seconds. 
    - Note that there are no offsets defined, it will run continuously until it gets an end signal.
- Because there are two streams running, the final line of code activates the StreamingQueryManager to manage the two active sinks. It blocks any shutdown until either one of them terminates.
- At the bottom of the file, an if statement activates the `main()` function. `__name__` references the current file. Python assigns the name `__main__` to the script when executed. If we import another script, the if statement will prevent other scripts from running. When we run `write_stream.py`, it will change its name to `__main__`, and only then will that if statement activate.

### Spin up the cluser.
```
docker-compose up -d
```
### Terminal 1: Run flask.
```
docker-compose exec mids env FLASK_APP=/w205/project-3-sophiaski/game_api.py flask run --host 0.0.0.0
```
- This runs the Flask server, adding `--host 0.0.0.0` to the end makes the server publicly available. It tells the operating system to listen on all public IPs.
- The `FLASK_APP` environment variable is the name of the module to import at `flask run`.

### Terminal 2: From the MIDS container, use kafkacat to print out the messages from the specified Kafka broker and "events" topic, dumping them into standard output.
```
docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning
```
- This command needs to be run twice since the topic does not exist, yet, but it will be created after it is called a second time..
- Excluding `-e` from the command ensures it will run continuously.

### Terminal 3: Check out HDFS to see if anything has been written to `/tmp/`.
```
docker-compose exec cloudera hadoop fs -ls /tmp/
```
- Nothing yet!

### Terminal 3: Run the Spark stream.
```
docker-compose exec spark spark-submit /w205/project-3-sophiaski/write_stream.py
```
- Run `spark-submit` command from within the Spark container to launch the python application.

### Terminal 4: Again, check out HDFS to see what has been written to `/tmp/`.
```
docker-compose exec cloudera hadoop fs -ls /tmp/
```
```
drwxrwxrwt   - root   supergroup          0 2020-07-29 19:39 /tmp/checkpoints_for_guild_joins
drwxrwxrwt   - root   supergroup          0 2020-07-29 19:39 /tmp/checkpoints_for_sword_purchases
drwxr-xr-x   - root   supergroup          0 2020-07-29 19:39 /tmp/guild_joins
drwxr-xr-x   - root   supergroup          0 2020-07-29 19:39 /tmp/sword_purchases
```
- There are new folders in HDFS. The streaming file adds locations for the streaming data sinks. 
- In the newly created folders, there are a snappy-compressed parquet files with spark metadata. Spark SQL caches Parquet metadata for better performance.
- A check pointing method provides fault tolerance to the streaming data, saving those checkpoints as separate folders in HDFS.

### Terminal 4: Set up Presto.
```
docker-compose exec cloudera hive
hive> create external table if not exists default.sword_purchases (
    raw_event string,
    timestamp string,
    Accept string,
    Host string,
    User_Agent string,
    event_type string
  )
  stored as parquet 
  location '/tmp/sword_purchases'
  tblproperties ("parquet.compress"="SNAPPY");
hive> create external table if not exists default.guild_joins (
    raw_event string,
    timestamp string,
    Accept string,
    Host string,
    User_Agent string,
    event_type string
  )
  stored as parquet 
  location '/tmp/guild_joins'
  tblproperties ("parquet.compress"="SNAPPY");
```
- This sets up two Hive external tables in the HDFS container via the Cloudera Hive console, which stores the schema, the data location, as well as table properties.

### Terminal 4: Query HDFS with Presto 
```
docker-compose exec presto presto --server presto:8080 --catalog hive --schema default
```
- Presto is talking to the Hive server to get the `sword_purchases` and `guild_joins` tables, and these are connected to HDFS to retrieve the data for queries.

### Terminal 5: With the Spark stream running, kick in events to the web-app server via a shell script to automaically generate events.
```
#!/bin/sh 

while true
	do
	docker-compose exec mids ab -n $(($RANDOM%25+6)) -H "Host: user1.comcast.com" http://localhost:5000/
	docker-compose exec mids ab -n $(($RANDOM%25+6)) -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword
	docker-compose exec mids ab -n $(($RANDOM%25+6)) -H "Host: user1.comcast.com" http://localhost:5000/join_a_guild
	docker-compose exec mids ab -n $(($RANDOM%50+1)) -H "Host: user2.att.com" http://localhost:5000/
	docker-compose exec mids ab -n $(($RANDOM%50+1)) -H "Host: user2.att.com" http://localhost:5000/purchase_a_sword
	docker-compose exec mids ab -n $(($RANDOM%50+1)) -H "Host: user2.att.com" http://localhost:5000/join_a_guild
	sleep 3
done
```
- This uses Apache Benche to benchmark the HTTP requests and simulate data by feeding in a random number of requests to perform and appending host information to the header of the request. 

### Terminal 4: Query HDFS with Presto after data has been loaded in.

The events were split into two tables for scalability purposes. Keeping in mind the future development of the game, there will likely be many more event types that need to be tracked, each with their own event meta-data structure and schema that is not shared by other types of actions. It is also important for storage and security reasons, given that our anayltics team may or may not need access to the entire dataset to solve a specific business problem. See below for some example queries and findings that our data analytics team might find useful.
```
presto:default> show tables;
      Table      
-----------------
 guild_joins     
 sword_purchases 
(2 rows)

Query 20200802_224911_00003_x7r64, FINISHED, 1 node
Splits: 2 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
```

## Basic Analytics Questions & Queries

### How many total events were tracked during the session?
```
presto:default> select count(*) as total_events from (select * from guild_joins union select * from sword_purchases as guild_and_swords);
 total_events 
--------------
         1278 
(1 row)

Query 20200802_224939_00004_x7r64, FINISHED, 1 node
Splits: 147 total, 137 done (93.20%)
0:14 [1.04K rows, 130KB] [74 rows/s, 9.34KB/s]
```
- During this session, there were 1278 events. This query joined the two tables and counted the total number of non-default events (e.g. sword purchase or guild joins) tracked so far in the current session.

### Which host had a higher number of sword purchases?
```
presto:default> select host, count(*) as total_events from sword_purchases group by host;
       host        | total_events 
-------------------+--------------
 user1.comcast.com |          223 
 user2.att.com     |          302 
(2 rows)

Query 20200802_230759_00010_azxik, FINISHED, 1 node
Splits: 94 total, 89 done (94.68%)
0:03 [523 rows, 75.3KB] [197 rows/s, 28.4KB/s]
```
- Querying from the `sword_purchases` table, the AT&T host had a higher number of sword purchases than the Comcast host.

### Which host had the higher number of guilds joined?
```
presto:default> select host, count(*) as total_events from guild_joins group by host;
       host        | total_events 
-------------------+--------------
 user2.att.com     |          465 
 user1.comcast.com |          288 
(2 rows)

Query 20200802_230858_00011_azxik, FINISHED, 1 node
Splits: 96 total, 87 done (90.63%)
0:03 [747 rows, 82KB] [286 rows/s, 31.4KB/s]
```
- Querying from the `guild_joins` table, the AT&T host had a higher number of guild joins than the Comcast host.

### Were more swords purchased than guilds joined?
```
presto:default> select event_type, count(*) as total_events from (select * from guild_joins union select * from sword_purchases as guild_and_swords) group by event_type;
   event_type   | total_events 
----------------+--------------
 join_guild     |          753 
 purchase_sword |          525 
(2 rows)

Query 20200802_231014_00013_azxik, FINISHED, 1 node
Splits: 191 total, 178 done (93.19%)
0:05 [1.21K rows, 155KB] [243 rows/s, 31.2KB/s]
```
- Querying from the joined `sword_purchases` and `guild_joins` tables, we can group by `event-type` to retrieve the result of this query. In this session, the majority of events were from either host joining a guild, with 753 total events.

### Spin down the cluser.
```
docker-compose down
```