# Tracking User Behavior on "Best Chef" (mobile game) 

### Problem Statement 
"Best Chef" is an RPG-style mobile game. Users are chefs who can join restuarants (ie. like teams/clubs) and enter cooking competitions. The game is free on the app stores, and the Game Company generates revenue from in-app purchases (ie. buying ingredients, extra features, etc.) Therefore, to maximize in-app purchases, it is critical to keep users engaged in the game. Certain actions may contribute more than others in increasing user engagement and willingness to purchase extra features. Therefore, it is crucial that the company tracks all user actions during the game, including metadata associated with each event (ie. timestamp, ingredient type, restaurant name). With this data, the company's data science team will be able to provide insights to the game development and design teams in order to maximize user engagement on "Best Chef" and revenue for Game Company. 



### Objective
Track user behavior in stream mode and prepare the infrastructure to land data in appropriate form and structure to be queried by data scientists. This process includes the following:
- Instrument the API server to log events into Kafka
- Assemble data pipeline: use Spark streaming to filter and select event types from Kafka, land into HDFS/Parquet to make data available for analysis using Presto
- Use Apache Bench to generate test data for pipeline
- Produce analytics report for development and design teams with analysis of events

### Tools
This project was executed on Google Cloud Platform (GCP) with the following tools:
- Docker/docker-compose (to set-up the cluster of containers)
- Flask (to instrument our web server API)
- Kafka (to publish and consume messages)
- Zookeeper (broker)
- Spark/pyspark (to extract, filter, flatten and load the messages into HDFS)
- Cloudera/HDFS (to store final data in Parquet format)
- Hive metastore (schema registry)
- Presto (to query the data from HDFS)
- Linux Bash (to run commands and scripts through CLI)
- Apache Bench (to simulate user interactions with the app)
- Python 3 json package (to wrap and unwrap JSON data models)

### Pipeline
1. User interacts with mobile app game
2. Mobile app makes API calls to web services
3. API server handles requests: [OUT OF SCOPE FOR THIS PROJECT]
    - ie. process game actions, in-game purchases, etc.
    - logs events (user actions) to kafka
4. Spark pull sevents form kafka, filters event by type, applies data schemas, writes to HDFS in Parquet format
5. Hive reads Parquet files and creates tables for queries
6. Presto used to query tables for data analysis


### Tracked events
In this project, 3 types of evets are tracked:
1. Purchase an ingredient
2. Join a restaurant
3. Enter a contest

### Bash commands used to create pipeline

**Spin up cluster**
```bash
docker-compose up -d
```

**Perform checks**
```bash
docker-compose ps
```

**Create event topics**
```bash
docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181
```

**Run FLask_App**
```bash
docker-compose exec mids env FLASK_APP=/w205/project-3-redcarrott/game_api.py flask run --host 0.0.0.0
```

**Generate random events through Apache Bench**
```bash
chmod +x ab_events_publisher.sh ./ab_events_publisher.sh
```

**Read from kafka**
```bash
docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning
```

**Spark stream: run it**
```bash
docker-compose exec spark spark-submit /w205/project-3-redcarrott/filtered_writes_stream.py
```

**Spark stream: feed it**
```bash
while true; do docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_an_ingredient; sleep 10; done

while true; do docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/join_a_restaurant; sleep 10; done

while true; do docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/enter_a_contest; sleep 10; done
```

**Check out results in Hadoop**
```bash
docker-compose exec cloudera hadoop fs -ls /tmp/purchase_ingredient
docker-compose exec cloudera hadoop fs -ls /tmp/join_restaurant
docker-compose exec cloudera hadoop fs -ls /tmp/enter_contest
```

*2 ways to create tables in Hive* \
**1) Create tables in Hive, one per event type**
```bash
docker-compose exec cloudera hive -f /w205/project-3-redcarrott/create_tables.hql
```
**2) Register tables with hive**
```bash
docker-compose exec spark spark-submit /w205/project-3-redcarrott/stream_and_hive.py
```

**Initiate Presto** \
*run in main prompt*
```bash
docker-compose exec presto presto --server presto:8080 --catalog hive --schema default;
```

**Check tables**
```bash
show tables; DESCRIBE hive.default.purchase_ingredient; DESCRIBE hive.default.join_restaurant; DESCRIBE hive.default.enter_contest;
```

**Run queries**

*Which proportion of contests do users win?*
```bash
select outcome, count(*) as count from enter_contest where outcome is not null group by outcome;
```
|outcome | count |
|--------|-------|
|lost | 7 |
|won | 3|

*Which contests do users more frequently enter?*
```bash
select contest, count(*) as count from enter_contest where enemy is not null group by contest order by count desc;
```
|contest | count |
|--------|-------|
|Le Best Chef | 4 |
|Taste of Home | 3|
|Feast and Field | 3 |
|Call Me Betty Crocker | 2 |
|Bake Off | 1 |
|(3 rows) |  |

*Which ingredient is most purchased by users?*
```bash
select ingredient_type, count(*) as count from purchase_ingredient where ingredient_type is not null group by ingredientweapon_type order by count desc;
```
|ingredient_type | count |
|--------|-------|
|egg | 10 |
|beef | 5 |
|salt | 3 |
|butter | 3 |
|(6 rows) | |

*Which restaurant is preferred by comcast users?*
```bash
select restaurant_name, count(*) as count from join_restaurant where restaurant_name is not null and host like '%.comcast.com' group by restaurant_name order by count desc;
```
|restaurant_name | count |
|--------|-------|
|Promiscuous Fork | 3 |
|The Angry Avocado | 1 |
|Call Your Mother | 1 |
|(4 rows) | |