# Project 2: Tracking User Activity Report

### Docker and Kafka Through Terminal

##### copy docker-compose.yml file from course-content

```
cp ../course-content/07-Sourcing-Data/docker-compose.yml .
```

##### spin up the cluster

```
docker-compose up -d
```

##### pull the data

```
curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp
```

##### create and check the topic

```
docker-compose exec kafka kafka-topics --create --topic assessment --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181
```

```
docker-compose exec kafka kafka-topics --describe --topic assessment --zookeeper zookeeper:32181
```

##### publish some messages

```
docker-compose exec mids bash -c "cat /w205/project-2-shhsieh99/assessment-attempts-20180128-121051-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessments && echo 'Produced 100 messages.'"
```

### Run Spark

```
docker-compose exec spark pyspark
```

##### Read from kafka

```
messages = spark .read .format("kafka") .option("kafka.bootstrap.servers", "kafka:29092") .option("subscribe","assessment") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .load()
```

##### Take a look

```
messages.printSchema()
messages.show()
```

##### Cache to cut back on warnings

```
messages.cache()
```

##### Cast as strings

```
assessments = messages.select(messages.value.cast('string'))
```

##### Using json

```
import json
first_message = json.loads(assessments.select('value').take(1)[0].value)
```

##### Save as parquet

```
assessments.write.parquet("/tmp/assessments")
```

##### Deal with unicode

```
import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
```

##### Unrolling json

```
extracted_assessments = assessments.rdd.map(lambda x: json.loads(x.value)).toDF()
```

##### Save as parquet

```
extracted_assessments.write.parquet("/tmp/extracted_assessments")
```

## Business Questions to answer

#### Assumptions

- I assume that each data point in the given dataset is unique meaning there is no duplicate to any assessments in the dataset.
- We have 3280 data points which might not be indicative of assessments (ex. missing data points)

#### Business Questions

How many assesstments are in the dataset?
- using the .count() function, I found there are 3280 assessments

What's the name of your Kafka topic? How did you come up with that name?
- The name of my Kafka topic is "assessment" because the data is about assessments so it is logical to name the topic after the dataset.

What is the schema?

```
extracted_assessments = assessments.rdd.map(lambda x: json.loads(x.value)).toDF()
extracted_assessments.printSchema()
```

```
root
 |-- base_exam_id: string (nullable = true)
 |-- certification: string (nullable = true)
 |-- exam_name: string (nullable = true)
 |-- keen_created_at: string (nullable = true)
 |-- keen_id: string (nullable = true)
 |-- keen_timestamp: string (nullable = true)
 |-- max_attempts: string (nullable = true)
 |-- sequences: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: map (containsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: boolean (valueContainsNull = true)
 |-- started_at: string (nullable = true)
 |-- user_exam_id: string (nullable = true)
```

- We see the very complex nested values this dataset contains so it could be difficult to navigate to certain business questions

What are the top 10 most common courses?

```
spark.sql("select exam_name, count(exam_name)  from assessments group by exam_name order by count(exam_name) desc").show(10)
```

```
+--------------------+----------------+                                         
|           exam_name|count(exam_name)|
+--------------------+----------------+
|        Learning Git|             394|
|Introduction to P...|             162|
|Introduction to J...|             158|
|Intermediate Pyth...|             158|
|Learning to Progr...|             128|
|Introduction to M...|             119|
|Software Architec...|             109|
|Beginning C# Prog...|              95|
|    Learning Eclipse|              85|
|Learning Apache M...|              80|
+--------------------+----------------+
```

How many people took *Learning Git*?
- from the previous table we see 394 data points for people who took *Learning Git*

What are the top 10 least common courses?

```
spark.sql("select exam_name, count(exam_name)  from assessments group by exam_name order by count(exam_name)").show(10)
```

```
+--------------------+----------------+                                         
|           exam_name|count(exam_name)|
+--------------------+----------------+
|Native Web Apps f...|               1|
|Nulls, Three-valu...|               1|
|Learning to Visua...|               1|
|Operating Red Hat...|               1|
|Client-Side Data ...|               2|
|The Closed World ...|               2|
|What's New in Jav...|               2|
|Arduino Prototypi...|               2|
|Hibernate and JPA...|               2|
|Understanding the...|               2|
+--------------------+----------------+
```

Was certification acquired?
- the "certification" value is unclear to what is meant by true or false so I assume true meant certification acquired once sufficently completing course
- the dataset is titled "assessment-attempt" so I can assume that these assessments are meant for people to acquire some certification

```
spark.sql("select * from assessments where certification = true").show(10)
```

- an empty table was returned which I can assume that all certification values are false and thus no certifications were acquired.