# Spark SQL - Part One

In this part you will use Spark SQL API for loading and querying JSON data.

Say you need to create a daily report that lists all employees of your company and the number of pushes they’ve made to GitHub. You can implement this using the GitHub archive site ([www.githubarchive.org](www.githubarchive.org)), put together by Ilya Grigorik from Google (with the help of the GitHub folks), where you can download GitHub archives for arbitrary time periods. You can download a single day from the archive and use it as a sample for building the daily report.

Type the following commands in your VM's terminal:

```
mkdir -p $HOME/sia/github-archive
cd $HOME/sia/github-archive
wget http://data.githubarchive.org/2015-03-01-{0..23}.json.gz
gunzip *
```
After a few seconds, you’re left with 24 JSON files, downloaded and decompressed.

To preview a file use the following:
```
head -n 1 2015-03-01-0.json
```
Or to make it more readable:
```
head -n 1 2015-03-01-0.json | jq '.'
```

Create SparkSession by executing the following cell:

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark course - Spark SQL").\
    master("local[*]").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

Now load the JSON files using the `spark` object just created (you can use a "*" sign) into a DataFrame called `ghLog`. You can find the documentation here: [SparkSession](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession).

In [2]:
ghLog = spark.read.json("/home/spark/sia/github-archive/*.json")

Examine the schema of the DataFrame containing the parsed JSON data.

In [3]:
ghLog.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nullable = true)
 |    |    |    |    |-- href: strin

Filter the log entries so that only push events remain. Push events have the `type` equal to `PushEvent`.

You can reference a DataFrame's column in pyspark in three ways. By attribute (in this case `ghLog.type`) or by indexing (in this case `ghLog['type']`). If you don't have a reference to the column's DataFrame, you can reference it using the function `col` (`col('type')`).

In [3]:
pushes = ghLog.filter(ghLog.type == "PushEvent")

How many pushes are in the dataset and how many events are in the dataset overall?

In [5]:
print(ghLog.count())
print(pushes.count())

438018
216461


Examine the first five push events from the push events DataFrame.

In [6]:
pushes.show(5)

+--------------------+--------------------+----------+----+--------------------+------+--------------------+---------+
|               actor|          created_at|        id| org|             payload|public|                repo|     type|
+--------------------+--------------------+----------+----+--------------------+------+--------------------+---------+
|[https://avatars....|2015-03-01T17:00:00Z|2615567685|null|[, c54b6c83c00a98...|  true|[30983183, Chen-H...|PushEvent|
|[https://avatars....|2015-03-01T17:00:01Z|2615567688|null|[, ca98f961baba82...|  true|[27672998, nakal/...|PushEvent|
|[https://avatars....|2015-03-01T17:00:01Z|2615567689|null|[, 4e25e55b8f0791...|  true|[26529161, chemis...|PushEvent|
|[https://avatars....|2015-03-01T17:00:02Z|2615567707|null|[, 8a7835c6f74f0e...|  true|[31498865, adaam2...|PushEvent|
|[https://avatars....|2015-03-01T17:00:02Z|2615567713|null|[, 07631d80df1bd9...|  true|[30685399, biotru...|PushEvent|
+--------------------+--------------------+-----

What other kinds of events exist in the dataset? (Hint: use `distinct`)

In [8]:
ghLog.select(ghLog.type).distinct().show(50, False)

+-----------------------------+
|type                         |
+-----------------------------+
|PushEvent                    |
|GollumEvent                  |
|ReleaseEvent                 |
|CommitCommentEvent           |
|CreateEvent                  |
|PullRequestReviewCommentEvent|
|IssueCommentEvent            |
|DeleteEvent                  |
|IssuesEvent                  |
|ForkEvent                    |
|PublicEvent                  |
|MemberEvent                  |
|WatchEvent                   |
|PullRequestEvent             |
+-----------------------------+



Next, find the number of pushes per username, which is represented by the `actor.login` field. (Hint: use `groupBy`).

In [4]:
grouped = pushes.groupBy("actor.login").count()

See the first five rows of the grouped DataFrame.

In [10]:
grouped.show(5)

+--------------------+-----+
|               login|count|
+--------------------+-----+
|             nmaletm|    5|
|fahadnasrullaharb...|    3|
|        lpsandaruwan|    3|
|            emilysas|   25|
|       jhollingworth|    9|
+--------------------+-----+
only showing top 5 rows



Find the 5 users with the highest number of pushes. (Hint: use `orderBy`).

In [5]:
ordered = grouped.orderBy(grouped['count'].desc())
ordered.show(5)

+------------------+-----+
|             login|count|
+------------------+-----+
|diversify-exp-user| 3202|
|             rydnr| 1862|
|     KenanSulayman| 1727|
|      greatfirebot| 1254|
|    mirror-updates| 1008|
+------------------+-----+
only showing top 5 rows



This includes all GitHub users and you only need users from your company. The list of your company's employees' GitHub usernames can be found in `first-edition/ch03/ghEmployees.txt` under your home directory.

Load the lines from that file into a list.

In [6]:
with open("/home/spark/first-edition/ch03/ghEmployees.txt", 'r') as empf:
    employees = [line.strip() for line in empf]
    empf.close()

Now you need to filter out the GitHub pushes of the users not in the username collection. Just iterating through the pushes and querying the collection would force Spark to send the collection out to executors several times. Much more efficient method would be to broadcast the collection. (Note that Spark might broadcast a table automatically when joining two tables if it can determine its size and if it's smaller than `spark.sql.autoBroadcastJoinThreshold`)

So, do that now and save the broadcast variable.

In [7]:
empBroad = sc.broadcast(employees)

Now create a function that takes a single String parameter and checks if the input exists in the broadcasted collection (the function returns true or false) and register that function as a Spark UDF (you will have to specify the return type, too).

In [8]:
from pyspark.sql.types import BooleanType

def isEmp(user):
    return user in empBroad.value

isEmployee = spark.udf.register("isEmpUdf", isEmp, BooleanType())

Finally, use the registered UDF to filter the ordered DataFrame (the one from where you found the users with the highest number of pushes). Hint: where the function takes an object argument, registered UDF takes a column.

In [9]:
filtered = ordered.filter(isEmployee(ordered.login))
filtered.show()

+---------------+-----+
|          login|count|
+---------------+-----+
|  KenanSulayman| 1727|
|direwolf-github|  561|
|         lukeis|  288|
|           keum|  192|
|        chapuni|  184|
|     manuelrp07|  104|
|         shryme|  101|
|            uqs|   90|
|   jeff1evesque|   79|
|        BitKiwi|   68|
|         qeremy|   66|
|        Somasis|   59|
|         jvodan|   57|
|     BhawanVirk|   55|
|       Valicek1|   53|
|      evelynluu|   49|
|  TheRingMaster|   47|
|         digipl|   42|
|   larperdoodle|   42|
|      jmarkkula|   39|
+---------------+-----+
only showing top 20 rows



Save the results to a Parquet file in your home directory (use DataFrame's `write` method).

In [13]:
filtered.write.parquet("/home/spark/jupyter4out")

Use the Linux shell to examine the output.

Now save the same DataFrame as a Spark table. Choose a table name you wish.

In [10]:
filtered.write.saveAsTable("topemployees")

Using your Linux terminal examine the contents of the `spark-warehouse` directory (under the folder where you started the Jupyter notebook from).

Now use the standard SQL query to find the users with more than 100 GitHub pushes.

In [12]:
spark.sql("select * from topemployees where count > 100").show()

+---------------+-----+
|          login|count|
+---------------+-----+
|     manuelrp07|  104|
|  KenanSulayman| 1727|
|direwolf-github|  561|
|        chapuni|  184|
|         lukeis|  288|
|         shryme|  101|
|           keum|  192|
+---------------+-----+

