## Prepare table in raw layer

Let us prepare table in raw layer. This will act as golden copy of our source data.
* The files in the landing folder are in JSON Format.
* There is no partitioning strategy while copying the files in landing zone.
* Parquet file is preferred over other file formats while storing the data in Data Lake storage layer (HDFS in this case).
* It is generally preferred to use daily partitions for the data to process further in incremental fashion.

Here are the steps we are going to follow. For now, we will take care of this manually but we need to automate and orchestrate later.
* Make sure files are available in landing zone.
* Read the data from the JSON files in the landing zone and create a Dataframe.
* Add additional columns as per the partitioning strategy. We are going to partition by year, then month, then day using one of date fields in input data.
* Partition the data frame by year, month and day and then write to the target table in the **{username}_ghraw_db**
* Make sure data is accessible using Spark SQL queries.

In [1]:
%%sh

hdfs dfs -ls /user/${USER}/itvgithub/landing

Found 96 items
-rw-r--r--   3 itv001477 supergroup   47825349 2021-12-05 01:05 /user/itv001477/itvgithub/landing/2021-01-13-0.json.gz
-rw-r--r--   3 itv001477 supergroup   45560145 2021-12-05 01:06 /user/itv001477/itvgithub/landing/2021-01-13-1.json.gz
-rw-r--r--   3 itv001477 supergroup   71293671 2021-12-05 01:05 /user/itv001477/itvgithub/landing/2021-01-13-10.json.gz
-rw-r--r--   3 itv001477 supergroup   65318647 2021-12-05 01:05 /user/itv001477/itvgithub/landing/2021-01-13-11.json.gz
-rw-r--r--   3 itv001477 supergroup   65044936 2021-12-05 01:06 /user/itv001477/itvgithub/landing/2021-01-13-12.json.gz
-rw-r--r--   3 itv001477 supergroup   77894277 2021-12-05 01:06 /user/itv001477/itvgithub/landing/2021-01-13-13.json.gz
-rw-r--r--   3 itv001477 supergroup   81246956 2021-12-05 01:06 /user/itv001477/itvgithub/landing/2021-01-13-14.json.gz
-rw-r--r--   3 itv001477 supergroup   85821693 2021-12-05 01:06 /user/itv001477/itvgithub/landing/2021-01-13-15.json.gz
-rw-r--r--   3 itv001477 su

In [2]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Analyze GitHub Archive Data'). \
    master('yarn'). \
    getOrCreate()

In [3]:
spark.conf.set('spark.sql.shuffle.partitions', 8)

In [5]:
process_dt = '2021-01-13'
ghdata = spark. \
    read. \
    json(f'/user/{username}/itvgithub/landing/{process_dt}-*.json.gz')

In [6]:
ghdata.show()

+--------------------+--------------------+-----------+--------------------+--------------------+------+--------------------+-----------------+
|               actor|          created_at|         id|                 org|             payload|public|                repo|             type|
+--------------------+--------------------+-----------+--------------------+--------------------+------+--------------------+-----------------+
|[https://avatars....|2021-01-13T15:00:00Z|14806766244|                null|[opened,,,,,,,,,,...|  true|[328163487, heret...| PullRequestEvent|
|[https://avatars....|2021-01-13T15:00:00Z|14806766250|[https://avatars....|[created,, [, COL...|  true|[155705622, conta...|IssueCommentEvent|
|[https://avatars....|2021-01-13T15:00:00Z|14806766257|                null|[,,,, A robot pow...|  true|[329342078, gmone...|      CreateEvent|
|[https://avatars....|2021-01-13T15:00:00Z|14806766262|[https://avatars....|[, bce214068f9be7...|  true|[264190944, Lombi...|        Pus

In [7]:
from pyspark.sql.functions import substring, col

In [8]:
spark.sql(f'CREATE DATABASE IF NOT EXISTS {username}_ghraw_db')

In [9]:
spark.sql(f'DROP TABLE IF EXISTS {username}_ghraw_db.ghactivity')

In [10]:
%%sh

hdfs dfs -ls /user/${USER}/warehouse/${USER}_ghraw_db.db

In [11]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/warehouse/${USER}_ghraw_db.db/ghactivity

rm: `/user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity': No such file or directory


CalledProcessError: Command 'b'\nhdfs dfs -rm -R -skipTrash /user/${USER}/warehouse/${USER}_ghraw_db.db/ghactivity\n'' returned non-zero exit status 1.

In [12]:
ghdata.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- display_login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nul

In [13]:
ghdata. \
    withColumn('year', substring('created_at', 1, 4)). \
    withColumn('month', substring('created_at', 6, 2)). \
    withColumn('day', substring('created_at', 9, 2)). \
    select('repo.*', 'actor.*', 'org.*', 'created_at', 'year', 'month', 'day'). \
    show()

+---------+--------------------+--------------------+--------------------+------------------+-----------+--------+------------------+--------------------+--------------------+-----------+--------+----------+--------------------+--------------------+----+-----+---+
|       id|                name|                 url|          avatar_url|     display_login|gravatar_id|      id|             login|                 url|          avatar_url|gravatar_id|      id|     login|                 url|          created_at|year|month|day|
+---------+--------------------+--------------------+--------------------+------------------+-----------+--------+------------------+--------------------+--------------------+-----------+--------+----------+--------------------+--------------------+----+-----+---+
|328163487|        heretere/hdl|https://api.githu...|https://avatars.g...|      imaginedevMC|           |70789387|      imaginedevMC|https://api.githu...|                null|       null|    null|      nul

In [14]:
ghdata = ghdata. \
    withColumn('year', substring('created_at', 1, 4)). \
    withColumn('month', substring('created_at', 6, 2)). \
    withColumn('day', substring('created_at', 9, 2))

In [15]:
ghdata. \
    write. \
    partitionBy('year', 'month', 'day'). \
    saveAsTable(f'{username}_ghraw_db.ghactivity')

In [16]:
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
+-------------------------+



In [17]:
%%sh

hdfs dfs -ls -R /user/${USER}/warehouse/${USER}_ghraw_db.db/ghactivity

-rw-r--r--   3 itv001477 supergroup          0 2021-12-06 01:04 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/_SUCCESS
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 00:55 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 00:55 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 01:04 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13
-rw-r--r--   3 itv001477 supergroup  172701438 2021-12-06 00:56 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13/part-00000-6e194789-615a-4cdf-b0dd-68f0e575d685.c000.snappy.parquet
-rw-r--r--   3 itv001477 supergroup  176429914 2021-12-06 00:56 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13/part-00001-6e194789-615a-4cdf-b0dd-68f0e575d685.c000.snappy.parquet
-rw-r--r--

In [19]:
spark.sql(f'''
    SELECT substring(created_at, 1, 10) AS created_dt, count(1)
    FROM {username}_ghraw_db.ghactivity
    GROUP BY created_dt
    ORDER BY created_dt
'''). \
    show()

+----------+--------+
|created_dt|count(1)|
+----------+--------+
|2021-01-13| 2829111|
+----------+--------+

