# 1. Ingest and Explore Data

**Goal**: Load the raw GitHub Archive JSON data and explore its complex nested structure.

---

In [1]:
from pyspark.sql import SparkSession
import os

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("ProjectSpark-Ingest") \
    .getOrCreate()

print("Spark Session Created: ", spark.version)

Spark Session Created:  3.5.0


## Load Raw JSON Data
We load the data from `../data/raw`. The data is in NDJSON format.

In [2]:
input_path = "../data/raw/2015-01-01-15.json"

# Read JSON
df_raw = spark.read.json(input_path)

print(f"Loaded data from {input_path}")

Loaded data from ../data/raw/2015-01-01-15.json


## Explore Schema
Notice the nested structures in `actor`, `repo`, and especially `payload`.

In [3]:
df_raw.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nullable = true)
 |    |    |    |    |-- href: strin

In [4]:
record_count = df_raw.count()
print(f"Total Records: {record_count}")

Total Records: 11351


In [5]:
# Show top 5 records (truncated)
df_raw.show(5)

+--------------------+--------------------+----------+--------------------+--------------------+------+--------------------+-----------+
|               actor|          created_at|        id|                 org|             payload|public|                repo|       type|
+--------------------+--------------------+----------+--------------------+--------------------+------+--------------------+-----------+
|{https://avatars....|2015-01-01T15:00:00Z|2489651045|                NULL|{NULL, NULL, NULL...|  true|{28688495, petroa...|CreateEvent|
|{https://avatars....|2015-01-01T15:00:01Z|2489651051|                NULL|{NULL, 437c03652c...|  true|{28671719, rspt/r...|  PushEvent|
|{https://avatars....|2015-01-01T15:00:01Z|2489651053|                NULL|{NULL, 590433109f...|  true|{28270952, izuzer...|  PushEvent|
|{https://avatars....|2015-01-01T15:00:03Z|2489651057|{https://avatars....|{started, NULL, N...|  true|{2871998, visionm...| WatchEvent|
|{https://avatars....|2015-01-01T15:00:03

## Filter for PushEvents
For this project, we are mostly interested in `PushEvent` which contains commits.

In [6]:
df_push = df_raw.filter(df_raw.type == "PushEvent")
print(f"Push Events: {df_push.count()}")
df_push.select("id", "type", "actor.login", "repo.name").show(5)

Push Events: 5815
+----------+---------+-------------+--------------------+
|        id|     type|        login|                name|
+----------+---------+-------------+--------------------+
|2489651051|PushEvent|         rspt|     rspt/rspt-theme|
|2489651053|PushEvent|      izuzero|izuzero/xe-module...|
|2489651062|PushEvent|     winterbe|   winterbe/streamjs|
|2489651063|PushEvent|hermanwahyudi|hermanwahyudi/sel...|
|2489651064|PushEvent|        jdilt|jdilt/jdilt.githu...|
+----------+---------+-------------+--------------------+
only showing top 5 rows

