<a href="https://colab.research.google.com/github/suriarasai/BEAD2025/blob/main/colab/05a_Log_Analytics_Using_RDD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Scenario and Data Set

We will process the famous NASA's web server logs from July 1995. Each line in the log file represents a request made to the server and follows the Common Log Format.

Our Goal:

Read the raw log file into an RDD.

Parse each line to extract the HTTP status code (e.g., 200 for OK, 404 for Not Found) and the request URL.

Filter out any malformed log entries that don't parse correctly.

Count the number of times each HTTP status code appears in the entire log.

### Setup: Getting the Data

We need to download the public dataset. We can can run the following command in code using bash operator to fetch the data.

In [None]:
# This command downloads the compressed log file and unzips it.
!wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz -O NASA_access_log_Jul95.gz
!gunzip NASA_access_log_Jul95.gz

--2025-08-16 03:52:49--  ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
           => ‘NASA_access_log_Jul95.gz’
Resolving ita.ee.lbl.gov (ita.ee.lbl.gov)... 131.243.2.164, 2620:83:8000:102::a4
Connecting to ita.ee.lbl.gov (ita.ee.lbl.gov)|131.243.2.164|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /traces ... done.
==> SIZE NASA_access_log_Jul95.gz ... 20676672
==> PASV ... done.    ==> RETR NASA_access_log_Jul95.gz ... done.
Length: 20676672 (20M) (unauthoritative)


2025-08-16 03:52:53 (7.16 MB/s) - ‘NASA_access_log_Jul95.gz’ saved [20676672]



This will create a file named NASA_access_log_Jul95 in the workspace.

### PySpark Setup

The first step involves installing pyspark.  The next step is to install findspark library.

*Note: the --ignore-install flag is used to ignore previous installations and use the latest one built alongside the allocated cluster.*


In [None]:
import os

# 1. Install OpenJDK 21 (if not already done in a previous cell)
!apt-get update -qq
!apt-get install -qq openjdk-21-jdk-headless

# 2. Verify where it landed (if needed)
!ls /usr/lib/jvm | grep 21

# 3. Point to JDK 21
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

# 4. Install PySpark via pip (make sure this happens AFTER setting JAVA_HOME)
!pip install pyspark --quiet

# 5. Import and start Spark
from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
      .master("local[*]")
      .appName("LOg Analytics Spark on Java21")
      .getOrCreate()
)


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package openjdk-21-jre-headless:amd64.
(Reading database ... 126380 files and directories currently installed.)
Preparing to unpack .../openjdk-21-jre-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Selecting previously unselected package openjdk-21-jdk-headless:amd64.
Preparing to unpack .../openjdk-21-jdk-headless_21.0.8+9~us1-0ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-21-jdk-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
Setting up openjdk-21-jre-headless:amd64 (21.0.8+9~us1-0ubuntu1~22.04.1) ...
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/java to provide /usr/bin/java (java) in auto mode
update-alternatives: using /usr/lib/jvm/java-21-openjdk-amd64/bin/j

In PySpark, a Spark Session is created using the SparkSession.builder method. Here's an example:

In [None]:
from pyspark.sql import SparkSession
# import collections
spark = SparkSession.builder.master("local").appName("Log Analytics").getOrCreate()

### Log Processing

Set the Regular Expression Pattern

In [None]:
import re
# A regular expression to parse the Common Log Format.
# Example: 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
LOG_PATTERN = r'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'

Function to parse a log line. Returns a tuple or None if parsing fails.

In [None]:
def parse_log_line(line):
    match = re.search(LOG_PATTERN, line)
    if match:
        # We are interested in the status code (group 8) and the URL (group 6)
        status_code = int(match.group(8))
        url = match.group(6)
        return (status_code, url)
    else:
        return None

Read the text file into an RDD

In [None]:
# Read Log File
log_file_path = "/content/NASA_access_log_Jul95"
# Each line of the file becomes an element in the RDD.
log_rdd = spark.sparkContext.textFile(log_file_path)

Data Munging

In [None]:
parsed_logs_rdd = log_rdd.map(parse_log_line)
# Filter out the lines that failed to parse (returned None)
valid_logs_rdd = parsed_logs_rdd.filter(lambda x: x is not None)

# The RDD is now structured as (status_code, url).
# For our goal, we just need the status code.
# map() -> (200, 1), (404, 1), (200, 1), ...
status_counts_rdd = valid_logs_rdd.map(lambda x: (x[0], 1))

Log Consolidation

In [None]:
# Count the occurrences of each status code
# reduceByKey() aggregates all values for a given key.
# For key 200, it will compute: (200, 1+1+1+...)
status_counts = status_counts_rdd.reduceByKey(lambda x, y: x + y)

Collect Results and Print

In [None]:
#Collect and Print Results
# Let's see the top 10 most frequent status codes
top_10_status_codes = status_counts.takeOrdered(10, key=lambda x: -x[1])
print("--- Top 10 HTTP Status Code Counts ---")
for status, count in top_10_status_codes:
    print(f"Status Code: {status}, Count: {count}")

--- Top 10 HTTP Status Code Counts ---
Status Code: 200, Count: 1700743
Status Code: 304, Count: 132626
Status Code: 302, Count: 46569
Status Code: 404, Count: 10783
Status Code: 500, Count: 62
Status Code: 403, Count: 54
Status Code: 501, Count: 14


Stop

In [None]:
spark.stop()

End of Use Case