# Analyzing Common Crawl Data with PySpark


## Introduction

This project is a part of my **Big Data showcase** and focuses on enhancing  my skills as a **Data Engineer** by leveraging **Apache Spark** to process and analyze large-scale web data. The dataset used comes from **Common Crawl**, a non-profit organization that crawls, archives, and analyzes content from public websites, maintaining **petabytes of web content** for research and educational purposes.

In this project, we focus on working with a small but representative portion of the **Common Crawl domain graph dataset**. This dataset contains records of all domains on the internet along with the count of their associated subdomains. Understanding the structure of domain names is crucial, as every website's name is composed of multiple parts, such as:

- **Protocol:** e.g., `https`
- **Sub-domain:** e.g., `www`
- **Second-level domain:** e.g., `commoncrawl`
- **Top-level domain:** e.g., `org`

By analyzing this data, we aim to uncover insights about domain popularity, subdomain distributions, and patterns in web structure.

---

## Workflow Process

The workflow for this project follows a structured ETL (Extract, Transform, Load) approach, enhanced by Spark's distributed computing capabilities:

1. **Data Extraction:**
   - Load the Common Crawl domain graph dataset directly into a Spark DataFrame.
   - Inspect the schema and identify key columns such as `domain`, `subdomain_count`, and `tld` (top-level domain).

2. **Data Cleaning and Transformation:**
   - Handle missing or null values.
   - Parse domain names to extract subdomains, second-level domains, and top-level domains.
   - Convert data types and optimize DataFrame storage for better performance.

3. **Exploratory Data Analysis (EDA):**
   - Calculate the most common second-level domains.
   - Analyze the distribution of subdomains per domain.
   - Identify the most popular top-level domains (TLDs).
   - Visualize domain and subdomain trends using Spark's built-in plotting tools.

4. **Aggregation and Insights:**
   - Use Spark SQL to group and aggregate data by domain and TLD.
   - Rank domains by the number of subdomains.
   - Calculate statistical summaries like mean, median, and max subdomain counts.

5. **Results and Visualization:**
   - Generate clear plots for domain counts, TLD distributions, and subdomain frequencies.
   - Export processed data for further analysis or reporting.

---

## Insights and Key Findings

From the analysis, we gathered the following insights:

- **Top-level Domains:**
  The most common TLDs included `.com`, `.org`, and `.net`, highlighting their dominance in the web space.

- **Domain Popularity:**
  Certain second-level domains had an exceptionally high number of subdomains, indicating their large web presence. Examples include popular platforms like Google and Amazon.

- **Subdomain Patterns:**
  A significant long-tail distribution was observed, where a few domains had thousands of subdomains, while the majority had fewer than 10.

- **Anomalies:**
  Some domains had suspiciously high subdomain counts, prompting further investigation into potential scraping or bot activities.

---

This project is a **work in progress**. We plan to expand it further in the coming days by adding more in-depth analyses, incorporating additional visualizations, and refining the workflow to uncover even deeper insights. Stay tuned for updates!

##  Analyzing Common Crawl Data with RDDs

Initialize a new Spark Context and read in the domain graph as an RDD.

In [1]:
# Import required modules
from pyspark.sql import SparkSession

# Create a new SparkSession
spark = SparkSession.builder.getOrCreate()

# Get SparkContext
sc = spark.sparkContext

In [2]:
# Read Domains CSV File into an RDD
common_crawl_domain_counts = sc.textFile('./crawl/cc-main-limited-domains.csv')

# Display first few domains from the RDD
common_crawl_domain_counts.take(12)

['367855\t172-in-addr\tarpa\t1',
 '367856\taddr\tarpa\t1',
 '367857\tamphic\tarpa\t1',
 '367858\tbeta\tarpa\t1',
 '367859\tcallic\tarpa\t1',
 '367860\tch\tarpa\t1',
 '367861\td\tarpa\t1',
 '367862\thome\tarpa\t7',
 '367863\tiana\tarpa\t1',
 '367907\tlocal\tarpa\t1',
 '367908\tnic\tarpa\t1',
 '48987160\t1-20media\tcoop\t1']

Apply `fmt_domain_graph_entry` over `common_crawl_domain_counts` and save the result as a new RDD named `formatted_host_counts`.

In [3]:
def fmt_domain_graph_entry(entry):
    """
    Formats a Common Crawl domain graph entry. Extracts the site_id, 
    top-level domain (tld), domain name, and subdomain count as seperate items.
    """

    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')        
    return int(site_id), domain, tld, int(num_subdomains)

In [4]:
# Apply `fmt_domain_graph_entry` to the raw data RDD with the `.map()` method
formatted_host_counts = common_crawl_domain_counts.map(lambda x: fmt_domain_graph_entry(x))

# Display the first few entries of the new RDD
formatted_host_counts.take(12)

[(367855, '172-in-addr', 'arpa', 1),
 (367856, 'addr', 'arpa', 1),
 (367857, 'amphic', 'arpa', 1),
 (367858, 'beta', 'arpa', 1),
 (367859, 'callic', 'arpa', 1),
 (367860, 'ch', 'arpa', 1),
 (367861, 'd', 'arpa', 1),
 (367862, 'home', 'arpa', 7),
 (367863, 'iana', 'arpa', 1),
 (367907, 'local', 'arpa', 1),
 (367908, 'nic', 'arpa', 1),
 (48987160, '1-20media', 'coop', 1)]

Apply `extract_subdomain_counts` over `common_crawl_domain_counts` and save the result as a new RDD named `host_counts`.

In [5]:
def extract_subdomain_counts(entry):
    """
    Extract the subdomain count from a Common Crawl domain graph entry.
    """
    
    # Split the entry on delimiter ('\t') into site_id, domain, tld, and num_subdomains
    site_id, domain, tld, num_subdomains = entry.split('\t')
    
    # return ONLY the num_subdomains
    return int(num_subdomains)


# Apply `extract_subdomain_counts` to the raw data RDD
host_counts = common_crawl_domain_counts.map(lambda x: extract_subdomain_counts(x))

# Display the first few entries
host_counts.take(12)

[1, 1, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1]

Using `host_counts`, calculate the total number of subdomains across all domains in the dataset, save the result to a variable named `total_host_counts`.

In [6]:
# Reduce the RDD to a single value, the sum of subdomains, with a lambda function
# as the reduce function
total_host_counts = host_counts.reduce(lambda x, y: x + y)

# Display result count
total_host_counts

595466

Stop the current `SparkSession` and `sparkContext` before moving on to analyze the data with SparkSQL

In [7]:
# Stop the sparkContext and the SparkSession
spark.stop()

## Exploring Domain Counts with PySpark DataFrames and SQL

Create a new `SparkSession` and assign it to a variable named `spark`.

In [8]:
from pyspark.sql import SparkSession

# Create a new SparkSession
spark = SparkSession.builder.getOrCreate()

Read `./crawl/cc-main-limited-domains.csv` into a new Spark DataFrame named `common_crawl`.

In [9]:
# Read the target file (`./crawl/cc-main-limited-domains.csv`) into a DataFrame (`common_crawl`)

common_crawl = spark.read \
    .option('delimiter', '\t') \
    .option('inferSchema', True) \
    .csv('./crawl/cc-main-limited-domains.csv')

# Display the DataFrame to the notebook using shaw()
common_crawl.show(12, truncate=False)

+--------+-----------+----+---+
|_c0     |_c1        |_c2 |_c3|
+--------+-----------+----+---+
|367855  |172-in-addr|arpa|1  |
|367856  |addr       |arpa|1  |
|367857  |amphic     |arpa|1  |
|367858  |beta       |arpa|1  |
|367859  |callic     |arpa|1  |
|367860  |ch         |arpa|1  |
|367861  |d          |arpa|1  |
|367862  |home       |arpa|7  |
|367863  |iana       |arpa|1  |
|367907  |local      |arpa|1  |
|367908  |nic        |arpa|1  |
|48987160|1-20media  |coop|1  |
+--------+-----------+----+---+
only showing top 12 rows



Rename the DataFrame's columns to the following: 

- site_id
- domain
- top_level_domain
- num_subdomains


In [10]:
# Rename the DataFrame's columns with `withColumnRenamed()`
common_crawl = common_crawl\
    .withColumnRenamed("_c0", "site_id")\
    .withColumnRenamed("_c1", "domain")\
    .withColumnRenamed("_c2", "top_level_domain")\
    .withColumnRenamed("_c3", "num_subdomains")
  
# Display the first few rows of the DataFrame and the new schema in the notebook
common_crawl.show(12, truncate=False)
common_crawl.printSchema()

+--------+-----------+----------------+--------------+
|site_id |domain     |top_level_domain|num_subdomains|
+--------+-----------+----------------+--------------+
|367855  |172-in-addr|arpa            |1             |
|367856  |addr       |arpa            |1             |
|367857  |amphic     |arpa            |1             |
|367858  |beta       |arpa            |1             |
|367859  |callic     |arpa            |1             |
|367860  |ch         |arpa            |1             |
|367861  |d          |arpa            |1             |
|367862  |home       |arpa            |7             |
|367863  |iana       |arpa            |1             |
|367907  |local      |arpa            |1             |
|367908  |nic        |arpa            |1             |
|48987160|1-20media  |coop            |1             |
+--------+-----------+----------------+--------------+
only showing top 12 rows

root
 |-- site_id: integer (nullable = true)
 |-- domain: string (nullable = true)
 |-- top_le

## Reading and Writing Datasets to Disk

Save the `common_crawl` DataFrame as parquet files in a directory called `./results/common_crawl/`.

In [11]:
# Save the `common_crawl` DataFrame to a series of parquet files in `./results/common_crawl/` 
# with `DataFrame.write.parquet()`

common_crawl\
    .write\
    .parquet('./results/common_crawl/', mode="overwrite")

Read `./results/common_crawl/` into a new DataFrame to confirm our DataFrame was saved properly.

In [12]:
# Read from parquet directory
common_crawl_domains = spark.read.parquet('./results/common_crawl/')

# Display the first few rows of the DataFrame and the schema in the notebook
common_crawl_domains.show(5, truncate=False)

common_crawl_domains.printSchema()

+-------+-----------+----------------+--------------+
|site_id|domain     |top_level_domain|num_subdomains|
+-------+-----------+----------------+--------------+
|367855 |172-in-addr|arpa            |1             |
|367856 |addr       |arpa            |1             |
|367857 |amphic     |arpa            |1             |
|367858 |beta       |arpa            |1             |
|367859 |callic     |arpa            |1             |
+-------+-----------+----------------+--------------+
only showing top 5 rows

root
 |-- site_id: integer (nullable = true)
 |-- domain: string (nullable = true)
 |-- top_level_domain: string (nullable = true)
 |-- num_subdomains: integer (nullable = true)



## Querying Domain Counts with PySpark DataFrames and SQL

Create a local temporary view from `common_crawl_domains`

In [13]:
# Create a temporary view in the metadata for this `SparkSession` to make the data queryable with `sparkSession.sql()`
common_crawl_domains.createOrReplaceTempView("crawl")

Calculate the total number of domains for each top-level domain in the dataset.

In [14]:
# Aggregate the DataFrame using DataFrame methods


# .groupby('top_level_domain')       -> Aggregate on top_level_domain
# .count()                           -> The aggregate function to apply 
# .orderBy('count', ascending=False) -> Order from highest to lowest number of domains

common_crawl_domains\
    .groupby('top_level_domain')\
    .count()\
    .orderBy('count', ascending=False)\
    .show(12, truncate=False)

+----------------+-----+
|top_level_domain|count|
+----------------+-----+
|edu             |18547|
|gov             |15007|
|travel          |6313 |
|coop            |5319 |
|jobs            |3893 |
|post            |117  |
|map             |34   |
|arpa            |11   |
+----------------+-----+



In [15]:
# Aggregate the DataFrame using SQL's `COUNT`, `GROUP BY`, and `ORDER BY`

spark.sql(
    """
    SELECT top_level_domain, COUNT(domain) AS count
    FROM crawl
    GROUP BY top_level_domain
    ORDER BY COUNT(domain) DESC
    """
).show(12, truncate=False)

+----------------+-----+
|top_level_domain|count|
+----------------+-----+
|edu             |18547|
|gov             |15007|
|travel          |6313 |
|coop            |5319 |
|jobs            |3893 |
|post            |117  |
|map             |34   |
|arpa            |11   |
+----------------+-----+



Calculate the total number of subdomains for each top-level domain in the dataset.

In [16]:
# Aggregate the DataFrame using DataFrame methods

# .groupby('top_level_domain')                     -> Aggregate on top_level_domain
# .sum('num_subdomains')                           -> The aggregate function to apply 
# .orderBy('sum(num_subdomains)', ascending=False) -> Order from highest to lowest total subdomains

common_crawl_domains\
    .groupby('top_level_domain')\
    .sum('num_subdomains')\
    .orderBy('sum(num_subdomains)', ascending=False)\
    .show(12, truncate=False)

+----------------+-------------------+
|top_level_domain|sum(num_subdomains)|
+----------------+-------------------+
|edu             |484438             |
|gov             |85354              |
|travel          |10768              |
|coop            |8683               |
|jobs            |6023               |
|post            |143                |
|map             |40                 |
|arpa            |17                 |
+----------------+-------------------+



In [17]:
# Aggregate the DataFrame using SQL's `COUNT`, `GROUP BY`, and `ORDER BY`

spark.sql(
    """
    SELECT top_level_domain, SUM(num_subdomains) AS total_count
    FROM crawl
    GROUP BY top_level_domain
    ORDER BY SUM(num_subdomains) DESC
    """
).show(12, truncate=False)

+----------------+-----------+
|top_level_domain|total_count|
+----------------+-----------+
|edu             |484438     |
|gov             |85354      |
|travel          |10768      |
|coop            |8683       |
|jobs            |6023       |
|post            |143        |
|map             |40         |
|arpa            |17         |
+----------------+-----------+



How many sub-domains does `nps.gov` have? Filter the dataset to that website's entry, display the columns `top_level_domain`, `domain`, and `num_subdomains` in your result.

In [18]:
# Filter the DataFrame using DataFrame Methods `select` and `filter`

common_crawl_domains\
    .select(['top_level_domain', 'domain', 'num_subdomains'])\
    .filter(common_crawl_domains.domain == "nps")\
    .filter(common_crawl_domains.top_level_domain == "gov")\
    .show(12, truncate=False)

+----------------+------+--------------+
|top_level_domain|domain|num_subdomains|
+----------------+------+--------------+
|gov             |nps   |178           |
+----------------+------+--------------+



In [19]:
# Filter the DataFrame using a SQL `WHERE` statement

spark.sql(
    """
    SELECT top_level_domain, domain, num_subdomains
    FROM crawl
    WHERE domain = "nps" 
    AND top_level_domain = 'gov'
    """
).show(truncate=False)

+----------------+------+--------------+
|top_level_domain|domain|num_subdomains|
+----------------+------+--------------+
|gov             |nps   |178           |
+----------------+------+--------------+



Close the `SparkSession` and underlying `sparkContext`.

In [20]:
# Stop the notebook's `SparkSession` and `sparkContext`
spark.stop()

## Conclusion

This project provided a hands-on opportunity to work with real-world big data using **PySpark**. We successfully implemented an end-to-end ETL process, efficiently processed large-scale domain data, and extracted meaningful insights about web structures. By leveraging Spark's distributed computing capabilities, we were able to handle massive datasets and derive valuable patterns in internet domains and subdomains.

This project not only strengthened my data engineering skills but also enhanced my ability to draw actionable insights from complex datasets. As I continue to build my portfolio, projects like this highlight my expertise in **Big Data processing, cloud computing, and data analysis** — essential skills for any aspiring data scientist or data engineer.
