Problem Statement:

You are tasked with organizing a list of single-letter alphabets and words based on their relationships. Specifically:

Each word in the dataset may start with a single letter (e.g., "Apple" starts with "A").
Identify relationships where:

A single-letter alphabet matches the first character of a word.
Output a table with two columns:

Alphabet: The single-letter alphabet.
Words: The words that start with the respective alphabet.

In [0]:
from pyspark.sql.types import *

# Define the schema
schema = StructType([StructField("letter_word", StringType(), True)])

# Define the data
data = [
    ("D",),
    ("A",),
    ("B",),
    ("C",),
    ("E",),
    ("Elephant",),
    ("Apple",),
    ("Cat",),
    ("Donkey",),
    ("Ball",),
]

# Create a DataFrame
alphabet_df = spark.createDataFrame(data, schema)

# display the DataFrame
alphabet_df.display()

letter_word
D
A
B
C
E
Elephant
Apple
Cat
Donkey
Ball


In [0]:
from pyspark.sql.functions import col, when, length, substring

# Create the 'stat' column using a case expression
cte_df = alphabet_df.withColumn(
    "stat",
    when(length(col("letter_word")) > 1, substring(col("letter_word"), 1, 1)).otherwise(
        ""
    ),
)
cte_df.display()

letter_word,stat
D,
A,
B,
C,
E,
Elephant,E
Apple,A
Cat,C
Donkey,D
Ball,B


In [0]:
# Create or replace a temporary view for the CTE
cte_df.createOrReplaceTempView("cte")

# Execute the main query
result_df = spark.sql(
    """
    SELECT c.letter_word AS Alphabet, c1.letter_word AS Words
    FROM cte AS c
    JOIN cte AS c1
    ON c.letter_word = c1.stat
    ORDER BY c.letter_word
"""
)

# Show the results
result_df.display()

Alphabet,Words
A,Apple
B,Ball
C,Cat
D,Donkey
E,Elephant


In [0]:
# For example: Save as a temporary view for querying
alphabet_df.createOrReplaceTempView("AlphabetWords")

In [0]:
# Spark SQL query with CTE
result_df = spark.sql(
    """
WITH cte AS (
    SELECT *,
           CASE 
               WHEN LENGTH(letter_word) > 1 THEN SUBSTRING(letter_word, 1, 1)
               ELSE ''
           END AS stat
    FROM AlphabetWords
)
SELECT c.letter_word AS Alphabet, 
       c1.letter_word AS Words
FROM cte AS c
JOIN cte AS c1
ON c.letter_word = c1.stat
ORDER BY c.letter_word
"""
)

# Show the results
result_df.display()

Alphabet,Words
A,Apple
B,Ball
C,Cat
D,Donkey
E,Elephant


Explanation:

Temporary View: 

The initial alphabet_df is registered as a temporary SQL table alphabetwords.
CTE Definition:

Adds a stat column using a CASE statement to extract the first character of letter_word if its length is greater than 1.
Join Operation:

Joins the cte with itself on the condition c.letter_word = c1.stat.
Ordering:

The result is ordered by c.letter_word.