Problem statement:

Given a travel route dataset containing information about origin and destination cities, you need to identify the travel routes that occur only once in the dataset. Specifically:

Input: A dataset with two columns:

origin: The city from which the travel originates.
destination: The city to which the travel is headed.
Task:

Perform a self-join on the dataset to find matching routes where the origin of one route matches the destination of another.
For each origin and destination pair, calculate the count of how many times that pair appears in the dataset.
Filter out the pairs that appear more than once, leaving only those that appear once or not at all.
Return the list of routes (origin-destination pairs) that appear at most once in the dataset, ordered by the destination city.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F

# Create a Spark session
spark = SparkSession.builder.appName("RoutesDataFrame").getOrCreate()

# Define the schema for the DataFrame
schema = StructType([
    StructField("Origin", StringType(), True),
    StructField("Destination", StringType(), True)
])

# Define the data
data = [
    ("Bangalore", "Chennai"),
    ("Chennai", "Bangalore"),
    ("Pune", "Chennai"),
    ("Delhi", "Pune")
]

# Create a DataFrame
routes_df = spark.createDataFrame(data, schema)

# Show the DataFrame
routes_df.display()


Origin,Destination
Bangalore,Chennai
Chennai,Bangalore
Pune,Chennai
Delhi,Pune


In [0]:
routes_df.createOrReplaceTempView('travel')

In [0]:
# Perform a self join to match t1.origin with t2.destination
joined_df = routes_df.alias("t1").join(
    routes_df.alias("t2"),
    F.col("t1.origin") == F.col("t2.destination"),
    how="left"
)

# Group by t1.origin and t1.destination and calculate the count
cte_df = joined_df.groupBy("t1.origin", "t1.destination").agg(F.count("*").alias("cnt"))

# Filter rows where cnt <= 1 and order by destination
result_df = cte_df.filter(F.col("cnt") <= 1).orderBy("t1.destination")

# Show the result
result_df.select("origin", "destination").display()


origin,destination
Bangalore,Chennai
Pune,Chennai
Delhi,Pune


Explanation:

Sample Data: We create a sample dataset that simulates your travel table.

Self Left Join: The travel_df DataFrame is joined to itself, using a left join on the condition t1.origin = t2.destination.

Group By and Aggregate: We group the results by t1.origin and t1.destination, and compute the count of rows.

Filter and Order: We filter the DataFrame to only include rows where the count (cnt) is less than or equal to 1, and then sort the results by the destination column.

Select and Display: Finally, we select the origin and destination columns and display the result.

In [0]:
%sql
with cte as(
  select
    t1.origin,
    t1.destination,
    count(*) as cnt
  from
    travel t1
    left join travel t2 on t1.origin = t2.destination
  group by
    t1.origin,
    t1.destination
)
select
  origin,
  destination
from
  cte
where
  cnt <= 1
order by
  destination

origin,destination
Bangalore,Chennai
Pune,Chennai
Delhi,Pune


Explanation:

Sample Data: This simulates your travel table.
Self Join: The DataFrame travel_df is joined to itself based on the condition t1.origin = t2.destination.

Group By: We group by t1.origin and t1.destination and compute the count of occurrences.

Filter and Order: We filter the DataFrame to include only rows where the count (cnt) is less than or equal to 1, then order the result by destination.

This PySpark code replicates the behavior of your SQL CTE query using DataFrame operations. Let me know if you need any adjustments!