Problem Statement: 

Identify Overlapping Date Ranges

You are given a dataset containing multiple records, each with a unique id, name, start_date, and end_date. The objective is to identify all pairs of records where the date ranges overlap. A date range is defined by the start_date and end_date fields.

Conditions for Overlap:

The start date of one record must be less than or equal to the end date of another record.
The end date of one record must be greater than or equal to the start date of another record.
The records being compared must be different (i.e., their id values must not match).

In [0]:
from pyspark.sql.functions import to_date  # Import to_date function

# Extended sample data
data = [
    (1, "Robert", "2009-01-16", "2009-01-20"),
    (2, "JOHN", "2010-06-24", "2010-06-26"),
    (3, "Robert", "2009-01-18", "2009-01-20"),
    (4, "Emily", "2012-03-12", "2012-03-15"),
    (5, "Sarah", "2013-07-01", "2013-07-05"),
    (6, "JOHN", "2011-09-10", "2011-09-12"),
    (7, "Emily", "2015-11-20", "2015-11-22"),
    (8, "Michael", "2018-02-14", "2018-02-18"),
    (9, "Sarah", "2016-08-25", "2016-08-29"),
    (10, "Robert", "2020-01-01", "2020-01-05"),
]

# Define schema
schema = StructType(
    [
        StructField("id", IntegerType(), True),
        StructField("name", StringType(), True),
        StructField("start_date", StringType(), True),  # Temporarily StringType
        StructField("end_date", StringType(), True),  # Temporarily StringType
    ]
)

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Convert to DateType
df = df.withColumn("start_date", to_date(df["start_date"], "yyyy-MM-dd")).withColumn(
    "end_date", to_date(df["end_date"], "yyyy-MM-dd")
)

df.display()

id,name,start_date,end_date
1,Robert,2009-01-16,2009-01-20
2,JOHN,2010-06-24,2010-06-26
3,Robert,2009-01-18,2009-01-20
4,Emily,2012-03-12,2012-03-15
5,Sarah,2013-07-01,2013-07-05
6,JOHN,2011-09-10,2011-09-12
7,Emily,2015-11-20,2015-11-22
8,Michael,2018-02-14,2018-02-18
9,Sarah,2016-08-25,2016-08-29
10,Robert,2020-01-01,2020-01-05


In [0]:
from pyspark.sql.functions import col

# Perform a self-join to compare each row with every other row
overlap_df = df.alias("a").join(
    df.alias("b"),
    (
        (col("a.id") != col("b.id"))
        & (  # Exclude self-match
            (col("a.start_date") <= col("b.end_date"))
            & (  # Overlap condition 1
                col("a.end_date") >= col("b.start_date")
            )  # Overlap condition 2
        )
    ),
    "inner",
)

# Select relevant columns for clarity
overlap_df = overlap_df.select(
    col("a.id").alias("id1"),
    col("a.name").alias("name1"),
    col("a.start_date").alias("start_date1"),
    col("a.end_date").alias("end_date1"),
)

overlap_df.display()

id1,name1,start_date1,end_date1
1,Robert,2009-01-16,2009-01-20
3,Robert,2009-01-18,2009-01-20


In [0]:
# Register DataFrame as a temporary view
df.createOrReplaceTempView("date_ranges")

In [0]:
# Write the SQL query to find overlapping date ranges
query = """
SELECT 
    a.id AS id1, 
    a.name AS name1, 
    a.start_date AS start_date1, 
    a.end_date AS end_date1
FROM date_ranges a
JOIN date_ranges b
ON a.id != b.id -- Exclude self-match
AND a.start_date <= b.end_date -- Overlap condition 1
AND a.end_date >= b.start_date -- Overlap condition 2
"""

# Execute the query
overlap_df_sql = spark.sql(query)

# Show the results
overlap_df_sql.show()


+---+------+-----------+----------+
|id1| name1|start_date1| end_date1|
+---+------+-----------+----------+
|  1|Robert| 2009-01-16|2009-01-20|
|  3|Robert| 2009-01-18|2009-01-20|
+---+------+-----------+----------+



Explanation of the SQL Query

Self-Join: The date_ranges table is joined with itself to compare every row with every other row.
Excluding Self-Match: The condition a.id != b.id ensures a record is not compared with itself.

Overlap Conditions:

a.start_date <= b.end_date: The start date of one range is less than or equal to the end date of the other.
a.end_date >= b.start_date: The end date of one range is greater than or equal to the start date of the other.
This approach produces the same results as the PySpark DataFrame method but allows you to leverage Spark SQL's syntax.