Problem Statement:

You are given a dataset of travel records where each record includes a customer ID, a start location, and an end location. The dataset is represented as follows:

customer: An identifier for the customer.
start_location: The location from where the customer starts their journey.
end_location: The location where the customer ends their journey.
Your task is to identify and list the following for each customer:

Start Locations: Locations where the customer started a journey but did not end any journey.
End Locations: Locations where the customer ended a journey but did not start any journey.
You need to produce a result that matches the following criteria:

For each customer, find start locations that do not appear as end locations in any records for that customer.
For each customer, find end locations that do not appear as start locations in any records for that customer.
Return a combined result showing the customer, their unique start location, and the corresponding unique end location where the conditions above are met.

In [0]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Sample data
data = [
    ("c1", "New York", "Lima"),
    ("c1", "London", "New York"),
    ("c1", "Lima", "Sao Paulo"),
    ("c1", "Sao Paulo", "New Delhi"),
    ("c2", "Mumbai", "Hyderabad"),
    ("c2", "Surat", "Pune"),
    ("c2", "Hyderabad", "Surat"),
    ("c3", "Kochi", "Kurnool"),
    ("c3", "Lucknow", "Agra"),
    ("c3", "Agra", "Jaipur"),
    ("c3", "Jaipur", "Kochi"),
]

# Define the schema
schema = "customer string, start_location string, end_location string"

# Create the DataFrame
df = spark.createDataFrame(data=data, schema=schema)

# Show the DataFrame
df.show()

+--------+--------------+------------+
|customer|start_location|end_location|
+--------+--------------+------------+
|      c1|      New York|        Lima|
|      c1|        London|    New York|
|      c1|          Lima|   Sao Paulo|
|      c1|     Sao Paulo|   New Delhi|
|      c2|        Mumbai|   Hyderabad|
|      c2|         Surat|        Pune|
|      c2|     Hyderabad|       Surat|
|      c3|         Kochi|     Kurnool|
|      c3|       Lucknow|        Agra|
|      c3|          Agra|      Jaipur|
|      c3|        Jaipur|       Kochi|
+--------+--------------+------------+



In [0]:
from pyspark.sql.functions import collect_list, udf
from pyspark.sql.types import StringType


def loc(x, y):
    a = [i for i in x if i not in y]
    return a[0]


loc_udf = udf(loc, StringType())
df1 = df.groupBy("customer").agg(
    collect_list("start_location").alias("start_list"),
    collect_list("end_location").alias("end_list"),
)
display(df1)
df2 = (
    df1.withColumn("start", loc_udf(df1.start_list, df1.end_list))
    .withColumn("end", loc_udf(df1.end_list, df1.start_list))
    .drop(*("start_list", "end_list"))
)
display(df2)

customer,start_list,end_list
c1,"List(New York, London, Lima, Sao Paulo)","List(Lima, New York, Sao Paulo, New Delhi)"
c2,"List(Mumbai, Surat, Hyderabad)","List(Hyderabad, Pune, Surat)"
c3,"List(Kochi, Lucknow, Agra, Jaipur)","List(Kurnool, Agra, Jaipur, Kochi)"


customer,start,end
c1,London,New Delhi
c2,Mumbai,Pune
c3,Lucknow,Kurnool


In [0]:
df.createOrReplaceTempView('travel_data')

In [0]:
# SQL query to get the results
query = """
WITH t1 AS (
  SELECT
    customer,
    start_location AS start
  FROM
    travel_data
  WHERE
    start_location NOT IN (
      SELECT
        end_location
      FROM
        travel_data
    )
),
t2 AS (
  SELECT
    customer,
    end_location AS end
  FROM
    travel_data
  WHERE
    end_location NOT IN (
      SELECT
        start_location
      FROM
        travel_data
    )
)
SELECT
  t1.customer,
  t1.start,
  t2.end
FROM
  t2
  JOIN t1 ON t2.customer = t1.customer
"""

# Execute the SQL query
result_df = spark.sql(query)

# Show the result
result_df.display()

customer,start,end
c1,London,New Delhi
c2,Mumbai,Pune
c3,Lucknow,Kurnool


Explanation:

customer: Represents the customer ID.

start_location: Represents the starting location of a journey.

end_location: Represents the ending location of a journey.