# **Shortest Distance**

## **Problem Statement**
You are given a table `Point` that holds the **x-coordinate** of some points on the x-axis in a plane. These are all **integers**.

### **Table: Point**
| Column Name | Type |
|-------------|------|
| `x`         | int  |

- Write a query to **find the shortest distance** between any two points from this list.

### **Example**
##### **Input Table**
| x   |
|-----|
| -1  |
| 0   |
| 2   |

##### **Expected Output**
| shortest |
|----------|
| 1        |


---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Create a DataFrame from the input**
2. **Sort the points by x**
3. **Create lagged column to get previous x**
4. **Compute the absolute difference between consecutive points**
5. **Get the minimum of those differences**

### **Code**

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lag, abs, min as spark_min
from pyspark.sql.window import Window

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("ShortestDistance").getOrCreate()

# Step 2: Create DataFrame
data = [(-1,), (0,), (2,)]
columns = ["x"]
df = spark.createDataFrame(data, columns)

# Step 3: Define Window Spec
window_spec = Window.orderBy("x")

# Step 4: Add Previous Point Using lag
df_lagged = df.withColumn("prev_x", lag("x", 1).over(window_spec))

# Step 5: Calculate Absolute Difference
df_diff = df_lagged.withColumn("diff", abs(col("x") - col("prev_x")))

# Step 6: Compute Minimum Non-null Distance
shortest_distance = df_diff.select(spark_min("diff").alias("shortest")).filter("shortest IS NOT NULL")

# Step 7: Display Output
shortest_distance.show()

StatementMeta(, 8c867562-a07b-42bd-aab3-4376c2a6c557, 5, Finished, Available, Finished)

+--------+
|shortest|
+--------+
|       1|
+--------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Register the DataFrame as SQL Table**
2. **Use a CTE with ROW_NUMBER to order x**
3. **Self-join current and previous row**
4. **Calculate absolute difference**
5. **Get the MIN of that difference**

### **Code**

In [4]:
# Step 1: Register DataFrame
df.createOrReplaceTempView("Point")

# Step 2: Execute SQL
sql_query = """
WITH OrderedPoints AS (
    SELECT x, ROW_NUMBER() OVER (ORDER BY x) AS rn
    FROM Point
),
PointPairs AS (
    SELECT a.x AS x1, b.x AS x2, ABS(a.x - b.x) AS diff
    FROM OrderedPoints a
    JOIN OrderedPoints b
    ON a.rn = b.rn + 1
)
SELECT MIN(diff) AS shortest
FROM PointPairs
"""

result = spark.sql(sql_query)
result.show()

StatementMeta(, 8c867562-a07b-42bd-aab3-4376c2a6c557, 6, Finished, Available, Finished)

+--------+
|shortest|
+--------+
|       1|
+--------+



---

## **Summary**
| Approach  | Method                      | Key Idea |
|-----------|-----------------------------|----------|
| **Approach 1** | PySpark DataFrame API    | Use `lag()` + `abs()` + `min()` |
| **Approach 2** | SQL Query in PySpark     | Use `ROW_NUMBER` and self-join |
