PySpark in approxQuantile(). This function is super useful when you want percentiles or median from a large dataset without scanning all data (approximation but very fast).

DataFrame.approxQuantile(col, probabilities, relativeError)
col → column name (string) or list of column names

probabilities → list of quantile probabilities (values between 0 and 1).

Example: [0.5] → median, [0.25, 0.5, 0.75] → quartiles.

relativeError → accuracy level (0.0 → exact quantile but slower, closer to 1 → faster but less accurate).

In [1]:
# Sample Data
data = [(10,), (20,), (30,), (40,), (50,), (60,), (70,), (80,), (90,), (100,)]
df = spark.createDataFrame(data, ["values"])

df.show()

StatementMeta(, 20403ac5-555e-4d58-bf64-f95ebe0b6ac8, 3, Finished, Available, Finished)

+------+
|values|
+------+
|    10|
|    20|
|    30|
|    40|
|    50|
|    60|
|    70|
|    80|
|    90|
|   100|
+------+



In [2]:
median = df.approxQuantile("values", [0.5], 0.01)  
print("Median:", median[0])

StatementMeta(, 20403ac5-555e-4d58-bf64-f95ebe0b6ac8, 4, Finished, Available, Finished)

Median: 50.0


In [3]:
quantiles = df.approxQuantile("values", [0.25, 0.5, 0.75], 0.01)
print("25th, 50th, 75th Percentiles:", quantiles)


StatementMeta(, 20403ac5-555e-4d58-bf64-f95ebe0b6ac8, 5, Finished, Available, Finished)

25th, 50th, 75th Percentiles: [30.0, 50.0, 80.0]


In [4]:
data2 = [(10, 100), (20, 200), (30, 300), (40, 400), (50, 500)]
df2 = spark.createDataFrame(data2, ["col1", "col2"])

quantiles_multi = df2.approxQuantile(["col1", "col2"], [0.25, 0.5, 0.75], 0.01)
print(quantiles_multi)


StatementMeta(, 20403ac5-555e-4d58-bf64-f95ebe0b6ac8, 6, Finished, Available, Finished)

[[20.0, 30.0, 40.0], [200.0, 300.0, 400.0]]


### 🔹 Key Notes

- approxQuantile() is much faster than percentile_approx() in SQL for very large datasets.
- If you need exact percentiles → set relativeError=0.
- Best choice for median and quantiles in big data pipelines.