PySpark, you can get a statistical summary of a DataFrame in one command using the built-in describe() or summary() functions. I'll show you both approaches with a sample DataFrame

In [1]:
from pyspark.sql.functions import col

# Sample data
data = [
    ("A", 10, 1000.5),
    ("B", 15, 1500.8),
    ("C", 20, 1700.0),
    ("D", 25, 2100.3),
    ("E", 30, 2300.7)
]

# Define schema
columns = ["Category", "Quantity", "Revenue"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

display(df)

StatementMeta(, 66ac2ae5-4741-44a5-b5bd-d9d05fbd8f8b, 3, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, bfc05703-6230-4084-9504-3733df5fd1e8)

2. Statistical Summary Using describe()

The describe() function gives you count, mean, stddev, min, and max.

In [5]:
display(df.describe())


StatementMeta(, 66ac2ae5-4741-44a5-b5bd-d9d05fbd8f8b, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 15bb938f-23d8-491a-90b9-7ad408333433)

### 3. Extended Statistical Summary Using summary()

The summary() function is more powerful than describe() and includes:

count

mean

stddev

min

25% (first quartile)

50% (median)

75% (third quartile)

max

In [6]:
display(df.summary())

StatementMeta(, 66ac2ae5-4741-44a5-b5bd-d9d05fbd8f8b, 8, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 9fb224e8-2033-4c62-879d-b2a19309bb59)

### 4. Quick One-Liner

If you just need the numeric stats quickly, you can select numeric columns only:

In [7]:
numeric_cols = [c for c, t in df.dtypes if t in ('int', 'double', 'float')]
display(df.select(numeric_cols).summary())

StatementMeta(, 66ac2ae5-4741-44a5-b5bd-d9d05fbd8f8b, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 363dcfd0-274c-4050-aafa-20b438c3a793)

### Comparison Table

| **Function**     | **Columns**          | **Stats Returned**                           |
| ---------------- | -------------------- | -------------------------------------------- |
| `describe()`     | All columns          | count, mean, stddev, min, max                |
| `summary()`      | All columns          | count, mean, stddev, min, 25%, 50%, 75%, max |
| `approxQuantile` | Numeric columns only | Custom quantiles                             |


### Best Practice

Use describe() if you only need basic stats.

Use summary() if you also want median & quartiles.