PySpark, the freqItems() function is used to quickly identify frequent items (most commonly occurring values) in one or more columns of a DataFrame.
It’s very useful for data exploration, categorical analysis, and feature engineering when you need to detect frequent values in large datasets.

### 📌 Syntax
DataFrame.stat.freqItems(cols, support=0.01)

| Parameter   | Description                                                                                                                                 |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **cols**    | A list of column names on which you want to find frequent items.                                                                            |
| **support** | Minimum frequency threshold. Default = `0.01` (1%). If support = `0.05`, only items appearing in **≥5%** of rows are considered "frequent". |

Returns

A DataFrame containing an array of frequent items for each column.


### 🔹 Sample DataFrame

In [1]:
# Sample data
data = [
    (1, "Apple",  "Red"),
    (2, "Apple",  "Green"),
    (3, "Banana", "Yellow"),
    (4, "Banana", "Yellow"),
    (5, "Orange", "Orange"),
    (6, "Apple",  "Red"),
    (7, "Apple",  "Red"),
    (8, "Banana", "Yellow"),
    (9, "Banana", "Yellow"),
    (10, "Grape", "Green")
]

columns = ["id", "fruit", "color"]
df = spark.createDataFrame(data, columns)

df.show()



StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 3, Finished, Available, Finished)

+---+------+------+
| id| fruit| color|
+---+------+------+
|  1| Apple|   Red|
|  2| Apple| Green|
|  3|Banana|Yellow|
|  4|Banana|Yellow|
|  5|Orange|Orange|
|  6| Apple|   Red|
|  7| Apple|   Red|
|  8|Banana|Yellow|
|  9|Banana|Yellow|
| 10| Grape| Green|
+---+------+------+



### 🔹 Example 1 — Find Frequent Items in One Column

In [2]:
df.stat.freqItems(["fruit"], support=0.3).show(truncate=False)

StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 4, Finished, Available, Finished)

+---------------+
|fruit_freqItems|
+---------------+
|[Banana, Apple]|
+---------------+



✅ Since Apple and Banana each appear ≥30% of the time, they're returned.

### Example 2 — Find Frequent Items in Multiple Columns

In [3]:
df.stat.freqItems(["fruit", "color"], support=0.3).show(truncate=False)


StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 5, Finished, Available, Finished)

+---------------+--------------------+
|fruit_freqItems|color_freqItems     |
+---------------+--------------------+
|[Banana, Apple]|[Green, Red, Yellow]|
+---------------+--------------------+



✅ Here:

Banana and Apple are frequent fruits.

Yellow and Red are frequent colors.

### Example 3 — Lowering Support Threshold

In [4]:
df.stat.freqItems(["fruit", "color"], support=0.1).show(truncate=False)


StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 6, Finished, Available, Finished)

+------------------------------+----------------------------+
|fruit_freqItems               |color_freqItems             |
+------------------------------+----------------------------+
|[Orange, Banana, Grape, Apple]|[Green, Orange, Red, Yellow]|
+------------------------------+----------------------------+



✅ Now we also get Orange and Green because they occur more than 10%.

### 🔹 Example 4 — Using with Filter & Select

You can combine freqItems with filtering:

In [5]:
frequent_fruits = df.stat.freqItems(["fruit"], support=0.3)
frequent_fruits.selectExpr("explode(fruit_freqItems) as frequent_fruit").show()


StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 7, Finished, Available, Finished)

+--------------+
|frequent_fruit|
+--------------+
|        Banana|
|         Apple|
+--------------+



### 🔹 When to Use freqItems()

✅ Best for:

Quickly detecting dominant categories

Pre-processing before encoding categorical variables

Identifying skewed data

Feature engineering for ML models

⚠️ Note:
For exact counts, use:

In [6]:
df.groupBy("fruit").count().orderBy("count", ascending=False).show()

StatementMeta(, ce624c7a-84ca-46df-8be1-581f71219c0d, 8, Finished, Available, Finished)

+------+-----+
| fruit|count|
+------+-----+
| Apple|    4|
|Banana|    4|
|Orange|    1|
| Grape|    1|
+------+-----+



But freqItems() is faster since it uses approximation under the hood.

| Use Case              | Function               | Speed  | Approx / Exact |
| --------------------- | ---------------------- | ------ | -------------- |
| Approx frequent items | `df.stat.freqItems()`  | Fast   | Approximation  |
| Exact frequency count | `df.groupBy().count()` | Slower | Exact          |
