

# ‚úÖ PySpark Top 100 Methods ‚Äî **PART 2 (26‚Äì50)**



**Category: Aggregations, GroupBy, Joins, Set Operations**

These methods dominate **mid-level and senior data engineer interviews**.



In [1]:
import pathlib

filepath = str(pathlib.Path().cwd().parent / "data" / "Spotify_Songs.csv")
filepath

'd:\\shra1\\github\\pyspark-practice\\data\\Spotify_Songs.csv'

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, expr, to_utc_timestamp, to_date
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType, DoubleType, TimestampType

In [3]:
sparksession = SparkSession.builder.appName("MyApp").getOrCreate()

schema = StructType([
    StructField("song_id", IntegerType(), True),
    StructField("title", StringType(), True),
    StructField("artist_id", IntegerType(), True),
    StructField("release_date", TimestampType(), True)
])

df = sparksession.read \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(schema) \
    .csv(filepath)

In [4]:
df.show(10, truncate=False)

+-------+-------+---------+--------------------------+
|song_id|title  |artist_id|release_date              |
+-------+-------+---------+--------------------------+
|1      |Song_1 |2        |2021-10-15 10:15:47.006571|
|2      |Song_2 |45       |2020-12-07 10:15:47.006588|
|3      |Song_3 |25       |2022-07-11 10:15:47.006591|
|4      |Song_4 |25       |2019-03-09 10:15:47.006593|
|5      |Song_5 |26       |2019-09-07 10:15:47.006596|
|6      |Song_6 |27       |2023-03-25 10:15:47.006598|
|7      |Song_7 |34       |2023-01-07 10:15:47.006602|
|8      |Song_8 |18       |2023-01-30 10:15:47.006604|
|9      |Song_9 |14       |2020-05-21 10:15:47.006606|
|10     |Song_10|1        |2021-09-26 10:15:47.006609|
+-------+-------+---------+--------------------------+
only showing top 10 rows


In [5]:
df = df.withColumn("date", to_date(df.release_date, "yyyy-MM-dd"))

In [6]:
df.schema

StructType([StructField('song_id', IntegerType(), True), StructField('title', StringType(), True), StructField('artist_id', IntegerType(), True), StructField('release_date', TimestampType(), True), StructField('date', DateType(), True)])

In [7]:
df.printSchema()

root
 |-- song_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- release_date: timestamp (nullable = true)
 |-- date: date (nullable = true)





## 2Ô∏è‚É£6Ô∏è‚É£ `df.groupBy()`

### **What**

Groups rows based on column(s).

### **Why**

Required for aggregations.

### **How**

In [8]:
df.groupBy(df.artist_id).count().show(truncate=False)

+---------+-----+
|artist_id|count|
+---------+-----+
|31       |3    |
|34       |2    |
|28       |3    |
|26       |4    |
|27       |2    |
|44       |2    |
|12       |3    |
|22       |1    |
|47       |2    |
|1        |7    |
|13       |3    |
|6        |3    |
|16       |1    |
|3        |1    |
|20       |1    |
|48       |1    |
|5        |1    |
|19       |1    |
|15       |2    |
|43       |3    |
+---------+-----+
only showing top 20 rows


### **Interview**

> `groupBy()` alone does nothing ‚Äî needs an aggregation

---

## 2Ô∏è‚É£7Ô∏è‚É£ `agg()`

### **What**

Applies aggregate functions.

### **Why**

Multiple aggregations in one pass.

### **How**

In [9]:
from pyspark.sql.functions import avg, max, sum, min, count

df_agg = df.groupBy(df.artist_id).agg(
    avg(df.artist_id),
    max(df.artist_id),
    sum(df.artist_id),
    min(df.artist_id),
    count(df.artist_id)
)

In [10]:
df_agg.show(truncate=False)

+---------+--------------+--------------+--------------+--------------+----------------+
|artist_id|avg(artist_id)|max(artist_id)|sum(artist_id)|min(artist_id)|count(artist_id)|
+---------+--------------+--------------+--------------+--------------+----------------+
|31       |31.0          |31            |93            |31            |3               |
|34       |34.0          |34            |68            |34            |2               |
|28       |28.0          |28            |84            |28            |3               |
|26       |26.0          |26            |104           |26            |4               |
|27       |27.0          |27            |54            |27            |2               |
|44       |44.0          |44            |88            |44            |2               |
|12       |12.0          |12            |36            |12            |3               |
|22       |22.0          |22            |22            |22            |1               |
|47       |47.0      

In [11]:
df_agg.printSchema()

root
 |-- artist_id: integer (nullable = true)
 |-- avg(artist_id): double (nullable = true)
 |-- max(artist_id): integer (nullable = true)
 |-- sum(artist_id): long (nullable = true)
 |-- min(artist_id): integer (nullable = true)
 |-- count(artist_id): long (nullable = false)



In [12]:
df_agg.count()

46

In [25]:
df.groupBy("artist_id").count().filter(col("count") > 3).show()

+---------+-----+
|artist_id|count|
+---------+-----+
|       26|    4|
|        1|    7|
|       50|    4|
|       33|    5|
+---------+-----+



---

## 2Ô∏è‚É£9Ô∏è‚É£ `sum()`

### **What**

Computes sum.

### **How**

In [31]:
df.groupBy("artist_id").sum("artist_id").filter(col("sum(artist_id)") > 100).show()

+---------+--------------+
|artist_id|sum(artist_id)|
+---------+--------------+
|       26|           104|
|       43|           129|
|       50|           200|
|       33|           165|
|       46|           138|
|       36|           108|
+---------+--------------+





## 3Ô∏è‚É£0Ô∏è‚É£ `avg()`

### **What**

Average aggregation.

### **Interview**

> Uses **double precision** internally





## 3Ô∏è‚É£1Ô∏è‚É£ `min()` / `max()`

### **What**

Minimum / maximum value.

### **How**

In [None]:
df.groupBy("dept").max("salary")

---

## 3Ô∏è‚É£2Ô∏è‚É£ `countDistinct()`

### **What**

Counts unique values.

### **How**

In [None]:
from pyspark.sql.functions import countDistinct
df.select(countDistinct("user_id"))

### **Interview**

> Expensive ‚Üí requires shuffle

---

## 3Ô∏è‚É£3Ô∏è‚É£ `approx_count_distinct()`

### **What**

Approximate distinct count.

### **Why**

Massive performance gain.

### **How**

In [None]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("user_id"))

### **Interview**

> Uses **HyperLogLog++**

---

## 3Ô∏è‚É£4Ô∏è‚É£ `pivot()`

### **What**

Converts rows ‚Üí columns.

### **How**

In [None]:
df.groupBy("dept").pivot("year").sum("salary")

### **Use Case**

* Reports
* BI transformations

---

## 3Ô∏è‚É£5Ô∏è‚É£ `join()`

### **What**

Combines DataFrames.

### **How**

In [None]:
df1.join(df2, on="id", how="inner")

---

## 3Ô∏è‚É£6Ô∏è‚É£ Join Types

| Type  | Use             |
| ----- | --------------- |
| inner | Matching rows   |
| left  | All left rows   |
| right | All right rows  |
| full  | All rows        |
| semi  | Exists in right |
| anti  | Not exists      |

In [None]:
df1.join(df2, "id", "left")

---

## 3Ô∏è‚É£7Ô∏è‚É£ `broadcast()`

### **What**

Broadcasts small table.

### **Why**

Avoids shuffle.

### **How**

In [None]:
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "id")

### **Interview Gold**

> Broadcast if **< 10‚Äì50 MB**

---

## 3Ô∏è‚É£8Ô∏è‚É£ `crossJoin()`

### **What**

Cartesian product.

### **Danger**

‚ùå Extremely expensive.

---

## 3Ô∏è‚É£9Ô∏è‚É£ `union()`

### **What**

Row-wise union.

### **How**

In [None]:
df1.union(df2)

### **Requirement**

> Same schema order

---

## 4Ô∏è‚É£0Ô∏è‚É£ `unionByName()`

### **What**

Union by column name.

### **Why**

Schema mismatch safe.

### **How**

In [None]:
df1.unionByName(df2, allowMissingColumns=True)

---

## 4Ô∏è‚É£1Ô∏è‚É£ `intersect()`

### **What**

Common rows.

### **Cost**

‚ö†Ô∏è Shuffle required.

---

## 4Ô∏è‚É£2Ô∏è‚É£ `exceptAll()`

### **What**

Rows in DF1 not in DF2.

---

## 4Ô∏è‚É£3Ô∏è‚É£ `having` (via filter)

### **What**

Filter after aggregation.

### **How**

In [None]:
df.groupBy("dept").count().filter("count > 10")

---

## 4Ô∏è‚É£4Ô∏è‚É£ `groupingSets()`

### **What**

Multiple groupings.

### **Why**

Advanced analytics.

---

## 4Ô∏è‚É£5Ô∏è‚É£ `rollup()`

### **What**

Hierarchical aggregation.

### **How**

In [None]:
df.rollup("country", "state").sum("sales")

---

## 4Ô∏è‚É£6Ô∏è‚É£ `cube()`

### **What**

All combinations aggregation.

### **Interview**

> More expensive than `rollup`

---

## 4Ô∏è‚É£7Ô∏è‚É£ `repartition()`

### **What**

Changes partition count (shuffle).

### **How**

In [None]:
df.repartition(10)

---

## 4Ô∏è‚É£8Ô∏è‚É£ `coalesce()`

### **What**

Reduce partitions (no shuffle).

### **How**

In [None]:
df.coalesce(5)

---

## 4Ô∏è‚É£9Ô∏è‚É£ `explain()`

### **What**

Execution plan.

### **How**

In [None]:
df.explain(True)

### **Interview**

> Know **Logical vs Physical Plan**

---

## 5Ô∏è‚É£0Ô∏è‚É£ `cache()` / `persist()`

### **What**

Stores DF in memory.

### **How**

In [None]:
df.cache()

### **Interview**

> Use before **multiple actions**

---

# üî• Interview Coverage (Part 2)

‚úî Aggregations
‚úî Joins (broadcast vs shuffle)
‚úî Set operations
‚úî Performance tuning

---

## üëâ Next:

**PART 3 (51‚Äì75):**

* Window functions
* UDFs
* JSON / Date / Array / Map functions
* explode, collect_list
* SQL functions vs DataFrame API

Say **‚ÄúContinue Part 3‚Äù**