

# ‚úÖ PySpark Top 100 Methods ‚Äî **PART 2 (26‚Äì50)**



**Category: Aggregations, GroupBy, Joins, Set Operations**

These methods dominate **mid-level and senior data engineer interviews**.



In [1]:
import pathlib

filepath = str(pathlib.Path().cwd().parent / "data" / "Spotify_Songs.csv")
filepath

'd:\\shra1\\github\\pyspark-practice\\data\\Spotify_Songs.csv'

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, expr, to_utc_timestamp, to_date, year, month, dayofmonth, hour, minute, second
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType, DoubleType, TimestampType

In [3]:
sparksession = SparkSession.builder.appName("MyApp").getOrCreate()

schema = StructType([
    StructField("song_id", IntegerType(), True),
    StructField("title", StringType(), True),
    StructField("artist_id", IntegerType(), True),
    StructField("release_date", TimestampType(), True)
])

df = sparksession.read \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(schema) \
    .csv(filepath)

In [4]:
df.show(10, truncate=False)

+-------+-------+---------+--------------------------+
|song_id|title  |artist_id|release_date              |
+-------+-------+---------+--------------------------+
|1      |Song_1 |2        |2021-10-15 10:15:47.006571|
|2      |Song_2 |45       |2020-12-07 10:15:47.006588|
|3      |Song_3 |25       |2022-07-11 10:15:47.006591|
|4      |Song_4 |25       |2019-03-09 10:15:47.006593|
|5      |Song_5 |26       |2019-09-07 10:15:47.006596|
|6      |Song_6 |27       |2023-03-25 10:15:47.006598|
|7      |Song_7 |34       |2023-01-07 10:15:47.006602|
|8      |Song_8 |18       |2023-01-30 10:15:47.006604|
|9      |Song_9 |14       |2020-05-21 10:15:47.006606|
|10     |Song_10|1        |2021-09-26 10:15:47.006609|
+-------+-------+---------+--------------------------+
only showing top 10 rows


In [5]:
df = df.withColumn("date", to_date(df.release_date, "yyyy-MM-dd"))

In [6]:
df.schema

StructType([StructField('song_id', IntegerType(), True), StructField('title', StringType(), True), StructField('artist_id', IntegerType(), True), StructField('release_date', TimestampType(), True), StructField('date', DateType(), True)])

In [7]:
df.printSchema()

root
 |-- song_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- release_date: timestamp (nullable = true)
 |-- date: date (nullable = true)



In [8]:
df = df.withColumn("date", dayofmonth(df.release_date)) \
    .withColumn("month", month(df.release_date)) \
    .withColumn("year", year(df.release_date)) \
    .withColumn("hour", hour(df.release_date)) \
    .withColumn("minute", minute(df.release_date)) \
    .withColumn("second", second(df.release_date))

In [None]:
df.show(5)

+-------+------+---------+--------------------+----+-----+----+----+------+------+
|song_id| title|artist_id|        release_date|date|month|year|hour|minute|second|
+-------+------+---------+--------------------+----+-----+----+----+------+------+
|      1|Song_1|        2|2021-10-15 10:15:...|  15|   10|2021|  10|    15|    47|
|      2|Song_2|       45|2020-12-07 10:15:...|   7|   12|2020|  10|    15|    47|
|      3|Song_3|       25|2022-07-11 10:15:...|  11|    7|2022|  10|    15|    47|
|      4|Song_4|       25|2019-03-09 10:15:...|   9|    3|2019|  10|    15|    47|
|      5|Song_5|       26|2019-09-07 10:15:...|   7|    9|2019|  10|    15|    47|
+-------+------+---------+--------------------+----+-----+----+----+------+------+
only showing top 5 rows




## 2Ô∏è‚É£6Ô∏è‚É£ `df.groupBy()`

### **What**

Groups rows based on column(s).

### **Why**

Required for aggregations.

### **How**

In [9]:
df.groupBy("year").count().sort("year").show(truncate=False)

+----+-----+
|year|count|
+----+-----+
|2018|1    |
|2019|21   |
|2020|25   |
|2021|19   |
|2022|15   |
|2023|19   |
+----+-----+



### **Interview**

> `groupBy()` alone does nothing ‚Äî needs an aggregation

---

## 2Ô∏è‚É£7Ô∏è‚É£ `agg()`

### **What**

Applies aggregate functions.

### **Why**

Multiple aggregations in one pass.

### **How**

In [10]:
from pyspark.sql.functions import avg, max, sum, min, count

df_agg = df.groupBy(df.year).agg(
    avg(df.month),
    max(df.month),
    sum(df.month),
    min(df.month),
    count(df.month)
)

In [11]:
df_agg.show(truncate=False)

+----+-----------------+----------+----------+----------+------------+
|year|avg(month)       |max(month)|sum(month)|min(month)|count(month)|
+----+-----------------+----------+----------+----------+------------+
|2018|11.0             |11        |11        |11        |1           |
|2023|4.157894736842105|11        |79        |1         |19          |
|2022|7.066666666666666|12        |106       |1         |15          |
|2019|7.380952380952381|12        |155       |1         |21          |
|2020|7.36             |12        |184       |1         |25          |
|2021|6.473684210526316|10        |123       |1         |19          |
+----+-----------------+----------+----------+----------+------------+



In [12]:
df_agg.printSchema()

root
 |-- year: integer (nullable = true)
 |-- avg(month): double (nullable = true)
 |-- max(month): integer (nullable = true)
 |-- sum(month): long (nullable = true)
 |-- min(month): integer (nullable = true)
 |-- count(month): long (nullable = false)



In [13]:
df_agg.count()

6

In [14]:
df.groupBy("year").count().filter(col("count") > 3).show()

+----+-----+
|year|count|
+----+-----+
|2023|   19|
|2022|   15|
|2019|   21|
|2020|   25|
|2021|   19|
+----+-----+



---

## 2Ô∏è‚É£9Ô∏è‚É£ `sum()`

### **What**

Computes sum.

### **How**

In [15]:
df.groupBy("year").sum("artist_id").filter(col("sum(artist_id)") > 100).show()

+----+--------------+
|year|sum(artist_id)|
+----+--------------+
|2023|           519|
|2022|           321|
|2019|           611|
|2020|           613|
|2021|           412|
+----+--------------+





## 3Ô∏è‚É£0Ô∏è‚É£ `avg()`

### **What**

Average aggregation.

### **Interview**

> Uses **double precision** internally



In [16]:
df.show()

+-------+-------+---------+--------------------+----+-----+----+----+------+------+
|song_id|  title|artist_id|        release_date|date|month|year|hour|minute|second|
+-------+-------+---------+--------------------+----+-----+----+----+------+------+
|      1| Song_1|        2|2021-10-15 10:15:...|  15|   10|2021|  10|    15|    47|
|      2| Song_2|       45|2020-12-07 10:15:...|   7|   12|2020|  10|    15|    47|
|      3| Song_3|       25|2022-07-11 10:15:...|  11|    7|2022|  10|    15|    47|
|      4| Song_4|       25|2019-03-09 10:15:...|   9|    3|2019|  10|    15|    47|
|      5| Song_5|       26|2019-09-07 10:15:...|   7|    9|2019|  10|    15|    47|
|      6| Song_6|       27|2023-03-25 10:15:...|  25|    3|2023|  10|    15|    47|
|      7| Song_7|       34|2023-01-07 10:15:...|   7|    1|2023|  10|    15|    47|
|      8| Song_8|       18|2023-01-30 10:15:...|  30|    1|2023|  10|    15|    47|
|      9| Song_9|       14|2020-05-21 10:15:...|  21|    5|2020|  10|    15|

In [17]:
df.groupBy("year").avg("artist_id").show(5)

+----+------------------+
|year|    avg(artist_id)|
+----+------------------+
|2018|              38.0|
|2023| 27.31578947368421|
|2022|              21.4|
|2019|29.095238095238095|
|2020|             24.52|
+----+------------------+
only showing top 5 rows




## 3Ô∏è‚É£1Ô∏è‚É£ `min()` / `max()`

### **What**

Minimum / maximum value.

### **How**

In [18]:
df.groupby("year").min("artist_id").show(5)

+----+--------------+
|year|min(artist_id)|
+----+--------------+
|2018|            38|
|2023|             7|
|2022|             1|
|2019|             1|
|2020|             1|
+----+--------------+
only showing top 5 rows


In [19]:
df.groupby("year").max("artist_id").show(5)

+----+--------------+
|year|max(artist_id)|
+----+--------------+
|2018|            38|
|2023|            50|
|2022|            50|
|2019|            47|
|2020|            50|
+----+--------------+
only showing top 5 rows


---

## 3Ô∏è‚É£2Ô∏è‚É£ `countDistinct()`

### **What**

Counts unique values.

### **How**

In [20]:
from pyspark.sql.functions import countDistinct

df.select(countDistinct("year")).show()

+--------------------+
|count(DISTINCT year)|
+--------------------+
|                   6|
+--------------------+



### **Interview**

> Expensive ‚Üí requires shuffle

---

## 3Ô∏è‚É£3Ô∏è‚É£ `approx_count_distinct()`

### **What**

Approximate distinct count.

### **Why**

Massive performance gain.

### **How**

In [22]:
from pyspark.sql.functions import approx_count_distinct

df.select(approx_count_distinct("year")).show()

+---------------------------+
|approx_count_distinct(year)|
+---------------------------+
|                          6|
+---------------------------+



In [25]:
df.select("year").distinct().show()

+----+
|year|
+----+
|2018|
|2023|
|2022|
|2019|
|2020|
|2021|
+----+



### **Interview**

> Uses **HyperLogLog++**

---

## 3Ô∏è‚É£4Ô∏è‚É£ `pivot()`

### **What**

Converts rows ‚Üí columns.

### **How**

In [27]:
df_stat = df.groupBy("month").pivot("year").count().sort("month").fillna(0)
df_stat.show()

+-----+----+----+----+----+----+----+
|month|2018|2019|2020|2021|2022|2023|
+-----+----+----+----+----+----+----+
|    1|   0|   1|   1|   2|   1|   5|
|    2|   0|   0|   5|   1|   0|   2|
|    3|   0|   3|   0|   1|   0|   2|
|    4|   0|   0|   1|   1|   2|   3|
|    5|   0|   2|   1|   1|   0|   3|
|    6|   0|   2|   2|   0|   1|   0|
|    7|   0|   3|   2|   5|   6|   1|
|    8|   0|   2|   1|   2|   2|   0|
|    9|   0|   2|   3|   4|   0|   1|
|   10|   0|   2|   1|   2|   1|   1|
|   11|   1|   0|   3|   0|   1|   1|
|   12|   0|   4|   5|   0|   1|   0|
+-----+----+----+----+----+----+----+



### **Use Case**

* Reports
* BI transformations



In [28]:
artists_file = str(pathlib.Path().cwd().parent / "data" / "Spotify_Artists.csv")

spotify_artists_schema = StructType([
    StructField("artist_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("genre", StringType(), True),
    StructField("country", StringType(), True)
])

artists_df = sparksession.read \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(spotify_artists_schema) \
    .csv(artists_file)


In [29]:
artists_df.show(5)

+---------+--------+----------+---------+
|artist_id|    name|     genre|  country|
+---------+--------+----------+---------+
|        1|Artist_1|Electronic|   France|
|        2|Artist_2|Electronic|Australia|
|        3|Artist_3|      Jazz|   France|
|        4|Artist_4| Classical|Australia|
|        5|Artist_5|   Hip-Hop|      USA|
+---------+--------+----------+---------+
only showing top 5 rows


In [30]:
listening_file = str(pathlib.Path().cwd().parent / "data" / "Spotify_Listening_Activity.csv")

spotify_listening_schema = StructType([
    StructField("activity_id", IntegerType(), True),
    StructField("song_id", IntegerType(), True),
    StructField("listen_date", TimestampType(), True),
    StructField("listen_duration", IntegerType(), True)
])

listening_df = sparksession.read \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(spotify_listening_schema) \
    .csv(listening_file)

In [31]:
listening_df.show()

+-----------+-------+--------------------+---------------+
|activity_id|song_id|         listen_date|listen_duration|
+-----------+-------+--------------------+---------------+
|          1|     12|2023-06-27 10:15:...|             69|
|          2|     44|2023-06-27 10:15:...|            300|
|          3|     75|2023-06-27 10:15:...|             73|
|          4|     48|2023-06-27 10:15:...|            105|
|          5|     10|2023-06-27 10:15:...|            229|
|          6|     82|2023-06-27 10:15:...|             35|
|          7|     64|2023-06-27 10:15:...|            249|
|          8|     96|2023-06-27 10:15:...|            211|
|          9|     52|2023-06-27 10:15:...|             99|
|         10|     21|2023-06-27 10:15:...|            181|
|         11|      4|2023-06-27 10:15:...|            175|
|         12|      6|2023-06-27 10:15:...|            244|
|         13|     90|2023-06-27 10:15:...|            129|
|         14|     33|2023-06-27 10:15:...|            26

In [78]:
songs_file = str(pathlib.Path().cwd().parent / "data" / "Spotify_Songs.csv")

spotify_songs_schema =  StructType([
    StructField("song_id", IntegerType(), True),
    StructField("title", StringType(), True),
    StructField("artist_id", IntegerType(), True),
    StructField("release_date", TimestampType(), True)
])

songs_df = sparksession.read \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(spotify_songs_schema) \
    .csv(filepath)

In [33]:
songs_df.show(5)

+-------+------+---------+--------------------+
|song_id| title|artist_id|        release_date|
+-------+------+---------+--------------------+
|      1|Song_1|        2|2021-10-15 10:15:...|
|      2|Song_2|       45|2020-12-07 10:15:...|
|      3|Song_3|       25|2022-07-11 10:15:...|
|      4|Song_4|       25|2019-03-09 10:15:...|
|      5|Song_5|       26|2019-09-07 10:15:...|
+-------+------+---------+--------------------+
only showing top 5 rows


In [34]:
listening_df = listening_df.withColumn("l_date", dayofmonth("listen_date")) \
    .withColumn("l_month", month("listen_date")) \
    .withColumn("l_year", year("listen_date")) 

listening_df.show(5, truncate=False)

+-----------+-------+--------------------------+---------------+------+-------+------+
|activity_id|song_id|listen_date               |listen_duration|l_date|l_month|l_year|
+-----------+-------+--------------------------+---------------+------+-------+------+
|1          |12     |2023-06-27 10:15:47.008867|69             |27    |6      |2023  |
|2          |44     |2023-06-27 10:15:47.008867|300            |27    |6      |2023  |
|3          |75     |2023-06-27 10:15:47.008867|73             |27    |6      |2023  |
|4          |48     |2023-06-27 10:15:47.008867|105            |27    |6      |2023  |
|5          |10     |2023-06-27 10:15:47.008867|229            |27    |6      |2023  |
+-----------+-------+--------------------------+---------------+------+-------+------+
only showing top 5 rows


In [35]:
songs_df = songs_df.withColumn("r_date", dayofmonth("release_date")) \
    .withColumn("r_month", month("release_date")) \
    .withColumn("r_year", year("release_date")) 
    
songs_df.show(5)

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      1|Song_1|        2|2021-10-15 10:15:...|    15|     10|  2021|
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4|Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
+-------+------+---------+--------------------+------+-------+------+
only showing top 5 rows




## 3Ô∏è‚É£5Ô∏è‚É£ `join()`

### **What**

Combines DataFrames.

### **How**

In [36]:
artist_songs_df = artists_df.join(songs_df, on="artist_id", how="inner")
artist_songs_df.show(5, truncate=False)

+---------+---------+----------+---------+-------+------+--------------------------+------+-------+------+
|artist_id|name     |genre     |country  |song_id|title |release_date              |r_date|r_month|r_year|
+---------+---------+----------+---------+-------+------+--------------------------+------+-------+------+
|2        |Artist_2 |Electronic|Australia|1      |Song_1|2021-10-15 10:15:47.006571|15    |10     |2021  |
|45       |Artist_45|Electronic|UK       |2      |Song_2|2020-12-07 10:15:47.006588|7     |12     |2020  |
|25       |Artist_25|Classical |Canada   |3      |Song_3|2022-07-11 10:15:47.006591|11    |7      |2022  |
|25       |Artist_25|Classical |Canada   |4      |Song_4|2019-03-09 10:15:47.006593|9     |3      |2019  |
|26       |Artist_26|Jazz      |France   |5      |Song_5|2019-09-07 10:15:47.006596|7     |9      |2019  |
+---------+---------+----------+---------+-------+------+--------------------------+------+-------+------+
only showing top 5 rows


In [37]:
artists_df.join(songs_df, on="artist_id", how="anti").show()

+---------+---------+----------+---------+
|artist_id|     name|     genre|  country|
+---------+---------+----------+---------+
|        8| Artist_8|      Rock|  Germany|
|       29|Artist_29| Classical|Australia|
|       40|Artist_40|Electronic|      USA|
|       41|Artist_41|   Hip-Hop|Australia|
+---------+---------+----------+---------+





## 3Ô∏è‚É£6Ô∏è‚É£ Join Types

| Type  | Use             |
| ----- | --------------- |
| inner | Matching rows   |
| left  | All left rows   |
| right | All right rows  |
| full  | All rows        |
| semi  | Exists in right |
| anti  | Not exists      |

In [38]:
artists_df.join(songs_df, "artist_id", "outer").show()

+---------+---------+----------+-----------+-------+-------+--------------------+------+-------+------+
|artist_id|     name|     genre|    country|song_id|  title|        release_date|r_date|r_month|r_year|
+---------+---------+----------+-----------+-------+-------+--------------------+------+-------+------+
|        1| Artist_1|Electronic|     France|     10|Song_10|2021-09-26 10:15:...|    26|      9|  2021|
|        1| Artist_1|Electronic|     France|     16|Song_16|2020-12-21 10:15:...|    21|     12|  2020|
|        1| Artist_1|Electronic|     France|     22|Song_22|2021-07-19 10:15:...|    19|      7|  2021|
|        1| Artist_1|Electronic|     France|     29|Song_29|2022-07-06 10:15:...|     6|      7|  2022|
|        1| Artist_1|Electronic|     France|     43|Song_43|2021-04-20 10:15:...|    20|      4|  2021|
|        1| Artist_1|Electronic|     France|     53|Song_53|2022-11-27 10:15:...|    27|     11|  2022|
|        1| Artist_1|Electronic|     France|     90|Song_90|2019



## 3Ô∏è‚É£7Ô∏è‚É£ `broadcast()`

### **What**

Broadcasts small table.

### **Why**

Avoids shuffle.

### **How**

In [39]:
artists_df.count(), listening_df.count(), songs_df.count()

(50, 11779, 100)

In [40]:
songs_df.join(artists_df, on="artist_id", how="inner").show()

+---------+-------+-------+--------------------+------+-------+------+---------+----------+-----------+
|artist_id|song_id|  title|        release_date|r_date|r_month|r_year|     name|     genre|    country|
+---------+-------+-------+--------------------+------+-------+------+---------+----------+-----------+
|        2|      1| Song_1|2021-10-15 10:15:...|    15|     10|  2021| Artist_2|Electronic|  Australia|
|       45|      2| Song_2|2020-12-07 10:15:...|     7|     12|  2020|Artist_45|Electronic|         UK|
|       25|      3| Song_3|2022-07-11 10:15:...|    11|      7|  2022|Artist_25| Classical|     Canada|
|       25|      4| Song_4|2019-03-09 10:15:...|     9|      3|  2019|Artist_25| Classical|     Canada|
|       26|      5| Song_5|2019-09-07 10:15:...|     7|      9|  2019|Artist_26|      Jazz|     France|
|       27|      6| Song_6|2023-03-25 10:15:...|    25|      3|  2023|Artist_27|   Hip-Hop|     Canada|
|       34|      7| Song_7|2023-01-07 10:15:...|     7|      1| 

In [41]:
from pyspark.sql.functions import broadcast

songs_df.join(broadcast(artists_df), on="artist_id", how="inner").show()

+---------+-------+-------+--------------------+------+-------+------+---------+----------+-----------+
|artist_id|song_id|  title|        release_date|r_date|r_month|r_year|     name|     genre|    country|
+---------+-------+-------+--------------------+------+-------+------+---------+----------+-----------+
|        2|      1| Song_1|2021-10-15 10:15:...|    15|     10|  2021| Artist_2|Electronic|  Australia|
|       45|      2| Song_2|2020-12-07 10:15:...|     7|     12|  2020|Artist_45|Electronic|         UK|
|       25|      3| Song_3|2022-07-11 10:15:...|    11|      7|  2022|Artist_25| Classical|     Canada|
|       25|      4| Song_4|2019-03-09 10:15:...|     9|      3|  2019|Artist_25| Classical|     Canada|
|       26|      5| Song_5|2019-09-07 10:15:...|     7|      9|  2019|Artist_26|      Jazz|     France|
|       27|      6| Song_6|2023-03-25 10:15:...|    25|      3|  2023|Artist_27|   Hip-Hop|     Canada|
|       34|      7| Song_7|2023-01-07 10:15:...|     7|      1| 

### **Interview Gold**

> Broadcast if **< 10‚Äì50 MB**





## 3Ô∏è‚É£8Ô∏è‚É£ `crossJoin()`

### **What**

Cartesian product.

### **Danger**

‚ùå Extremely expensive.



In [42]:
artists_df.crossJoin(songs_df).count()

5000



## 3Ô∏è‚É£9Ô∏è‚É£ `union()`

### **What**

Row-wise union.

### **How**

In [47]:
songs_sample_df1 = songs_df.sample(fraction=0.7, seed=43).limit(5)
songs_sample_df2 = songs_df.sample(fraction=0.3, seed=40).limit(3)

In [48]:
songs_sample_df1.union(songs_sample_df2).show() # should have same number of columns in both dataframes

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4|Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
|      9|Song_9|       14|2020-05-21 10:15:...|    21|      5|  2020|
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
+-------+------+---------+--------------------+------+-------+------+



### **Requirement**

> Same schema order

---

## 4Ô∏è‚É£0Ô∏è‚É£ `unionByName()`

### **What**

Union by column name.

### **Why**

Schema mismatch safe.

### **How**

In [50]:
songs_sample_df1.unionByName(songs_sample_df2, allowMissingColumns=True).sort("song_id").show()

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4|Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
|      9|Song_9|       14|2020-05-21 10:15:...|    21|      5|  2020|
+-------+------+---------+--------------------+------+-------+------+



* unionByName is schema-safe , columns order does not matter, Name-based matching
* union is position-based, columns order matter, Position-based matching
* unionAll is deprecated.

---

## 4Ô∏è‚É£1Ô∏è‚É£ `intersect()`

### **What**

Common rows.

### **Cost**

‚ö†Ô∏è Shuffle required.



In [43]:
sample_songs_df = songs_df.sample(fraction=0.5, seed=1).limit(7)
sample_songs_df.show()

+-------+-------+---------+--------------------+------+-------+------+
|song_id|  title|artist_id|        release_date|r_date|r_month|r_year|
+-------+-------+---------+--------------------+------+-------+------+
|      3| Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4| Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      7| Song_7|       34|2023-01-07 10:15:...|     7|      1|  2023|
|      8| Song_8|       18|2023-01-30 10:15:...|    30|      1|  2023|
|      9| Song_9|       14|2020-05-21 10:15:...|    21|      5|  2020|
|     12|Song_12|       50|2020-12-02 10:15:...|     2|     12|  2020|
|     13|Song_13|       46|2019-07-01 10:15:...|     1|      7|  2019|
+-------+-------+---------+--------------------+------+-------+------+



In [51]:
songs_sample_df1.intersect(songs_sample_df2).show() # no duplicates

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
+-------+------+---------+--------------------+------+-------+------+



In [53]:
songs_sample_df1.intersectAll(sample_songs_df).show() # preserves duplicates

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4|Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      9|Song_9|       14|2020-05-21 10:15:...|    21|      5|  2020|
+-------+------+---------+--------------------+------+-------+------+





## 4Ô∏è‚É£2Ô∏è‚É£ `exceptAll()`

### **What**

Rows in DF1 not in DF2.



In [54]:
songs_sample_df1.exceptAll(sample_songs_df).show()

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
+-------+------+---------+--------------------+------+-------+------+



In [55]:
sample_songs_df.exceptAll(songs_df).show()

+-------+-----+---------+------------+------+-------+------+
|song_id|title|artist_id|release_date|r_date|r_month|r_year|
+-------+-----+---------+------------+------+-------+------+
+-------+-----+---------+------------+------+-------+------+





## 4Ô∏è‚É£3Ô∏è‚É£ `having` (via filter)

### **What**

Filter after aggregation.

### **How**

In [56]:
songs_df.groupBy("r_year").count().filter(col("count") > 20).show()

+------+-----+
|r_year|count|
+------+-----+
|  2019|   21|
|  2020|   25|
+------+-----+





## 4Ô∏è‚É£4Ô∏è‚É£ `groupingSets()`

### **What**

Multiple groupings.

### **Why**

Advanced analytics.



In [None]:
# no groupingSets() in pyspark dataframe api, it is available in SPARK sql and in Scala Api



## 4Ô∏è‚É£5Ô∏è‚É£ `rollup()`

### **What**

Hierarchical aggregation.

### **How**

In [57]:
songs_df.show(5)

+-------+------+---------+--------------------+------+-------+------+
|song_id| title|artist_id|        release_date|r_date|r_month|r_year|
+-------+------+---------+--------------------+------+-------+------+
|      1|Song_1|        2|2021-10-15 10:15:...|    15|     10|  2021|
|      2|Song_2|       45|2020-12-07 10:15:...|     7|     12|  2020|
|      3|Song_3|       25|2022-07-11 10:15:...|    11|      7|  2022|
|      4|Song_4|       25|2019-03-09 10:15:...|     9|      3|  2019|
|      5|Song_5|       26|2019-09-07 10:15:...|     7|      9|  2019|
+-------+------+---------+--------------------+------+-------+------+
only showing top 5 rows


In [64]:
listening_df.rollup("l_month", "l_year").count().show()

+-------+------+-----+
|l_month|l_year|count|
+-------+------+-----+
|      6|  2023| 1485|
|     10|  2023| 1813|
|      5|  NULL| 1950|
|      5|  2023| 1950|
|     10|  NULL| 1813|
|      7|  NULL| 2506|
|      4|  2023|  412|
|     11|  2023|  686|
|      8|  2023| 1213|
|      7|  2023| 2506|
|      9|  NULL| 1714|
|   NULL|  NULL|11779|
|      4|  NULL|  412|
|      8|  NULL| 1213|
|      9|  2023| 1714|
|      6|  NULL| 1485|
|     11|  NULL|  686|
+-------+------+-----+



In [62]:
songs_df.rollup("r_year", "r_month").count().show()

+------+-------+-----+
|r_year|r_month|count|
+------+-------+-----+
|  2020|      9|    3|
|  2019|   NULL|   21|
|  2023|   NULL|   19|
|  2019|      5|    2|
|  2022|      4|    2|
|  2021|      5|    1|
|  2023|      9|    1|
|  2022|      8|    2|
|  2021|      1|    2|
|  2023|      4|    3|
|  2023|      7|    1|
|  2020|     10|    1|
|  2020|      2|    5|
|  2020|   NULL|   25|
|  2021|      7|    5|
|  2022|      6|    1|
|  2021|      9|    4|
|  2020|     12|    5|
|  2020|     11|    3|
|  2019|      9|    2|
+------+-------+-----+
only showing top 20 rows




## 4Ô∏è‚É£6Ô∏è‚É£ `cube()`

### **What**

All combinations aggregation.

### **Interview**

> More expensive than `rollup`



In [69]:
sample_songs_df.cube("r_year").sum("r_date").show()

+------+-----------+
|r_year|sum(r_date)|
+------+-----------+
|  2022|         11|
|  NULL|         81|
|  2023|         37|
|  2020|         23|
|  2019|         10|
+------+-----------+





## 4Ô∏è‚É£7Ô∏è‚É£ `repartition()`

### **What**

Changes partition count (shuffle).

### **How**

In [79]:
songs_df.rdd.getNumPartitions()

1

In [80]:
songs_df = songs_df.repartition(10)
songs_df.show(10)

+-------+-------+---------+--------------------+
|song_id|  title|artist_id|        release_date|
+-------+-------+---------+--------------------+
|     21|Song_21|       21|2022-07-17 10:15:...|
|      2| Song_2|       45|2020-12-07 10:15:...|
|     34|Song_34|       43|2023-05-11 10:15:...|
|     14|Song_14|       36|2019-06-11 10:15:...|
|     67|Song_67|       30|2021-07-08 10:15:...|
|     63|Song_63|       17|2019-10-19 10:15:...|
|     85|Song_85|       39|2020-09-04 10:15:...|
|     18|Song_18|        6|2022-04-29 10:15:...|
|     55|Song_55|       13|2020-08-25 10:15:...|
|     79|Song_79|       36|2023-05-13 10:15:...|
+-------+-------+---------+--------------------+
only showing top 10 rows


In [81]:
songs_df.rdd.getNumPartitions()

10

---

## 4Ô∏è‚É£8Ô∏è‚É£ `coalesce()`

### **What**

Reduce partitions (no shuffle).

### **How**

In [83]:
songs_df = songs_df.coalesce(5)
songs_df.show(5)

+-------+-------+---------+--------------------+
|song_id|  title|artist_id|        release_date|
+-------+-------+---------+--------------------+
|     21|Song_21|       21|2022-07-17 10:15:...|
|      2| Song_2|       45|2020-12-07 10:15:...|
|     34|Song_34|       43|2023-05-11 10:15:...|
|     14|Song_14|       36|2019-06-11 10:15:...|
|     67|Song_67|       30|2021-07-08 10:15:...|
+-------+-------+---------+--------------------+
only showing top 5 rows


In [84]:
songs_df.rdd.getNumPartitions()

5

---

## 4Ô∏è‚É£9Ô∏è‚É£ `explain()`

### **What**

Execution plan.

### **How**

In [85]:
songs_df.explain(True)

== Parsed Logical Plan ==
Repartition 5, false
+- Repartition 10, true
   +- Relation [song_id#3906,title#3907,artist_id#3908,release_date#3909] csv

== Analyzed Logical Plan ==
song_id: int, title: string, artist_id: int, release_date: timestamp
Repartition 5, false
+- Repartition 10, true
   +- Relation [song_id#3906,title#3907,artist_id#3908,release_date#3909] csv

== Optimized Logical Plan ==
Repartition 5, false
+- Repartition 10, true
   +- Relation [song_id#3906,title#3907,artist_id#3908,release_date#3909] csv

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   ResultQueryStage 1
   +- Coalesce 5
      +- ShuffleQueryStage 0
         +- Exchange RoundRobinPartitioning(10), REPARTITION_BY_NUM, [plan_id=4084]
            +- FileScan csv [song_id#3906,title#3907,artist_id#3908,release_date#3909] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/d:/shra1/github/pyspark-practice/data/Spotify_Songs.csv], PartitionFilter

### **Interview**

> Know **Logical vs Physical Plan**

---

## 5Ô∏è‚É£0Ô∏è‚É£ `cache()` / `persist()`

### **What**

Stores DF in memory.

### **How**

In [86]:
songs_df.cache()

DataFrame[song_id: int, title: string, artist_id: int, release_date: timestamp]

### **Interview**

> Use before **multiple actions**

---

# üî• Interview Coverage (Part 2)

‚úî Aggregations
‚úî Joins (broadcast vs shuffle)
‚úî Set operations
‚úî Performance tuning

---

## üëâ Next:

**PART 3 (51‚Äì75):**

* Window functions
* UDFs
* JSON / Date / Array / Map functions
* explode, collect_list
* SQL functions vs DataFrame API

