****Salting means adding randomness to a key column to distribute data more evenly across partitions.**

**It’s like splitting a big group into smaller subgroups by giving them suffixes.****

In [0]:
from pyspark.sql.functions import *

In [0]:
data = [("A", 100), ("A", 200), ("A", 300), ("B", 400), ("C", 500)]
df = spark.createDataFrame(data, ["user_id", "purchase"])


In [0]:
df.display()

user_id,purchase
A,100
A,200
A,300
B,400
C,500


let say A has 1.5 GB data , but one executor has 1 GB  data. OOM (Out of Memory) and data skew issues during operations like join or groupBy

**Adding Salt Column**

In [0]:
from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("salt_column", monotonically_increasing_id() % 3)


In [0]:
df.display()

user_id,purchase,salt_column
A,100,0
A,200,2
A,300,1
B,400,0
C,500,1


**Creating Concat Column on original groupBy col and salt_column to create a new groupBy col**

In [0]:
df = df.withColumn("user_id_salt", concat(col("user_id"), lit("-"), col("salt_column")))

df.display()


user_id,purchase,salt_column,user_id_salt
A,100,0,A-0
A,200,2,A-2
A,300,1,A-1
B,400,0,B-0
C,500,1,C-1


**Apply group By on user_id_salt**

In [0]:
df=df.groupby("user_id_salt").agg(sum("purchase"))
df.display()

user_id_salt,sum(purchase)
A-0,100
A-2,200
A-1,300
C-1,500
B-0,400
