### Different ways to remove duplicates.

#### Write a PySpark code to remove duplicate rows based on specific columns

In [0]:
# sample data
data = [ (1, "Rohish", 5000), (2, "Melody", 3000), 
         (3, "Rohish", 2000), (4, "Rajesh", 4000), 
         (1, "Rohish", 5000), (6, "Melody", 3000) 
       ]

columns = ["ID", "Name", "Salary"]

df = spark.createDataFrame(data, columns)
df.show()

+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  1|Rohish|  5000|
|  2|Melody|  3000|
|  3|Rohish|  2000|
|  4|Rajesh|  4000|
|  1|Rohish|  5000|
|  6|Melody|  3000|
+---+------+------+



#### In PySpark, you can remove duplicates using multiple methods depending on your requirements.

**Using dropDuplicates():** This is the simplest way to remove duplicates based on specific columns.

In [0]:
print("before dropping duplicates:")
df.show()

# column names should be a list or tuple
df_unique = df.dropDuplicates(["Name"])

print("after dropping duplicates:")
df_unique.show()

before dropping duplicates:
+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  1|Rohish|  5000|
|  2|Melody|  3000|
|  3|Rohish|  2000|
|  4|Rajesh|  4000|
|  1|Rohish|  5000|
|  6|Melody|  3000|
+---+------+------+

after dropping duplicates:
+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  1|Rohish|  5000|
|  2|Melody|  3000|
|  4|Rajesh|  4000|
+---+------+------+



In [0]:
# If you want to drop duplicates based on all columns:
df_unique = df.dropDuplicates()
df_unique.show()

+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  1|Rohish|  5000|
|  2|Melody|  3000|
|  3|Rohish|  2000|
|  4|Rajesh|  4000|
|  6|Melody|  3000|
+---+------+------+



**Using distinct():** This removes duplicate rows from the entire DataFrame.

In [0]:
df_distinct = df.distinct()
df_distinct.show()

+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  1|Rohish|  5000|
|  2|Melody|  3000|
|  3|Rohish|  2000|
|  4|Rajesh|  4000|
|  6|Melody|  3000|
+---+------+------+



**`distinct()` considers all columns while `dropDuplicates(["col1", "col2"])` removes duplicates based on selected columns.**

**Using Window Functions (row_number())**

In [0]:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

window = Window.partitionBy("Name").orderBy("Name")
rank_df = df.withColumn("rank", row_number().over(window))
rank_df.show()

rank_df = rank_df.filter(col("rank")==1).drop("rank")
rank_df.show()

+---+------+------+----+
| ID|  Name|Salary|rank|
+---+------+------+----+
|  2|Melody|  3000|   1|
|  6|Melody|  3000|   2|
|  4|Rajesh|  4000|   1|
|  1|Rohish|  5000|   1|
|  3|Rohish|  2000|   2|
|  1|Rohish|  5000|   3|
+---+------+------+----+

+---+------+------+
| ID|  Name|Salary|
+---+------+------+
|  2|Melody|  3000|
|  4|Rajesh|  4000|
|  1|Rohish|  5000|
+---+------+------+



**Using `groupBy()` + `agg()`:** To check how many duplicates exist before removing

In [0]:
from pyspark.sql.functions import count, col

df_grouped = df.groupBy("Name").agg(count("*").alias("count"))
df_grouped.filter(col("count") > 1).show()

+------+-----+
|  Name|count|
+------+-----+
|Rohish|    3|
|Melody|    2|
+------+-----+



### Which Method to Use?

| **Method**        | **Use Case** |
|------------------|-----------------------------------------------|
| `dropDuplicates()` | Best for quick duplicate removal on specific columns |
| `distinct()`      | Best when you want unique rows based on all columns |
| `row_number()`    | Best when you need to retain a record based on ordering |
| `groupBy()`       | Best when you need to aggregate duplicate records |
