# **Delete Duplicate Emails**

## **Problem Statement**
You are given a table named `Person` that may contain **duplicate emails**. Write a **SQL query to delete duplicate email entries**, **keeping only the one with the smallest Id** for each email.

---

## **Table: Person**

| Column Name | Type    |
|-------------|---------|
| Id          | int     |
| Email       | varchar |

- `Id` is the primary key of this table.

---

## **Example**

### **Input**

| Id | Email            |
|----|------------------|
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |

### **Expected Output (After Deletion)**

| Id | Email            |
|----|------------------|
| 1  | john@example.com |
| 2  | bob@example.com  |

Only the row with `john@example.com` and the **smallest Id** is kept.

---

## **Approach 1: PySpark DataFrame API**

### **Steps**
1. Create a DataFrame with Person data.
2. Use `Window` function to rank emails by Id.
3. Filter to keep only the **first occurrence** (minimum Id).
4. Show the final DataFrame with duplicates removed.

### **Code**


In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col

# Step 1: Initialize Spark
spark = SparkSession.builder.appName("DeleteDuplicateEmails").getOrCreate()

# Step 2: Sample data
data = [(1, 'john@example.com'), (2, 'bob@example.com'), (3, 'john@example.com')]
columns = ["Id", "Email"]
person_df = spark.createDataFrame(data, columns)

# Step 3: Add row number partitioned by email ordered by Id
window_spec = Window.partitionBy("Email").orderBy("Id")
ranked_df = person_df.withColumn("rn", row_number().over(window_spec))

# Step 4: Keep only first occurrences
unique_df = ranked_df.filter(col("rn") == 1).drop("rn")

# Step 5: Show result
unique_df.show()

StatementMeta(, 0a4a5447-27e9-4c5f-8b6a-c5f909c5ba97, 7, Finished, Available, Finished)

+---+----------------+
| Id|           Email|
+---+----------------+
|  2| bob@example.com|
|  1|john@example.com|
+---+----------------+



---

## **Approach 2: SQL Query in PySpark**

### **Steps**
1. Use a **CTE** or subquery to find **minimum Id** for each email.
2. Delete from `Person` where Id is **not** in the set of minimum Ids per email.


### **Code**

In [6]:
person_df.createOrReplaceTempView("Person")

sql_query = """
    SELECT *
    FROM Person
    WHERE Id IN (
        SELECT MIN(Id)
        FROM Person
        GROUP BY Email
    )
"""

result = spark.sql(sql_query)
result.show()



StatementMeta(, 0a4a5447-27e9-4c5f-8b6a-c5f909c5ba97, 8, Finished, Available, Finished)

+---+----------------+
| Id|           Email|
+---+----------------+
|  1|john@example.com|
|  2| bob@example.com|
+---+----------------+



---

## **Summary**

| Approach         | Method               | Key Technique           |
|------------------|----------------------|--------------------------|
| **Approach 1**   | PySpark DataFrame API| `Window + row_number()`  |
| **Approach 2**   | SQL in PySpark       | `SELECT + GROUP BY`      |
