# **Duplicate Emails**

## **Problem Statement**
You are given a table **Person** with the following structure:

### **Table: Person**
| Column Name | Type    |
|------------|--------|
| `Id`       | int    |
| `Email`    | varchar |

### **Objective**
Write a query to return all duplicate emails from the **Person** table.

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for `Person` Table**
3. **Group by `Email` and count occurrences**
4. **Filter emails that appear more than once**
5. **Select and display the output**

### **Code**

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("DuplicateEmails").getOrCreate()

# Step 2: Create DataFrame for Person Table
person_data = [
    (1, "a@b.com"),
    (2, "c@d.com"),
    (3, "a@b.com"),
]
person_columns = ["Id", "Email"]

person_df = spark.createDataFrame(person_data, person_columns)

# Step 3: Group by Email and Count Occurrences
duplicate_emails_df = person_df.groupBy("Email").count()

# Step 4: Filter Emails with Count > 1
duplicate_emails_df = duplicate_emails_df.filter(col("count") > 1)

# Step 5: Select Only Email Column
result_df = duplicate_emails_df.select("Email")

# Step 6: Display Output
result_df.show()

StatementMeta(, 11299924-3809-4a59-8769-69d311dc9b6b, 5, Finished, Available, Finished)

+-------+
|  Email|
+-------+
|a@b.com|
+-------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrame for `Person` Table**
3. **Register it as a SQL View**
4. **Write and Execute SQL Query**
5. **Display the Output**

### **Code**

In [4]:
# Step 1: Register DataFrame as a SQL View
person_df.createOrReplaceTempView("Person")

# Step 2: Run SQL Query
sql_query = """
SELECT Email 
FROM Person
GROUP BY Email
HAVING COUNT(Email) > 1;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 11299924-3809-4a59-8769-69d311dc9b6b, 6, Finished, Available, Finished)

+-------+
|  Email|
+-------+
|a@b.com|
+-------+



---


## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `groupBy().count()` and `filter()` |
| **Approach 2** | SQL Query in PySpark     | Uses SQL `GROUP BY` and `HAVING COUNT(Email) > 1` |
