# **Second Highest Salary**

## **Problem Statement**
You are given a table **Employee** with the following structure:

### **Table: Employee**
| Column Name | Type |
|------------|------|
| `Id`      | int  |
| `Salary`  | int  |

- The `Id` column is the **primary key**.
- The table contains information about **employees' salaries**.

### **Objective**
Write a **SQL query** to get the **second highest salary** from the Employee table.
- If there is no second highest salary, return **NULL**.

#### **Example**
##### **Input Table: Employee**
| Id | Salary |
|----|--------|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

##### **Expected Output**
| Salary |
|--------|
| 200    |

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for Employee Table**
3. **Use `dense_rank()` to Rank Salaries**
4. **Filter for the Second Highest Salary**
5. **Handle Cases Where There is No Second Highest Salary**

### **Code**

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, dense_rank
from pyspark.sql.window import Window

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("SecondHighestSalary").getOrCreate()

# Step 2: Create DataFrame for Employee Table
employee_data = [
    (1, 100),
    (2, 200),
    (3, 300),
]
employee_columns = ["Id", "Salary"]

employee_df = spark.createDataFrame(employee_data, employee_columns)

# Step 3: Define Window Function for Ranking
window_spec = Window.orderBy(col("Salary").desc())

# Step 4: Apply `dense_rank()` to Rank Salaries
ranked_df = employee_df.withColumn("rank", dense_rank().over(window_spec))

# Step 5: Filter for Second Highest Salary
second_highest_salary_df = ranked_df.filter(col("rank") == 2).select("Salary")

# Step 6: Handle Case Where No Second Highest Salary Exists
if second_highest_salary_df.count() == 0:
    print("NULL")
else:
    second_highest_salary_df.show()

StatementMeta(, f25e324d-7c65-4b90-aeb0-72f0b27a1786, 7, Finished, Available, Finished)

+------+
|Salary|
+------+
|   200|
+------+



---

## **Approach 2: SQL Query in PySpark (Using CTE)**
### **Steps**
1. **Create a DataFrame for Employee Table**
2. **Register it as a SQL View**
3. **Use a Common Table Expression (CTE) to Rank Salaries**
4. **Filter for the Second Highest Salary**
5. **Return NULL if There is No Second Highest Salary**

### **Code**

In [6]:
# Step 1: Register DataFrame as a SQL View
employee_df.createOrReplaceTempView("Employee")

# Step 2: Run SQL Query with CTE
sql_query = """
WITH RankedSalaries AS (
    SELECT Salary, DENSE_RANK() OVER (ORDER BY Salary DESC) AS rnk
    FROM Employee
)
SELECT Salary 
FROM RankedSalaries
WHERE rnk = 2;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
if result_sql.count() == 0:
    print("NULL")
else:
    result_sql.show()

StatementMeta(, f25e324d-7c65-4b90-aeb0-72f0b27a1786, 8, Finished, Available, Finished)

+------+
|Salary|
+------+
|   200|
+------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `dense_rank()` and `filter()` |
| **Approach 2** | SQL Query in PySpark     | Uses **CTE with DENSE_RANK()** and **WHERE rnk = 2** |