# **Geography Report**

## **Problem Statement**
A U.S. graduate school has students from **Asia, Europe, and America**. The students' **location information** is stored in the table **Student**.

### **Table: Student**
| Column Name | Type   |
|------------|--------|
| `name`     | String |
| `continent`| String |

- Each student belongs to **one continent**.
- The number of students in **America** is **greater than or equal** to the number of students in **Asia or Europe**.

### **Objective**
Pivot the `continent` column so that:
- Each `name` is sorted **alphabetically** and displayed under its respective **continent**.
- The headers in the output should be **America, Asia, and Europe**.

#### **Example**
##### **Input Table: Student**
| name   | continent |
|--------|-----------|
| Jack   | America   |
| Pascal | Europe    |
| Xi     | Asia      |
| Jane   | America   |

##### **Expected Output**
| America | Asia | Europe |
|---------|------|--------|
| Jack    | Xi   | Pascal |
| Jane    |      |        |

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for Student Table**
3. **Group by Continent and Collect Names**
4. **Sort Names Alphabetically**
5. **Use `zip` to Align Names Across Continents**
6. **Convert to DataFrame and Display Output**

### **Code**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list, col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("GeographyReport").getOrCreate()

# Step 2: Create DataFrame for Student Table
student_data = [
    ("Jack", "America"),
    ("Pascal", "Europe"),
    ("Xi", "Asia"),
    ("Jane", "America")
]
student_columns = ["name", "continent"]

student_df = spark.createDataFrame(student_data, student_columns)

# Step 3: Group by Continent and Collect Names
grouped_df = student_df.groupBy("continent").agg(collect_list("name").alias("names"))

# Step 4: Convert to Dictionary and Sort Names
grouped_dict = {row["continent"]: sorted(row["names"]) for row in grouped_df.collect()}

# Step 5: Align the Names Using Zip
max_len = max(len(grouped_dict.get("America", [])), len(grouped_dict.get("Asia", [])), len(grouped_dict.get("Europe", [])))

america_list = grouped_dict.get("America", []) + [""] * (max_len - len(grouped_dict.get("America", [])))
asia_list = grouped_dict.get("Asia", []) + [""] * (max_len - len(grouped_dict.get("Asia", [])))
europe_list = grouped_dict.get("Europe", []) + [""] * (max_len - len(grouped_dict.get("Europe", [])))

# Step 6: Convert to DataFrame
result_df = spark.createDataFrame(zip(america_list, asia_list, europe_list), ["America", "Asia", "Europe"])

# Step 7: Display Output
result_df.show()

StatementMeta(, a7d7575d-3517-4c92-8dd7-bede6eb6092f, 3, Finished, Available, Finished)

+-------+----+------+
|America|Asia|Europe|
+-------+----+------+
|   Jack|  Xi|Pascal|
|   Jane|    |      |
+-------+----+------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create a DataFrame for Student Table**
2. **Register it as a SQL View**
3. **Use ROW_NUMBER() to Assign a Unique Index per Continent**
4. **Use Pivoting to Transform Data**
5. **Return the Final Table**

### **Code**

In [2]:
# Step 1: Register DataFrame as a SQL View
student_df.createOrReplaceTempView("Student")

# Step 2: Run SQL Query
sql_query = """
WITH RankedStudents AS (
    SELECT 
        name, 
        continent,
        ROW_NUMBER() OVER (PARTITION BY continent ORDER BY name) AS rn
    FROM Student
)
SELECT 
    MAX(CASE WHEN continent = 'America' THEN name END) AS America,
    MAX(CASE WHEN continent = 'Asia' THEN name END) AS Asia,
    MAX(CASE WHEN continent = 'Europe' THEN name END) AS Europe
FROM RankedStudents
GROUP BY rn;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, a7d7575d-3517-4c92-8dd7-bede6eb6092f, 4, Finished, Available, Finished)

+-------+----+------+
|America|Asia|Europe|
+-------+----+------+
|   Jack|  Xi|Pascal|
|   Jane|NULL|  NULL|
+-------+----+------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `groupBy()`, `collect_list()`, and `zip()` |
| **Approach 2** | SQL Query in PySpark     | Uses **ROW_NUMBER() with CASE WHEN for Pivoting** |
