# **Marvel Cities**

## **Problem Statement**
You are given a table **CITY** with the following structure:

### **Table: CITY**
| Column Name  | Type    |
|-------------|--------|
| `ID`         | Number  |
| `Name`       | Varchar |
| `CountryCode`| Varchar |
| `Population` | Number  |

### **Objective**
Write a query to return **all columns** for **cities in Marvel (CountryCode = 'Marv')** where **the population is greater than 100,000**.

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for `CITY` Table**
3. **Filter rows where `CountryCode = 'Marv'` and `Population > 100000`**
4. **Select all columns and display the results**

### **Code**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("MarvelCities").getOrCreate()

# Step 2: Create DataFrame for CITY Table
city_data = [
    (1, "New York", "USA", 8500000),
    (2, "Los Angeles", "USA", 4000000),
    (3, "Marvel City", "Marv", 120000),
    (4, "Small Town", "Marv", 50000),
    (5, "Mega Marvel", "Marv", 200000),
    (6, "Gotham", "DC", 1500000)
]
city_columns = ["ID", "Name", "CountryCode", "Population"]

city_df = spark.createDataFrame(city_data, city_columns)

# Step 3: Filter Marvel cities with population > 100000
filtered_df = city_df.filter((col("CountryCode") == "Marv") & (col("Population") > 100000))

# Step 4: Display Output
filtered_df.show()

StatementMeta(, 2c10692f-7565-4191-89c0-1265c01dbf89, 3, Finished, Available, Finished)

+---+-----------+-----------+----------+
| ID|       Name|CountryCode|Population|
+---+-----------+-----------+----------+
|  3|Marvel City|       Marv|    120000|
|  5|Mega Marvel|       Marv|    200000|
+---+-----------+-----------+----------+



---

# **Code**

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrame for `CITY` Table**
3. **Register it as a SQL View**
4. **Write and Execute SQL Query**
5. **Display the Output**

In [2]:
# Step 1: Register DataFrame as a SQL View
city_df.createOrReplaceTempView("CITY")

# Step 2: Run SQL Query
sql_query = """
SELECT * 
FROM CITY
WHERE CountryCode = 'Marv' AND Population > 100000;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 2c10692f-7565-4191-89c0-1265c01dbf89, 4, Finished, Available, Finished)

+---+-----------+-----------+----------+
| ID|       Name|CountryCode|Population|
+---+-----------+-----------+----------+
|  3|Marvel City|       Marv|    120000|
|  5|Mega Marvel|       Marv|    200000|
+---+-----------+-----------+----------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `filter()` and `col()` to apply conditions |
| **Approach 2** | SQL Query in PySpark     | Uses SQL `WHERE` clause in PySpark |
