# **Classes with More than 5 students**

## **Problem Statement**
You are given a table `courses` with the following structure:

| Column Name | Type   |
|-------------|--------|
| student     | string |
| class       | string |

Write a query to **list all classes** that have **five or more students** enrolled.

---

### **Example**

#### **Input Table: courses**
| student | class    |
|---------|----------|
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |

#### **Expected Output**
| class |
|--------|
| Math   |

---

## **Approach 1: PySpark DataFrame API**

### **Steps**
1. Create a DataFrame.
2. Group by class and count students.
3. Filter where count is >= 5.

### **Code**

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

# Step 1: Initialize Spark
spark = SparkSession.builder.appName("Courses").getOrCreate()

# Step 2: Create DataFrame
data = [
    ("A", "Math"), ("B", "English"), ("C", "Math"),
    ("D", "Biology"), ("E", "Math"), ("F", "Computer"),
    ("G", "Math"), ("H", "Math"), ("I", "Math")
]
columns = ["student", "class"]
df = spark.createDataFrame(data, columns)

# Step 3: Group by class and count students
result = df.groupBy("class").agg(count("student").alias("num_students"))

# Step 4: Filter classes with >= 5 students
result.filter(col("num_students") >= 5).select("class").show()

StatementMeta(, c33e8b16-1712-4e07-bcbf-202f3fbfa08f, 9, Finished, Available, Finished)

+-----+
|class|
+-----+
| Math|
+-----+



---

## **Approach 2: SQL Query in PySpark**

### **Steps**
1. Register DataFrame as a SQL temp view.
2. Use GROUP BY and HAVING to filter classes.


### **Code**

In [8]:

df.createOrReplaceTempView("courses")

sql_result = spark.sql("""
    SELECT class
    FROM courses
    GROUP BY class
    HAVING COUNT(student) >= 5
""")

sql_result.show()


StatementMeta(, c33e8b16-1712-4e07-bcbf-202f3fbfa08f, 10, Finished, Available, Finished)

+-----+
|class|
+-----+
| Math|
+-----+



---

## **Summary**

| Approach         | Method                | Key Function      |
|------------------|------------------------|--------------------|
| **Approach 1**   | PySpark DataFrame API  | `groupBy`, `count` |
| **Approach 2**   | SQL in PySpark         | `GROUP BY`, `HAVING` |