# Student Performance Analytics System - Spark Analytics

This notebook demonstrates how to use **Apache Spark (PySpark)** to perform distributed analytics on the student data.

### Steps:
1.  **Initialize Spark**: Start a Spark Session.
2.  **Load Data**: Read CSV files into Spark DataFrames.
3.  **Analyze**: Calculate average performance by grade level.
4.  **Insights**: Identify top performing students.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, when
import os

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("StudentPerformanceAnalytics") \
    .getOrCreate()

print("Spark Session Created.")

## Step 1: Load Data
We define the path to our data directory and load the `students.csv` and `performance.csv` files.

In [None]:
# Define paths
PROJECT_ROOT = os.getcwd()
DATA_DIR = os.path.join(PROJECT_ROOT, "data")

print(f"Loading data from: {DATA_DIR}")

try:
    df_students = spark.read.csv(f"{DATA_DIR}/students.csv", header=True, inferSchema=True)
    df_performance = spark.read.csv(f"{DATA_DIR}/performance.csv", header=True, inferSchema=True)
    
    print("Data Loaded into Spark DataFrames.")
    df_students.show(5)
except Exception as e:
    print(f"Error loading data: {e}")

## Step 2: Average Performance by Grade Level
We join the students and performance dataframes, then group by `Grade_Level` to calculate the average `Exam_Score`.

In [None]:
# Join Students and Performance
df_joined = df_students.join(df_performance, "Student_ID")

# Aggregate
df_report = df_joined.groupBy("Grade_Level") \
    .agg(
        avg("Exam_Score").alias("Avg_Exam_Score"),
        count("Student_ID").alias("Student_Count")
    ) \
    .orderBy("Avg_Exam_Score", ascending=False)

print("Average Performance by Grade Level:")
df_report.show()

## Step 3: Top Performers
We filter the data to find students who scored above 90.

In [None]:
# Filter for high scores
df_top_performers = df_joined.filter(col("Exam_Score") > 90) \
    .select("Full_Name", "Grade_Level", "Subject", "Exam_Score")

print("Top Performing Students:")
df_top_performers.show(5)

In [None]:
# Stop the Spark Session
spark.stop()