# Comprehensive Student Data Analysis with PySpark

## Objective

Leverage PySpark to perform a thorough analysis of student performance data. This exercise covers data loading and manipulation using RDDs and DataFrames, and it culminates in building and evaluating a logistic regression model to predict student success.



## Dataset

**student_data.csv** includes:

- age: Age of the student
- study_time: Weekly study hours
- failures: Number of past class failures
- passed: Course outcome (1: passed, 0: failed)

## Set Up

In [1]:
# !pip install pyspark

In [2]:
import pyspark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

spark = SparkSession.builder.appName("Student Data Analysis").getOrCreate()

24/04/04 00:18:25 WARN Utils: Your hostname, codespaces-dc2d55 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/04/04 00:18:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/04 00:18:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


24/04/04 00:18:42 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Tasks

### Task 1: Resilient Distributed Dataset (RDD) Operations

1. Load student_data.csv into an RDD and remove the header.

In [7]:
# YOUR CODE HERE
import pyspark as sp
df = sp.read.csv('student_data.csv')

AttributeError: module 'pyspark' has no attribute 'park'

2. Filter to include only students older than 20 years.

In [None]:
# YOUR CODE HERE

[['22', '8', '0', '1'], ['23', '8', '1', '1'], ['22', '9', '0', '1'], ['22', '3', '3', '0'], ['23', '3', '1', '0']]


3. Count students older than 20 with past failures.


In [None]:
# YOUR CODE HERE

276


### Task 2: DataFrame Operations

1. Load student_data.csv into a DataFrame.

In [None]:
# YOUR CODE HERE

2. Explore the data by displaying the schema and the first five rows.

In [None]:
# YOUR CODE HERE

root
 |-- age: integer (nullable = true)
 |-- study_time: integer (nullable = true)
 |-- failures: integer (nullable = true)
 |-- passed: integer (nullable = true)

+---+----------+--------+------+
|age|study_time|failures|passed|
+---+----------+--------+------+
| 22|         8|       0|     1|
| 19|         7|       2|     0|
| 23|         8|       1|     1|
| 20|         6|       2|     0|
| 22|         9|       0|     1|
+---+----------+--------+------+
only showing top 5 rows



3. Add a new column study_time_hours converting study time from hours to minutes.

In [None]:
# YOUR CODE HERE

4. Calculate the average age of students grouped by their pass/fail status.

In [None]:
# YOUR CODE HERE

+------+------------------+
|passed|          avg(age)|
+------+------------------+
|     1|19.997109826589597|
|     0|19.845565749235472|
+------+------------------+



### Task 3: Logistic Regression Model

1. Prepare the data by vectorizing features and splitting into training and test datasets.


In [None]:
# YOUR CODE HERE

2. Build and train a logistic regression model.

In [None]:
# YOUR CODE HERE

3. Evaluate the model using accuracy, precision, recall, F1 score, and the area under the ROC curve.


In [None]:
# YOUR CODE HERE

Model Evaluation:
Area under ROC: 0.9836863890917945
Precision: 0.9247311827956989
Recall: 0.9297297297297298
F1 Score: 0.9087003451598346
