# Student Performance Prediction

## 1. Introduction
In this notebook, we will analyze student performance data and predict the **Average Score** based on demographic and academic factors. We will use **Linear Regression** for this prediction.

### Objectives:
- Load and inspect the dataset.
- Perform Data Cleaning and Exploratory Data Analysis (EDA).
- Feature Engineering (Create generic Average Score).
- Preprocessing (Encoding categorical variables).
- Train a Linear Regression model.
- Evaluate the model's performance.

## 2. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Set plot style
sns.set(style="whitegrid")
%matplotlib inline

## 3. Loading the Dataset

In [None]:
# Load the dataset
file_path = "../data/StudentsPerformance.csv"
df = pd.read_csv(file_path)

# Display the first few rows
df.head()

## 4. Data Inspection and Cleaning
We will check for missing values, data types, and the shape of the dataset.

In [None]:
# Check dataset info
df.info()

In [None]:
# Check for missing values
df.isnull().sum()

*Observation: If there are no missing values, we proceed. If there are, we would handle them (imputation or removal).*

## 5. Feature Engineering
We need a target variable. We'll create `AverageScore` which is the mean of math, reading, and writing scores.

In [None]:
df['AverageScore'] = (df['math score'] + df['reading score'] + df['writing score']) / 3
df.head()

### Data Visualization
Let's check the distribution of the target variable `AverageScore`.

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(df['AverageScore'], kde=True, bins=30)
plt.title('Distribution of Average Scores')
plt.xlabel('Average Score')
plt.ylabel('Frequency')
plt.show()

## 6. Preprocessing (Encoding Categorical Variables)
Machine learning models require numerical input. We will convert categorical columns using One-Hot Encoding.

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_cols)

# Log-transform or standard scaling isn't applied here for simplicity, but One-Hot Encoding is essential.
# We use pd.get_dummies to convert categorical features to numeric
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

df_encoded.head()

## 7. Train-Test Split

In [None]:
# Define features (X) and target (y)
# Note: We must drop the original score columns as they were used to calculate AverageScore directly (Data Leakage prevention)
X = df_encoded.drop(columns=['math score', 'reading score', 'writing score', 'AverageScore'])
y = df_encoded['AverageScore']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training shape: {X_train.shape}")
print(f"Testing shape: {X_test.shape}")

## 8. Model Training
We will train a Linear Regression model.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

print("Model training complete.")

## 9. Model Evaluation
We will evaluate the model using R-squared (R²) and Mean Absolute Error (MAE).

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared Score: {r2:.2f}")

### Visualization of Results

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', lw=2)
plt.xlabel('Actual Average Score')
plt.ylabel('Predicted Average Score')
plt.title('Actual vs Predicted Scores')
plt.show()

## 10. Conclusion
- **R² Score interpretation**: Indicates how much variance in the student's score is explained by the demographic factors.
- **MAE interpretation**: On average, how far off our predictions are from the actual score.

This model gives us a baseline for understanding how factors like parental education, detailed race/ethnicity groups, and test preparation courses impact student performance.