#  Student Performance Prediction

This notebook develops regression models to predict student exam scores using lifestyle, health, and background variables.
We use four models: Linear Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbors.

## Project Objective and Overarching Question
The central question driving this project is:
**To what extent can student exam scores be predicted from lifestyle habits, wellness factors, and socioeconomic background?**

We aim to identify which features contribute most to academic performance and explore predictive models that can help estimate student outcomes.

#  Student Performance Prediction

This notebook develops regression models to predict student exam scores using lifestyle, health, and background variables.
We use four models: Linear Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbors.

##  Data Preparation

We import all required libraries including models, pipelines, and metrics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

df = pd.read_csv('../../student_habits_performance.csv')

In [None]:
# Drop ID and separate features/target
X = df.drop(columns=['student_id', 'exam_score'])
y = df['exam_score']

We define the target (exam_score) and drop non-useful columns like student_id.

In [None]:
# Separate categorical and numeric features
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

##  Preprocessing Pipeline

Here we separate numeric and categorical features to process them differently.

In [None]:
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This preprocessing step scales numeric values and one-hot encodes categoricals to prepare them for machine learning.

##  Models Used

In [None]:
# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5)
}

Train-test split ensures we evaluate models on unseen data.

##  Model Evaluation

We define four regression models to compare performance using familiar techniques from class.

### Interpretation of Modeling Results
The table summarizes each model's predictive performance:
- **Random Forest** and **Gradient Boosting** had the highest R² scores, indicating better generalization and predictive accuracy.
- **Linear Regression** performed reasonably well, showing that there is a linear relationship between some variables and scores.
- **K-Nearest Neighbors** had the lowest R² and highest error, likely because it's sensitive to feature scaling and local variation.

These results suggest that tree-based models handle this mix of features well, especially when nonlinear patterns exist.

## Linear Regression Model

In [None]:
# Linear Regression
lr_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])
lr_pipe.fit(X_train, y_train)
lr_preds = lr_pipe.predict(X_test)
print("Linear Regression R²:", r2_score(y_test, lr_preds))
print("RMSE:", mean_squared_error(y_test, lr_preds, squared=False))
print("MAE:", mean_absolute_error(y_test, lr_preds))

## Random Forest Regressor

In [None]:
# Random Forest Regressor
rf_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])
rf_pipe.fit(X_train, y_train)
rf_preds = rf_pipe.predict(X_test)
print("Random Forest R²:", r2_score(y_test, rf_preds))
print("RMSE:", mean_squared_error(y_test, rf_preds, squared=False))
print("MAE:", mean_absolute_error(y_test, rf_preds))

## Gradient Boosting Regressor

In [None]:
# Gradient Boosting Regressor
gb_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(random_state=42))
])
gb_pipe.fit(X_train, y_train)
gb_preds = gb_pipe.predict(X_test)
print("Gradient Boosting R²:", r2_score(y_test, gb_preds))
print("RMSE:", mean_squared_error(y_test, gb_preds, squared=False))
print("MAE:", mean_absolute_error(y_test, gb_preds))

## K-Nearest Neighbors Regressor

In [None]:
# K-Nearest Neighbors Regressor
knn_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', KNeighborsRegressor(n_neighbors=5))
])
knn_pipe.fit(X_train, y_train)
knn_preds = knn_pipe.predict(X_test)
print("KNN R²:", r2_score(y_test, knn_preds))
print("RMSE:", mean_squared_error(y_test, knn_preds, squared=False))
print("MAE:", mean_absolute_error(y_test, knn_preds))


## Final Interpretation and Key Takeaways

The Random Forest and Gradient Boosting models performed the best in our evaluation. Their ability to capture non-linear patterns and handle mixed data types made them ideal for this dataset.

Most influential features across models were:
- Study hours per day
- Class attendance
- Mental health rating
- Sleep hours
- Diet and exercise frequency

This confirms that academic performance isn't just about studying longer—wellness and environmental factors matter significantly. These results can help educators support students holistically.
