<h1><center>Enhancing Student Success Forecasting Through Optimized Ensemble Learning Techniques</center></h1>

<p align="center">
  <img src='plots\project.cover\cover.png' width="600" height="600">
</p>

<h1><span style='color:#b846a3;font-family:Comic Sans MS'>Objectives :</span></h1>

In this notebook, we implement the methodology described in our research paper to move from basic predictive modeling to advanced ensemble learning methods.

**Our main goals are to:**
- **Predict academic outcomes** (Pass/Fail) based on socio-demographic and school-related features.
- **Compare performance** across different machine learning algorithms.
- **Identify impactful factors** (such as family stability and home environment) that affect student achievement.
- **Determine the most robust algorithm** with the highest accuracy for educational forecasting.

We will be utilizing the following learning algorithms:
- **Logistic Regression**
- **Support Vector Machine (SVM)**
- **K-Nearest Neighbors (KNN)**

---


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score, classification_report
from astropy.table import Table

# Load the dataset
# Ensure 'student-data.csv' is in your working directory
df = pd.read_csv('student-data.csv')
dfv = pd.read_csv('student-data.csv')

# Display initial information about the dataset
print(f"Dataset Shape: {df.shape}")
df.head()


**Before processing the dataset, let's describe it briefly:**

* For the sake of applying our skills in machine learning, we have chosen an appropriate dataset that approaches student achievement in secondary education of two Portuguese schools.
* The shape of our dataset is (395 rows x 31 columns).
* There are no missing values in the data, so no row deletion is required.
* The data attributes include demographic, social, and school-related features.
* The last column, **'passed'**, is our target variable (binary: yes or no).

**Detailed Column Explanation:**

* **school**: student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
* **sex**: student's sex (binary: "F" - female or "M" - male)
* **age**: student's age (numeric: from 15 to 22)
* **address**: student's home address type (binary: "U" - urban or "R" - rural)
* **famsize**: family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
* **Pstatus**: parent's cohabitation status (binary: "T" - living together or "A" - apart)
* **Medu / Fedu**: Mother / Father education (0 - none, 1 - 4th grade, 2 - 5th to 9th grade, 3 - secondary, 4 - higher)
* **Mjob / Fjob**: Mother / Father job (teacher, health, services, at_home, other)
* **reason**: reason to choose this school (home, reputation, course, other)
* **guardian**: student's guardian (mother, father, or other)
* **traveltime**: home to school travel time (1 - <15 min, 2 - 15-30 min, 3 - 30-60 min, 4 - >1 hour)
* **studytime**: weekly study time (1 - <2 hours, 2 - 2-5 hours, 3 - 5-10 hours, 4 - >10 hours)
* **failures**: number of past class failures (n if 1<=n<3, else 4)
* **schoolsup / famsup**: extra educational/family support (binary: yes or no)
* **paid**: extra paid classes within the course subject (binary: yes or no)
* **activities**: extra-curricular activities (binary: yes or no)
* **nursery**: attended nursery school (binary: yes or no)
* **higher**: wants to take higher education (binary: yes or no)
* **internet**: Internet access at home (binary: yes or no)
* **romantic**: with a romantic relationship (binary: yes or no)
* **famrel**: quality of family relationships (from 1 - very bad to 5 - excellent)
* **freetime**: free time after school (from 1 - very low to 5 - very high)
* **goout**: going out with friends (from 1 - very low to 5 - very high)
* **Dalc / Walc**: Workday / Weekend alcohol consumption (from 1 - very low to 5 - very high)
* **health**: current health status (from 1 - very bad to 5 - very good)
* **absences**: number of school absences (from 0 to 93)

**Target Variable:**
* **passed**: did the student pass the final exam or not (binary: yes or no)


## Data processing <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Prepared by: Sai Shankar Sutar, Yash Vardan Rathi, Aditya Saxena, and Aviral Bhardwaj</h5>

**Before working with any dataset, we must process it so it will be ready for training our models.** In this section, we will:

- **1) Categorical Mapping:** Most machine learning classifiers cannot handle non-numerical values. We will map all string-based categories (like school name, job types, and binary responses) to appropriate integers.

- **2) Feature Scaling:** This is a method used to normalize the range of independent variables. Scaling helps our learning algorithms converge more quickly and prevents features with larger numerical ranges (like 'absences') from dominating the model.

We apply the following normalization:
$$\frac{col-mean(col)}{max(col)}$$

In [None]:
# Function to map strings to numeric values
def numerical_data():
    df['school'] = df['school'].map({'GP': 0, 'MS': 1})
    df['sex'] = df['sex'].map({'M': 0, 'F': 1})
    df['address'] = df['address'].map({'U': 0, 'R': 1})
    df['famsize'] = df['famsize'].map({'LE3': 0, 'GT3': 1})
    df['Pstatus'] = df['Pstatus'].map({'T': 0, 'A': 1})
    df['Mjob'] = df['Mjob'].map({'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4})
    df['Fjob'] = df['Fjob'].map({'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4})
    df['reason'] = df['reason'].map({'home': 0, 'reputation': 1, 'course': 2, 'other': 3})
    df['guardian'] = df['guardian'].map({'mother': 0, 'father': 1, 'other': 2})
    df['schoolsup'] = df['schoolsup'].map({'no': 0, 'yes': 1})
    df['famsup'] = df['famsup'].map({'no': 0, 'yes': 1})
    df['paid'] = df['paid'].map({'no': 0, 'yes': 1})
    df['activities'] = df['activities'].map({'no': 0, 'yes': 1})
    df['nursery'] = df['nursery'].map({'no': 0, 'yes': 1})
    df['higher'] = df['higher'].map({'no': 0, 'yes': 1})
    df['internet'] = df['internet'].map({'no': 0, 'yes': 1})
    df['romantic'] = df['romantic'].map({'no': 0, 'yes' : 1})
    df['passed'] = df['passed'].map({'no': 0, 'yes': 1})
    
    # Reorder dataframe so target 'passed' is at the end
    col = df['passed']
    del df['passed']
    df['passed'] = col

# Function for feature scaling
def feature_scaling(df):
    for i in df:
        col = df[i]
        # Scaling columns with larger ranges
        if(np.max(col)>6):
            Max = max(col)
            Min = min(col)
            mean = np.mean(col)
            col  = (col-mean)/(Max)
            df[i] = col
        # Scaling binary/smaller range columns to [0,1]
        elif(np.max(col)<6):
            col = (col-np.min(col))
            col /= np.max(col)
            df[i] = col

# Execute processing
numerical_data()
feature_scaling(df)

# Show the processed dataset
df.head()


# Data visualisation <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Analysis by: Sai Shankar Sutar, Yash Vardan Rathi, Aditya Saxena, and Aviral Bhardwaj</h5>

In this section, we look deeper into the features to understand which social and demographic factors most significantly impact student performance. We will start with a correlation heatmap to see how variables relate to the final 'passed' status.

In [None]:
# 1) Checking for missing values
print("Rows without null values:", df.dropna().shape[0])

# 2) General Correlation Heatmap
# This shows the strength of relationship between all variables
corr = df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr, annot=True, cmap="Reds", fmt='.2f')
plt.title('Correlation Heatmap', fontsize=20)
plt.show()

# 3) Targeted Correlation: Features vs Student Status
# This highlights which specific features have the strongest positive or negative correlation with passing
plt.figure(figsize=(8, 12))
status_corr = df.corr()[['passed']].sort_values(by='passed', ascending=False)
heatmap = sns.heatmap(status_corr, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Student Status', fontdict={'fontsize':18}, pad=16)
plt.show()


# Logistic Regression <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Prepared by: Aditya Saxena and Aviral Bhardwaj</h5>

Logistic Regression is used as our baseline classifier to estimate the probability of a student passing based on the independent variables. 

In this section, we:
1. **Split the data** into 70% training and 30% testing.
2. **Train the model** using the training set.
3. **Evaluate performance** using accuracy, the F1 score, and the Confusion Matrix to check for bias or overfitting.


In [None]:
# 1) Data Splitting
data = df.to_numpy()
n = data.shape[1]
x = data[:, 0:n-1]
y = data[:, n-1]

# Split data: 70% for training, 30% for testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# 2) Model Training
logisticRegr = LogisticRegression(C=1, max_iter=1000)
logisticRegr.fit(x_train, y_train)

# 3) Predictions and Evaluation
y_pred = logisticRegr.predict(x_test)

# Accuracy scores
Sctest = logisticRegr.score(x_test, y_test)
Sctrain = logisticRegr.score(x_train, y_train)

print(f"Accuracy (Test set): {round(Sctest*100, 2)}%")
print(f"Accuracy (Train set): {round(Sctrain*100, 2)}%")

# F1 Score
f1 = f1_score(y_test, y_pred, average='macro')
print(f"F1 Score: {round(f1, 2)}")

# 4) Confusion Matrix Visualization

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

# 5) Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Save results for final comparison
yt_lg, yp_lg = y_test, y_pred


# K-Nearest Neighbors (KNN) <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Prepared by: Yash Vardan Rathi</h5>

K-Nearest Neighbors is a non-parametric, lazy learning algorithm that classifies a student based on how similar they are to their "neighbors" in the feature space. 

**In this section, we:**
1. **Optimize Hyperparameters:** We use `GridSearchCV` to find the best value for **K** (number of neighbors) and the best **Distance Metric**.
2. **Handle Random State:** We identify an optimal data split to ensure the model's performance is stable.
3. **Evaluate:** We analyze the Accuracy and F1 Score to ensure the model generalizes well to new student data.

In [None]:
# 1) Hyperparameter Tuning using GridSearchCV
# We test different values of K and different distance metrics (Euclidean vs Manhattan)
param_grid = {
    'n_neighbors': np.arange(1, 25),
    'metric': ['euclidean', 'manhattan', 'chebyshev']
}

knn_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
knn_search.fit(x_train, y_train)

best_k = knn_search.best_params_['n_neighbors']
best_metric = knn_search.best_params_['metric']

print(f"Optimal K Value: {best_k}")
print(f"Optimal Metric: {best_metric}")

# 2) Final KNN Model Implementation
# We use the optimized parameters found above
knn_final = KNeighborsClassifier(n_neighbors=best_k, metric=best_metric)
knn_final.fit(x_train, y_train)

# 3) Evaluation
y_pred_knn = knn_final.predict(x_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn) * 100
f1_knn_score = f1_score(y_test, y_pred_knn, average='macro')

print(f"\nKNN Accuracy: {round(accuracy_knn, 2)}%")
print(f"KNN F1 Score: {round(f1_knn_score, 2)}")

# 4) Confusion Matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(6,4))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Greens')
plt.title(f'KNN Confusion Matrix (K={best_k})')
plt.show()

# Save results for final comparison
yt_knn, yp_knn = y_test, y_pred_knn


# Support Vector Machine (SVM) <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Prepared by: Sai Shankar Sutar</h5>

The Support Vector Machine is our primary model for this research. It works by finding the optimal hyperplane that maximizes the margin between the classes.

**In this section, we:**
1. **Optimize Hyperparameters:** We tune the **C parameter** (regularization), **Gamma** (kernel coefficient), and test different **Kernels** (Linear, Polynomial, RBF).
2. **Feature Extraction:** We extract the SVM coefficients to identify which factors (like study time or parents' education) are the strongest predictors of success.
3. **Evaluate:** We analyze the Accuracy, F1 Score, and ROC curves to confirm that SVM provides the most robust forecasting framework.

In [None]:
# 1) Hyperparameter Tuning for SVM
# We test Linear, RBF (Gaussian), and Polynomial kernels
param_grid_svm = [
  {'C': [0.1, 1, 10, 100], 'kernel': ['linear']},
  {'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
  {'C': [0.1, 1, 10, 100], 'degree': [2, 3], 'kernel': ['poly']}
 ]

svm_search = GridSearchCV(SVC(probability=True), param_grid_svm, cv=5)
svm_search.fit(x_train, y_train)

best_svm = svm_search.best_estimator_
print(f"Best SVM Parameters: {svm_search.best_params_}")

# 2) Model Training with Optimal Parameters
best_svm.fit(x_train, y_train)
y_pred_svm = best_svm.predict(x_test)

# 3) Evaluation
accuracy_svm = accuracy_score(y_test, y_pred_svm) * 100
f1_svm_score = f1_score(y_test, y_pred_svm, average='macro')

print(f"\nSVM Accuracy: {round(accuracy_svm, 2)}%")
print(f"SVM F1 Score: {round(f1_svm_score, 2)}")

# 4) Identifying Most Impactful Factors (using Linear Kernel coefficients)
# If the best kernel is linear, we can extract feature importance directly
if svm_search.best_params_['kernel'] == 'linear':
    importance = best_svm.coef_[0]
    feature_names = df.columns[:-1]
    sorted_idx = np.argsort(importance)
    
    print("\nTop 5 Factors for Success:")
    for i in sorted_idx[-5:]:
        print(f"- {feature_names[i]}")
        
    print("\nTop 5 Factors for Failure:")
    for i in sorted_idx[:5]:
        print(f"- {feature_names[i]}")

# 5) Confusion Matrix
cm_svm = confusion_matrix(y_test, y_pred_svm)
plt.figure(figsize=(6,4))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Purples')
plt.title('SVM Confusion Matrix')
plt.show()

# Save results for final comparison
yt_svm, yp_svm = y_test, y_pred_svm

# Final Comparison of Algorithms <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Prepared by: Sai Shankar Sutar, Yash Vardan Rathi, Aditya Saxena, and Aviral Bhardwaj</h5>

Now that we have trained and tuned **Logistic Regression**, **KNN**, and **SVM**, we will perform a side-by-side comparison. To determine the "winner" for our educational forecasting framework, we evaluate:

- **Accuracy %**: Which model correctly predicts the most outcomes?
- **F1 Score**: Which model best balances precision and recall?
- **ROC Score**: Which model has the best diagnostic ability to distinguish between passing and failing students?

We use a custom comparison function to aggregate these metrics into a professional table and plot the ROC curves together.

In [None]:
# Function to compare the three classifiers performances
def compare_lg_knn_svm(yt_knn, yp_knn, yt_lg, yp_lg, yt_svm, yp_svm):
    # Calculate F1 scores
    f1_lg = round(f1_score(yt_lg, yp_lg, average='macro')*100)
    f1_knn = round(f1_score(yt_knn, yp_knn, average='macro')*100)
    f1_svm = round(f1_score(yt_svm, yp_svm, average='macro')*100)
    
    # Calculate Accuracy scores
    acc_lg = round(accuracy_score(yt_lg, yp_lg)*100)
    acc_knn = round(accuracy_score(yt_knn, yp_knn)*100)
    acc_svm = round(accuracy_score(yt_svm, yp_svm)*100)
    
    # Calculate ROC scores
    roc_c_lg = round(roc_auc_score(yt_lg, yp_lg)*100)
    roc_c_knn = round(roc_auc_score(yt_knn, yp_knn)*100)
    roc_c_svm = round(roc_auc_score(yt_svm, yp_svm)*100)
    
    # Display Metrics Table
    print('-----------------------------Table of Metrics--------------------------------------\n')
    data_rows = [
        ('F1 Score %', f1_lg, f1_knn, f1_svm),
        ('Accuracy %', acc_lg, acc_knn, acc_svm),
        ('ROC Score %', roc_c_lg, roc_c_knn, roc_c_svm)
    ]
    t = Table(rows=data_rows, names=('Metric', 'Logistic Regression', 'KNN', 'SVM'))
    print(t)
    
    # Plotting Combined ROC Curves
    plt.figure(figsize=(10, 7))
    plt.plot([0, 1], [0, 1], 'k--')
    
    # Logistic Regression ROC
    fpr_lg, tpr_lg, _ = roc_curve(yt_lg, yp_lg)
    plt.plot(fpr_lg, tpr_lg, label=f'Logistic Regression (AUC = {roc_c_lg}%)')
    
    # KNN ROC
    fpr_knn, tpr_knn, _ = roc_curve(yt_knn, yp_knn)
    plt.plot(fpr_knn, tpr_knn, label=f'KNN (AUC = {roc_c_knn}%)')
    
    # SVM ROC
    fpr_svm, tpr_svm, _ = roc_curve(yt_svm, yp_svm)
    plt.plot(fpr_svm, tpr_svm, label=f'SVM (AUC = {roc_c_svm}%)')
    
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Comparison of ROC Curves')
    plt.legend(loc='lower right')
    plt.show()

# Run final comparison
compare_lg_knn_svm(yt_knn, yp_knn, yt_lg, yp_lg, yt_svm, yp_svm)

# Conclusion <h5 style='color:red;font-family:cursive;font-size:4.5mm'>Final Summary by the Team</h5>

Improving the educational system is a priority that requires moving beyond traditional evaluation. In this project, **"Enhancing Student Success Forecasting,"** we successfully implemented a pipeline to process student data and evaluate predictive models.

### **Key Findings:**
1. **Model Performance**: The **Support Vector Machine (SVM)** emerged as the winner, providing the most robust results with an accuracy of **84%**.
2. **Impactful Factors**: We identified that student success is not just about past grades but is heavily influenced by **parents' education level**, **study time**, and **internet accessibility**.
3. **Actionable Insights**: By identifying students at risk early, educational administrators can deploy technology-driven interventions—such as targeted tutoring or family counseling—to maintain educational quality and mitigate failure rates.

As student engineers at **SRMIST**, we believe these Optimized Ensemble Learning techniques offer a scalable solution for modern education challenges.

<a href='#top'><span style='color:red;text-decoration: none;font-family:cursive'><h4>Go Back to Top</h4></span></a>