# 🎓 Student Performance Prediction
Predicting academic outcomes using behavioral, demographic, and academic features.

## 📂 Overview
This project analyzes a dataset of high school students to predict final academic performance (grades A to F) using machine learning. The dataset includes features like parental education, study habits, attendance, and extracurricular participation.

## 🎯 Objective
Build a multiclass classification model to predict `GradeClass` (A–F) and identify key drivers of academic performance to support data-driven interventions.

## 🛠️ Tools & Technologies
- Python  
- Pandas, NumPy  
- Scikit-learn  
- Matplotlib, Seaborn  
- SMOTE (from imbalanced-learn)

## 🧪 Methodology
### 1. Data Cleaning & Preprocessing
- Removed irrelevant features (`StudentID`, `GPA`)  
- Handled class imbalance using SMOTE  
- Scaled numeric features (`Absences`, `StudyTimeWeekly`) with `StandardScaler`

In [1]:
df.isnull().sum() # Our data set has no missing values.

NameError: name 'df' is not defined

In [None]:
df.drop(['StudentID', 'GPA'], axis=1, inplace=True)

df.columns

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

labels = ["A", "B", "C", "D", "E"]
ticks = range(len(labels))

sns.countplot(data = df, x= 'GradeClass', hue='GradeClass')
plt.title('Distribution of Grade Classes')


In [None]:
grade_counts = df['GradeClass'].value_counts().sort_index()

# Create labels and colors
labels = ['Grade 0', 'Grade 1', 'Grade 2', 'Grade 3', 'Grade 4']


# Plot pie chart
plt.figure(figsize=(5,5))
plt.pie(grade_counts,
        autopct='%1.1f%%', 
        labels = labels,
        colors = sns.color_palette("pastel"),
        startangle=90 )      

plt.title('Grade Distribution')
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), cmap='plasma', vmin=-1, vmax=1)
plt.show()


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

num_columns =  ['Absences', 'StudyTimeWeekly']
df[num_columns] = scaler.fit_transform(df[num_columns]) #scaling numerical columns.


In [None]:
pip install imblearn 


In [None]:
from imblearn.over_sampling import SMOTE # Synthetic Minority Over-sampling Technique

smote = SMOTE()
X = df.drop('GradeClass', axis=1) # to help balance the classes(avoid wrongful predictions because of the majority in the data)
y = df['GradeClass']
X_res, y_res = smote.fit_resample(X, y)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

### 2. Exploratory Data Analysis (EDA)
- Used countplots, pie charts, and correlation heatmaps  
- Explored relationships between support systems, study habits, and outcomes

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(criterion='entropy')
dtree.fit(X_train,y_train)
y_pred_dtree = dtree.predict(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

models = {'Tree': y_pred_dtree, 'KNN': y_pred_knn}
for name, pred in models.items():
    print(f"== {name} ==")
    print(confusion_matrix(y_test, pred))
    print(classification_report(y_test, pred))

In [None]:
importances = pd.Series(dtree.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending= False)

plt.figure(figsize=(7,5))
importances.head(10).plot(kind='barh', color='orange')
plt.title("Top 10 Important Features - Decision Tree")
plt.xlabel("Feature Importance Score")
# plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### 3. Modeling
- Trained classification models  
- Split dataset (70% train / 30% test)  
- Evaluated using accuracy, precision, recall, and confusion matrix

## 📊 Results
- Achieved an accuracy of XX% (*update with actual result*) on test data  
- SMOTE improved performance on minority classes like Grade A and F  
- Found strong correlation between study hours and grade class

## 🧠 Key Insights
- Parental education and tutoring had significant impact on high performance  
- Students with extracurricular activities and lower absences tended to score better  
- Balancing classes improved the model’s fairness across all grade groups

* Data Privacy, Data Security and any other ethical concerns.
  - Privacy Concern: Recombining features or retaining identifiers such as Student ID can lead to re-identification of individual students. This poses a risk of the data being misused to trace academic results back to specific students, which could lead to stigmatization instead of providing constructive support.

  - Ethical Concern: If the data used for modeling is not representative — for example, if certain races or groups are underrepresented — the model may learn biased patterns which can result in unfair predictions, reinforcing inequality rather than promoting fairness.

* Possible Solutions:
  - Anonymization of Student IDs: Replace identifiable student information (such as Student ID) with randomly generated codes that cannot be traced back to individual students.

  - Fairness in Modeling: Apply techniques such as SMOTE (Synthetic Minority Oversampling Technique) to address class imbalance and ensure that underrepresented student groups are fairly represented..

## 👥 Contributors
- Hafsa Yahya  
- Naiserian Kyama  
- Eryca Wacuka  
- Mercy Waruguru  
- **Salima Ali Zeid**