A machine learning project to predict student academic performance and identify key factors influencing success, helping educational institutions provide proactive support.
This project addresses the challenge of identifying at-risk students by building predictive models based on demographic, social, and academic data. The goal is to provide educational institutions with a data-driven tool for early intervention. The project involves two main tasks:
- Regression: Predicting a student's final numeric grade.
- Classification: Predicting whether a student will pass or fail.
- Data Exploration (EDA): In-depth analysis of student data to uncover initial trends and correlations.
- Data Preprocessing: A complete pipeline for cleaning data, encoding categorical variables, and scaling features.
- Dual-Task Modeling: Implements both regression and classification models to provide a comprehensive performance analysis.
- Performance Evaluation: Uses a wide range of metrics (R², RMSE, Accuracy, Precision, Recall, F1-Score) for robust model assessment.
- Feature Importance Analysis: Identifies the key drivers of academic success from the dataset.
The project follows a standard machine learning workflow:
- Data Understanding: The UCI Student Performance Dataset was used. Initial analysis was performed to understand its structure, quality, and statistical properties.
- Exploratory Data Analysis (EDA): Visualizations such as histograms and a correlation heatmap were created to identify relationships between variables, especially their impact on the final grade (
G3
). - Data Preprocessing:
- A binary
pass_fail
feature was engineered from theG3
grade. - Categorical features were converted to a numerical format using one-hot encoding.
- All features were scaled using
StandardScaler
to prepare the data for modeling.
- A binary
- Model Building:
- Regression Task: A Multiple Linear Regression model was trained to predict the final grade.
- Classification Task: Logistic Regression and Decision Tree models were trained to predict the pass/fail outcome.
- Model Evaluation: The models were evaluated on an unseen test set (20% of the data) to measure their real-world performance.
The models demonstrated strong predictive capabilities:
Model / Task | Metric | Score |
---|---|---|
Linear Regression | R² Score | 0.72 |
Classification Models | Accuracy | 90% |
(Logistic & Decision Tree) | F1-Score | 0.92 |
Key Insight: The feature importance analysis revealed that a student's second-period grade (G2
) is overwhelmingly the most significant predictor of their final academic outcome, accounting for over 70% of the decision-making power in the model. Other important factors include parental education (Medu
), student absences, and social habits.
To run this project on your local machine, follow these steps:
-
Clone the repository:
git clone [https://github.com/YourUsername/YourRepositoryName.git](https://github.com/YourUsername/YourRepositoryName.git) cd YourRepositoryName
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required libraries: Create a
requirements.txt
file with the following content:pandas numpy scikit-learn matplotlib seaborn jupyter
Then, run the installation command:
pip install -r requirements.txt
-
Download the Dataset:
- Download the
student-mat.csv
file from the UCI repository. - Place the
student-mat.csv
file in the root directory of the project.
- Download the
-
Launch Jupyter Notebook:
jupyter notebook
Open the
.ipynb
notebook file and run the cells.
Technology | Description |
---|---|
Python | Core programming language for the project. |
Pandas | Data manipulation and analysis library. |
NumPy | For numerical operations and array handling. |
Scikit-learn | For building and evaluating machine learning models. |
Matplotlib & Seaborn | For data visualization and creating plots. |
Jupyter Notebook / Colab | For interactive development and documentation. |