This project implements a machine learning system to predict student performance in mathematics based on various demographic and academic factors. The system uses multiple regression algorithms to find the best-performing model for predicting student math scores.
- Data ingestion pipeline with automatic train-test splitting
- Advanced data preprocessing for both numerical and categorical features
- Multiple regression models comparison (Random Forest, XGBoost, CatBoost, etc.)
- Automated model selection based on R2 score
- Web interface for real-time predictions
- Comprehensive logging and error handling
- Python 3.10+
- Scikit-learn
- Pandas
- NumPy
- XGBoost
- CatBoost
- Flask
- HTML/CSS
Student-Performance-Prediction-System/
├── artifacts/
├── notebook/
│ └── data/
│ └── student_data.csv
├── src/
│ ├── components/
│ │ ├── data_ingestion.py
│ │ ├── data_transformation.py
│ │ └── model_trainer.py
│ ├── pipeline/
│ │ ├── predict_pipeline.py
│ │ └── train_pipeline.py
│ ├── utils.py
│ ├── logger.py
│ └── exception.py
├── templates/
│ ├── home.html
│ └── index.html
├── app.py
├── setup.py
└── requirements.txt
- Gender
- Race/Ethnicity
- Parental Level of Education
- Lunch Type
- Test Preparation Course
- Reading Score
- Writing Score
- Clone the repository:
git clone <repository-url>
cd Student-Performance-Prediction-System
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # For Linux/Mac
venv\Scripts\activate # For Windows
- Install dependencies:
pip install -r requirements.txt
- Run the training pipeline:
python src/pipeline/train_pipeline.py
- Start the Flask application:
python app.py
-
Training:
- The system automatically handles data preprocessing
- Trains multiple models and selects the best performer
- Saves the model and preprocessor in the artifacts directory
-
Prediction:
- Access the web interface at
http://localhost:5000
- Input student details
- Get predicted math score
- Access the web interface at
The system trains and compares the following models:
- Random Forest Regressor
- Decision Tree Regressor
- Gradient Boosting Regressor
- Linear Regression
- XGBoost Regressor
- CatBoost Regressor
- AdaBoost Regressor
-
Numerical Features:
- Median imputation for missing values
- Standard scaling
-
Categorical Features:
- Most frequent imputation for missing values
- One-hot encoding
- Standard scaling
The system uses R2 score as the primary metric for model evaluation. Models with scores below 0.6 are rejected to ensure prediction quality.
The system implements custom exception handling throughout the pipeline for better error tracking and debugging.
Comprehensive logging is implemented for all major operations:
- Data ingestion
- Transformation
- Model training
- Predictions
- Dataset provided is focused on student performance metrics
- Thanks to all contributors and maintainers