This project is designed to predict student performance in math, reading, and writing based on various demographic and educational factors such as gender, parental level of education, race/ethnicity, lunch type, and test preparation course. The project leverages Python, data processing pipelines, and machine learning algorithms to build predictive models.
The main goal of this project is to transform raw student performance data and build machine learning models that can predict a student's scores in math, reading, or writing based on their demographic and educational backgrounds. The project uses feature engineering, data preprocessing techniques, and machine learning algorithms like linear regression and random forests to accomplish these predictions.
- Data Transformation Pipelines: Includes preprocessing steps such as missing value imputation, one-hot encoding for categorical variables, and scaling for numerical features.
- Machine Learning Models: Regression models like Linear Regression, Random Forest, and Gradient Boosting to predict student scores.
- Exploratory Data Analysis (EDA): Visualization and analysis of relationships between features and target variables (student scores).
- Automated Pipelines: Uses Scikit-Learn's
PipelineandColumnTransformerto streamline the preprocessing and modeling process.
- Programming Languages: Python
- Libraries:
- Data Manipulation: Pandas, NumPy
- Data Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-Learn, XGBoost
- Others: OS, Logging, Dataclasses
- Tools:
- Git, GitHub
- Jupyter Notebooks (for EDA and development)
- Visual Studio Code