Skip to content

ybacoder/project-3

Repository files navigation

PROJECT 3: Understanding Student Success

Project Team

Project Goal

Our goal was to use survey data gathered by the National Center for Education Statistics (NCES) to develop a model that predicts student success in high school. More specifically, the input parameters to the model are the responses to a series of questions posed to the student (e.g., student's demographics, socioeconomic status, habits, and goals), and the model result is the student's predicted range of final grade point average (GPA) upon graduation from high school.

Machine Learning Model

To determine the most suitable model for our dataset, we tested and trained logistic regression, decision tree classifier, random forest classifer, and multi-layer perceptron classifer models with multiple imbalanced under/over sampling methods; we also employed several imbalanced learn classifiers. We developed these models using the Python packages scikit-learn and imbalanced-learn. Although the models did not yield a higher degree of correlation between the input parameters and the results, they did perform reasonably better than the null model. The final model selected was a Random Forest Classifier with Imbalanced Learn Random Over Sampling along with an iterative imputer to fill in missing values in the data. An excerpt of our code along with the imblanced learn classification report is provided below.

Data Source for Model

As mentioned, our model is based on data provided by the NCES; the longitudianl study that we selected to obtain data for our model is the Education Longitudinal Study of 2002 (ELS, 2002). ELS (2002) is the fourth study in a series of school-based longitudinal studies by NCES. ELS (2002) is a nationally representative, longitudinal study of 10th graders in 2002 and 12th graders in 2004. The goal of the study is to follow the students' trajectories from the beginning of high school into postsecondary education, the workforce, and beyond. Within the NCES dataset, our model focuses on high school statistics. After cleaning the NCES dataset, our model training dataset contained almost 15,000 rows (i.e., students) with 32 columns (i.e., 31 input parameters and one result parameter (i.e., GPA)). This cleaned dataset can be accessed via our web app (see below). [Note: the dataset also contains two more columns, one indicating if student graduated or was on track to graduate high school (or equivalent) and another indicating if the student ever dropped out of high school].

Predictive Questionnaire

As part of the NCES studies, student's are asked to answer a very broad range and extensive list of questions. We reviewed the NCES questionnaire and selected 31 questions the we deemed most likely to be correlated with a student's final high school GPA. Our app reproduces these questions, serves the responses to a model, and then returns a prediction of the student's final high school GPA range (as well as the equilvalent letter grade) and the predicted probability of that GPA range.

Final Thoughts

Ultimately, there are a multitude of factors that can contribute to a student's success in high school and their final GPA. In selecting 31 parameters that we deemed most likely to affect a student's over final GPA range, our hope was to develop a more accurate model. However, there are clearly a lot more factors that affect a student's GPA that are not captured in our model's dataset and in the NCES dataset.

Using the App

App Deployment

The Student Success web app is deployed online. It can be found here!

Run App Locally

pipenv install
pipenv shell
python app.py

Access our API directly

You may access our API directly and get student success prediction results by submitting a POST request to https://student-success-owz2yc537q-uc.a.run.app/predict. Please review our input parameters keyfile and the example input and output JSONs below.

Example Body to Submit as JSON
{
    "BYSEX": 0,
    "BYRACE": 0,
    "BYSTLANG": 0,
    "BYPARED": 1,
    "BYINCOME": 1,
    "BYURBAN": 0,
    "BYREGION": 3,
    "BYRISKFC": 1,
    "BYS34A": 1,
    "BYS34B": 0,
    "BYWRKHRS": 1,
    "BYS42": 0,
    "BYS43": 1,
    "BYTVVIGM": 0,
    "BYS46B": 0,
    "BYS44C": 1,
    "BYS20E": 2,
    "BYS87C": 4,
    "BYS20D": 1,
    "BYS23C": 0,
    "BYS37": 3,
    "BYS27I": 2,
    "BYS90D": 2,
    "BYS38A": 2,
    "BYS20J": 3,
    "BYS24C": 2,
    "BYS24D": 1,
    "BYS54I": 3,
    "BYS84D": 0,
    "BYS84I": 1,
    "BYS85A": 1
}
Example JSON Response
{
    "gpa_range": "1.51 - 2.00",
    "equivalent_letter_grade": "C",
    "probability_of_gpa_range": 0.72
}

Repo Overview

  1. /data contains downloaded data from the NCES Codebook.

  2. The jupyter notebooks contain our code to import the NCES ELS (2002) data, clean the data, and test various machine learning models.

    • "clean_student_data.csv" is the final cleaned version of the data set we used to train our model.

    • "rus_clf.joblib" is a joblib dump of our final deployed model: Random Forest Classifier with Imbalanced Learn Random Over Sampling.

    • "project-3.postman_collection.json" contains several example requests to test our app.