# Dropout Prediction with LASSO-Selected Features

LASSO regression is used for dimensionality reduction, and predictors with nonzero coefficients in the resulting model will be chosen for the classification models. Since there is multicollinearity between the features, as seen in the EDA file, dimensionality reduction is imperative to reduce multicollinearity. Dimensionality reduction also makes models simpler to understand and reduces variance, balancing the bias-variance tradeoff.

The performance of the LASSO regression model is measured by the following metrics: mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Mean absolute percentage error is the average of the absolute value of each deviation between prediction and true response, divided by the absolute value of the true response. The encoded response for dropout is $0$, and hence in these cases, the absolute error will be divided by $1$.

Both ensemble and baseline machine learning models are used to predict whether a given student dropped out, is enrolled, or graduated from a degree program. The ensemble models are random forest and AdaBoost. A decision tree, support vector machine (SVM), k-nearest neighbors (KNN), and logistic regression model is each fit, as well.

5-fold cross validation is used to optimally tune the hyperparameters of each model.

The performance of each classification model is measured by the following metrics: accuracy, precision (conditional probability of the model correctly classifying a point given the true classification), recall (true positive rate), and F1 score (harmonic average of precision and recall, a measure of how balanced precision and recall are). Calculation of each of these metrics were done using scikit-learn's metrics module in Python. The precision, recall, and F1 score are weighted due to the imbalance of the response variable.

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_absolute_percentage_error, mean_squared_error, mean_absolute_error
import warnings

In [2]:
df = pd.read_csv("data.csv", delimiter=";")
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [3]:
df.drop(columns=["Previous qualification"], inplace=True)

In [4]:
categorical_cols = ["Mother\'s qualification", "Father\'s qualification","Mother\'s occupation", "Father\'s occupation",\
                    "Marital status", "Nacionality", "Application mode", "Course", "Gender", "Displaced", "Educational special needs",\
                    "Debtor", "Tuition fees up to date", "Scholarship holder", "International", "Daytime/evening attendance\t"]

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df_encoded["Target"] = df["Target"].map({"Dropout": 0, "Enrolled": 1, "Graduate": 2})

df_encoded.head()

Unnamed: 0,Application order,Previous qualification (grade),Admission grade,Age at enrollment,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),...,Course_9853,Course_9991,Gender_1,Displaced_1,Educational special needs_1,Debtor_1,Tuition fees up to date_1,Scholarship holder_1,International_1,Daytime/evening attendance\t_1
0,5,122.0,127.3,20,0,0,0,0,0.0,0,...,False,False,True,True,False,False,True,False,False,True
1,1,160.0,142.5,19,0,6,6,6,14.0,0,...,False,False,True,True,False,False,False,False,False,True
2,5,122.0,124.8,19,0,6,0,0,0.0,0,...,False,False,True,True,False,False,False,False,False,True
3,2,122.0,119.6,20,0,6,8,6,13.428571,0,...,False,False,False,True,False,False,True,False,False,True
4,1,100.0,141.5,45,0,6,9,5,12.333333,0,...,False,False,False,False,False,False,True,False,False,False


In [5]:
X = df_encoded.drop(columns=["Target"])
y = df_encoded["Target"]

# Dimension Reduction with LASSO Regression

In [6]:
np.random.seed(1)

# 5-fold cross validation
cv = KFold(n_splits=5, shuffle=True)

# standardize data
X_std = StandardScaler().fit_transform(X)

# 80/20 train/test split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, shuffle=True)


LASSO regression has a mean absolute percentage error of 8.079\%. In this context, each prediction deviates from the true label by at most 0.162, on average. If the labels were rounded in the LASSO model, this would imply a high accuracy. Furthermore, a MAE and MSE of 1.20118 and 1.70576, respectively, is a little high, implying that on average, the absolute deviation from a true label is a full label. 

According to the LASSO model, the top 5 most important predictors of whether a student drops out, is enrolled, or graduated, in order of absolute value, is the number of curricular credits enrolled and credited in the second semester (negative coefficients), whether their tuition fees are up to date (positive coefficient), whether the student is a scholarship holder (positive coefficient), and the number of curricular credits approved in the first semester (positive coefficient). Hence, the more up to date students are with their tuition fees, the more credits approved in the first semester, and if a student is a scholarship holder, the more likely a student will stay enrolled in an online program. This makes sense because when tuition fees are up to date and/or a student is a scholarship holder, there is typically less financial stress on the student. It also motivates students to finish their schooling because of the money invested. But it is also quite common for people to drop out of school for financial reasons.

Conversely, the larger the number of credits a student is enrolled and/or credited for in the second semester, the more likely a student is to drop out of an online program. Typically, the workload in the second semester onwards of college is more difficult than the first. The first semester of college is usually introductory courses that are easier than their more advanced counterparts. This becomes apparent in the LASSO regression model, as certain courses were kept in the model that appear to be considered "weed out" classes.

Moreover the LASSO model has made it clear that parent qualification and occupation does not play a large role in whether a student drops out. None of the metrics involving the parents were chosen by the LASSO model. In some people, genetics and pressure from their parents are enough to keep them in school, but for the general population, it is clearly not a statistically significant factor.

In [7]:
np.random.seed(1) # for reproducibility

# 5-fold CV grid search
lasso_params = {"alpha": np.arange(0.01, 1, 0.01)}
lasso_gs = GridSearchCV(Lasso(fit_intercept=False), param_grid=lasso_params, cv=cv).fit(X_train, y_train)

# fit LASSO model
lasso_model = Lasso(**lasso_gs.best_params_, fit_intercept=False).fit(X_train, y_train)
predictions = lasso_model.predict(X_test)
print(f"MAPE: {mean_absolute_percentage_error(predictions, y_test)}")
print(f"MSE: {mean_squared_error(predictions, y_test)}")
print(f"MAE: {mean_absolute_error(predictions, y_test)}")

lasso_model.coef_

MAPE: 8.079146616416843
MSE: 1.7057611140219981
MAE: 1.2011788540103814


array([ 0.00000000e+00,  1.93973063e-03,  2.60331017e-03, -3.43620872e-03,
       -0.00000000e+00, -0.00000000e+00, -3.87091715e-02,  5.14202704e-02,
        0.00000000e+00,  0.00000000e+00, -1.23238266e-01, -1.74525985e-01,
       -1.31215419e-03,  5.93540422e-01,  1.36766426e-02, -0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
       -0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -

In [8]:
selected_predictors = np.where(lasso_model.coef_ != 0)[0]
(X.columns[selected_predictors], lasso_model.coef_[selected_predictors])

(Index(['Previous qualification (grade)', 'Admission grade',
        'Age at enrollment', 'Curricular units 1st sem (evaluations)',
        'Curricular units 1st sem (approved)',
        'Curricular units 2nd sem (credited)',
        'Curricular units 2nd sem (enrolled)',
        'Curricular units 2nd sem (evaluations)',
        'Curricular units 2nd sem (approved)',
        'Curricular units 2nd sem (grade)', 'Nacionality_26', 'Course_9085',
        'Course_9119', 'Course_9238', 'Course_9853', 'Course_9991', 'Debtor_1',
        'Tuition fees up to date_1', 'Scholarship holder_1'],
       dtype='object'),
 array([ 1.93973063e-03,  2.60331017e-03, -3.43620872e-03, -3.87091715e-02,
         5.14202704e-02, -1.23238266e-01, -1.74525985e-01, -1.31215419e-03,
         5.93540422e-01,  1.36766426e-02,  6.74032714e-05,  2.24452211e-03,
        -4.06986479e-04,  2.55129453e-02, -3.72521646e-02, -5.41010925e-03,
        -2.62655665e-02,  1.22012024e-01,  6.88781228e-02]))

In [9]:
X_train = X_train[:, selected_predictors]
X_test = X_test[:, selected_predictors]

In [10]:
# clearly unbalanced responses, so will set class weights
class_weights = {c: 1.0/np.sum(y_train == c) for c in np.unique(y_train)}

# Fit Ensemble Classifiers

## Random Forest

In [11]:
np.random.seed(1) # for reproducibility

# 5-fold CV grid search
rf_params = {"n_estimators": range(100, 400, 100), "warm_start": [True, False]}
rf_gs = GridSearchCV(RandomForestClassifier(class_weight=class_weights), param_grid=rf_params, cv=cv).fit(X_train, y_train)

# fit random forest model and make predictions
rf_model = RandomForestClassifier(**rf_gs.best_params_, class_weight=class_weights).fit(X_train, y_train)
predictions = rf_model.predict(X_test)

# performance metrics
print("Random Forest")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")

Random Forest
Accuracy: 0.7887005649717514
Precision: 0.8290663851535305
Recall: 0.7887005649717514
F1 Score: 0.8021695950575256


## AdaBoost

In [12]:
np.random.seed(1) # for reproducibility

# 5-fold CV grid search
adaboost_params = {"n_estimators": range(100, 400, 100), "learning_rate": np.arange(0.001, 10, 100)}
adaboost_gs = GridSearchCV(AdaBoostClassifier(algorithm="SAMME"), param_grid=adaboost_params, cv=cv).fit(X_train, y_train)

# fit AdaBoost model and make predictions
adaboost_model = AdaBoostClassifier(**adaboost_gs.best_params_, algorithm="SAMME").fit(X_train, y_train)
predictions = adaboost_model.predict(X_test)

# performance metrics
print("AdaBoost")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")


AdaBoost
Accuracy: 0.7152542372881356
Precision: 0.8932832374623789
Recall: 0.7152542372881356
F1 Score: 0.7914654860112544


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Baseline Classifiers

## Classification Tree

In [None]:
np.random.seed(1) # for reproducibility

# fit tree model and make predictions
cart_model = DecisionTreeClassifier(class_weight=class_weights).fit(X_train, y_train)
predictions = cart_model.predict(X_test)

# performance metrics
print("Classification Tree")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")

Classification Tree
Accuracy: 0.6790960451977401
Precision: 0.6773377160068086
Recall: 0.6790960451977401
F1 Score: 0.6781836598777795


## Support Vector Machine

In [None]:
np.random.seed(1) # for reproducibility

# 5-fold CV grid search
svm_params = {"kernel": ["linear", "poly", "rbf"]}
svm_gs = GridSearchCV(SVC(class_weight=class_weights), param_grid=svm_params, cv=cv).fit(X_train, y_train)

# fit SVM model and make predictions
svm_model = SVC(**svm_gs.best_params_).fit(X_train, y_train)
predictions = svm_model.predict(X_test)

# performance metrics
print("Support Vector Machine")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")

Support Vector Machine
Accuracy: 0.7751412429378531
Precision: 0.8188324061741398
Recall: 0.7751412429378531
F1 Score: 0.7886778451733745


## k-Nearest Neighbors

In [None]:
np.random.seed(1) # for reproducibility

# 5-fold CV grid search
knn_params = {"n_neighbors": range(3, 31, 2), "metric": ["manhattan", "cosine", "euclidean"]}
knn_gs = GridSearchCV(KNeighborsClassifier(), param_grid=knn_params, cv=cv).fit(X_train, y_train)

# fit KNN model and make predictions
knn_model = KNeighborsClassifier(**knn_gs.best_params_).fit(X_train, y_train)
predictions = knn_model.predict(X_test)

# performance metrics
print("k-Nearest Neighbors")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")

k-Nearest Neighbors
Accuracy: 0.7694915254237288
Precision: 0.8397855418827208
Recall: 0.7694915254237288
F1 Score: 0.7910876893130133


## Logistic Regression

In [None]:
np.random.seed(1) # for reproducibility

warnings.filterwarnings("ignore")

# 5-fold CV grid search
lreg_params = {"solver": ["lbgfs", "liblinear", "saga", "sag", "newton-cholesky"]}
lreg_gs = GridSearchCV(LogisticRegression(class_weight=class_weights), param_grid=lreg_params, cv=cv).fit(X_train, y_train)

# fit logistic regression model and make predictions
lreg_model = LogisticRegression(**lreg_gs.best_params_).fit(X_train, y_train)
predictions = lreg_model.predict(X_test)

# performance metrics
print("Logistic Regression")
print(f"Accuracy: {accuracy_score(predictions, y_test)}\nPrecision: {precision_score(predictions, y_test, average='weighted')}")
print(f"Recall: {recall_score(predictions, y_test, average='weighted')}\nF1 Score: {f1_score(predictions, y_test, average='weighted')}")

Logistic Regression
Accuracy: 0.7593220338983051
Precision: 0.8259023106222929
Recall: 0.7593220338983051
F1 Score: 0.7831675967436951


Random forest using LASSO-selected features performed the best in terms of accuracy and recall out of every classification model made throughout this project. This finding is expected, as random forest is known to be a powerful classification model. It leverages the combined predictions of multiple decision trees, each built from distinct combinations of independent variables drawn from bootstrap samples. Also, random forest performed the best in terms of F1 score, and thus had the best balance of precision and recall.

AdaBoost had the best precision score. But since its F1 score is lower than that of random forest, the margin between precision and recall are wider (and thus less balanced) than that of the random forest model, considering its large precision value. Interestingly, however AdaBoost performed among the worst in terms of accuracy and recall. This is surprising for an ensemble model. SVM, KNN, and logistic regression performed better than AdaBoost in terms of accuracy and recall. 