# Predicting Students' Dropout and Academic Success

This project aims to predict student dropout rates and academic success using various machine learning techniques. The dataset used for this analysis is sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success).

## Project Outline

1. **Download and Load the Data:**
   - Download the dataset and load it into a Pandas DataFrame.

2. **Explore and Preprocess the Data:**
   - Understand the features and target variable.
   - Handle missing values and outliers.
   - Encode categorical variables.
   - Normalize/standardize numerical features.
   
   Please click this link to access:[EDA Notebook](./Predict_Students_Dropout_EDA.ipynb).

3. **Split the Data:**
   - Split the dataset into training and testing sets.

4. **Build Classification Models:**
   - Train multiple classification models such as Logistic Regression, Random Forest, and SVM.
   - Evaluate their performance using metrics like accuracy, precision, recall, and F1-score.

5. **Feature Selection and Hyperparameter Tuning:**
   - Perform feature selection to identify important features.
   - Tune hyperparameters using Grid Search or Random Search.

6. **Evaluate and Compare Models:**
   - Compare the models based on their performance metrics.
   - Choose the best-performing model.

7. **Model Interpretation and Insights:**
   - Interpret the model to understand which features are most influential.
   - Provide insights and recommendations based on the findings.

To prepare the dataset for analysis, we will initially install specific Python packages including numpy, pandas, seaborn, and matplotlib for data processing and analysis. For machine learning model, we will install Extreme Gradient Boosting (XGB) library and use binary logistics. We will then import the dataset using the pandas package and assign it to the variable name student_data.

In [29]:
""" install the required package to fetch the dataset for this project
from the UCI Machine Learning Repository """

#%pip install ucimlrepo
#%pip install xgboost
#%pip install cmake
#%pip install catboost

' install the required package to fetch the dataset for this project\nfrom the UCI Machine Learning Repository\xa0'

In [26]:
#Import required packages

from ucimlrepo import fetch_ucirepo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost as xgb
from sklearn.preprocessing import RobustScaler
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

In [5]:
# Load the dataset
df = pd.read_csv('data/prep_data.csv')

### Data Modelling¶

Converting Target Variable into Numeric Form
To predict students' academic success and dropout, we will use logistic regression to determine the target variable using the feature variables. Since the target data contains students who are still enrolled, we will drop them from the dataset and use the data for student who dropped out and graduated. Then, we will transform the target variable into numeric form using label encoder, a data preprocessing feature from SciKit library. The labels dropout and graduate become 0 and 1, respectively.

In [8]:
df_student = df.drop(df[df['Target']=='Enrolled'].index)
df_student.head()

Unnamed: 0,Marital_Status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_(grade),Nationality,Mothers_qualification,Fathers_qualification,...,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Target,age_admission_ratio
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0.0,0,10.8,1.4,1.74,Dropout,-0.462655
1,1,15,1,9254,1,1,160.0,1,1,3,...,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate,-0.002746
2,1,1,5,9070,1,1,122.0,1,37,37,...,6,0,0,0.0,0,10.8,1.4,1.74,Dropout,0.101311
3,1,17,2,9773,1,1,122.0,1,38,37,...,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate,0.030682
4,2,39,1,8014,0,1,100.0,1,37,38,...,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate,0.079806


In [10]:
encoder = LabelEncoder()

In [12]:
df_student['Target'] = encoder.fit_transform(df_student['Target'])

In [15]:
# Check encoding worked on target column
df_student['Target'].value_counts()


Target
1    2209
0    1421
Name: count, dtype: int64

In [16]:
df_student.head()


Unnamed: 0,Marital_Status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_(grade),Nationality,Mothers_qualification,Fathers_qualification,...,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Target,age_admission_ratio
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0.0,0,10.8,1.4,1.74,0,-0.462655
1,1,15,1,9254,1,1,160.0,1,1,3,...,6,6,6,13.666667,0,13.9,-0.3,0.79,1,-0.002746
2,1,1,5,9070,1,1,122.0,1,37,37,...,6,0,0,0.0,0,10.8,1.4,1.74,0,0.101311
3,1,17,2,9773,1,1,122.0,1,38,37,...,6,10,5,12.4,0,9.4,-0.8,-3.12,1,0.030682
4,2,39,1,8014,0,1,100.0,1,37,38,...,6,6,6,13.0,0,13.9,-0.3,0.79,1,0.079806


#### Splitting Features and Target Variables into X and Y
We set X and Y as the dataframe feature and target variables, respectively. Note that we will drop the Nationality and International columns since they are highly correlated and only one nationality significantly dominates the data. This will prevent bias in the statistical regression.

In [19]:
# Set target variable as X and the remaining variables except Nationality and International as Y
X = df_student.drop(columns=['Nationality','International','Target'], axis=1)
Y = df_student['Target']

In [20]:
print(X, X.shape)

      Marital_Status  Application_mode  Application_order  Course  \
0                  1                17                  5     171   
1                  1                15                  1    9254   
2                  1                 1                  5    9070   
3                  1                17                  2    9773   
4                  2                39                  1    8014   
...              ...               ...                ...     ...   
4419               1                 1                  6    9773   
4420               1                 1                  2    9773   
4421               1                 1                  1    9500   
4422               1                 1                  1    9147   
4423               1                10                  1    9773   

      Daytime_evening_attendance  Previous_qualification  \
0                              1                       1   
1                              1                   

In [21]:
print(Y, Y.shape)


0       0
1       1
2       0
3       1
4       1
       ..
4419    1
4420    0
4421    0
4422    1
4423    1
Name: Target, Length: 3630, dtype: int64 (3630,)


#### Splitting Data into Training and Testing Data
To begin with the logistic regression as our machine learning model, we split the data into training and testing data. 80% of the data will be our training model and rest 20% will be the testing model. We choose the third state of the random sampling.

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [24]:
print(X.shape, X_train.shape, X_test.shape)

(3630, 35) (2904, 35) (726, 35)


#### XGB Logistic Regression
Logistic regression will now be implemented using Extreme Gradient Boosting (XGBoost) which is one of the available open source libraries used for regression models. In this case, binary logistic is set for our model with 1000 n_estimators. The n_estimators serves as the number of decision trees or classification considering the data from feature variables.

In [27]:
bin_log = xgb.XGBClassifier(objective='binary:logistic', n_estimators=1000)
bin_log.fit(X_train, Y_train)

#### Data Prediction and Evaluation of the Model
We now set the logistic regression model to the training data.

In [29]:
target_prediction = bin_log.predict(X_test)
print(target_prediction)

[1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0
 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1
 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1
 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0
 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1
 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1
 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0
 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 0
 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1
 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 0 0 1
 1 0 1 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1
 0 1 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0
 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 1 1 0 1 1 1
 1 1 0 1 0 1 0 1 0 1 1 1 

In [30]:
data_accuracy = accuracy_score(Y_test, target_prediction)
print("Accuracy:", data_accuracy)

Accuracy: 0.8953168044077136


In [31]:
# The input data is the 192nd record in the student_data dataset disregarding the Nationality and International record
input_data = (1, 1, 2, 14, 1, 1, 1, 3, 5, 4, 0, 0, 0, 1, 0, 0, 19, 0, 5, 5, 5, 13, 0, 0, 5, 5, 5, 13.2, 0, 9.4, -0.8, -3.12) 
input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction = bin_log.predict(input_data_reshaped)
print(prediction)
print("The initial value is ",prediction[0])

ValueError: Feature shape mismatch, expected: 35, got 32

#### Creating a System for Prediction¶