# Classification model to predict students' dropout and academic sucess.

The dataset contains information collected from a higher education institution related to students undertaking different degree programs.
The original dataset contains information known at the time of student enrollment and the student's academic perfromance at the end of the 1st and 2nd semester.

The target is split into three distinct categories namely Dropout, Enrolled and Graduate.

In this spirit, I shall build classification models using various architectures to predict the student's dropout and academic success.This can then be used to predict which students are most likely to drop out at an early stage so that strategies can be put in place to counter this.

That would help reduce the rate of academic dropout and failure.

## Import Libraries and Datasets

### Libraries

## Imports

Below, i import all the libraries and datasets needed for this competition.

In [None]:
!pip install catboost
!pip install optuna
!pip install optuna_distributed
!pip install openfe

In [None]:
#hide
#! [ -e /content ]

#hide
#This imports and sets up everything you will need for this notebook
#
#!pip install -Uqq fastbook
#import fastbook
#fastbook.setup_book()

#from fastbook import *
#!pip install ucimlrepo
#from ucimlrepo import fetch_ucirepo

from fastai.tabular.all import *
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from numpy import random

from fastai.imports import *
np.set_printoptions(linewidth=130)


from pathlib import Path
import os


from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.metrics import roc_auc_score,accuracy_score,mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,r2_score
#from sklearn.metrics import root_mean_squared_error

import xgboost as xgb
from xgboost import plot_importance

import lightgbm as lgb

from catboost import CatBoostClassifier,CatBoostRegressor,Pool, metrics, cv

from ipywidgets import interact


matplotlib.rc('image', cmap='Greys')

#from fastkaggle import setup_comp

import optuna
from openfe import OpenFE, transform

from IPython.display import FileLink

#from lightgbm import LGBMClassifier



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = Path('/kaggle/input/playground-series-s4e6/')
path

## Import Datasets

In [None]:
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')
sub_df = pd.read_csv(path/'sample_submission.csv')
original_df = pd.read_csv('/kaggle/input/academic-success-dataset/data.csv')

In [None]:
#X = train_df.drop(columns=["Target"], axis=1)
#y = train_df["Target"]

#y.shape, X.shape

# Baseline

Previously, i had built a baseline model using AutoML solution AutoGluon without presets, this gave me an initial submission score of 0.83434.Find the notebook [here](https://www.kaggle.com/code/rubanzasilva/autogluon-starter).

In this notebook, i test out different model architectures and data transformation to try to improve on the baseline score.

# Without original dataset

First i shall try out the models using only the data initially provided to us, without the original dataset.

Below i use the fastai tabular methods to preprocess and prepare my data for machine learning, creating training and a validation set.

Use the fastai cont_cat_split to separate my dataset variables into categorical and continous variables.

I then use randomsplitter to do a random split and create a validation set of about 20% of the initial dataset.



In [None]:
cont_names,cat_names = cont_cat_split(train_df, dep_var='Target')
splits = RandomSplitter(valid_pct=0.2)(range_of(train_df))
to = TabularPandas(train_df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names='Target',
                   y_block=CategoryBlock(),
                   splits=splits)

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

dls = to.dataloaders(bs=64)
test_dl = dls.test_dl(test_df)

# Trying out different model architectures.

Here i start with decision trees specifically random forests, then i try out gradient boosting models catboost, xgboost and light GBMs.

Later on i try out neural networks and an ensemble of various neural networks using the fastai library.

## Random Forests

In [None]:
%%time
rf = RandomForestClassifier(100, min_samples_leaf=3)
rf_model_a = rf.fit(X_train, y_train);

rf_preds = tensor(rf_model_a.predict(test_dl.xs))

rf_preds_x = tensor(rf_model_a.predict(X_test))

#mse = mean_absolute_error(y_test, rf_preds_x)
#rmse = np.sqrt(mse)

accuracy_score(y_test,rf_preds_x)

## Cat Preds

In [None]:
%%time
cat_model = CatBoostClassifier(iterations=2000, depth=8, learning_rate=  0.08, random_strength=10)
cat_model = cat_model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False)

#test set preds
cat_preds = tensor(cat_model.predict(test_dl.xs))


cat_preds_final = cat_preds.squeeze(1)

#validation set preds
cat_preds_x = tensor(cat_model.predict(X_test))

cat_preds_x_final = cat_preds_x.squeeze(1)

accuracy_score(y_test,cat_preds_x)

## XGBoost

In [None]:
xgb_model = xgb.XGBClassifier(n_estimators = 197, max_depth=4, learning_rate=0.1818695751227044, subsample= 0.39774994666482544)
xgb_model = xgb_model.fit(X_train, y_train)

xgb_preds = tensor(xgb_model.predict(test_dl.xs))

xgb_preds_x = tensor(xgb_model.predict(X_test))

accuracy_score(y_test,xgb_preds_x)

## LGBM

In [None]:
lgb_model = lgb.LGBMClassifier(num_leaves=251, learning_rate=0.02956613668999794, n_estimators=483, max_depth=82, boosting_type='gbdt',min_child_samples=90, random_state=27)
lgb_model = lgb_model.fit(X_train, y_train)

#test set preds
lgb_preds = tensor(lgb_model.predict(test_dl.xs))

#validation set preds
lgb_preds_x = tensor(lgb_model.predict(X_test))

lgb_score = accuracy_score(y_test,lgb_preds_x)
lgb_score

In [None]:
model_preds = {
    "random forests":accuracy_score(y_test,rf_preds_x),
    "cat boost":accuracy_score(y_test,cat_preds_x),
    "lgbm":lgb_score,
    "xgboost":accuracy_score(y_test,xgb_preds_x),   
}

#model_preds_a = model_preds.sort()
print(model_preds)

{'random forests': 0.826439260275763, 'cat boost': 0.8316016467359342, 'lgbm': 0.8306214467751422, 'xgboost': 0.8299679801346141}

In [None]:
mapping = dict(enumerate(dls.vocab))
mapping

## Submission

In [None]:
lgb_preds.shape,cat_preds.shape,cat_preds_final.shape

In [None]:
mapping = dict(enumerate(dls.vocab))
predicted_labels = [mapping[value.item()] for value in cat_preds_final]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = predicted_labels
submit.to_csv('submission.csv',index=False)
submit

In [None]:
!ls

In [None]:
#!kaggle competitions submit -c kagglex-cohort4 -f submission.csv -m "general_preds baseline"

In [None]:
!rm submission.csv

### Neural Network

In [None]:
learn = tabular_learner(dls, metrics=accuracy)
learn.lr_find(suggest_funcs=(slide,valley))

In [None]:
%%time
learn.fit_one_cycle(20,0.02)

In [None]:
dl = learn.dls.test_dl(test_df)

In [None]:
%%time
nn_preds = learn.get_preds(dl=dl)
nn_preds_x = learn.get_preds()[0]
a_preds, _ = learn.get_preds(dl=dl)
nn_preds_y = a_preds.squeeze(1)

In [None]:
cat_preds_final = cat_preds.squeeze(1)

In [None]:
nn_preds_x.shape,nn_preds.shape,nn_preds_y.shape

In [None]:
!ls

In [None]:
!rm submission.csv

In [None]:
mapping = dict(enumerate(dls.vocab))
predicted_labels = [mapping[value.item()] for value in cat_preds_final]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = predicted_labels
submit.to_csv('submission.csv',index=False)
submit

### Ensemble

For testing with our accuracy_score metric, we use the x preds as in lgb_preds_x, xgboost_preds_x,rf_preds_x, and cat_preds_x_final.

For submission, we use the lgb_preds, xgboost_preds,rf_preds, and cat_preds_final as this is the result of running the model on the initial test set and gives us the same shape as our expected submission.

#### For testing

In [None]:
cat_preds_x_final = cat_preds_x.squeeze(1)

In [None]:
rf_preds_x.shape,cat_preds_x_final.shape,lgb_preds_x.shape,xgb_preds_x.shape

In [None]:
general_preds = (rf_preds_x + cat_preds_x_final + lgb_preds_x + xgb_preds_x)/4
general_preds

In [None]:
rf_preds_x.shape,cat_preds_x_final.shape,lgb_preds_x.shape,xgb_preds_x.shape,general_preds.shape

In [None]:
accuracy_score(y_test,general_preds)

#### For Submission

In [None]:
rf_preds.shape,cat_preds_final.shape,lgb_preds.shape,xgb_preds.shape

In [None]:
general_preds_sub = (rf_preds + cat_preds_final + lgb_preds + xgb_preds)/4
general_preds_sub

In [None]:
rf_preds.shape,cat_preds_final.shape,lgb_preds.shape,xgb_preds.shape,general_preds_sub.shape

In [None]:
mapping = dict(enumerate(dls.vocab))
predicted_labels = [mapping[value.item()] for value in general_preds_sub]

In [None]:
mapping = dict(enumerate(dls.vocab))
predicted_labels = [mapping[value.item()] for value in general_preds_sub]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = predicted_labels
submit.to_csv('submission.csv',index=False)
submit

# Adding original dataset

In [None]:
original_df = pd.read_csv('/kaggle/input/academic-success-dataset/data.csv', delimiter=';')

In [None]:
original_df.rename(columns={'Daytime/evening attendance\t':'Daytime/evening attendance'}, inplace=True)

In [None]:
train_final = pd.concat([train_df,original_df], axis=0)

In [None]:
cont_names,cat_names = cont_cat_split(train_final, dep_var='Target')
splits = RandomSplitter(valid_pct=0.2)(range_of(train_final))
to = TabularPandas(train_final, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names='Target',
                   y_block=CategoryBlock(),
                   splits=splits)

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

dls = to.dataloaders(bs=64)
#test_dl = dls.test_dl(test_df)

In [None]:
test_dl = dls.test_dl(test_df)

In [None]:
%%time
rf = RandomForestClassifier(100, min_samples_leaf=3)
rf_model = rf.fit(X_train, y_train);

rf_preds = tensor(rf_model.predict(test_dl.xs))

rf_preds_x = tensor(rf_model.predict(X_test))

mse = mean_absolute_error(y_test, rf_preds_x)
rmse = np.sqrt(mse)

accuracy_score(y_test,rf_preds_x)

{'random forests': 0.8236293537214925, 'cat boost': 0.8265699536038685, 'lgbm': 0.8253283669868653, 'xgboost': 0.8268313402600798}

In [None]:
%%time
cat_model = CatBoostClassifier(iterations=2000, depth=8, learning_rate=  0.08, random_strength=10)
cat_model = cat_model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False)

#test set preds
cat_preds = tensor(cat_model.predict(test_dl.xs))


cat_preds_final = cat_preds.squeeze(1)

#validation set preds
cat_preds_x = tensor(cat_model.predict(X_test))

cat_preds_x_final = cat_preds_x.squeeze(1)

accuracy_score(y_test,cat_preds_x)

In [None]:
%%time
xgb_model = xgb.XGBClassifier(n_estimators = 197, max_depth=4, learning_rate=0.1818695751227044, subsample= 0.39774994666482544)
xgb_model = xgb_model.fit(X_train, y_train)
xgb_preds = tensor(xgb_model.predict(test_dl.xs))

xgb_preds_x = tensor(xgb_model.predict(X_test))

accuracy_score(y_test,xgb_preds_x)

In [None]:
lgb_model = lgb.LGBMClassifier(num_leaves=251, learning_rate=0.02956613668999794, n_estimators=483, max_depth=82, boosting_type='gbdt',min_child_samples=90, random_state=27)
lgb_model = lgb_model.fit(X_train, y_train)

#test set preds
lgb_preds = tensor(lgb_model.predict(test_dl.xs))

#validation set preds
lgb_preds_x = tensor(lgb_model.predict(X_test))

lgb_score = accuracy_score(y_test,lgb_preds_x)
lgb_score

In [None]:
!rm submission.csv

In [None]:
mapping = dict(enumerate(dls.vocab))
predicted_labels = [mapping[value.item()] for value in xgb_preds]
submit = pd.read_csv(path/'sample_submission.csv')
submit.Target = predicted_labels
submit.to_csv('submission.csv',index=False)
submit

# Adding Base Features

from https://www.kaggle.com/code/trupologhelper/ps4e5-openfe-blending-explain#Creating-New-Features-%F0%9F%93%8A

In [None]:
BASE_FEATURES = test_final.columns
initial_features = BASE_FEATURES
initial_features

In [None]:
%%time
for df in [train_final, test_final]:
    print('comnputing f_sum')
    df['fsum'] = df[initial_features].sum(axis=1) # for tree models
    print('comnputing f_std')
    df['f_std']  = df[initial_features].std(axis=1)
    print('comnputing f_mean')
    df['f_mean'] = df[initial_features].mean(axis=1)
    print('comnputing f_max')
    df['f_max']  = df[initial_features].max(axis=1)
    print('comnputing f_min')
    df['f_min']  = df[initial_features].min(axis=1)
    print('comnputing f_mode')
    df['f_mode'] = df[initial_features].mode(axis=1)[0]
    print('comnputing f_median')
    df['f_median'] = df[initial_features].median(axis=1)
    print('comnputing f_25th')
    df['f_25th'] = df[initial_features].quantile(0.25, axis=1)
    print('comnputing f_75th')
    df['f_75th'] = df[initial_features].quantile(0.75, axis=1)
    print('comnputing f_skew')
    df['f_skew'] = df[initial_features].skew(axis=1)
    print('comnputing f_kurt')
    df['f_kurt'] = df[initial_features].kurt(axis=1)
    df['special1'] = df['fsum'].isin(np.arange(72, 76)) # for linear models
    for i in range(10,100,10):
        print(f'comnputing f_{i}th')
        df[f'f_{i}th'] = df[initial_features].quantile(i/100, axis=1)
    print('comnputing f_harmonic')
    df['f_harmonic'] = len(initial_features) / df[initial_features].apply(lambda x: (1/x).mean(), axis=1)
    print('comnputing f_geometric')
    df['f_geometric'] = df[initial_features].apply(lambda x: x.prod()**(1/len(x)), axis=1)
    print('comnputing f_zscore')
    df['f_zscore'] = df[initial_features].apply(lambda x: (x - x.mean()) / x.std(), axis=1).mean(axis=1)
    print('computing Coefficient of Variation ')
    df['f_cv'] = df[initial_features].std(axis=1) / df[initial_features].mean(axis=1)
    print('computing f_Quantile Coefficients of Skewness_75')
    df['f_Quantile Coefficients of Skewness_75'] = (df[initial_features].quantile(0.75, axis=1) - df[initial_features].mean(axis=1)) / df[initial_features].std(axis=1)
    print('computing f_Quantile Coefficients of Skewness_25')
    df['f_Quantile Coefficients of Skewness_25'] = (df[initial_features].quantile(0.25, axis=1) - df[initial_features].mean(axis=1)) / df[initial_features].std(axis=1)
    print('computing f_2ndMoment')
    df['f_2ndMoment'] = df[initial_features].apply(lambda x: (x**2).mean(), axis=1)
    print('computing f_3rdMoment')
    df['f_3rdMoment'] = df[initial_features].apply(lambda x: (x**3).mean(), axis=1)
    print('computing f_entropy')
    df['f_entropy'] = df[initial_features].apply(lambda x: -1*(x*np.log(x)).sum(), axis=1)
    #print('computing f_mad') probably has negative impact
    #df['f_mad'] = df[initial_features].apply(lambda x: (x - x.median()).abs().median(), axis=1)
    #print('computing f_iqr') probably has negative impact
    #df['f_iqr'] = df[initial_features].quantile(0.75, axis=1) - df[initial_features].quantile(0.25, axis=1)

In [None]:
train_final.head()

## Neural Network Ensemble

In [None]:
def ensemble():
    learn = tabular_learner(dls, metrics=accuracy)
    with learn.no_bar(),learn.no_logging(): learn.fit(6, 0.02)
    return learn.get_preds(dl=dl)[0]

In [None]:
learns = [ensemble() for _ in range(5)]

In [None]:
ens_preds = torch.stack(learns).mean(0)

In [None]:
nn_preds_x.shape,ens_preds.shape

In [None]:
# Assuming ens_preds is a PyTorch tensor with shape [51012, 3]
# Select predictions for the first class (index 0)
selected_class_preds = ens_preds[:, 0]

# Now selected_class_preds has a shape of torch.Size([51012])
print(selected_class_preds.shape)


In [None]:
ens_preds_final = ens_preds.squeeze(1)
ens_preds_final.shape

In [None]:
r2_score(y_test,nn_preds_x)

In [None]:
target_preds = nn_preds[0]

In [None]:
test_df['FloodProbability'] = target_preds

In [None]:
test_df.to_csv('submission.csv', columns=['FloodProbability'], index=True)

In [None]:
submission = pd.read_csv('submission.csv')
submission.head()

In [None]:
!rm submission.csv

In [None]:
test_df['FloodProbability'] = target_preds
test_df.to_csv('submission.csv', columns=['FloodProbability'], index=True)

submission = pd.read_csv('submission.csv')
submission.head()

### Original Dataset

@misc{misc_predict_students'_dropout_and_academic_success_697,
  author       = {Realinho,Valentim, Vieira Martins,Mónica, Machado,Jorge, and Baptista,Luís},
  title        = {{Predict Students' Dropout and Academic Success}},
  year         = {2021},
  howpublished = {UCI Machine Learning Repository},
  note         = {{DOI}: https://doi.org/10.24432/C5MC89}
}