# Playground S4E6 - Classification with an Academic Success Dataset

The goal of the notebook is to predict the class value of the `Target`, which is a categorical academic risk assessment. The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Predict Students' Dropout and Academic Success dataset](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success).

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

## Load Data

- `id`: A unique identifier for each student
- `Marital status`: Marital status of the student (1 = single, 2 = married, 3 = widower, 4 = divorced, 5 = facto union, 6 = legally separated).
- `Application mode`: Application mode for the student
- `Application order`: Order of the application.
- `Course`: Course enrolled by the student.
- `Daytime evening attendance`: Whether the student attends daytime or evening classes. (1 = daytime, 0 = evening)
- `Previous qualification`: Previous qualification of the student.
- `Previous qualification` (grade): Previous qualification grade of the student.
- `Nationality`: Nationality of the student.
- `Mother qualification`: Qualification of the student's mother.
- `Father qualification`: Qualification of the student's father.
- `Mother_occupation`: Occupation of the student's mother
- `Father_occupation`: Occupation of the student's father
- `Displaced`: Whether the student is displaced. (1 = yes, 0 = no)
- `Educational_special_needs`: Whether the student has educational special needs. (1 = yes, 0 = no)
- `Debtor`: Whether the student is a debtor. (1 = yes, 0 = no)
- `Tuition_fees_up_to_date:` Whether the student's tuition fees are up to date. (1 = yes, 0 = no)
- `Gender`: Gender of the student (0 = female, 1 = male)
- `Scholarship_holder`: Whether the student is a scholarship holder. (1 = yes, 0 = no)
- `Age_at_enrollment`: Age of the student at enrollment.
- `International`: Whether the student is international. (1 = yes, 0 = no)
- `Curricular units 1st sem (credited)`: Number of curricular units credited in the first semester.
- `Curricular_units_1st_sem_enrolled`: Number of curricular units enrolled in the first semester.
- `Curricular_units_1st_sem_evaluations`: Number of evaluations in the first semester.
- `Curricular_units_1st_sem_approved`: Number of approved units in the first semester.
- `Curricular_units_1st_sem_grade`: Grade in the first semester.
- `Curricular units 1st sem (without evaluations)`: Number of curricular units in the first semester without evaluations.
- `Curricular units 2nd sem (credited)`: Number of curricular units credited in the second semester.
- `Curricular_units_2nd_sem_enrolled`: Number of curricular units enrolled in the second semester.
- `Curricular_units_2nd_sem_evaluations`: Number of evaluations in the second semester.
- `Curricular_units_2nd_sem_approved`: Number of approved units in the second semester.
- `Curricular_units_2nd_sem_grade`: Grade in the second semester.
- `Curricular units 2nd sem (without evaluations)`: Number of curricular units in the second semester without evaluations.
- `Unemployment_rate`: Unemployment rate.
- `Inflation_rate`: Inflation rate.
- `GDP`: Gross Domestic Product.
- `Target`: The target variable indicating the student's outcome (Graduate, Dropout, Enrolled).


In [8]:
train = pd.read_csv('/kaggle/input/playground-series-s4e6/train.csv', index_col='id')
test = pd.read_csv('/kaggle/input/playground-series-s4e6/test.csv', index_col='id')
original = pd.read_csv("/kaggle/input/playgrounds4e06originaldata/original.csv", index_col='id')
original.rename(columns={'Daytime/evening attendance\t': 'Daytime/evening attendance'}, inplace=True)
submission = pd.read_csv('/kaggle/input/playground-series-s4e6/sample_submission.csv')

In [9]:
X = pd.concat([train, original], ignore_index=True).reset_index(drop=True)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(X.pop('Target'))
num_classes = len(set(y))
X.shape, y.shape, test.shape

((80942, 36), (80942,), (51012, 36))

## Modelling

In [10]:
params = {
    'n_estimators': 9000,
    'num_class': 3,
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'metric': 'multi_logloss',
    'verbosity': -1,
    'random_state': 42,
    'subsample': 0.70, 
    'learning_rate': 0.05, 
    'max_depth': 25, 
    'num_leaves': 80, 
    'min_child_samples': 50, 
    'min_data_per_groups': 18
}

In [11]:
def train_model(X_train, y_train, X_valid, y_valid, params):
    model = lgb.LGBMClassifier(**params)
    model.fit(X_train, y_train,
              eval_set=[(X_valid, y_valid)],
              callbacks=[lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(100)]
             )
    return model

def cross_val_train(X_train, y_train, X_test, params, n_splits=10):
    test_preds_avg = np.zeros((len(X_test), num_classes))
    valid_preds = np.zeros((len(X_train), num_classes))
    valid_accuracies = []

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    for fold, (train_indices, valid_indices) in enumerate(kf.split(X_train)):
        X_train_fold, y_train_fold = X_train.iloc[train_indices], y_train[train_indices]
        X_valid_fold, y_valid_fold = X_train.iloc[valid_indices], y_train[valid_indices]
        
        model = train_model(X_train_fold, y_train_fold, X_valid_fold, y_valid_fold, params)
        
        y_pred_valid = model.predict_proba(X_valid_fold)
        valid_accuracy = accuracy_score(y_valid_fold, np.argmax(y_pred_valid, axis=1))
        print(f"Fold: {fold}, Validation Accuracy: {valid_accuracy:.5f}")
        
        test_preds_avg += model.predict_proba(X_test) / n_splits
        valid_preds[valid_indices] = y_pred_valid
        valid_accuracies.append(valid_accuracy)
        print("-" * 50)
        
    return valid_accuracies, valid_preds, test_preds_avg

In [12]:
validation_accuracies, validation_predictions, test_predictions = cross_val_train(X, y, test, params)
validation_predictions_out = np.argmax(validation_predictions, axis=1)
overall_validation_accuracy = accuracy_score(y, validation_predictions_out)
print("Overall Validation Accuracy:", overall_validation_accuracy)

Training until validation scores don't improve for 50 rounds
[100]	valid_0's multi_logloss: 0.438773
[200]	valid_0's multi_logloss: 0.434197
Early stopping, best iteration is:
[247]	valid_0's multi_logloss: 0.433497
Fold: 0, Validation Accuracy: 0.83323
--------------------------------------------------
Training until validation scores don't improve for 50 rounds
[100]	valid_0's multi_logloss: 0.426415
[200]	valid_0's multi_logloss: 0.423861
Early stopping, best iteration is:
[161]	valid_0's multi_logloss: 0.423317
Fold: 1, Validation Accuracy: 0.83422
--------------------------------------------------
Training until validation scores don't improve for 50 rounds
[100]	valid_0's multi_logloss: 0.432494
[200]	valid_0's multi_logloss: 0.427308
Early stopping, best iteration is:
[241]	valid_0's multi_logloss: 0.426772
Fold: 2, Validation Accuracy: 0.83988
--------------------------------------------------
Training until validation scores don't improve for 50 rounds
[100]	valid_0's multi_lo

In [14]:
test_preds_labels = np.argmax(test_predictions, axis=1)
test_preds_labels = label_encoder.inverse_transform(test_preds_labels)

## Submission

In [16]:
submission['Target'] = test_preds_labels
submission.to_csv("submission.csv", index=False)