
### About Data

This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution.It includes demographic data, social-economic factors and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. 

Feature column names: Marital status, Application mode, Application order,Course,Daytime/evening attendance,Previous qualification,Nationality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Displaced,Educational special needs,Debtor,Tuition fees up to date,
Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved)

Note: Selected Features Relevant Case study.

Target column name: Target


#### Read data from dataset

In [29]:
import pandas as pd
df = pd.read_csv("dataset.csv")
df.head(5)

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,12,1,3,0,1,1,22,28,10,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


Data pre-processing: Data Cleaning,Scaling etc


In [30]:
# droping the Columns that un relevant to case study
df = df.drop('Unemployment rate', axis=1)
df = df.drop('Inflation rate', axis=1)
df = df.drop('GDP', axis=1)

#Rename columns (correcting the mistake)
df.rename(columns = {'Nacionality':'Nationality'}, inplace = True)

#convert string to Numerical value:
df['Target'] = df['Target'].map({'Dropout':0,'Enrolled':1,'Graduate':2})

#check null value 
df.isnull().sum()

df.head(5)


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nationality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0.0,0,0,0,0,0,0.0,0,0
1,1,6,1,11,1,1,1,1,3,4,...,6,14.0,0,0,6,6,6,13.666667,0,2
2,1,1,5,5,1,1,1,22,27,10,...,0,0.0,0,0,6,0,0,0.0,0,0
3,1,8,2,15,1,1,1,23,27,6,...,6,13.428571,0,0,6,10,5,12.4,0,2
4,2,12,1,3,0,1,1,22,28,10,...,5,12.333333,0,0,6,6,6,13.0,0,2


In [31]:
## Remove Low correlated value features from the dataframe
df = df.drop(columns=['Nationality', 'Mother\'s qualification', 'Educational special needs','Father\'s qualification','International', 'Curricular units 1st sem (without evaluations)'], axis=1)
df

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Mother's occupation,Father's occupation,Displaced,Debtor,...,Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Target
0,1,8,5,2,1,1,6,10,1,0,...,0,0,0.000000,0,0,0,0,0.000000,0,0
1,1,6,1,11,1,1,4,4,1,0,...,6,6,14.000000,0,6,6,6,13.666667,0,2
2,1,1,5,5,1,1,10,10,1,0,...,0,0,0.000000,0,6,0,0,0.000000,0,0
3,1,8,2,15,1,1,6,4,1,0,...,8,6,13.428571,0,6,10,5,12.400000,0,2
4,2,12,1,3,0,1,10,10,0,0,...,9,5,12.333333,0,6,6,6,13.000000,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,15,1,1,6,5,0,0,...,7,5,13.600000,0,6,8,5,12.666667,0,2
4420,1,1,2,15,1,1,10,10,1,1,...,6,6,12.000000,0,6,6,2,11.000000,0,0
4421,1,1,1,12,1,1,10,10,1,0,...,8,7,14.912500,0,8,9,1,13.500000,0,0
4422,1,1,1,9,1,1,8,5,1,0,...,5,5,13.800000,0,5,6,5,12.000000,0,2


In [47]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn import svm


dt = DecisionTreeClassifier(random_state=0)
rf = RandomForestClassifier(random_state=2)
lo = LogisticRegression(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
abc = AdaBoostClassifier(n_estimators=50,learning_rate=1, random_state=0)
xbc = XGBClassifier(tree_method='gpu_hist')
svm = svm.SVC(kernel='linear',probability=True)

X = df.drop('Target', axis=1)
y = df['Target']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)


dt.fit(X_train,y_train)
rf.fit(X_train,y_train)
lo.fit(X_train,y_train)
knn.fit(X_train,y_train)
abc.fit(X_train, y_train)
xbc.fit(X_train, y_train)
svm.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [55]:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score,confusion_matrix,roc_auc_score
import numpy as np



# Initialize a dictionary to hold the evaluation metrics
metrics_dict = {
    "Model": [],
    "Accuracy": [],
    "F1 Score": [],
    "Precision": [],
    "Recall": [],
    "confusion matrix":[]
}

# Define a function to evaluate and store the metrics
def evaluate_model(model, model_name, X_test, y_test, metrics_dict):
    y_pred = model.predict(X_test)
    metrics_dict["Model"].append(model_name)
    metrics_dict["Accuracy"].append(accuracy_score(y_test, y_pred) * 100)
    metrics_dict["F1 Score"].append(f1_score(y_test, y_pred, average='weighted'))
    metrics_dict["Precision"].append(precision_score(y_test, y_pred, average='weighted'))
    metrics_dict["Recall"].append(recall_score(y_test, y_pred, average='weighted'))
    metrics_dict["confusion matrix"].append(confusion_matrix(y_test, y_pred))

# Assuming dt, rf, lo, knn, abc, xbc, and svm are already defined and trained models
evaluate_model(dt, "Decision Tree", X_test, y_test, metrics_dict)
evaluate_model(rf, "Random Forest", X_test, y_test, metrics_dict)
evaluate_model(lo, "Logistic Regression", X_test, y_test, metrics_dict)
evaluate_model(knn, "KNeighbors", X_test, y_test, metrics_dict)
evaluate_model(abc, "AdaBoost", X_test, y_test, metrics_dict)
evaluate_model(xbc, "XGBoost", X_test, y_test, metrics_dict)
evaluate_model(svm, "SVM", X_test, y_test, metrics_dict)

# Convert the dictionary to a DataFrame
metrics_df = pd.DataFrame(metrics_dict)

# Display the DataFrame
print(metrics_df)


                 Model   Accuracy  F1 Score  Precision    Recall  \
0        Decision Tree  69.717514  0.702120   0.708549  0.697175   
1        Random Forest  79.322034  0.783529   0.782652  0.793220   
2  Logistic Regression  78.079096  0.765749   0.764116  0.780791   
3           KNeighbors  69.378531  0.683683   0.677123  0.693785   
4             AdaBoost  77.627119  0.768653   0.764695  0.776271   
5              XGBoost  78.983051  0.783210   0.780088  0.789831   
6                  SVM  76.949153  0.757834   0.756862  0.769492   

                               confusion matrix  
0  [[206, 44, 34], [44, 65, 42], [41, 63, 346]]  
1  [[219, 26, 39], [33, 65, 53], [15, 17, 418]]  
2  [[218, 24, 42], [43, 52, 56], [12, 17, 421]]  
3  [[201, 35, 48], [50, 42, 59], [45, 34, 371]]  
4  [[220, 32, 32], [39, 61, 51], [17, 27, 406]]  
5  [[221, 29, 34], [40, 67, 44], [12, 27, 411]]  
6   [[200, 39, 45], [38, 57, 56], [8, 18, 424]]  
