#Mini LA Assignment 1 - Classification and Predicition
#### **Authors**: Yutong Shen, Jingfei Chen, Yiran Wang, Simon Chen

**Tasks:**

- Use multiple algorithms to attempt to predict which students drop out of courses
- Evaluate the performance of the algorithms
- Discuss the advantages/disadvantages of each algorithms

**Following the task instructions, in Mini LA 1, we defined our classifier as follows:**

**Feature Selection:** 
* We performed feature selection using *Extra Trees Classifier* and chose mutiple features to include in our Feature Set - "student_id", "years", "courses_taken", "enroll_date_time", and "course_id".

**Algorithms:** Logistic Regression, Support Vector Machine, Naive Bayes

## Problem & Purpose
During the registration season, many universities suffer the dilemma of students over-enrolling in courses at the beginning of the semester and then dropping most of them after finsihing adjusting their class schedule. This makes it difficult to plan for the semester and allocate resources. On the basis of not limiting students' choices of course selection, we decide to make predictions of which students are likley to drop out of which courses and use these predictions as reference for future curriculum setting.

## Literature Review
One of the similar studies we found was about building a dropout warning system, which enabled school to identify students who were under the risk of dropping out of school and took timely action to help them continue their studies. Researchers used the attendance, behavior, and course performance as the key indicators for their predictions. For the model implementation, Lee and Chung (2019) chose four classifiers including random forest (RF), boosted decision tree (BDT), random forest with SMOTE (SMOTE + RF), and boosted decision tree with SMOTE (SMOTE + BDT). To evalute the classification model performance, they used receiver operating characteristic (ROC) curve and precision-recall (PR) curve. Since the dataset was unbalanced, PR curves were more informative than ROC curves. Boost decision tree won the best performance with a 0.898 AUC. 

## Feature Selection
Extra trees classifier is a type of ensemble learning technique that combines the predictions from many decision trees.

### 1. Import Training Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# read the data
data = pd.read_csv("drop-out.csv")
data

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,complete,enroll_date_time,course_id,international,online,gender
0,172777,0,47.0,4,yes,159227767,807728,no,no,1
1,172777,0,47.0,4,yes,159227782,658434,no,no,1
2,172777,0,47.0,4,yes,159227866,658463,no,no,1
3,172777,0,47.0,4,yes,159227948,658498,no,no,1
4,175658,0,92.8,22,yes,157446419,807728,no,no,1
...,...,...,...,...,...,...,...,...,...,...
5856,295097,0,0.0,3,yes,159550455,807728,no,no,2
5857,295097,0,0.0,3,no,159550594,658434,no,no,2
5858,295097,0,0.0,3,no,159551590,658463,no,no,2
5859,299198,0,0.0,2,yes,160838118,807728,no,no,2


### 2. Descriptive Analysis

In [None]:
# get some descriptive data 
data['complete'].value_counts()

yes    4137
no     1724
Name: complete, dtype: int64

In [None]:
data.describe()

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,enroll_date_time,course_id,gender
count,5861.0,5861.0,5861.0,5861.0,5861.0,5861.0,5861.0
mean,252955.885856,0.643917,11.429688,16.417847,156255700.0,697634.106808,1.664904
std,26979.616759,1.535799,19.173021,12.326306,3630621.0,65677.872812,0.843351
min,172777.0,0.0,0.0,1.0,149369500.0,658434.0,1.0
25%,235106.0,0.0,0.0,7.0,153066300.0,658439.0,1.0
50%,247301.0,0.0,1.7,13.0,156135400.0,658467.0,2.0
75%,282338.0,0.0,19.1,22.0,159923600.0,807717.0,2.0
max,299198.0,10.0,121.7,60.0,161511100.0,807758.0,5.0


### 3. Data Processing

Based on the dataset, we need to create three dummy variables, which are international, online, and complete.

In [None]:
# create a dummy variable for international
dummyInternational = pd.get_dummies(data['international'], prefix = 'international') 
data = pd.concat([data, dummyInternational], axis=1)
data = data.drop(['international', 'international_no'], axis=1)
data

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,complete,enroll_date_time,course_id,online,gender,international_yes
0,172777,0,47.0,4,yes,159227767,807728,no,1,0
1,172777,0,47.0,4,yes,159227782,658434,no,1,0
2,172777,0,47.0,4,yes,159227866,658463,no,1,0
3,172777,0,47.0,4,yes,159227948,658498,no,1,0
4,175658,0,92.8,22,yes,157446419,807728,no,1,0
...,...,...,...,...,...,...,...,...,...,...
5856,295097,0,0.0,3,yes,159550455,807728,no,2,0
5857,295097,0,0.0,3,no,159550594,658434,no,2,0
5858,295097,0,0.0,3,no,159551590,658463,no,2,0
5859,299198,0,0.0,2,yes,160838118,807728,no,2,0


In [None]:
# create a dummy variable for online
dummyOnline = pd.get_dummies(data['online'], prefix = 'online') 
data = pd.concat([data, dummyOnline], axis=1)
data = data.drop(['online', 'online_no'], axis=1)
data

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,complete,enroll_date_time,course_id,gender,international_yes,online_yes
0,172777,0,47.0,4,yes,159227767,807728,1,0,0
1,172777,0,47.0,4,yes,159227782,658434,1,0,0
2,172777,0,47.0,4,yes,159227866,658463,1,0,0
3,172777,0,47.0,4,yes,159227948,658498,1,0,0
4,175658,0,92.8,22,yes,157446419,807728,1,0,0
...,...,...,...,...,...,...,...,...,...,...
5856,295097,0,0.0,3,yes,159550455,807728,2,0,0
5857,295097,0,0.0,3,no,159550594,658434,2,0,0
5858,295097,0,0.0,3,no,159551590,658463,2,0,0
5859,299198,0,0.0,2,yes,160838118,807728,2,0,0


In [None]:
# create a dummy variable for complete
dummyComplete = pd.get_dummies(data['complete'], prefix = 'complete') 
data = pd.concat([data, dummyComplete], axis=1)
data = data.drop(['complete', 'complete_no'], axis=1)
data

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,enroll_date_time,course_id,gender,international_yes,online_yes,complete_yes
0,172777,0,47.0,4,159227767,807728,1,0,0,1
1,172777,0,47.0,4,159227782,658434,1,0,0,1
2,172777,0,47.0,4,159227866,658463,1,0,0,1
3,172777,0,47.0,4,159227948,658498,1,0,0,1
4,175658,0,92.8,22,157446419,807728,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...
5856,295097,0,0.0,3,159550455,807728,2,0,0,1
5857,295097,0,0.0,3,159550594,658434,2,0,0,0
5858,295097,0,0.0,3,159551590,658463,2,0,0,0
5859,299198,0,0.0,2,160838118,807728,2,0,0,1


###4. Feature Selection

We performed feature selection using Extra Trees Classifier since it was faster and easy for computation. Extra Trees Classifier would randomly sample the features at each split point of a decision tree.

In [None]:
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

X = data[['student_id', 'years', 'entrance_test_score', 'courses_taken', 'enroll_date_time', 'course_id', \
          'gender', 'international_yes', 'online_yes']]
y = np.ravel(data[['complete_yes']])

model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, y)
print(model.feature_importances_)

[0.07832154 0.39335218 0.03245077 0.09725056 0.15067535 0.2122657
 0.019558   0.00811721 0.00800868]


The above computation gave us an importance score for each attribute where the larger the score, the more important the attribute. According to the importance scores, student_id, years, courses_taken, enroll_date_time, and course_id as our features were selected for inclusion in our model.

In [None]:
X = data[['student_id', 'years', 'courses_taken', 'enroll_date_time', 'course_id']]
y = np.ravel(data[['complete_yes']])

###5. Model Implementation & Validation

For classifer selection, we apply three approaches for comparison: logistic regression, support vector machine, and Naive Bayes.

- Logistic regression is the most common approach for classification and is easy to compute.
- Support vector machine is another linear algorithm and is often used as comparsion with logistic regression.
- Naive Bayes is a non-linear classification algorithm that helps us learn from the features from given label.

We will compare the accuracy of three approaches in the end.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training sets and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

**5.1. Implementing Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

# Load LogisticRegression() and call for LogisticRegression.fit() 
LogitModel = LogisticRegression(max_iter=2000).fit(X_train, y_train)
y_pred = LogitModel.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

# Compare the predicted Ys with what is actually in the testing dataset and obtain the confusion matrix
print(confusion_matrix(y_test, y_pred))

[[  29  485]
 [  42 1203]]


In [None]:
from sklearn.metrics import accuracy_score

# Obtaining accuracy scores
ac_logit = accuracy_score(y_test, y_pred)
print("The accuracy for logistic regression in sklearn is", ac_logit*100, "%")

The accuracy for logistic regression in sklearn is 70.03979533826038 %


In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, y_pred, zero_division=0))

              precision    recall  f1-score   support

           0       0.41      0.06      0.10       514
           1       0.71      0.97      0.82      1245

    accuracy                           0.70      1759
   macro avg       0.56      0.51      0.46      1759
weighted avg       0.62      0.70      0.61      1759



**5.2. Implementing Support Vector Machine**

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

svcModel = make_pipeline(StandardScaler(), SVC(gamma='auto')).fit(X_train, y_train)
y_pred = svcModel.predict(X_test)

In [None]:
# Obtaining accuracy scores
ac_logit = accuracy_score(y_test, y_pred)
print("The accuracy for support vector classification in sklearn is", ac_logit*100, "%")

The accuracy for support vector classification in sklearn is 87.15179079022171 %


**5.3. Implementing Naive Bayes**

In [None]:
from sklearn.naive_bayes import GaussianNB

NBModel = GaussianNB(priors=[0.5, 0.5]).fit(X_train, y_train)
y_pred = NBModel.predict(X_test)

In [None]:
# Obtaining accuracy scores
ac_logit = accuracy_score(y_test, y_pred)
print("The accuracy for Naive Bayes in sklearn is", ac_logit*100, "%")

The accuracy for Naive Bayes in sklearn is 55.713473564525295 %


Based on the classification report, we got 70.04% accuracy using logistic regression, 87.15% accuracy when using support vector machine, and 55.71% accuracy when using Naive Bayes. The model performance of support vector machine was far better than the other two in predicting which students drop out of which courses. 

## Discussion

In this assignment, our team ultilized three kinds of algorithms: Logistic Regression, Support Vector Machine, and Naive Bayes.

- Logistic regression is the most common approach for classification and is easy to compute. It is a relatively conservative algorithm and is perfect for setting up a baseline to compare with. It is good to be used when the linear relationship is clear and when data is relatively normally distributed.
- Support vector machine is another linear algorithm and is often used as comparsion with logistic regression. It goes a step further than logistic regression and estimates the boundary of each group.
- Naive Bayes is a non-linear classification algorithm that helps us learn from the features from given label. It infers the labels given features and works very quickly. It works well with categorical variables but faces the zero-probability problem when there is limited sample of the data to train the model. Thus, the labels may not be associated with all the features.

Based on the classification report, we got 70.04% accuracy using logistic regression, 87.15% accuracy when using support vector machine, and 55.71% accuracy when using Naive Bayes. The model performance of support vector machine was far better than the other two in predicting which students drop out of which courses. Our result got a higher accuracy from support vector machine than logistic regression. Relating back to our selected features: "student_id", "years", "courses_taken", "enroll_date_time", and "course_id", student_id and course_id might cause some noise in the data. The enroll_date_time was also coded in a specific way. The low accuracy we got from Naive Bayes might be caused due to zero-probability problem. The labels might not be associated with relevant features and thus reduce the accuracy rate.

## Reference
Lee, S., & Chung, J. Y. (2019). The machine learning-based dropout early warning system for improving the performance of dropout prediction. Applied Sciences, 9(15), 3093.