
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer:**

This is a binary classification problem since the output for prediction (i.e: 'passed') has 2 values 'Yes' and 'No'. 

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Read student data
data = pd.read_csv('./datasets/student-data.csv')
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# Calculate number of students
n_students = data.shape[0]

In [4]:
# Calculate number of features
n_features = data.shape[1]

In [5]:
# Calculate passing students
n_passed = len(data[data['passed']=='yes'])

In [6]:
# Calculate failing students
n_failed = len(data[data['passed']=='no'])

In [7]:
# Calculate graduation rate
grad_rate = round((n_passed/(n_students))*100, 2)

In [8]:
# Print the results
print(f'The total number of students, {n_students}')
print(f'The total number of features for each student, {n_features}')
print(f'The number of those students who passed, {n_passed}')
print(f'The number of those students who failed, {n_failed}')
print(f'The graduation rate of the class, {grad_rate}, in percent (%).')

The total number of students, 395
The total number of features for each student, 31
The number of those students who passed, 265
The number of those students who failed, 130
The graduation rate of the class, 67.09, in percent (%).


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [9]:
# Extract feature columns

In [10]:
data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],
      dtype='object')

In [11]:
features = ['school', 'sex', 'age', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences']

Address column is removed as it has no effect in calculation

In [12]:
# Extract target column 'passed'

In [13]:
target = ['passed']

In [14]:
# Separate the data into feature data and target data (X and y, respectively)

In [15]:
X = data[features]
y = data[target]
print('Feature size :: ', X.shape)

Feature size ::  (395, 29)


In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 29 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   famsize     395 non-null    object
 4   Pstatus     395 non-null    object
 5   Medu        395 non-null    int64 
 6   Fedu        395 non-null    int64 
 7   Mjob        395 non-null    object
 8   Fjob        395 non-null    object
 9   reason      395 non-null    object
 10  guardian    395 non-null    object
 11  traveltime  395 non-null    int64 
 12  studytime   395 non-null    int64 
 13  failures    395 non-null    int64 
 14  schoolsup   395 non-null    object
 15  famsup      395 non-null    object
 16  paid        395 non-null    object
 17  activities  395 non-null    object
 18  nursery     395 non-null    object
 19  higher      395 non-null    object
 20  internet  

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [17]:
non_numeric_features = X.select_dtypes(include = ['object'])

In [18]:
non_numeric_features.nunique()

school        2
sex           2
famsize       2
Pstatus       2
Mjob          5
Fjob          5
reason        4
guardian      3
schoolsup     2
famsup        2
paid          2
activities    2
nursery       2
higher        2
internet      2
romantic      2
dtype: int64

In [19]:
non_numeric_features.columns

Index(['school', 'sex', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic'],
      dtype='object')

In [20]:
binary_columns = ['school', 'sex','famsize', 'Pstatus', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery','higher', 
                  'internet', 'romantic']
one_hot_columns = ['Mjob', 'Fjob', 'reason', 'guardian']

In [21]:
#Binary Conversion
one_binary_value = []
for column in binary_columns:
    one_binary_value.append(X[column][0]) #adding the first value in each binary column to list
print(one_binary_value)
pd.options.mode.chained_assignment = None 
# Convert columns to Binary
for column, first_value in zip(binary_columns, one_binary_value):
    X[column] = (X[column] == first_value).astype(int)
y['passed'] = (y['passed'] == 'yes').astype(int)

['GP', 'F', 'GT3', 'A', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no']


In [22]:
# OneHot encoding
X = pd.get_dummies(data = X, columns = one_hot_columns)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   school             395 non-null    int32
 1   sex                395 non-null    int32
 2   age                395 non-null    int64
 3   famsize            395 non-null    int32
 4   Pstatus            395 non-null    int32
 5   Medu               395 non-null    int64
 6   Fedu               395 non-null    int64
 7   traveltime         395 non-null    int64
 8   studytime          395 non-null    int64
 9   failures           395 non-null    int64
 10  schoolsup          395 non-null    int32
 11  famsup             395 non-null    int32
 12  paid               395 non-null    int32
 13  activities         395 non-null    int32
 14  nursery            395 non-null    int32
 15  higher             395 non-null    int32
 16  internet           395 non-null    int32
 17  romantic        

### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [23]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=40, test_size=95)

In [24]:
# Show the results of the split
print(f'X_train = {X_train.shape}')
print(f'y_train = {y_train.shape}')
print(f'X_test = {X_test.shape}')
print(f'y_test = {y_test.shape}')

X_train = (300, 42)
y_train = (300, 1)
X_test = (95, 42)
y_test = (95, 1)


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

In [25]:
corr = X.corr()
corr[corr>0.5].notna().sum()

school               1
sex                  1
age                  1
famsize              1
Pstatus              1
Medu                 2
Fedu                 2
traveltime           1
studytime            1
failures             1
schoolsup            1
famsup               1
paid                 1
activities           1
nursery              1
higher               1
internet             1
romantic             1
famrel               1
freetime             1
goout                1
Dalc                 2
Walc                 2
health               1
absences             1
Mjob_at_home         1
Mjob_health          1
Mjob_other           1
Mjob_services        1
Mjob_teacher         1
Fjob_at_home         1
Fjob_health          1
Fjob_other           1
Fjob_services        1
Fjob_teacher         1
reason_course        1
reason_home          1
reason_other         1
reason_reputation    1
guardian_father      1
guardian_mother      1
guardian_other       1
dtype: int64

<b>Model 1 : Logistic Regression<br>
Model 2 : K-Nearest Neighbours<br>
Model 3 : Random Forest Classification<br>
</b>

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

*#explaination*<br>
**Model 1 : Logistic Regression**<br>
This probality based model is best suited for cases in which 1. minimum multicollinearity, 2. No null values, 3. No categorical values.<br>
**Model 2 : kNN Classifier**<br>
Simple, fast and efficient<br>
**Model 3 : Random Forest Classifier**<br>
Powerfull, accurate and good performance on many problems

In [27]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [28]:
# fit model-1  on traning data 

In [39]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [40]:
import warnings
warnings.filterwarnings("ignore") 
lr = LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [41]:
# predict on the test data 

In [42]:
lr_pred = lr.predict(X_test)

In [43]:
# calculate the accuracy score

In [44]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
def check_model_metrices(y_test, y_pred):
    print('Model Accuracy = ', accuracy_score(y_test, y_pred))
    print('Model Precision = ', precision_score(y_test, y_pred))
    print('Model Recall = ', recall_score(y_test, y_pred))
    print('Model F1 Score = ', f1_score(y_test, y_pred))
    print('Confusion Matrix = \n', confusion_matrix(y_test, y_pred))
check_model_metrices(y_test, lr_pred)

Model Accuracy =  0.7473684210526316
Model Precision =  0.7837837837837838
Model Recall =  0.8787878787878788
Model F1 Score =  0.8285714285714285
Confusion Matrix = 
 [[13 16]
 [ 8 58]]


In [45]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

In [46]:
def generate_kNN_model(x_train, y_train, x_test, k):
    knn_model = KNeighborsClassifier(n_neighbors=k, metric='minkowski')
    knn_model.fit(x_train, y_train)
    return knn_model.predict(x_test)

accur_dict = dict()
for k in np.arange(3,16):
    y_pred = generate_kNN_model(X_train, y_train, X_test, k)
    accur_dict[k] = accuracy_score(y_test, y_pred)
optimal_k = max(accur_dict, key = lambda x: accur_dict[x])
print('Best k value = ', optimal_k)
knn_y_pred = generate_kNN_model(X_train, y_train, X_test, optimal_k)
check_model_metrices(y_test, knn_y_pred)

Best k value =  13
Model Accuracy =  0.7473684210526316
Model Precision =  0.75
Model Recall =  0.9545454545454546
Model F1 Score =  0.84
Confusion Matrix = 
 [[ 8 21]
 [ 3 63]]


In [35]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [53]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

In [54]:
check_model_metrices(y_test, rf_pred)

Model Accuracy =  0.7684210526315789
Model Precision =  0.775
Model Recall =  0.9393939393939394
Model F1 Score =  0.8493150684931509
Confusion Matrix = 
 [[11 18]
 [ 4 62]]
