In [4]:
# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

The goal of this project is to identify students who might need early intervention. As we want to predict whether a given student will pass or
fail, given information about his life and habits, this indicates a classification problem with two classes, pass and fail.

In [3]:
# Import libraries
import numpy as np
import pandas as pd

from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns

Student data read successfully!


We explore the following features in the data:

- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

In [15]:
student_data.columns[-1]

'passed'

In [5]:
n_students = student_data.shape[0]
n_features = student_data.shape[1] - 1
n_passed = sum([1 for y in student_data['passed'] if y == 'yes'])
n_failed = sum([1 for n in student_data['passed'] if n == 'no'])
grad_rate = 100.*n_passed/(n_passed + n_failed)

print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that data sets contain non-numeric features. This can be a problem, as most machine learning algorithms expect numeric
data to perform computations with.

First we separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [8]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted. Many of them are simply `yes`/`no`, e.g. `internet`. These can
be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such
a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of
them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the
[`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies)
function to perform this transformation.

In [9]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding
labels) into training and test sets.

In [10]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size = .24, random_state = 0)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Next we perform a baseline comparison of 5 different classification models out of the box.

 - DecisionTreeClassifier
 - SVC
 - KNeighborsClassifier
 - GaussianNB
 - AdaBoostClassifier

We fit the model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Then
repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

* * *
**Model Training**  

Kfold cross validation can be used to get the maximum use of a small data set like the one here. To get my bearings and not to impose any
pre-judgement, I wanted to try most of the classifiers seen in class out of the box. I excluded neural net, since its recommended use requires
the features to be scaled (one of the disadvantages of that algorithm).  After separating the data into a training set (size 300) and test set
(size 95), I used Kfold cross-validation on the training set with 10 folds to have a basic handle on how the models were performing on
average. The result of this can be seen in the table below. The decision tree trails all others in its f1 test score, while scoring perfectly
on the training set (which no other does) so it seems to be over-fitting.  The SVC tops all others with an average test f1 of 0.808. KNN
doesn't trail far with 0.781. Naive Bayes and AdaBoost are pretty much tied with test f1's of 0.768 and 0.767 respectively.

In [34]:
# Helper Functions
import time
from IPython.display import display, HTML 
from sklearn.metrics import f1_score

# Return the classifier's training time
def timeTraining(clf, X_train, y_train):
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    return (end - start)

# Return the classifier's predictions and prediction time
def predictAndTime(clf, features):
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    return y_pred, (end - start)

# Return the f1 score for the target values and predictions
def F1(target, prediction):
    return f1_score(target.values, prediction, pos_label='yes')



In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier

In [82]:
#Adding Kfold validation to try out the Pro Tip from CodeReview2
from sklearn.cross_validation import KFold  

# Setting up KFold cross_validation object
kf = KFold(X_train.shape[0], 10)

# Array of classifiers
clfs = [DecisionTreeClassifier(criterion = "entropy"),
        SVC(C = 1.0, kernel="rbf"),
        GaussianNB(),
        AdaBoostClassifier(),
        KNeighborsClassifier(n_neighbors = 3)]
 
#Gathering Table column and index labels
classifier_names = [clf.__class__.__name__ for clf in clfs]
benchmarks = ["Training time",  "F1 score training set","Prediction time", "F1 score test set"]
table = pd.DataFrame(columns = classifier_names, index = benchmarks)

# Fit Classifiers and average the times and f1 scores resulting from KFold (10 folds)
for clf in clfs: 
    classifier   = clf.__class__.__name__
    t_test  = 0.0 
    t_train = 0.0
    F1_test = 0.0
    F1_train =0.0
    
    #Averaging scores and seconds accross the folds
    for tr_i ,t_i in kf:
        #Train (k-1 buckets)
        t_train += timeTraining(clf, X_train.iloc[tr_i], y_train.iloc[tr_i])
        pred_train_set = predictAndTime(clf, X_train.iloc[tr_i])[0]
        F1_train += F1(y_train.iloc[tr_i], pred_train_set)
        #Test (kth bucket)
        pred_test_set, t_t = predictAndTime(clf,X_train.iloc[t_i])
        t_test += t_t
        F1_test += F1(y_train.iloc[t_i], pred_test_set)
        
    #Filling table 
    table[classifier]['Training time']         = "{:10.4f} s".format(t_train/10)
    table[classifier]['F1 score training set'] = F1_train/10
    table[classifier]['Prediction time']       = "{:10.4f} s".format(t_test/10)
    table[classifier]['F1 score test set']     = F1_test/10
    
from IPython.display import display, HTML  
display(table)

Unnamed: 0,DecisionTreeClassifier,SVC,GaussianNB,AdaBoostClassifier,KNeighborsClassifier
Training time,0.0035 s,0.0069 s,0.0009 s,0.1022 s,0.0007 s
F1 score training set,1,0.874614,0.799737,0.870058,0.889301
Prediction time,0.0002 s,0.0007 s,0.0003 s,0.0053 s,0.0011 s
F1 score test set,0.702098,0.807571,0.767603,0.76675,0.780717


__DecisionTree:__ Generates a tree representation of a decision function, where each node in the tree represents an if-then-else decision rule,
and each leaf finally places the input in a given category (pass/fail in our case).

__Pros:__  

* Simple to understand and interpret. The decision function can explicitly be drawn out as a tree structure.  

* The cost of predicting is logarithmic on the size of the training data.  

* It can handle categorical and numerical data equally well.  

__Cons:__

* They are prone to overffiting the data (as can be seen in the table) unless pruned, and/or tuned through its various paramters (i.e. minimum samples per leaf, maximum depth). This can be costly.  

* They can be unstable; small variations in the data can generate completely different trees.  

* They can create bias trees if some classes dominate.  

__Chosen/rejected:__ Chosen  

__Reasoning:__ Before we manipulated it, the data was strongly categorical. Even age, though ordered, basically could be thought of as buckets
between 15 and 22.  Even though the decision tree can handle both categorical and numerical data equally well, overall the structure of this
data seems like a good fit for the tree like decision function. We can easily see how if then rules can be deduced from the attributes, things
like, if (has internet connection) then … , or if (father's job == Teacher) then … else if (father's job == Healthcare worker) then … , so on
and so forth. The data is also not too badly out of balance (approx 2:1 pass to no pass ratio) which mitigates one of the cons. Tunning the
tree to improve its performance may prove costly and there is already some evidence of overfitting above. Using the model for prediction is
fast, which would be advantageous in a situation of limited computational resources.

* * *  

__SVC:__ Seeks to maximize the decision boundary between classes by solving a quadratic programming problem.  

__Pros:__ 

* Effective for data with a high number of attributes  

* Still effective when there are more attributes than data points  

* Its memory efficient because only a subset of the data is required (support vectors)  

* Highly tunable at the tuning phase (also a con). In sklearn's implementation we have 4 kernel function (plus the ability to define a custom
  kernel) plus around a dozen parameters.

__Cons:__  

* Harder to conceptualize relative to other models. The entire mechanism of the algorithm is a rather abstract linear algebra/analysis
  problem. (Though this con only really applies when seeking to interpret results in an intuitive way)

* Though both its training time and predcition time were relatively fast out of the box, the model can be costly to tune do to its sheer number
  of configurations (can also be a pro).

* Can only expensively provide probability estimates. (Not so relevant here) 

__Chosen/Rejected:__ Chosen  

__Reasoning:__ For this particular problem since we are not required to estimate any probabilties there seems to be little downside to using
SVC. On the other hand, some of its strengths are also not particulary relevant since, the number of attributes is pretty small compared to the
size of our data set. Its memory efficiency definitely fits within a scenario were we have small reasorces. Its f1 score on the test set was
also best out of all, out of the box.

* * *  

__GaussianNB:__ Applies Baye's theorem with the assumption of independence between every pair of attributes. From the training data it
calculates particular P(__x__|y)'s (i.e. given a label what is the probability of the input point __x__) fits them to a Gaussian
distribution. It then uses that model, applying Bayes, to approximate P(y|__x__) for new inputs __x__.

__Pros:__  

* Requires only a small amount of training data to estimate the necessary parameters. 

* They can be very fast compared to more sophisticated methods (evidence of this can be seen in the table)  

__Cons:__  

*  Bad estimator of probabilities. (Not so relevant here)  

* Sklearn's implementation has little in the way of tune-ability.  

__Chosen/Rejected:__ Rejected  

__Reasoning:__ GaussianNB was one of the fastest to be trained and had a decent average f1 test score. Two of Naive Bayes' advantages that are
relevant to this problem are, its capacity to train on small amounts of data and its training speed. In this sense it is perfectly suited for
the situation at hand. The main reason I pass on it, is the lack of tune-able parameters in sklearn's GaussianNB implementation. So I don't
have a real chance of improving its performance from here, what I see now is what I get (I maybe wrong on this).

* * * 

__AdaBoostClassifier:__ It trains multiple weak classifiers (weak meaning they are better than guessing i.e. generalization error < 0.5), and
then combines them into a single boosted classifier using a weighted voting scheme.

__Pros:__  

* Computationally efficient (taken from Intro to boosting pdf)  

* No difficult parameters to set  

* Versatile – a wide range of weak learners can be used  

__Cons:__  

* Weak learner should not be too complex to avoid overffiting  

* There needs to be enough data so that the weak learner requirement is satisfied.  

__Chosen/Rejected:__ Rejected  

__Reasoning:__ From the above table, AdaBoost seems expensive to train as its average training time of 0.1022s is approximately 15 times slower
than the next slowest time of 0.069s by SVC. Also from the Intro to Boosting reading, it seems that choosing the weak learner properly is
particularly important for this algorithm to perform well. Simply using the default DecisionTree weak learner from sklearn's implementation,
resulted in the somewhat costly, and not particularly strong performance seen in the table above. I experimented briefly using grid search and
a parameter set (the DecisionTree was left as the weak learner choice) but did not achieve significants gains in performance given the
computational cost. This leads me to believe that experimenting with the choice of weak learner would be more fruitful, but I opted to postpone
such a study for later.

* * *  

__KNeighborsClassifier:__ Given an input point, __x__, it finds the k closest points to __x__ in the training set and then applies a majority
voting scheme regarding their labels, to determine the label for __x__.

__Pros:__  

* Relatively fast prediction time in general, O[Dlog(N)] where D is the number attributes and N the number of training examples (D is unlikely
  to change much in this scenario, so its really more like O[log(N)] )

* Easier to conceptualize and reason about than other more sophisticated models.  

__Cons:__  

* Not memory efficient, since it makes predictions by directly using the data as a “model”. Must therefore keep the data stored.  

__Chosen/Rejected:__ Chosen  

__Reasoning:__ Upon closer reading of section 1.6.4 of Sklearn's documentation, _"Nearest Neighbor Algorithms"_, KNN as, I used it here, makes
a choice of algorithm based on the data passed to fit(). The choice being between brute force, K-D tree, and Ball tree. This data set is too
large for brute force (n = 300 >> 30) so we can consider KNN here as either K-D or Ball tree. Now, since the size of the feature space, is for
computational purposes, D = 48 > 20 (after adding all the dummy classes) it is likely it is choosing Ball Tree. In any case its time complexity
at prediction is O[Dlog(N)] which explains why its slower than the decision tree. Since the amount of student features is not likely to change
much if KNN is put to practice in our scenario (suppose we choose it in the end for use of the school board), this can be consider a pro for
this algorithm, since it is essentially logarithmic, just as the decision tree.
Also from sklearn;  

“Ball tree and KD tree query times can be greatly influenced by data structure. In general, sparser data with a smaller intrinsic
dimensionality leads to faster query times.”

If I'm understanding correctly (may very well not be) since sparsity of the data set “refers to the degree to which the data fills the
parameter space” , then it seems to me that the data set here is somewhat sparse. The reason I say this is because when we added the dummies,
essentially we added a lot of zero components to each student “vector”. Since the student's mother, say, cannot be partially between being a
Teacher and Health care worker, in regard to those categorical variables, there are regions in the feature space that are empty. Because no
vectors will ever have components that are non-zero there. If this is indeed the case, then it would also be a pro for KNN.  However if our
training data set were to grow very large, this will come at significant memory cost.

Next I try each of these models with varying training set sizes, from 50 students to 300 in increments of 50, for a total of 6 training set sizes. 


In [42]:
# Helper function makeTable
def makeTable(clf, index_list, X_train, X_test, y_train, y_test):
    
    #Gathering column and row labels for the table
    benchmarks = ["Training time",  "F1 score training set","Prediction time", "F1 score test set"]
    size_labels = ["Training samples: {}".format(len(indices)) for indices in index_list]
    table = pd.DataFrame(columns = benchmarks, index = size_labels)
    
    for i, ind in enumerate(index_list): 
        #Cutting training data to len(ind) samples
        X_tr = X_train.iloc[ind]
        y_tr = y_train.iloc[ind]
        #Compute benchmarks
        t_train    = timeTraining(clf, X_tr, y_tr)
        pred_train_set = predictAndTime(clf, X_tr)[0]
        pred_test_set, t_test = predictAndTime(clf,X_test)  
        
        #fill table
        table['Training time'][i]    = t_train
        table['F1 score training set'][i] = F1(y_tr, pred_train_set)
        table['Prediction time'][i]  = t_test
        table['F1 score test set'][i] = F1(y_test, pred_test_set)
        
    return table 


In [38]:
# Helper function return a set of k unique random indices in range 0 to N
from random import randint

def getIndices(k,N):
    ind = []
    while(len(ind) < k):
        i = randint(0,N)
        if i not in ind:
            ind.append(i)
    return ind

In [45]:
# Chosen Classifiers
chosen_clfs = [DecisionTreeClassifier(criterion = "entropy"),
        SVC(C = 1.0, kernel="rbf"),
        KNeighborsClassifier(n_neighbors = 3)]

# Test Classifiers with increasing data set size
training_sizes = [50,100,150,200,250,300]
# Choosing random index sets for each size in training_sizes 
index_list = [getIndices(size, X_train.shape[0]-1) for size in training_sizes]

for clf in chosen_clfs:
    print clf.__class__.__name__
    table = makeTable(clf, index_list, X_train, X_test, y_train, y_test)
    display(table)



DecisionTreeClassifier


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.00134015,1,0.00022912,0.655738
Training samples: 100,0.00187397,1,0.000470161,0.738462
Training samples: 150,0.00248003,1,0.000349045,0.7
Training samples: 200,0.00266218,1,0.000222921,0.692913
Training samples: 250,0.00324202,1,0.000225067,0.694215
Training samples: 300,0.00353885,1,0.000210047,0.744186


SVC


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.00126004,0.868421,0.00067687,0.746667
Training samples: 100,0.00251102,0.853801,0.00161004,0.766234
Training samples: 150,0.0035069,0.867769,0.00148201,0.791946
Training samples: 200,0.00478506,0.870662,0.00180602,0.773333
Training samples: 250,0.00570989,0.874036,0.001755,0.772414
Training samples: 300,0.00794792,0.869198,0.00203085,0.758621


KNeighborsClassifier


Unnamed: 0,Training time,F1 score training set,Prediction time,F1 score test set
Training samples: 50,0.000905991,0.84507,0.000964165,0.706767
Training samples: 100,0.000568867,0.860759,0.00172305,0.723077
Training samples: 150,0.000732183,0.908257,0.00194407,0.68254
Training samples: 200,0.000803947,0.865772,0.00178885,0.729927
Training samples: 250,0.000707865,0.878453,0.00239015,0.677165
Training samples: 300,0.000725985,0.880361,0.00236201,0.731343


SVC showed consistently higher f1 test scores. The DecisionTree was very fast at predicting. KNN was very fast at training.

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best
  model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you
  chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

In [72]:
# TODO: Fine-tune your model and report the best F1 score
from sklearn import grid_search
from sklearn.metrics import make_scorer

scorer = make_scorer(F1)

# tree_param = { 'criterion': ["entropy", "gini"], 'max_features':["sqrt", "log2"], 'max_depth': range(2,11),
#               'min_samples_split':range(2,9), 'min_samples_leaf':range(1,9) }

neigh_param = {'n_neighbors' : [3,5,10,20,25,30,40], 'weights' : ['uniform', 'distance'], 'p':[1,2,3,5,10],
              'algorithm' : ['ball_tree', 'kd_tree', 'brute']}

#Perform grid Search
def gridIt(clf, params):
    #Grid search folds = 10, for consistency with previous computations
    grid_clf = grid_search.GridSearchCV(clf, params, scorer, cv = 10)
    print clf.__class__.__name__
    print "Grid search time:", timeTraining(grid_clf, X_train, y_train)
    print "Parameters of tuned model: ", grid_clf.best_params_
    y_pred, predict_t = predictAndTime(grid_clf, X_test)
    print "f1_score and prediction time on X_test, y_test: "
    print F1(y_test, y_pred), predict_t
    print '------------------\n'
    
gridIt(KNeighborsClassifier(), neigh_param)


KNeighborsClassifier
Grid search time: 25.9852719307
Parameters of tuned model:  {'n_neighbors': 25, 'weights': 'uniform', 'algorithm': 'ball_tree', 'p': 2}
f1_score and prediction time on X_test, y_test: 
0.786666666667 0.00278496742249
------------------



## Conclusions
**Algorithm Selection:**  
* * *
The decision to use KNeighborsClassifier was mostly due to its low cost computationally. Given a battery of
possible tuning parameters it was the only one of the algorithms that was feasably tunable given low resources.
It performed better than other cheap alternatives like Naive Bayes, so it was reasonably
well performing for its cost. As seen in the benchmark table in cell [82] out of the box it was outperformed only
by the SVC, which is more costly to train. Still its improvement after tuning was only marginal. 

**Layman Explanation: KNeighborsClassifier**  
* * *
The mechanism with which our model decides whether a current student will pass or fail is very intuitive. Each student has 30 attributes
associated with him/her. These range from basic descriptors like their age, sex, and health, to behavioral descriptors like whether they are in
a romantic relationship, how much time they devote to study, if they have any extra curricular activities. Some of the attributes are of things
out of there control like whether they have internet access, the size of their family, and what neighborhood they live in. What are model
process does, is that it assigns a relevant number to every one of these attributes. Just like for example you can take a house and assign it a
longitude, a latitude, and maybe if it sits on a hill an altitude. In the same way that information like longitude, latitude and altitude,
alows us to decide how far away two houses are from each other, we can decide how "far away" two particular students are from each other. Based
on all those attributes like free time and age and so forth. Since we have information about which students have failed and which have passed,
our model basically answers the question; how "close" is this student to other students who have passed. Maybe he is "closer" to students who
fail. We can choose to compare him/her to the closest _single_ student to determine how likely he is to pass or fail. Usually, however we tune
the model to find a _group_ of students closest to him/her, maybe the closest 4 students, or maybe 10 closest students. The exact number is
determined while tuning. Since we use students that we know either passed or failed in this group, we can determine if our student in question
is closer to others who pass or those who fail.

** Final F1 score **  
0.787