# 09 Machine Learning

This notebook is based upon **08 pandas**, so please make sure you run and understood that one.<br>
This notebook will scratch on the surface of machine learning by introducing some techniques and algorithms. The idea and application of the objects representing the algorithms is demonstrated. Your task will be to beat the presented, dramatically simplified algorithm.

The scipy stack includes the scikit-learn module which covers anything one would need for solving everyday machine learning tasks. The basic idea is usually to *train* a pure data-driven algorithm on a *training dataset*. This dataset includes different *predictors* that are used to predict the state or value of a *target* variable. After training the algorithm, it will be applied to a *test dataset* in order to rate its performance. Usually, the predicting accuracy is used as a measure.<br>
Before data can be handed to an algorithm it has to be preprocessed to make the algorithm classes understand the data in a correct way.

In [1]:
import pandas as pd
import numpy as np
import sklearn
from pprint import pprint
from sklearn import preprocessing

In [2]:
# open the file from last week (08 pandas)
df = pd.read_csv('data/train_corrected.csv')
df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmountMedian,ApplicantIncomeMedian
0,LP001003,-1,True,1,True,False,4583,1508.0,128.0,360.0,True,Rural,N,False,True
1,LP001005,-1,True,0,True,True,3000,0.0,66.0,360.0,True,Urban,Y,False,False
2,LP001006,-1,True,0,False,False,2583,2358.0,120.0,360.0,True,Urban,Y,False,False
3,LP001008,-1,False,0,True,False,6000,0.0,141.0,360.0,True,Urban,Y,True,True
4,LP001011,-1,True,2,True,True,5417,4196.0,267.0,360.0,True,Urban,Y,True,True
5,LP001013,-1,True,0,False,False,2333,1516.0,95.0,360.0,True,Urban,Y,False,False
6,LP001014,-1,True,3+,True,False,3036,2504.0,158.0,360.0,False,Semiurban,N,True,False
7,LP001018,-1,True,2,True,False,4006,1526.0,168.0,360.0,True,Urban,Y,True,True
8,LP001020,-1,True,1,True,False,12841,10968.0,349.0,360.0,True,Semiurban,N,True,True
9,LP001024,-1,True,2,True,False,3200,700.0,70.0,360.0,True,Urban,Y,False,False


## Preprocessing

Before we can continue, we have to split the dateset into the predictors and the target values. In the language of machine learning you call them *data* and *target*. In this example, we will only use the Gender and Education predictors for a first guess. Trying to beat an algorithm based only on that data, you'll have to redo all these steps with your predictors.

In [3]:
target = df.Loan_Status.values
data = df[['Gender', 'Education', 'Self_Employed']].values
assert len(target) == len(data)

The preprocessing toolbox offers sever classes that can convert labels and ranges to data, that the algorithms actually understand. Both datasets hold binary information, therefore the Binarizer class is the correct one. Parameters are always given to the objects on instantiation. Before the input can be transformed (e.g. into a binary information), the Transformer has to be fitted to the input data. Most classes offer a fit_transform method, that can do both steps in one.<br>

In [4]:
data = preprocessing.Binarizer().fit_transform(data)

Before we use any of the predictors, we should check the correlations between the predictors. There is no sense in using two highly correlated predictors

In [5]:
from scipy.stats import spearmanr, pearsonr
print(spearmanr(data[0], data[1]))
print(pearsonr(data[0], data[1]))

SpearmanrResult(correlation=0.50000000000000011, pvalue=0.66666666666666674)
(0.5, 0.66666666666666663)


The target variable holds Label information. We can use the LabelEncoder to turn the labels into integer. Here, we could also use the Binarizer, as there are only two different target classes. Nevertheless, the LabelEncoder can work on more than two classes.

In [6]:
target = preprocessing.LabelEncoder().fit_transform(target)

Next we need an test and a train dataset. In this very specific case we downloaded only the train dataset last week and could now download the test.csv as well. Then you would have to apply the cleanup to the test dataset in exactly the same way. Nevertheless scikit-learn offers a convenient function to split data into a test and a train dataset: *train_test_split*. In the machine learning world, the data would be denoted as a captial X and the target as a lower y. Usually you would split into $\frac{1}{3}$ test and $\frac{2}{3}$ train size, but the function can take any ratio you need. Lastly, a random state can be used as a seed to randomly choose the datasets.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=1337)
print('Train data:', X_train.shape)
print('Test data:', X_test.shape)
print('Train targets:', y_train.shape)
print('Test targets:', y_test.shape)

Train data: (328, 3)
Test data: (162, 3)
Train targets: (328,)
Test targets: (162,)


## Decision Tree 

We will use one of the easiest algorithms for predicting the Loan Status: the Decision Tree. A decision tree will build up different branches of sucessive decisions (nodes, or here: leafs) based on all predictor combinations to reach the target classes. That means decision trees are classifiers, exactly what we need. It will answer a question represented by one leaf based on the state of the actual dataset until it reaches a target leaf.<br>
One important parameter for decision trees is the break criterion. You can specify in many ways, when the algortihm shall stop building branches and leafs and add the targets. If it wouldn't stop, it might create one branch for each unique dataset (with as many leafs as there are predictor combinations making this data point unique). This would be overfitting, as it is 100% accurate on the train dataset but most likely very bad on test datasets.

One can now either set the maximum length (depth) of branches (*max_depth*), the minimum samples that have to pass a leaf (*min_samples_leaf*) or the minimum samples needed on a node to actually split it into a new branch (*min_samples_split*). min_samples_leaf defaults to 1 and min_samples_split defaults to 2. This would result leafs, that are only entered by two data points, where each of them using another branch. <br>
Often it is a good choice to set a max_depth and increment this number by validating that no overfitting took place.

In [8]:
from sklearn.tree import DecisionTreeClassifier

# instantiate a Decision Tree
clf = DecisionTreeClassifier(criterion='gini', max_depth=3)

# build the tree based upon the test samples
clf = clf.fit(X_train, y_train)

# predict the test datset
clf.score(X_test, y_test)

0.64814814814814814

In [9]:
clf.predict(X_test)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1])

### k-fold cross-validation

The result we had above is highly dependend on the splitting we made. Generally, a machine learning algorithm is better, when the test dataset contains all value ranges and predictor combination that are possible. Or at least preset in the test dataset. With an easy Decision Tree like the one we used, the predictions will turn into prediction always '1', if we use not enough predictors or unsuitable training dataset sizes. Remind, that we find more '1's in the target than '0'.

In [10]:
np.histogram(target, bins=2)[0]

array([154, 336])

If you remove the random_state from train_test_split and rerun the splitting and prediction several times, you will notice, that the accuracy changes by about 10%! A better way to get a more splitting-independent score is to use k-fold cross validation.<br>
Here, the dataset is splitted into $k$ slices (here called folds). The the Decision Tree is trained with $k -1$ folds and tested on the remaining fold. The score is then the mean score of all runs.

In [11]:
from sklearn.model_selection import cross_val_score   # for scoring
from sklearn.model_selection import cross_val_predict # for predicting

# build the decision tree
clf = DecisionTreeClassifier(max_depth=3)

# 5-fold cross validation
print(' 3-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=3).mean() * 100))
print(' 5-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=5).mean() * 100))
print(' 7-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=7).mean() * 100))
print('10-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=10).mean()* 100))

 3-Fold accuracy: 68.4%
 5-Fold accuracy: 68.8%
 7-Fold accuracy: 68.8%
10-Fold accuracy: 68.2%


## Random Forst

There is actually another weakness in the model. Beside the dependance on the folds / test dataset, there is also a dependancy on the distribution of the classes. The 'Y' class is overweighting the 'N' by far. Both weaknesses can be overcome to a specific amount by the RandomForest algorithm.<br>
As the name is already stating, a RandomForest is a set of DecisionTrees. Here, a subset of data is choosen from the dataset and passed to a number of DecisionTrees. Overfitting and mean accuracy are then improved by using the average tree performance. The scipy RandomForest can also bootstrap the sub-samples, that means the data points are drawn from the dataset with repetitions.

In [12]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(criterion='entropy', n_estimators=10, max_depth=3, 
                             bootstrap=True, max_features='auto')

# fit the forest
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

0.64814814814814814

We are slightly better than the Decision tree on its own. You could try to adapt some of the settings to see the influence on the result. Of course we can again cross-validate the RandomForest like we did with the DecisionTree. 
Now, we would need a suitable test scenario, altering only one of the parameters at a time or doing a sensitivity analysis, over blind-guessing good parameter choices.

In [13]:
clf = RandomForestClassifier(criterion='entropy', n_estimators=50, max_depth=3, 
                             bootstrap=True, max_features='auto')

print(' 3-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=3).mean() * 100))
print(' 5-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=5).mean() * 100))
print(' 7-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=7).mean() * 100))
print('10-Fold accuracy: %.1f%%' % (cross_val_score(clf, data, target, cv=10).mean()* 100))

 3-Fold accuracy: 68.8%
 5-Fold accuracy: 68.8%
 7-Fold accuracy: 68.4%
10-Fold accuracy: 68.2%


Seems like there is nothing we can do to improve a RandomForest or DecisionTree based on only these 3 predictors.<br><br>
<div class="alert alert-success"><br>**TASK:** Now it's your turn! beat my 68.8% accuracy!<br><br></div>