# Basic Supervised Learning - Classification

This notebook practice basic classification algorithms and common evaluation techniques and metrics.
The notebook includes:

* [Iris dataset description](#dataset_description)
* [Train / Test](#train_test)
* [Decision tree](#decision_tree)
* [Confusion Matrix](#confusion_matrix)
* [Recall and precision](#recall_precision)
* [ROC and AUC](#roc_auc)
* [Cross validation](#cross_validation)
* [K-Nearest neighbors](#knn)

In [13]:
import sklearn # sklearn is the most common Python package for machine learning
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn import tree
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets

## <a name="dataset_description"></a>Iris Dataset Description:

The iris dataset describes iris flowers. 
The explaining features are:

* sepal length in cm
* sepal width in cm
* petal length in cm
* petal width in cm   

There are 3 different iris species: Setosa, Versicolor and Verginica

In [6]:
iris = pd.read_csv('Data/iris.csv')
print(iris.shape)
print(iris.columns.values)

iris.head(10)

(150, 5)
['Sepal.Length' 'Sepal.Width' 'Petal.Length' 'Petal.Width' 'Species']


Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


## <a name="cross_validation"></a>Cross Validation

In [11]:
from sklearn.model_selection import KFold


iris_X = iris.drop(['Species'],axis = 1)
iris_Y = iris.Species

In [16]:
kf = KFold(n_splits=10,shuffle=True) 
for train_index, test_index in kf.split(iris_X,iris_Y):
    print("Test indices:", test_index)

Test indices: [ 19  26  30  31  44  63  70  71  76  87  95 114 118 122 125]
Test indices: [  7   9  23  24  36  40  56  66  75  79  80  91  97 107 116]
Test indices: [  3  18  22  38  46  55  59  60  65  69  72 108 133 137 143]
Test indices: [  2   4  13  21  33  34  43  54 109 112 119 120 135 136 145]
Test indices: [ 14  17  32  37  47  52  64  73  83  93  94 127 132 139 148]
Test indices: [  1   5  16  29  35  45  77  88  99 100 105 128 140 142 149]
Test indices: [ 10  15  42  50  61  74  81  82  86  89  90  98 126 131 147]
Test indices: [  0  11  20  28  39  48  78 102 104 110 115 117 130 134 144]
Test indices: [  6  12  49  51  57  58  67  84  96 101 103 111 123 141 146]
Test indices: [  8  25  27  41  53  62  68  85  92 106 113 121 124 129 138]


In [17]:
clf = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", min_samples_leaf=10, max_depth=4) # using default parameters

scores = cross_val_score(clf, iris_X, iris_Y, cv=kf, scoring='accuracy')
scores

array([0.8       , 0.93333333, 0.93333333, 1.        , 0.86666667,
       1.        , 0.93333333, 1.        , 1.        , 1.        ])

Another way to perform CV faster using sklearn:

In [98]:
clf = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", min_samples_leaf=1, max_depth=20) # using default parameters

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
scores = cross_val_score(clf, iris_explain_with_noise, iris_target, cv=cv, scoring='accuracy')
print(scores)

[0.68888889 0.73333333 0.62222222 0.6        0.8        0.66666667
 0.55555556 0.73333333 0.64444444 0.8       ]


ROC with cross-validation