<img align="right" width="250" src="https://nullpointerexception1.files.wordpress.com/2017/11/decision-tree-e1513448957591.jpg?w=1400&h=9999">

# Classification with Python

This notebook contains an overview of basic python functionalities for classification using the [sklearn](http://scikit-learn.org/stable/) library.  
Note: this notebook is purposely not 100% comprehensive, it only discusses the basic things you need to get started.

Import of the basic packages to use

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
FOLDER = 'Dataset/Dataset Visti a Lezione/'

<img align="right" width="150" src="https://archive.ics.uci.edu/ml/assets/MLimages/Large53.jpg">

## Iris Dataset  
[Link](https://archive.ics.uci.edu/ml/datasets/iris) to the dataset on the UCI Machine Learning Repository.  
As first step we load the whole Titanic Dataset and make confidence with its features.  

In [None]:
df = pd.read_csv(FOLDER+"iris.csv")
df.head()

In [None]:
df

In [None]:
df.info()

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  
The predictive attribute is the class of the iris plant. 

Fetures:
* sepal length (in cm)
* sepal width (in cm)
* petal length (in cm) 
* petal width (in cm) 
* class: Iris-setosa, Iris-versicolour, Iris-virginica

Since classification is a ***supervised*** task we are interested in knowing the distribution of thetarget class.

In [None]:
df['class'].value_counts()

Sometimes is useful to map a set of string into a set of integers.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
num_classes = le.fit_transform(df['class'])
print(num_classes[0:5])

In [None]:
len(num_classes)

## Data Understanding

We observe the distributions of the attributes without considering the class.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(df, figsize=(10, 10), c=num_classes, s=50)
plt.show()

In [None]:
plt.scatter(df['petal_length'], df['petal_width'], s=20, c=num_classes)
plt.tick_params(axis='both', which='major', labelsize=22)
plt.show()

## Classification Objective

Given a collection of records called ***training set*** where each record contains a set of ***attributes*** and one of the attributes is the ***target class***. The objective of classification is to find a model for the class attribute as a function of the values of other attributes.

The ***goal*** is to assign to a class previously unseen records as accurately as possible.
A ***test set*** is used to determine the accuracy of the model. 

Usually, the given data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.

<img align="center" width="650" src="http://images.slideplayer.com/15/4732696/slides/slide_4.jpg"> 

## Classification Techniques
* ***Decision Tree***
* ***Instance-based methods***
* Rule-based methods
* Neural Networks
* Naïve Bayes and Bayesian Belief Networks
* Support Vector Machines (SVM)

## Evaluating the Performance of a Classifier

In order to evaluate the quality of classification there exist several measures: all of them built upon the concept of **Confusion Matrix**.

**Confusion Matrix**
In the field of machine learning a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

<img align="right" width="300" src="https://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png">

Given a Target class:
* ***True Positive (TP)*** represent those instances correctly predicted to be True
* ***False Positive (FP)*** represent those instances incorrectly predicted to be True
* ***True Negative (TN)*** represent those instances correctly predicted to be False
* ***False Negative (FT)*** represent those instances incorrectly predicted to be False 

Upon such classes are built several indicators.
Among the otehrs, two scores characterize the outcome of a predictive model: ***precision*** and ***recall***

* **Precision**: how many of the instances I predict to be True are really True? $\mathit{precision} = \frac{TP}{TP+FP}$
* **Recall**: how many True instances I was able to correctly predict? $\mathit{recall} = \frac{TP}{TP+FN}$

To summarize the overall performance of a model we can also use the ***accuracy*** and the ***f1-score***: 

* The **accuracy**  $=\frac{TP+TN}{TP+TN+FP+FN}$ captures the number of instances correctly classified above all
* $1-\mathit{accuracy}$ gives the errore rate, i.e., the error committed by the classifier.
* The **f1-score** $=\frac{2TP}{2TP+FP+FN}$  describes the armonic mean of precision and recall.

All these indicators are provided by [sklearn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

![image.png](attachment:image.png)

# Decision Tree

## Example of Decision Tree and Application
<img align="left" width="490" src="http://images.slideplayer.com/15/4732696/slides/slide_10.jpg">
<img align="right" width="490" src="http://images.slideplayer.com/15/4732696/slides/slide_13.jpg">

## The Algorithm in a Nutshell

**Objective:** Build the most accurate decision tree.

Given a set $D$ of training records.  
* If $D_x$ contains records that belong the same class $y$, then this is a leaf node labeled as $y$;
* If $D_x$ contains records that belong to more than one class, use the **best attribute** to split the data into smaller subsets $D_1, \dots D_k$.
* Recursively apply the procedure to each subset.

How to determine the best split: nodes with ***homogeneous*** class distribution are preferred.  
Thus, a measure of node ***impurity*** is required. Example of impurity nodes:
* Gini Index
* Entropy
* Misclassification error

How to determine when to stop splitting: there are various ***stopping criteria***:
* Stop expanding a node when all the records belong to the same class
* Stop expanding a node when all the records have similar attribute values
* Early termination (to be discussed later) 

> Tan, P. N. (2006). Introduction to data mining. Pearson Education India.

Running [example](http://matlaspisa.isti.cnr.it:5055/Decision%20Tree)
Wikipedia [link](https://en.wikipedia.org/wiki/Decision_tree)

## Classification Problems

* Missing values: sophisticatd techniques are required to handle missing values
* The sklearn library does not allow missing values.
* Overfitting: the model is too accurate on the training data but its performance are poor on the test data.
* For a Decision Trees it means that the tree is more complex and deep than necessary.

## Decision Tree in Python  ([sklearn](http://scikit-learn.org/stable/modules/tree.html))

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
df.columns

In [None]:
predictors = [col for col in df.columns if col != 'class']

In [None]:
predictors

In [None]:
df[predictors].values[:5]

In [None]:
predictors = [col for col in df.columns if col != 'class']
X = df[predictors].values
y = df['class']

Split the dataset into train and test

Remember that **stratification** is important to maintain proportions among classes

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=100, 
                                                    stratify=y)

What if I want also a validation set?

In [None]:
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
    #test_size=0.25, random_state= 8) 

# 0.25 x 0.7 = 0.175

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Build the decision tree

Parameters:
* **criterion** (default 'gini'): The function to measure the quality of a split. Available: gini, entropy.
* **max_depth** (default None): The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* **min_samples_split** (default 2): The minimum number of samples required to split an internal node.
* **min_samples_leaf** (default 1): The minimum number of samples required to be at a leaf node.

In [None]:
clf = DecisionTreeClassifier(criterion='gini', max_depth=None, 
                             min_samples_split=2, min_samples_leaf=1)
clf

In [None]:
clf.fit(X_train, y_train)

Output:
* **feature\_importances_**: The feature importances. The higher, the more important the feature.
* **tree_**: The underlying Tree object.

Features Importance

In [None]:
for col, imp in zip(predictors, clf.feature_importances_):
    print(col, imp)
print(clf.classes_)

Visualize the decision tree

In [None]:
import pydotplus
from sklearn import tree
from IPython.display import Image

In [None]:
#import os
#os.environ['PATH'] += os.pathsep + 'C:/Users/Username/Anaconda3/Library/bin/graphviz'

In [None]:
dot_data = tree.export_graphviz(clf, out_file=None,  
                                feature_names=predictors, 
                                class_names=clf.classes_,  
                                filled=True, rounded=True,  
                                special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

Apply the decision tree on the training set

In [None]:
y_pred = clf.predict(X_train)

Evaluate the performance

In [None]:
print('Accuracy %s' % accuracy_score(y_train, y_pred))
print('F1-score %s' % f1_score(y_train, y_pred, average=None))

In [None]:
print(classification_report(y_train, y_pred))

In [None]:
confusion_matrix(y_train, y_pred)

Apply the decision tree on the test set and evaluate the performance

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))
confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import roc_curve, auc, roc_auc_score

In [None]:
lb = LabelBinarizer()
lb.fit(y_test)
lb.classes_.tolist()

In [None]:
fpr = dict()
tpr = dict()
roc_auc = dict()
by_test = lb.transform(y_test)
by_pred = lb.transform(y_pred)

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(by_test[:, i], by_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
    
roc_auc = roc_auc_score(by_test, by_pred, average=None)
roc_auc

In [None]:
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

class_of_interest = "Iris-virginica"
class_id = np.flatnonzero(lb.classes_ == class_of_interest)[0]

RocCurveDisplay.from_predictions(
    by_test[:, class_id],
    by_pred[:, class_id],
    name=f"{class_of_interest} vs the rest",
    color="darkorange",
)
plt.plot([0, 1], [0, 1], "k--", label="chance level (AUC = 0.5)")
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nVirginica vs (Setosa & Versicolor)")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
for i in range(3):
    plt.plot(fpr[i], tpr[i], 
             label='%s ROC curve (area = %0.2f)' % (lb.classes_.tolist()[i], roc_auc[i]))
    
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=20)
plt.ylabel('True Positive Rate', fontsize=20) 
plt.tick_params(axis='both', which='major', labelsize=22)
plt.legend(loc="lower right", fontsize=14, frameon=False)
plt.show()

### Cross Validation  
![image.png](attachment:image.png)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

In [None]:
scores = cross_val_score(clf, X_train, y_train, cv=10)
print('Accuracy: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='f1_macro')
print('F1-score: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

In [None]:
scores

In [None]:
scoring = ['precision_macro', 'recall_macro']
scores = cross_validate(clf, X_train, y_train, scoring=scoring, cv=10)
sorted(scores.keys())

In [None]:
scores

### Tuning the hyper-parameters

- **Search Space** Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration.
- **Random Search** Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.
- **Grid Search** Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

More options at [link](http://scikit-learn.org/stable/modules/grid_search.html#grid-search)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
param_list = {'min_samples_split': [2, 5, 10, 20],
              'min_samples_leaf': [1, 5, 10, 20],
             }

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(clf, param_grid=param_list, cv=cv)
grid_search.fit(X_train, y_train)
clf = grid_search.best_estimator_

In [None]:
clf

In [None]:
report(grid_search.cv_results_, n_top=3)

In [None]:
param_list = {'max_depth': [None] + list(np.arange(2, 20)),
              'min_samples_split': [2, 5, 10, 20, 30, 50, 100],
              'min_samples_leaf': [1, 5, 10, 20, 30, 50, 100],
             }

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
random_search = RandomizedSearchCV(clf, param_distributions=param_list, 
                                   n_iter=100, cv=cv)
random_search.fit(X_train, y_train)
clf = random_search.best_estimator_

In [None]:
clf

In [None]:
report(random_search.cv_results_, n_top=3)

## Any other Sklearn classifier can be used in the same way

Let see two examples: Random Forest and K-Nearest Neighbor

# Random Forest

Sklearn [link](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for more details.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(n_estimators=100, 
                             criterion='gini', 
                             max_depth=None, 
                             min_samples_split=2, 
                             min_samples_leaf=1, 
                             class_weight=None)

In [None]:
scores = cross_val_score(clf, X_train, y_train, cv=10)
print('Accuracy: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='f1_macro')
print('F1-score: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

# K-Nearest Neighbors

Sklearn [link](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for more details.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier(n_neighbors=5)

scores = cross_val_score(clf, X, y, cv=10)
print('Accuracy: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

scores = cross_val_score(clf, X, y, cv=10, scoring='f1_macro')
print('F1-score: %0.4f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

## XGBOOST

In [None]:
import xgboost

In [None]:
xgb = xgboost.XGBClassifier()

In [None]:
le = LabelEncoder()
num_classes = le.fit_transform(y_train)
print(num_classes[0:5], len(num_classes))

In [None]:
xgb.fit(X_train, num_classes)

In [None]:
res = xgb.predict(X_test)

In [None]:
accuracy_score(le.fit_transform(y_test), res)