<a href="https://colab.research.google.com/github/sakuronohana/my_datascience/blob/master/udemy/mlaz/Part%203%20-%20Classification/Section%2010%20-%20Model%20Selection%20Classification/classifier_model_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Model Selection

In diesem Notebook werden wir eine klassische Modell Selektion vornehmen. Dazu werden wir verschiedene Klassifikatoren auf einen Datensatz trainineren und mittels Performance-Messungen das beste Modell auswählen.

## Importing the libraries

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [9]:
datloc = ('https://raw.githubusercontent.com/sakuronohana/my_datascience/master/udemy/mlaz/Part%203%20-%20Classification/Section%2010%20-%20Model%20Selection%20Classification/Data.csv')

dataset = pd.read_csv(datloc)
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

## Splitting the dataset into the Training set and Test set

In [10]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25, random_state=0)

## Feature Scaling

In [33]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Train and Test different Classification models

### 1. Model - Linear Logistic Regression Classificator

In [34]:
# Train
from sklearn.linear_model import LogisticRegression
clf_logreg = LogisticRegression(random_state=0)
clf_logreg.fit(X_train,y_train)

# Test 
y_pred_logreg = clf_logreg.predict(X_test)

### 2. Model - KNeighbors Model

In [13]:
# Train
from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier(n_neighbors=10, p=2)
clf_knn.fit(X_train, y_train)

# Test
y_pred_knn = clf_knn.predict(X_test)

### 3. Model - Linear Support Vector Classification (SVC)

In [14]:
# Train
from sklearn.svm import SVC
clf_linsvc = SVC(random_state=0, kernel='linear')
clf_linsvc.fit(X_train,y_train)

# Test
y_pred_linsvc = clf_linsvc.predict(X_test)

### 4. Model - RBF Support Vector Classificator

In [15]:
# Train
from sklearn.svm import SVC
clf_rbfsvc = SVC(random_state=0, kernel='rbf')
clf_rbfsvc.fit(X_train,y_train)

# Test
y_pred_rbfsvc = clf_rbfsvc.predict(X_test)

### 5. Model -Naive Bayes Classificator

In [16]:
# Train
from sklearn.naive_bayes import GaussianNB
clf_gausnb = GaussianNB()
clf_gausnb.fit(X_train, y_train)

# Test
y_pred_gausnb = clf_gausnb.predict(X_test)

### 6. Model - Decision Tree Classificator

In [35]:
# Train
from sklearn.tree import DecisionTreeClassifier
clf_dtree = DecisionTreeClassifier(random_state=0, criterion='entropy')
clf_dtree.fit(X_train,y_train)

# Test
y_pred_dtree = clf_dtree.predict(X_test)

### 7. Model - Random Forest Classificator

In [18]:
# Train
from sklearn.ensemble import RandomForestClassifier
clf_ranfor = RandomForestClassifier(n_estimators=10,random_state=0, criterion='entropy')
clf_ranfor.fit(X_train,y_train)

# Test
y_pred_ranfor = clf_ranfor.predict(X_test)

## Measure classification performance.

In [48]:
print('Results with confusion matrix:')
print('Linear Logistic Regression: \n',confusion_matrix(y_test,y_pred_logreg))
print('KNeighbors: \n',confusion_matrix(y_test,y_pred_knn))
print('Linear Support Vector (SVC): \n',confusion_matrix(y_test,y_pred_linsvc))
print('SVM RBF-Kernel:\n',confusion_matrix(y_test,y_pred_rbfsvc))
print('Naive Bayes:\n',confusion_matrix(y_test,y_pred_gausnb))
print('Decision Tree: \n', confusion_matrix(y_test,y_pred_dtree))
print('Random Forest:\n',confusion_matrix(y_test,y_pred_ranfor))

Results with confusion matrix:
Linear Logistic Regression: 
 [[103   4]
 [  5  59]]
KNeighbors: 
 [[103   4]
 [  6  58]]
Linear Support Vector (SVC): 
 [[102   5]
 [  5  59]]
SVM RBF-Kernel:
 [[102   5]
 [  3  61]]
Naive Bayes:
 [[99  8]
 [ 2 62]]
Decision Tree: 
 [[103   4]
 [  3  61]]
Random Forest:
 [[102   5]
 [  6  58]]


### Measure Accuracy with Matthews Correlation Coefficient (MCC)

Information wie der MCC funktioniert findent man [hier](https://towardsdatascience.com/the-best-classification-metric-youve-never-heard-of-the-matthews-correlation-coefficient-3bf50a2f3e9a).

In [44]:
from sklearn.metrics import confusion_matrix, matthews_corrcoef

print('Liste der Accuracy:')
print('Linear Logistic Regression:',round(matthews_corrcoef(y_test,y_pred_logreg),3))
print('KNeighbors:',round(matthews_corrcoef(y_test,y_pred_knn),3))
print('Linear Support Vector (SVC):',round(matthews_corrcoef(y_test,y_pred_linsvc),3))
print('SVM RBF-Kernel:',round(matthews_corrcoef(y_test,y_pred_rbfsvc),3))
print('Naive Bayes:',round(matthews_corrcoef(y_test,y_pred_gausnb),3))
print('Decision Tree:',round(matthews_corrcoef(y_test,y_pred_dtree),3))
print('Random Forest:',round(matthews_corrcoef(y_test,y_pred_ranfor),3))

Liste der Accuracy:
Linear Logistic Regression: 0.887
KNeighbors: 0.875
Linear Support Vector (SVC): 0.875
SVM RBF-Kernel: 0.901
Naive Bayes: 0.88
Decision Tree: 0.913
Random Forest: 0.862


### Measure with Cross Validation

In [42]:
from sklearn.model_selection import cross_val_score
dtree_cv = cross_val_score(estimator=clf_dtree, X = X_train, y = y_train)
print('Real Accuracy: {:.2f}%'.format(dtree_cv.mean()*100))
print('Real Accuracy: {:.2f}%'.format(dtree_cv.std()*100))


Real Accuracy: 92.97%
Real Accuracy: 0.37%


In [40]:
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred_dtree, average='macro'))

0.9564362921716344


In [41]:
from sklearn.metrics import matthews_corrcoef
print(matthews_corrcoef(y_test, y_pred_dtree))

0.9129464705674794
