# 3.Machine Learning Classifieds Model Tour on scikit-learn
- Introduction to powerful and popular classification algorithms, logistic regression, support vector machines, and decision trees
- Using the scikit-learn machine learning library for examples and explanations
- Describe the strengths and weaknesses of classification algorithms with linear or nonlinear decision boundaries

## 3.1 Select classification algorithm

The predictive and computational performance of the classification model depends heavily on the data you want to use for training  
The five main steps for training machine learning algorithms are as follows
1. Select property and collect training samples
2. choose performance index
3. Choose a classification model and an optimization algorithm
4. Evaluate model performance
5. Tune the algorithm

## 3.2 Scikit-learn First Steps: Perceptron Training

Use only two properties from the iris dataset for visualization  
In 150 flower samples, the petal length and petal width are assigned to the characteristic matrix x, and the class labels corresponding to the corresponding flower varieties are assigned to the vector y  

In [3]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:,[2, 3]]
y = iris.target
print('Class Label:', np.unique(y))

Class Label: [0 1 2]


The np.unique (y) function returns three unique class labels stored in iris.target  
As you can see, Iris-setosa, Iris-versicolor, and Iris-virginica are already stored as integers (here: 0,1,2)  
Integer labels are recommended because they avoid small numbers and take up small memory areas, which improves computational performance  

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

Randomly split X and y arrays using the train_test_split function of the scikit-learn model_selection module  
30% will be test data, 70% will be training data  


premix the dataset before the train_test_split function splits  
Otherwise, class 0 and class 1 are in the training set, and the test set consists of only 45 samples of class 2  
Passes a fixed ramdon seed (random_state = 1) to the random_state parameter to the pseudorandom number generator used to randomly shuffle the dataset before splitting  
Fixing random_state can reproduce the result of execution  

Finally use stratification via stratify = y  
Stratification means that the train_test_split function makes the ratio of class labels in the training set and test set equal to the input data set  
You can count the number of unique values ​​in an array using the numpy bincount function  


Let's check the stratification

In [5]:
print('label count for y', np.bincount(y))

label count for y [50 50 50]


In [6]:
print('label count for y_train', np.bincount(y_train))

label count for y_train [35 35 35]


In [7]:
print('label count for y_test', np.bincount(y_test))

label count for y_test [15 15 15]


We will standardize the characteristics using the StandardScaler class from scikit-learn's preprocessing module

In [8]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

The fit method of the StandardScaler calculates $\mu$ and $\sigma$ for each feature dimension in the training set  
Calling the transform method normalizes the training set using the computed $\mu$ and $\sigma$  
Standardize the test set using the same $\mu$ and $\sigma$ so that samples from the training and test sets are moved at the same rate  

Standardize training data and train perceptron models  
Most of scikit-learn's algorithms support multiclass classifications using the OVR method  
I will inject three iris classes into the perceptron at once

In [10]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=40, eta0=0.1, tol=1e-3, random_state=1)
ppn.fit(X_train_std, y_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=40, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=1, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

Load the Perceptron class from the linear_model module, create a new Perceptron object, and train the model using the fit method  

Some experimentation is required to find an appropriate learning rate  
If the learning rate is too high, the algorithm goes past the global minimum  
If the learning rate is too small, the learning rate is slow, which requires a lot of epochs to converge, especially on large datasets  
Use the random_state parameter so that the results of mixing the training set per epoch are reproduced later  

You can make predictions with the predict method

In [13]:
y_pred = ppn.predict(X_test_std)
print('Misclassified Sample Count: %d' % (y_test != y_pred).sum())

Misclassified Sample Count: 1
