*Python Machine Learning 2nd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/LICENSE.txt)

# Python Machine Learning - Code Examples

# Chapter 3 - A Tour of Machine Learning Classifiers Using Scikit-Learn

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

In [1]:
from sklearn import __version__ as sklearn_version
from distutils.version import LooseVersion

if LooseVersion(sklearn_version) < LooseVersion('0.18'):
    raise ValueError('Please use scikit-learn 0.18 or newer')

*The use of `watermark` is optional. You can install this IPython extension via "`pip install watermark`". For more information, please see: https://github.com/rasbt/watermark.*

### Overview

- [Choosing a classification algorithm](#Choosing-a-classification-algorithm)
- [Modeling class probabilities via logistic regression](#Modeling-class-probabilities-via-logistic-regression)
    - [Logistic regression intuition and conditional probabilities](#Logistic-regression-intuition-and-conditional-probabilities)
    - [Learning the weights of the logistic cost function](#Learning-the-weights-of-the-logistic-cost-function)
    - [Training a logistic regression model with scikit-learn](#Training-a-logistic-regression-model-with-scikit-learn)
    - [Tackling overfitting via regularization](#Tackling-overfitting-via-regularization)
- [Maximum margin classification with support vector machines](#Maximum-margin-classification-with-support-vector-machines)
    - [Maximum margin intuition](#Maximum-margin-intuition)
    - [Dealing with the nonlinearly separable case using slack variables](#Dealing-with-the-nonlinearly-separable-case-using-slack-variables)
    - [Alternative implementations in scikit-learn](#Alternative-implementations-in-scikit-learn)
- [Solving nonlinear problems using a kernel SVM](#Solving-nonlinear-problems-using-a-kernel-SVM)
    - [Using the kernel trick to find separating hyperplanes in higher dimensional space](#Using-the
- [Summary](#Summary)

<br>
<br>

<br>
<br>

In [1]:
from IPython.display import Image
%matplotlib inline


# Choosing a classification algorithm

...

# First steps with scikit-learn

Loading the Iris dataset from scikit-learn. Here, the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

In [1]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

print('Class labels:', np.unique(y))

Looking at the shape of the arrays and the data.

In [1]:
X.shape

In [1]:
X[0:10,:]

In [1]:
y[0:10]

Checking the distribution of classes in the dataset.

In [1]:
print(np.sum(y==0))
print(np.sum(y==1))
print(np.sum(y==2))

There are equal number of flowers in each class. As an exercise, it will be interesting to check if the first 50 values of y are in class 0, the next 50 in class 1 and the last 50 in class 2.

In [1]:
# Check if the first 50 values in y are from Class 0 (setosa), the next 50 from Class 1 (Versicolor) and the last 50 from Class 2 (Virginica)

Splitting data into 70% training and 30% test data:
Note that the train_test_split function already shuffles the training sets
internally before splitting; otherwise, all class 0 and class 1 samples would have
ended up in the training set, and the test set would consist of 45 samples from
class 2. Via the random_state parameter, we provided a fixed random seed
( random_state=1 ) for the internal pseudo-random number generator that is used
for shuffling the datasets prior to splitting. Using such a fixed random_state ensures
that our results are reproducible.
Lastly, we took advantage of the built-in support for stratification via stratify=y . In
this context, stratification means that the train_test_split method returns training
and test subsets that have the same proportions of class labels as the input dataset.

In [1]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [1]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

![](http://)Standardizing the features:
Using the fit method, StandardScaler estimated the
parameters μ (sample mean) and σ (standard deviation) for each feature dimension
from the training data. By calling the transform method, we then standardized the
training data using those estimated parameters μ and σ .

In [1]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Redefining the `plot_decision_region` function from chapter 2:

In [1]:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt


def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx], 
                    label=cl, 
                    edgecolor='black')

    # highlight test samples
    if test_idx:
        # plot all samples
        X_test, y_test = X[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0],
                    X_test[:, 1],
                    c='',
                    edgecolor='black',
                    alpha=1.0,
                    linewidth=1,
                    marker='o',
                    s=100, 
                    label='test set')

In [1]:
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

<br>
<br>

# Modeling class probabilities via logistic regression

...

Logistic Regression is a classificaton method that predicts the probability that an input X belongs to Class y. The probability P(y =1 / X) is calculated and converted to 0 or 1 for classifying y. 

It is named for the logistic (sigmoid) function that is an S curve that can map any real number to a value between 0 and 1, but not at those limits. 

y = e^(b0 + b1 * x) / (1 + e^(b0 + b1 * x))
where b0 and b1 are the weights. 

P(y = 1 / X) = P(X) = e^(b0 + b1 * X) / (1 + e^(b0 + b1 * X))

This can be written as
ln(p(X) / 1 – p(X)) = b0 + b1 * X

The log odds of the default class is a linear combination of the input X. The coefficients are estimated using Maximum Likelihood Estimation.
P(X) >= 0.5 => y = 1

P(X) < 0.5 => y = 0

https://machinelearningmastery.com/logistic-regression-for-machine-learning/

### Logistic regression intuition and conditional probabilities

In [1]:
import matplotlib.pyplot as plt
import numpy as np


def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.arange(-7, 7, 0.1)
phi_z = sigmoid(z)

plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$\phi (z)$')

# y axis ticks and gridline
plt.yticks([0.0, 0.5, 1.0])
ax = plt.gca()
ax.yaxis.grid(True)

plt.tight_layout()
#plt.savefig('images/03_02.png', dpi=300)
plt.show()

In [1]:
Image('../input/python-ml-ch03-images/03_03.png',width=700)

<br>
<br>

### Learning the weights of the logistic cost function
Cost function

https://sebastianraschka.com/faq/docs/probablistic-logistic-regression.html

In [1]:
Image(filename='../input/regularization/LR-cost.png')

where z = b0 + b1 * x

In [1]:
def cost_1(z):
    return - np.log(sigmoid(z))


def cost_0(z):
    return - np.log(1 - sigmoid(z))

z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z)

c1 = [cost_1(x) for x in z]
plt.plot(phi_z, c1, label='J(w) if y=1')

c0 = [cost_0(x) for x in z]
plt.plot(phi_z, c0, linestyle='--', label='J(w) if y=0')

plt.ylim(0.0, 5.1)
plt.xlim([0, 1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
#plt.savefig('images/03_04.png', dpi=300)
plt.show()

<br>
<br>

### Training a logistic regression model with scikit-learn

Parameters

class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)

C : float, default: 1.0
Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization.

In [1]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=100,penalty='l2', random_state=1)
lr.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined,
                      classifier=lr, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_06.png', dpi=300)
plt.show()

In [1]:
#Probability estimate
lr.predict_proba(X_test_std[:3, :])

In [1]:
lr.predict_proba(X_test_std[:3, :]).sum(axis=1)

In [1]:
lr.predict_proba(X_test_std[:3, :]).argmax(axis=1)

In [1]:
lr.predict(X_test_std[:3, :])

In [1]:
lr.predict(X_test_std[0, :].reshape(1, -1))

In [1]:
#Returns the mean accuracy on the given test data and labels.
lr.score(X_test_std, y_test)


In [1]:
lr.score(X_train_std,y_train)

In [1]:
# Find the weights b0 and b1. 
print(lr.coef_)
print(lr.intercept_)

Exercise: Recalculate scores by changing the value of C. C = 0.1, 1, 10, 100


Exercise: Change penalty from default "L2" to "L1". Note change in the weights. Try C = 0.1 and see the difference in scores.

<br>
<br>

### Tackling overfitting via regularization

What is overfitting?
Model performs well on training data but not on unseen data (test data). Has high variance that can be due to too many parameters yielding a model with high complexity. Variance measures the stability or consistency of the model if we rebuild model multiple times with different subsets of training data.

What is underfitting?
Model does not capture the pattern in the training data and hence performs poorly on both training and test data. Low complexity model with low performance. High bias model.
Bias measures how far off the predictions are from the true values if we train the model multiple times with different subsets of training data.

Solution: Tune the complexity of the model using regularization. It handles high collinearity, filters out noise from data and prevents overfitting. Add a bias term to penalize extreme values for weights. Feature scaling such as standardization are very important for regularization to work properly.

The original solution to minimize the cost function is below.

In [1]:
#Image(filename='../input/regularization/04_04.png',width=700)

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

https://www.knime.com/blog/regularization-for-logistic-regression-l1-l2-gauss-or-laplace

http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/

L1 regularization or LASSO adds the absolute value of the weights as the penalty term to the loss/cost function modulated by the lambda. It is computationally inefficient, yields sparse outputs (shrinks less important features' coefficients to zero) and hence has feature selection built in. 

In [1]:
#Image(filename='../input/regularization/04_06.png',width=700)

L2 regularization adds the squared magnitude of the coefficients to the cost function. It is computationally efficient, yields nonsparse outputs and does not help in feature selection.  

In [1]:
Image(filename='../input/regularization/l2-term.png', width=700)

In [1]:
#Image(filename='../input/regularization/04_05.png',width=300)

In [1]:
weights, params = [], []
for c in np.arange(-5, 5):
    lr = LogisticRegression(C=10.**c, random_state=1)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10.**c)

weights = np.array(weights)
plt.plot(params, weights[:, 0],
         label='petal length')
plt.plot(params, weights[:, 1], linestyle='--',
         label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
#plt.savefig('images/03_08.png', dpi=300)
plt.show()

<br>
<br>

# Maximum margin classification with support vector machines
Maximize the margin. Margin is the distance between the decision boundary and the training samples closest to the decision boundary, the support vectors.

In [1]:
Image(filename='../input/python-ml-ch03-images/03_09.png', width=700) 

## Maximum margin intuition

...

## Dealing with the nonlinearly separable case using slack variables

In [1]:
#Image(filename='../input/python-ml-ch03-images/03_10.png', width=600) 

In [1]:
from sklearn.svm import SVC

svm = SVC(kernel='linear', C=0.1, random_state=1)
svm.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, 
                      y_combined,
                      classifier=svm, 
                      test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_11.png', dpi=300)
plt.show()

In [1]:
svm.score(X_test_std,y_test)

In [1]:
svm.score(X_train_std,y_train)

## Alternative implementations in scikit-learn

In [1]:
from sklearn.linear_model import SGDClassifier

ppn = SGDClassifier(loss='perceptron', n_iter=1000)
lr = SGDClassifier(loss='log', n_iter=1000)
svm = SGDClassifier(loss='hinge', n_iter=1000)

**Note**

- You can replace `Perceptron(n_iter, ...)` by `Perceptron(max_iter, ...)` in scikit-learn >= 0.19. The `n_iter` parameter is used here deriberately, because some people still use scikit-learn 0.18.

<br>
<br>

# Solving non-linear problems using a kernel SVM

In [1]:
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
                       X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)

plt.scatter(X_xor[y_xor == 1, 0],
            X_xor[y_xor == 1, 1],
            c='b', marker='x',
            label='1')
plt.scatter(X_xor[y_xor == -1, 0],
            X_xor[y_xor == -1, 1],
            c='r',
            marker='s',
            label='-1')

plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()
#plt.savefig('images/03_12.png', dpi=300)
plt.show()

In [1]:
Image(filename='../input/python-ml-ch03-images/03_13.png', width=700) 

<br>
<br>

## Using the kernel trick to find separating hyperplanes in higher dimensional space

In [1]:
svm = SVC(kernel='rbf', random_state=1, gamma=0.10, C=10.0)
svm.fit(X_xor, y_xor)
plot_decision_regions(X_xor, y_xor,
                      classifier=svm)

plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_14.png', dpi=300)
plt.show()

In [1]:
svm = SVC(kernel='rbf', random_state=1, gamma=100.0, C=1.0)
svm.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined, 
                      classifier=svm, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_16.png', dpi=300)
plt.show()

In [1]:
from sklearn.svm import SVC

svm = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1.0)
svm.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined,
                      classifier=svm, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_15.png', dpi=300)
plt.show()

<br>
<br>

<br>
<br>

<br>
<br>

# Summary

...

---

Readers may ignore the next cell.

In [1]:
! python ../.convert_notebook_to_script.py --input ch03.ipynb --output ch03.py