## Try 9.4.1: Support vector machine classification in Python.

**The Python code below predicts a penguin's species based on a penguin's measurements using several kernels.**

* **Click the double-right arrow to restart the kernel and run all cells.**
* **Examine the code below.**
* **Change the parameters for each type of kernel. Changing by factors of ten is more effective for most parameters. Ex: 0.01, 0.1, 1, 10, 100.**

In [1]:
# Load packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

In [3]:
# Load and view data
penguins = sns.load_dataset('penguins')
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


In [4]:
# Remove the penguins with missing data
penguinsClean = penguins[~penguins['body_mass_g'].isna()]

In [5]:
# Only use numeric values. Categorical values could be encoded as dummy variables.

X = penguinsClean[
    ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
]
Y = penguinsClean['species']

# Split the data into training and testing sets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=20220621)

# Scale the input variable because SVM is dependent on differences in scale for distances
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Linear SVM

In [12]:
# Define and fit the model.
# Adjust C from 0.01 to 100 by changing the number of decimal places or zeros.
# C controls the slope of the hinge function. Larger values make misclassification less frequent.

penguinsSVMlinear = svm.SVC(kernel='linear', C=0.01)
penguinsSVMlinear.fit(X_train_scaled, Y_train)

In [14]:
# Predict for the test set
Y_pred = penguinsSVMlinear.predict(X_test_scaled)

In [16]:
# Display the confusion matrix
confusion_matrix(Y_test, Y_pred)

array([[31,  0,  0],
       [ 7, 13,  0],
       [ 0,  0, 35]], dtype=int64)

## Radial basis function

In [18]:
# Adjust the number of decimal places in
# gamma (affects distance a point has influence, smaller value of gamma allow influence to spread more )
# and C

penguinsSVMrbf = svm.SVC(kernel='rbf', C=10, gamma=0.01)
penguinsSVMrbf.fit(X_train_scaled, Y_train)

In [20]:
# Predict for the test set
Y_pred = penguinsSVMrbf.predict(X_test_scaled)

In [22]:
# Display the confusion matrix
confusion_matrix(Y_test, Y_pred)

array([[30,  1,  0],
       [ 1, 19,  0],
       [ 0,  0, 35]], dtype=int64)

## Polynomial

In [24]:
# Adjust the number of decimal places in C and change degree by steps of 1.
# Degree impacts the degree of the polynomial for the kernel.

penguinsSVMpoly = svm.SVC(kernel='poly', C=0.1, degree=5)
penguinsSVMpoly.fit(X_train_scaled, Y_train)

In [26]:
# Predict for the test set
Y_pred = penguinsSVMpoly.predict(X_test_scaled)

In [28]:
# Display the confusion matrix
confusion_matrix(Y_test, Y_pred)

array([[31,  0,  0],
       [17,  3,  0],
       [ 4,  0, 31]], dtype=int64)

## Accessing information

In [31]:
# The number of support vectors for each class
penguinsSVMrbf.n_support_

array([21, 21,  6])

In [33]:
# Which instances in the training set are support vectors
penguinsSVMrbf.support_

array([ 12,  18,  38,  65,  99, 111, 114, 116, 120, 122, 140, 155, 157,
       165, 180, 212, 217, 218, 220, 241, 245,  16,  24,  36,  45,  56,
        58,  72,  78,  83,  89,  95, 113, 137, 183, 186, 187, 200, 226,
       227, 234, 237,  43,  86,  92, 112, 196, 247])

In [35]:
# The coefficients of the hyperplanes for each pair of classes in the form intercept = coefficient1*variable1 + coefficient2*variable2 + ...
penguinsSVMlinear.coef_

array([[-0.62622598,  0.05401353, -0.108782  ,  0.08399517],
       [-0.24717395,  0.49072732, -0.33804034, -0.25984323],
       [ 0.08731435,  0.50450852, -0.36236078, -0.36043294]])

In [37]:
# The intercept of the hyperplanes for each pair of classes.
penguinsSVMlinear.intercept_

array([0.44537772, 0.30524277, 0.23923495])

## challenge activity 9.4.2: Support vector machine classification using scikit-learn.

## 1)
**The US Forest Service regularly monitors weather conditions to predict which areas are at risk of wildfires. Data scientists working with the US Forest Service would like to predict whether a wildfire will occur based on humidity, wind speed, and moisture content below ground level.**

* **Apply the pre-defined scaler function to the matrix of input features, X. Assign the scaled inputs to XScaled.**
  
**The code contains all imports, loads the dataset, and prints the scaled inputs.**

In [None]:
# Import dataset
fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y
X = fires[['humidity', 'wind', 'below_moisture']]
y = np.ravel(fires['fire'])

# Define a scaling function
scaler = StandardScaler()
XScaled = # Your code goes here

# Print the scaled input features
print(XScaled)

**2)**
**The US Forest Service regularly monitors weather conditions to predict which areas are at risk of wildfires. Data scientists working with the US Forest Service would like to predict whether a wildfire will occur based on daily rainfall, moisture content at ground level, and wind speed.**

* **Initialize SVCModel, a support vector machine classifier, with a linear kernel.**
* **Fit the support vector machine classifier to the scaled inputs, XScaled.**

**The code contains all imports, loads the dataset, and prints the predicted values.**

In [None]:

# Create input matrix X and output matrix y
X = fires[['rain', 'ground_moisture', 'wind']]
y = np.ravel(fires['fire'])

# Define and apply a scaling function
scaler = StandardScaler()
XScaled = scaler.fit_transform(X)

# Initialize and fit the linear SVC model
# Your code goes here

# Print the scaled input features
print(SVCModel.predict(XScaled))

## 3)
**The US Forest Service regularly monitors weather conditions to predict which areas are at risk of wildfires. Data scientists working with the US Forest Service would like to predict whether a wildfire will occur based on moisture content below ground level, drought level, and wind speed.**

* **Initialize SVCModel, a support vector machine classifier, with a radial basis function kernel.**
* **Fit the support vector machine classifier to the scaled inputs, XScaled.**

**The code contains all imports, loads the dataset, and prints the predicted values.**

In [None]:




# Create input matrix X and output matrix y
X = fires[['below_moisture', 'drought', 'below_moisture']]
y = np.ravel(fires['fire'])

# Define and apply a scaling function
scaler = StandardScaler()
XScaled = scaler.fit_transform(X)

# Initialize and fit the radial basis function SVC model
# Your code goes here

# Print the scaled input features
print(SVCModel.predict(XScaled))

## Solutions:

**1)** XScaled = scaler.fit_transform(X)

**2)** SVCModel = SVC(kernel='linear')
       SVCModel = SVCModel.fit(XScaled, y)

**3)** SVCModel = SVC(kernel='rbf')
       SVCModel = SVCModel.fit(XScaled, y)