## Decision Trees and Support Vector Machines

In this notebook, we'll look at decision trees and support vector machines for classification.  Decisions trees split data along a single variables in a binary way, that is into two parts.  The goal is to get to leafs or nodes that are relatively homogenous because homogeneity makes for better prediction.  The trees here will use a Gini impurity index.  

Support vector machines (or support vector classifiers) are classifiers that try to find decision boundaries that have some separation between the classes.  This is similar to what LDA or QDA do but with SVM's the tradeoff is made of trying to find more robust separation, wider regions of separation between the classes, in exchange for making a some errors.  The hope is that the wider regions of separation will yield better out of sample or cross validation performance.

In [None]:
# import libraries that we need

import pandas as pd

import matplotlib.pyplot as plt
import numpy as np

from matplotlib import colors
import seaborn as sns

import scipy.stats as stats

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score


from sklearn import tree


# Introduction to Support Vector Machines (SVM) for Classification

# Let's start by importing the necessary libraries

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score


In [None]:
# Visualizing confusion matrices for each classifier
# Here's another function for plotting the confusion matrix
def plot_confusion_matrix(y_true, y_pred, title='Confusion Matrix'):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
    plt.title(title)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

In [None]:
# read in the bcancer data
bcancer = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/BreastCancer.csv", na_values=['NA'])
bcancer.info()

In [None]:
# Choose the features X and target y
X=bcancer[['Concavity','Texture','Radius','Area']]
y = bcancer['Diagnosis']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier
dtree = DecisionTreeClassifier(random_state=420)

# Train the classifier
dtree.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtree.predict(X_test)

# get the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Evaluate the model with out of sample prediction
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Let's make a visualization of the performance results that is a bit easier to digest.  

In [None]:
feature_names = ['Concavity','Texture','Radius','Area']
target_names = ['B','M']
plot_confusion_matrix(y_test, y_pred, 'Decision Tree Confusion Matrix')

Now let's visualize the tree itself

In [None]:
# create a plot of the decision tree
fig = plt.figure(figsize=(25,20))
tree.plot_tree(dtree,
                   feature_names=['Concavity','Texture','Radius','Area'],
                   class_names=['B','M'],
                   filled=True)

The tree above is quite deep. Looks like it has a depth of 10 splits.  That's quite a few.  
 It is also possible, maybe even likely, that the tree is overfit given how many different branches/splits there are and the depth of the tree.

Let's look at a version that is pruned.  We'll start with a tree that has *max_depth* of 4.  

In [None]:
dt_pre_pruned = DecisionTreeClassifier(max_depth=4, min_samples_split=5, min_samples_leaf=2)

# Train the model
dt_pre_pruned.fit(X_train, y_train)

Here's a plot of that tree.

In [None]:
fig = plt.figure(figsize=(25,20))
tree.plot_tree(dt_pre_pruned,
                   feature_names=['Concavity','Texture','Radius','Area'],
                   class_names=['B','M'],
                   filled=True)

In [None]:
And here's the test performance.

In [None]:
y_pred = dt_pre_pruned.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

plot_confusion_matrix(y_test, y_pred, 'Decision Tree Confusion Matrix (pruned)')

In [None]:
Performance has gone up slightly.  Let's try pruning further.

In [None]:
dt_pre_pruned2 = DecisionTreeClassifier(max_depth=2, min_samples_split=5, min_samples_leaf=2)

# Train the model
dt_pre_pruned2.fit(X_train, y_train)

fig = plt.figure(figsize=(25,20))
tree.plot_tree(dt_pre_pruned2,
                   feature_names=['Concavity','Texture','Radius','Area'],
                   class_names=['B','M'],
                   filled=True)

y_pred = dt_pre_pruned2.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

plot_confusion_matrix(y_test, y_pred, 'Decision Tree Confusion Matrix (pruned)')

So this is a better tree by far than the previous.  We can also clearly see which of the features is important for this classification.

In [None]:

# 1. Linear Kernel SVM

# Create a linear kernel SVM model
linear_svm = SVC(kernel='linear')

# Train the model on the training data
linear_svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred_linear = linear_svm.predict(X_test)

# Calculate accuracy
accuracy_linear = accuracy_score(y_test, y_pred_linear)
print(f"Linear SVM accuracy: {accuracy_linear * 100:.2f}%")

plot_confusion_matrix(y_test, y_pred_linear, 'Decision Tree Confusion Matrix (pruned)')

The support vector machine we used above creates a linear decision boundary.  

Below we will look at the Radial Basis Function (RBF) Kernel which allows for distance based decision boundaries.  

In [None]:

# Create an RBF kernel SVM model
rbf_svm = SVC(kernel='rbf')

# Train the model on the training data
rbf_svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rbf = rbf_svm.predict(X_test)

# Calculate accuracy
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"RBF Kernel SVM accuracy: {accuracy_rbf * 100:.2f}%")

plot_confusion_matrix(y_test, y_pred_rbf, 'Decision Tree Confusion Matrix (pruned)')

So far we have just used a single split to evaluate the performance of our models.  

In [None]:

# Cross-validation for linear kernel
cv_scores_linear = cross_val_score(linear_svm, X, y, cv=5)  # 5-fold cross-validation
print(f"Linear Kernel SVM cross-validation accuracy: {cv_scores_linear.mean() * 100:.2f}%")

# Cross-validation for RBF kernel
cv_scores_rbf = cross_val_score(rbf_svm, X, y, cv=5)  # 5-fold cross-validation
print(f"RBF Kernel SVM cross-validation accuracy: {cv_scores_rbf.mean() * 100:.2f}%")

#

The *rbf* kernel that we use here is a radial basis function and as the name suggests it is a function of $$exp^d$$ where $d$ is the Euclidean distance/radial distance between two vectors.  This allows for the creation of decision boundaries that are based upon the 


[<https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html>]

[<https://en.wikipedia.org/wiki/Radial_basis_function_kernel>]

### Tasks 

1.   Using Texture, Radius, Area, Compactness and Smoothness, run 8 fold cross validation for the following models: logistic regression, SVM with linear kernel, SVM with RBF kernel, decision tree with depth of 3, decision tree with depth of 5.  Report which method did performed the best.

2. Write a paragraph summarizing the analysis you did in Task 1 and explaining the model to a classmate.

