## Importance and SHAP values

In this Jupyter Notebook, we will

### Feature Importance

We will start with feature importance.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance

We'll start with the _iris_ data.

In [None]:

# Load sample dataset (Iris Dataset for classification)
data = load_iris()
X = data.data
y = data.target

In [None]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the dataset for models that are sensitive to feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Initialize classifiers
classifiers = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM": SVC(kernel='linear', random_state=42),  # Linear SVM for simplicity
    "LDA": LinearDiscriminantAnalysis(),
    "Logistic Regression": LogisticRegression(random_state=42)
}

In [None]:

# Function to calculate feature importance for each classifier
def calculate_feature_importance(classifiers, X_train, y_train):
    importance_results = {}

    for name, clf in classifiers.items():
        clf.fit(X_train, y_train)
        
        if name == "Decision Tree":
            importance_results[name] = clf.feature_importances_
        
        elif name == "Logistic Regression":
            # Coefficients of Logistic Regression (scaled by regularization)
            importance_results[name] = np.abs(clf.coef_[0])
        
        elif name == "SVM":
            # SVM uses the absolute value of coefficients as feature importance
            importance_results[name] = np.abs(clf.coef_[0]) if hasattr(clf, 'coef_') else None
        
        elif name == "LDA":
            # For LDA, the absolute values of the coefficients are used to determine importance
            importance_results[name] = np.abs(clf.coef_[0])
    
    return importance_results

# Calculate feature importance
importance_values = calculate_feature_importance(classifiers, X_train_scaled, y_train)


In [None]:


# Plot feature importance
def plot_feature_importance(importance_values, feature_names):
    plt.figure(figsize=(12, 8))
    
    for i, (name, importance) in enumerate(importance_values.items()):
        plt.subplot(2, 2, i+1)
        plt.barh(feature_names, importance)
        plt.title(f'Feature Importance - {name}')
        plt.xlabel('Importance')
        plt.ylabel('Feature')

    plt.tight_layout()
    plt.show()

# Feature names from the Iris dataset
feature_names = data.feature_names

# Plotting the feature importances
plot_feature_importance(importance_values, feature_names)



Above we can see the impact of the four features on these four models types.  

For the decision tree model, clearly petal length is the most important feature.  For SVM, petal 
length and petal width seem to be more important than sepal width and sepal length.  For LDA, petal length is more important than petal width which is more important that sepal width which is more important than sepal length.  Finally, all of the features seem relatively important for the logistic regression with petal length being the most important.

In [None]:
# read in the bcancer data
bcancer = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/BreastCancer.csv", na_values=['NA'])
bcancer.info()

In [None]:
X=bcancer[['Radius', 'Texture','Perimeter','Area','Smoothness','Compactness',
           'Concavity','Concave Points','Symmetry']]

y=bcancer['Diagnosis']
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the dataset for models that are sensitive to feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:

# Initialize classifiers
classifiers = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM": SVC(kernel='linear', random_state=42),  # Linear SVM for simplicity
    "LDA": LinearDiscriminantAnalysis(),
    "Logistic Regression": LogisticRegression(random_state=42)
}

In [None]:
# Calculate feature importance
importance_values = calculate_feature_importance(classifiers, X_train_scaled, y_train)

In [None]:
# Feature names from the Iris dataset
feature_names = ['Radius', 'Texture','Perimeter','Area','Smoothness','Compactness',
           'Concavity','Concave Points','Symmetry']

# Plotting the feature importances
plot_feature_importance(importance_values, feature_names)

### SHAP

The SHAP approach to determining the impact of a feature is to 
consider at all possible combinations of features and how the prediction would change 
if a feature were included or excluded. 


In [None]:
#### !!!! you likely have to install shap
# Try pip3 install shap
######
# If that doesn't work try pip3 install shap --pre
import shap

In [None]:
# Train the Logistic Regression Model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Create a SHAP explainer object
explainer = shap.Explainer(model, X_train_scaled)

# Get SHAP values for the test set
shap_values = explainer(X_test_scaled)

# Plot the SHAP summary plot (global explanation of the model)
shap.summary_plot(shap_values, X_test)

The order of impact for these features seems to be 2, 0, 7, 1, 8, and 4 followed by 5, 3 and 6.  The latter three seems to have a good bit less impact than the former six.  Recall that the order of the 
feature names was ['Radius', 'Texture','Perimeter','Area','Smoothness','Compactness','Concavity','Concave Points','Symmetry'], so that Perimeter, feature 2, would be the most impactful here.

In [None]:
# Train the Support Vector Machine Classifier
model = SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

# Create a SHAP explainer object
explainer = shap.Explainer(model, X_train_scaled)

# Get SHAP values for the test set
shap_values = explainer(X_test_scaled)
print(shap_values)
# Plot the SHAP summary plot (global explanation of the model)
shap.summary_plot(shap_values, X_test)


For the above plot, we can see that feature 2, 0, 1, 7, and 8 seem to be important with positive relationships with the response.  Features 5, 6, and 3 seem to have less impact with Features 5 and 3 having a negative relationship with our response here.

### Tasks

1. Choose one of the model types from the Feature Importance section above and fit the model with all of the predictors.  Assess that model for cross-validation accuracy.

2. For the model you chose in the previous task, look at the Feature Importance plot and remove some of the variables from the full model in the previous task.  Fit this new smaller model and assess that model using cross-validation.  Was the cross-validation accuracy an improvement.

3.  Get the SHAP values for the features in a linear discriminant analysis.  

4. Using the output you got from the previous task, create a reduced model with fewer predictors and evaluate that model via cross-validation.