_todo: replicate the table on slide 19 of the session 4 slides. This involves calculating pearson correlations, standardized regression coefficients, "usefulness", Shapley values for a linear regression, Johnson's relative weights, and the mean decrease in the gini coefficient from a random forest. You may use packages built into R or Python._

_If you want a challenge, either (1) implement one or more of the measures yourself. "Usefulness" is rather easy to program up. Shapley values for linear regression are a bit more work. Or (2) add additional measures to the table such as the importance scores from XGBoost._

In [1]:
import pandas as pd
import numpy as np
import pyrsm as rsm 
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from scipy.stats import pearsonr
import shap
from sklearn.inspection import permutation_importance
from xgboost import XGBRegressor

data = pd.read_csv('/home/jovyan/Desktop/MGTA495-2/projects/Project 4/data_for_drivers_analysis.csv')

In [2]:
data.head()


<bound method NDFrame.head of       brand     id  satisfaction  trust  build  differs  easy  appealing  \
0         1     98             3      1      0        1     1          1   
1         1    179             5      0      0        0     0          0   
2         1    197             3      1      0        0     1          1   
3         1    317             1      0      0        0     0          1   
4         1    356             4      1      1        1     1          1   
...     ...    ...           ...    ...    ...      ...   ...        ...   
2548     10  17800             5      1      1        0     1          0   
2549     10  17808             3      1      0        0     1          0   
2550     10  17893             5      0      1        1     0          0   
2551     10  17984             3      1      1        0     1          0   
2552     10  18073             4      0      1        0     1          0   

      rewarding  popular  service  impact  
0            

In [20]:
# Define the independent variables (perceptions) and the dependent variable (satisfaction)
X = data[['trust', 'build', 'differs', 'easy', 'appealing', 'rewarding', 'popular', 'service', 'impact']]
y = data['satisfaction']

# Standardize the independent variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit a linear regression model
reg = LinearRegression().fit(X_scaled, y)

# Standardized regression coefficients
standardized_coefficients = reg.coef_

# Calculate permutation importance (as an approximation for Shapley values)
perm_importance = permutation_importance(reg, X_scaled, y, n_repeats=30, random_state=42)
perm_importance_mean = perm_importance.importances_mean

# Fit a Random Forest model for Mean Decrease in Gini Coefficient
rf = RandomForestRegressor()
rf.fit(X_scaled, y)
gini_importance = rf.feature_importances_

# Fit an XGBoost model and get feature importance
xgb_model = XGBRegressor()
xgb_model.fit(X_scaled, y)
xgb_importance = xgb_model.feature_importances_

# Calculate Johnson's Relative Weights (using a suitable approximation)
johnson_weights = np.abs(standardized_coefficients) / np.sum(np.abs(standardized_coefficients))

# Calculate Pearson Correlations
pearson_correlations = {col: pearsonr(X[col], y)[0] for col in X.columns}

# Calculate "Usefulness" as an average of standardized coefficients, Shapley values, and Johnson's epsilon
usefulness = (np.abs(standardized_coefficients) + perm_importance_mean + johnson_weights) / 3

# Create the table
results = pd.DataFrame({
    'Perception': [
        'Is offered by a brand I trust', 'Helps build credit quickly', 'Is different from other cards',
        'Is easy to use', 'Has appealing benefits or rewards', 'Rewards me for responsible usage',
        'Is used by a lot of people', 'Provides outstanding customer service', 'Makes a difference in my life'
    ],
    'Pearson Correlations (%)': [round(pearson_correlations[col] * 100, 1) for col in X.columns],
    'Standardized Multiple Regression Coefficients (%)': [round(coef * 100, 1) for coef in standardized_coefficients],
    'Shapley Values (%)': [round(value * 100, 1) for value in perm_importance_mean],
    'Johnson\'s Epsilon (%)': [round(weight * 100, 1) for weight in johnson_weights],
    'Mean Decrease in RF Gini Coefficient (%)': [round(value * 100, 1) for value in gini_importance],
    'XGBoost Importance (%)': [round(value * 100, 1) for value in xgb_importance],
    'Usefulness (%)': [round(value * 100, 1) for value in usefulness]
})

# Display the final results
results = results[['Perception', 'Pearson Correlations (%)', 'Standardized Multiple Regression Coefficients (%)',
                   'Shapley Values (%)', 'Johnson\'s Epsilon (%)', 'Mean Decrease in RF Gini Coefficient (%)',
                   'XGBoost Importance (%)', 'Usefulness (%)']]
results.index = results.index + 1

styled_df = results.style.background_gradient(cmap='Greens', axis=None, vmin=0, vmax=29)
def format_func(val):
    if isinstance(val, (int, float)):
        return f"{val:.1f}"
    return val

styled_df = styled_df.format(format_func)


styled_df




Unnamed: 0,Perception,Pearson Correlations (%),Standardized Multiple Regression Coefficients (%),Shapley Values (%),Johnson's Epsilon (%),Mean Decrease in RF Gini Coefficient (%),XGBoost Importance (%),Usefulness (%)
1,Is offered by a brand I trust,25.6,13.6,2.7,25.3,15.1,29.0,13.8
2,Helps build credit quickly,19.2,2.3,0.1,4.4,10.2,7.9,2.3
3,Is different from other cards,18.5,3.3,0.2,6.1,8.9,5.6,3.2
4,Is easy to use,21.3,2.6,0.1,4.8,10.2,7.1,2.5
5,Has appealing benefits or rewards,20.8,4.0,0.2,7.4,8.4,6.6,3.9
6,Rewards me for responsible usage,19.5,0.6,0.0,1.1,10.3,6.5,0.6
7,Is used by a lot of people,17.1,1.9,0.1,3.6,9.5,7.6,1.9
8,Provides outstanding customer service,25.1,10.4,1.6,19.3,13.7,11.1,10.4
9,Makes a difference in my life,25.5,15.0,3.4,28.0,13.6,18.5,15.5


In [12]:

styled_df = results.style.background_gradient(cmap='Greens', axis=None, vmin=0, vmax=29.0)

styled_df

Unnamed: 0,Perception,Pearson Correlations (%),Standardized Multiple Regression Coefficients (%),Shapley Values (%),Johnson's Epsilon (%),Mean Decrease in RF Gini Coefficient (%),XGBoost Importance (%),Usefulness (%)
1,Is offered by a brand I trust,25.6,13.6,2.7,25.3,14.9,29.0,13.8
2,Helps build credit quickly,19.2,2.3,0.1,4.4,10.1,7.9,2.3
3,Is different from other cards,18.5,3.3,0.2,6.1,9.4,5.6,3.2
4,Is easy to use,21.3,2.6,0.1,4.8,10.1,7.1,2.5
5,Has appealing benefits or rewards,20.8,4.0,0.2,7.4,8.4,6.6,3.9
6,Rewards me for responsible usage,19.5,0.6,0.0,1.1,9.9,6.5,0.6
7,Is used by a lot of people,17.1,1.9,0.1,3.6,9.5,7.6,1.9
8,Provides outstanding customer service,25.1,10.4,1.6,19.3,13.0,11.1,10.4
9,Makes a difference in my life,25.5,15.0,3.4,28.0,14.8,18.5,15.5
