# This notebook is focused on to give a brief view about feature engineering and feature selection

### How Insurance Companies Work

- Insurance companies assess the risk and charge premiums for various types of insurance coverage. If an insured event occurs and you suffer damages, the insurance company pays you up to the agreed amount of the insurance policy. The way insurance companies work, they can pay this and still make a profit.

Evaluating Risk

- Companies that buy insurance policies transfer their risk to the insurance company in return for paying their premiums. The insurance company has to define insurance risk it is taking on. It asks questions, each of which is designed to evaluate a particular risk. Depending on your answers to the questions, the insurance company quotes you a premium. If your risk is higher than usual – for example, if you are not near a fire hydrant, then your fire insurance will be higher. If you don't answer the questions honestly, the insurance company may refuse to pay if there are damages, according to the Insurance Institute of Michigan.

Shared Risk

- Your premiums are much lower than the possible damages, but the insurance company can afford to pay them because it receives premiums from many customers. Insurance companies operate on the principle of shared risk. All the customers pay small amounts and share the risk that way. A fire or other covered event only happens rarely. The insurance company has to calculate the premiums so the total premiums it receives from its many customers cover the few damage claims, with some money left over for administration and profit.


Re-Insurance

- Insurance companies have to consider that, if they have a lot of policies in one area and there is a natural disaster, many customers will make a claim. The insurance company may not have collected enough premiums to cover so many claims. To prevent such a problem, insurance companies pass on some of the risk to other large financial firms that offer re-insurance, meaning they may be protected in a worst case scenario.
The large firms take over the extra risk from the insurance company that holds the policies, and it pays for this service. For major natural disasters, the re-insurance companies pay for some of the damages through the local insurance companies that sold the policies.

Investment Income

- Over time, insurance companies receive lots of small amounts in premiums and have to occasionally pay out large amounts. Before paying out the damages, they may have large surpluses which they invest, according to Obrella. Because they don't want to take much additional risk, they typically place this money in safe investments, but it still generates a substantial income. This income increases the revenue of the insurance companies, and they can use it to reduce the premiums they charge or to increase their profits.

Source: https://smallbusiness.chron.com/insurance-companies-work-60269.html

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import missingno as msno
import numpy as np

plt.style.use("fivethirtyeight")
%matplotlib inline

In [None]:
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    sns.set(font_scale=1.6)
    
    plt.style.use("fivethirtyeight")
    # sns.set(style='whitegrid')
    # plt.style.use('seaborn-darkgrid')
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 16
    
    display( HTML('<style>.container {width:100% !important; }</style>'))
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option('display.expand_frame_repr', False)
    
jupyter_settings()

In [None]:
train = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
plt.figure(figsize=(8,6))

sns.countplot(x="Response", data = train, palette ="husl" ,edgecolor="black")
plt.ylabel('count', fontsize=15)
plt.xlabel('gender', fontsize=15)
plt.title('Balance of the output variable', fontsize=16)
plt.show()

# Feature Engineering

### Age range

In [None]:
train['age_range'] = train['Age'].apply(lambda x: 'Adult 1' if 20<x<30 else('Adult 2' if 30 < x < 40 else('Adult 3' if 40<x<65 else 'Elderly')))

### Monthly premium

In [None]:
train['monthly_premium'] = round(train['Annual_Premium']/12, 2)

Percentage of total premium

In [None]:
train['percentage_total_premium'] = train['Annual_Premium']/train['Annual_Premium'].sum()

In [None]:
df = pd.get_dummies(train['Vehicle_Damage'], prefix='Vehicle_Damage').rename(columns={'vehicle_damage_0':'vehicle_damage_no', 'vehicle_damage_1':'vehicle_damage_yes'})

In [None]:
train  = pd.concat([train, df], axis=1)

In [None]:
train['insured_with_no_damage'] = train['Previously_Insured']*train['Vehicle_Damage_No']

In [None]:
train["not_insured_with_damage"] = train["Previously_Insured"].apply(lambda x: 1 if x == 0 else 0) * train["Vehicle_Damage_Yes"]

In [None]:
train["vehicle_age_<_1_year"] = train["Vehicle_Age"].apply(lambda x: 1 if x=='< 1 Year' else 0)

In [None]:
train["new_damage_no_insurance"] = train["vehicle_age_<_1_year"]*train["not_insured_with_damage"]

In [None]:
train.head()

# Feature analysis and selection

In [None]:
categorical_features = train.select_dtypes(exclude=[np.number])

# Encoder

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
train['Gender'] = le.fit_transform(train['Gender'])
train['Vehicle_Age'] = le.fit_transform(train['Vehicle_Age'])
train['Vehicle_Damage'] = le.fit_transform(train['Vehicle_Damage'])
train['age_range'] = le.fit_transform(train['age_range'])

In [None]:
y = train['Response'].copy()
X = train.drop('Response', axis=1).copy()

# Oversampling (ADASYN)

In [None]:
from imblearn.over_sampling import ADASYN

In [None]:
adasyn = ADASYN()

In [None]:
X_adasyn, y_adasyn = adasyn.fit_resample(X,y)

In [None]:
print('The number of lines before oversampling : {}'.format(X.shape[0]))
print('The number of lines after oversampling : {}'.format(X_adasyn.shape[0]))

In [None]:
import matplotlib.pyplot as plt

In [None]:
print("Now the training data is shorter but the classes are balanced")

# sets the plot size
plt.figure(figsize=(8,6))

# counts each class for the target var
ax = sns.countplot(x=y_adasyn, palette ="husl", edgecolor="black")

# sets plot features
plt.title("Balancing of the output variable")
plt.xlabel("Response")
plt.ylabel("Count")
plt.xticks(ticks=[0,1], labels=['No','Yes'])

# displays the plot
plt.show()

# Feature Importance (Random Forest)

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X_adasyn, y_adasyn, test_size=0.3, random_state=72)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(random_state=1)

In [None]:
rf.fit(X_adasyn, y_adasyn)

In [None]:
importances = rf.feature_importances_

In [None]:
importance = list(importances)

In [None]:
colum = list(X_adasyn.columns)

In [None]:
feature_importance = pd.DataFrame(zip(colum, importance), columns=['Feature', 'Importance']).sort_values('Importance')

In [None]:
feature_importance = feature_importance.set_index('Feature')

In [None]:
feature_importance.plot(kind='barh', figsize=(12,10))
plt.title('Feature Importance', fontsize=16)
plt.legend(bbox_to_anchor=(0.95, 0.1), fontsize=14)
plt.ylabel('Features', fontsize=14)
plt.xlabel('Importance', fontsize=14)
plt.show()

# Permutation Importance

Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.

In [None]:
from sklearn.linear_model import Ridge

In [None]:
model = Ridge(alpha=1e-2).fit(x_train, y_train)

In [None]:
model.score(x_val, y_val)

In [None]:
feature_names = x_train.columns

In [None]:
from sklearn.inspection import permutation_importance
r = permutation_importance(model, x_val, y_val, n_repeats=30,random_state=0)

permutation_importance_name = []
permutation_importance_mean = []

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        print(f"{feature_names[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
        f" +/- {r.importances_std[i]:.3f}")

        permutation_importance_name.append(feature_names[i])
        permutation_importance_mean.append(r.importances_mean[i]) 

# Boruta

Basically, you choose a model of convenience — capable of capturing non-linear relationships and interactions, e.g. a random forest — and you fit it on X and y. Then, you extract the importance of each feature from this model and keep only the features that are above a given threshold of importance.

In Boruta, features do not compete among themselves. Instead — and this is the first brilliant idea — they compete with a randomized version of them.

binomial distribution
As often happens in machine learning (in life?), the key is iteration. Not surprisingly, 20 trials are more reliable than 1 trial and 100 trials are more reliable than 20 trials.

In [None]:
!pip install Boruta==0.3

### Feature selection using Boruta (In case you need to run Boruta again)

In [None]:
#from boruta import BorutaPy

###initialize Boruta
#forest = RandomForestRegressor(
#   n_jobs = -1, 
#   max_depth = 5
#)

#boruta = BorutaPy(
#   estimator = rf, 
#   n_estimators = 'auto',
#   max_iter = 20 # number of trials to perform
#)
### fit Boruta (it accepts np.array, not pd.DataFrame)
#boruta.fit(np.array(X_adasyn), np.array(y_adasyn))
### print results
#green_area = X_adasyn.columns[boruta.support_].to_list()
#blue_area = X_adasyn.columns[boruta.support_weak_].to_list()
#print('features in the green area:', green_area)
#print('features in the blue area:', blue_area)

Features selected by Boruta

id, Age, Region_Code, Previously_Insured, Policy_Sales_Channel, Vintage, age_range, Vehicle_Damage_No, Vehicle_Damage_Yes, insured_with_no_damage, not_insured_with_damage, vehicle_age_<_1_year

## Summary

___
- Feature Importance using Random Forest (Top 5 features)

percentage_total_premium, Policy_Sales_Channel, Previously_Insured, Vehicle_Damage_Yes and Age.
___

- Permutation Importance (excluding features with high standard deviation)

Top 5 features

Previously_Insured, percentage_total_premium, Vehicle_Damage_Yes, vehicle_age_<_1_year and insured_with_no_damage.
___

- Boruta

Features selected

id, Age, Region_Code, Previously_Insured, Policy_Sales_Channel, Vintage, age_range, Vehicle_Damage_No, Vehicle_Damage_Yes, insured_with_no_damage, not_insured_with_damage, vehicle_age_<_1_year
___
- Mutual info(Top 5 features)

Features selected

Policy_Sales_Channel, Region_Code, Vehicle_Damage_No, Previously_Insured and not_insured_with_damage
___
**Conclusions**

- age_range came from Age, features related to customer age were indicated in two analysis (Feature Importance and Boruta)
- percentage_total_premium was indicated in two analysis (Feature Importance and Permutation Importance)
- Features related to vehicle damage were indicated in all analysis(ANOVA, Feature Importance, Permutation Importance and Boruta)
- Policy_Sales_Channel was indicated in two analysis(Feature Importance and Boruta)
- Previous_Insured was indicated in three analysis(Feature Importance, Permutation Importance and Boruta)
___


In [None]:
x_train.columns

In [None]:
x_train_selected = x_train[['id', 'Age', 'Region_Code', 'Previously_Insured', 'Policy_Sales_Channel', 'Vintage', 'age_range', 'Vehicle_Damage_No', 'Vehicle_Damage_Yes', 'insured_with_no_damage', 'not_insured_with_damage']]

In [None]:
x_val_selected = x_val[['id', 'Age', 'Region_Code', 'Previously_Insured', 'Policy_Sales_Channel', 'Vintage', 'age_range', 'Vehicle_Damage_No', 'Vehicle_Damage_Yes', 'insured_with_no_damage', 'not_insured_with_damage']]

___

# Model Building

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import f1_score, recall_score, precision_score

- Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()
dt.fit(x_train_selected, y_train)
y_pred_dt = dt.predict(x_val_selected)
acc_dt = accuracy_score(y_val, y_pred_dt)
f1_score_dt = f1_score(y_val, y_pred_dt)
recall_score_dt = recall_score(y_val, y_pred_dt)
precision_score_dt = precision_score(y_val, y_pred_dt)

In [None]:
print('The average accuracy is: {}'.format(acc_dt))

- Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(x_train_selected, y_train)
y_pred_rf = rf.predict(x_val_selected)
acc_rf = accuracy_score(y_val, y_pred_rf)
f1_score_rf = f1_score(y_val, y_pred_rf)
recall_score_rf = recall_score(y_val, y_pred_rf)
precision_score_rf = precision_score(y_val, y_pred_rf)

In [None]:
print('The average accuracy is: {}'.format(acc_rf))

- XGBoost

In [None]:
import xgboost as xgb

In [None]:
xgb = xgb.XGBClassifier()
xgb.fit(x_train_selected, y_train)
y_pred_xgb = xgb.predict(x_val_selected)
acc_xgb = accuracy_score(y_val, y_pred_xgb)
f1_score_xgb = f1_score(y_val, y_pred_xgb)
recall_score_xgb = recall_score(y_val, y_pred_xgb)
precision_score_xgb = precision_score(y_val, y_pred_xgb)

In [None]:
print('The average accuracy is: {}'.format(round(acc_xgb,3)))

- LGBM

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lgbm = LGBMClassifier()
lgbm.fit(x_train_selected, y_train)
y_pred_lgbm = lgbm.predict(x_val_selected)
acc_lgbm = accuracy_score(y_val, y_pred_lgbm)
f1_score_lgbm = f1_score(y_val, y_pred_lgbm)
recall_score_lgbm = recall_score(y_val, y_pred_lgbm)
precision_score_lgbm = precision_score(y_val, y_pred_lgbm)

In [None]:
print('The average accuracy is: {}'.format(round(acc_lgbm,3)))

- K Nearest Neighbor

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier() 
knn.fit(x_train_selected, y_train)  
y_pred_knn = knn.predict(x_val_selected)  
acc_knn = accuracy_score(y_val, y_pred_knn)
f1_score_knn = f1_score(y_val, y_pred_knn)
recall_score_knn = recall_score(y_val, y_pred_knn)
precision_score_knn = precision_score(y_val, y_pred_knn)

In [None]:
print('The average accuracy is: {}'.format(round(acc_knn,3)))

- Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log = LogisticRegression()
log.fit(x_train_selected, y_train)  
y_pred_log = log.predict(x_val_selected)  
acc_log = accuracy_score(y_val, y_pred_log)
f1_score_log = f1_score(y_val, y_pred_log)
recall_score_log = recall_score(y_val, y_pred_log)
precision_score_log = precision_score(y_val, y_pred_log)

In [None]:
print('The average accuracy is: {}'.format(round(acc_log,3)))

- Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
bag = BaggingClassifier()
bag.fit(x_train_selected, y_train)  
y_pred_bag = bag.predict(x_val_selected)  
acc_bag = accuracy_score(y_val, y_pred_bag)
f1_score_bag = f1_score(y_val, y_pred_bag)
recall_score_bag = recall_score(y_val, y_pred_bag)
precision_score_bag = precision_score(y_val, y_pred_bag)

In [None]:
print('The average accuracy is: {}'.format(round(acc_bag,3)))

- Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gbst = GradientBoostingClassifier()
gbst.fit(x_train_selected, y_train)  
y_pred_gbst = gbst.predict(x_val_selected)  
acc_gbst = accuracy_score(y_val, y_pred_gbst)
f1_score_gbst = f1_score(y_val, y_pred_gbst)
recall_score_gbst = recall_score(y_val, y_pred_gbst)
precision_score_gbst = precision_score(y_val, y_pred_gbst)

In [None]:
print('The average accuracy is: {}'.format(round(acc_gbst,3)))

In [None]:
results = pd.DataFrame({
    'Model': ['Decision tree', 'Random Forest', 'XGBoost', 'LGBM', 'K Nearest Neighbor', 'Logistic Regression', 'Bagging Classifier', 'Gradient Boosting Classifier'],
    'Accuracy': [acc_dt, acc_rf, acc_xgb, acc_lgbm, acc_knn, acc_log, acc_bag, acc_gbst],
    'Recall': [recall_score_dt, recall_score_rf, recall_score_xgb, recall_score_lgbm, recall_score_knn, recall_score_log, recall_score_bag, recall_score_gbst],
    'Precision': [precision_score_dt, precision_score_rf, precision_score_xgb, precision_score_lgbm, precision_score_knn, precision_score_log, precision_score_bag, precision_score_gbst],    
    'F1-score': [f1_score_dt, f1_score_rf, f1_score_xgb, f1_score_lgbm, f1_score_knn, f1_score_log, f1_score_bag, f1_score_gbst]})
result = results.sort_values(by='F1-score', ascending=False)
result = result.set_index('Model')
display(result.head(10))

## Chosing the best hyperparameters using GridSearchCV - Fine tuning

In [None]:
param_grid = {"n_estimators": [200,300,400],
              "max_depth": [4,5,6],
             "learning_rate": [0.001, 0.01, 0.05]} 
xgb_grid_selected = GridSearchCV(xgb, cv=KFold(n_splits = 5, shuffle=True), param_grid=param_grid, scoring='accuracy')
eval_set = [(x_train_selected, y_train), (x_val_selected, y_val)]
xgb_grid_selected.fit(x_train_selected, y_train , eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)
best_xgb_selected = xgb_grid_selected.best_estimator_

print(best_xgb_selected)

### Plotting the Loss and error to check the overfitting

In [None]:
# retrieve performance metrics
results = xgb_grid_selected.best_estimator_.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()

# plot classification error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

In [None]:
y_pred_xgb_best_model = best_xgb_selected.predict(x_val_selected)

In [None]:
print(classification_report(y_val, y_pred_xgb_best_model))

In [None]:
from sklearn.metrics import plot_confusion_matrix

### Confusion matrix

In [None]:
plot_confusion_matrix(best_xgb_selected, x_val_selected, y_val) 
plt.title('Confusion matrix')
plt.yticks(ticks=[0,1], labels=['No accepted','Accepted'])
plt.xticks(ticks=[0,1], labels=['No accepted','Accepted'])
plt.grid(False)
plt.show()

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

### Comparison of the different models

In [None]:
ns_probs = [0 for _ in range(len(y_val))]


# fit a model
SEED=1

dt_clf = DecisionTreeClassifier(random_state=SEED)
rf_clf = RandomForestClassifier(random_state=SEED)
xgb_clf = XGBClassifier(random_state=SEED)
lgbm_clf = LGBMClassifier(random_state=SEED)
knn_clf = KNeighborsClassifier() 
log_clf = LogisticRegression(random_state=SEED)
bag_clf = BaggingClassifier(random_state=SEED)
gbst_clf = GradientBoostingClassifier(random_state=SEED)


# trains the classifiers
dt_clf.fit(x_train_selected, y_train)
rf_clf.fit(x_train_selected, y_train)
xgb_clf.fit(x_train_selected, y_train)
lgbm_clf.fit(x_train_selected, y_train)
knn_clf.fit(x_train_selected, y_train)
log_clf.fit(x_train_selected, y_train)
bag_clf.fit(x_train_selected, y_train)
gbst_clf.fit(x_train_selected, y_train)


# predict probabilities

dt_probs = dt_clf.predict_proba(x_val_selected)
rf_probs = rf_clf.predict_proba(x_val_selected)
xgb_probs = xgb_clf.predict_proba(x_val_selected)
lgbm_probs = lgbm_clf.predict_proba(x_val_selected)
knn_probs = knn_clf.predict_proba(x_val_selected)
log_probs = log_clf.predict_proba(x_val_selected)
bag_probs = bag_clf.predict_proba(x_val_selected)
gbst_probs = gbst_clf.predict_proba(x_val_selected)

# keep probabilities for the positive outcome only

dt_probs = dt_probs[:, 1]
rf_probs = rf_probs[:, 1]
xgb_probs = xgb_probs[:, 1]
lgbm_probs = lgbm_probs[:, 1]
knn_probs = knn_probs[:, 1]
log_probs = log_probs[:, 1]
bag_probs =  bag_probs[:, 1]
gbst_probs =  gbst_probs[:, 1]

# calculate scores

ns_auc = roc_auc_score(y_val, ns_probs)
dt_auc = roc_auc_score(y_val, dt_probs)
rf_auc = roc_auc_score(y_val, rf_probs)
xgb_auc = roc_auc_score(y_val, xgb_probs)
lgbm_auc = roc_auc_score(y_val, lgbm_probs)
knn_auc = roc_auc_score(y_val, knn_probs)
log_auc = roc_auc_score(y_val, log_probs)
bag_auc = roc_auc_score(y_val, bag_probs)
gbst_auc = roc_auc_score(y_val, gbst_probs)


# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Decision Tree: ROC AUC=%.3f' % (dt_auc))
print('Random Forest: ROC AUC=%.3f' % (rf_auc))
print('XGBoost: ROC AUC=%.3f' % (xgb_auc))
print('LGBM: ROC AUC=%.3f' % (lgbm_auc))
print('KNN: ROC AUC=%.3f' % (knn_auc))
print('Logistic Regression: ROC AUC=%.3f' % (log_auc))
print('Bagging Classifier: ROC AUC=%.3f' % (bag_auc))
print('Gradient Boosting Classifier: ROC AUC=%.3f' % (gbst_auc))


# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_val, ns_probs)
dt_fpr, dt_tpr, _ = roc_curve(y_val, dt_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_val, rf_probs)
xgb_fpr, xgb_tpr, _ = roc_curve(y_val, xgb_probs)
lgbm_fpr, lgbm_tpr, _ = roc_curve(y_val, lgbm_probs)
knn_fpr, knn_tpr, _ = roc_curve(y_val, knn_probs)
log_fpr, log_tpr, _ = roc_curve(y_val, log_probs)
bag_fpr, bag_tpr, _ = roc_curve(y_val, bag_probs)
gbst_fpr, gbst_tpr, _ = roc_curve(y_val, gbst_probs)



# plot the roc curve for the model
plt.figure(figsize=(16,8), dpi=100)

plt.plot(ns_fpr, ns_tpr, linestyle='dashed', linewidth=2, color= 'black', label='No Skill (auc = %0.3f)' % ns_auc)
plt.plot(dt_fpr, dt_tpr, linestyle='-', linewidth=2, color= 'red', label='Decision Tree (auc = %0.3f)' % dt_auc)
plt.plot(rf_fpr, rf_tpr, linestyle='-', linewidth=2, color= 'blue', label='Random Forest (auc = %0.3f)' % rf_auc)
plt.plot(xgb_fpr, xgb_tpr, marker='.', linewidth=2, color= 'green', label='XGBoost (auc = %0.3f)' % xgb_auc)
plt.plot(lgbm_fpr, lgbm_tpr, linestyle='-', linewidth=2, color= 'yellow', label='LGBM (auc = %0.3f)' % lgbm_auc)
plt.plot(knn_fpr, knn_tpr, linestyle='-', linewidth=2, color= 'orange', label='KNN (auc = %0.3f)' % knn_auc)
plt.plot(log_fpr, log_tpr, linestyle='-', linewidth=2, color= 'magenta', label='Logistic Regression (auc = %0.3f)' % log_auc)
plt.plot(bag_fpr, bag_tpr, linestyle='-', linewidth=2, color= 'gray', label='Bagging Classifier (auc = %0.3f)' % bag_auc)
plt.plot(gbst_fpr, gbst_tpr, linestyle='-', linewidth=2, color= 'pink', label='Gradient Boosting Classifier (auc = %0.3f)' % gbst_auc)


# axis labels
plt.xlabel('False Positive Rate -->')
plt.ylabel('True Positive Rate -->')
plt.title("AUC-ROC Curve")
plt.legend()

plt.show()

In [None]:
import scikitplot as skplt

Plotting the calibration curves of a classifier is useful for determining whether or not you can interpret 
their predicted probabilities directly as confidence level. For instance, a well-calibrated binary classifier 
should classify the samples such that for samples to which it gave a score of 0.8, around 80% should actually 
be from the positive class.

In [None]:
xgb_probs2 = best_xgb_selected.predict_proba(x_val_selected)

skplt.metrics.plot_roc(y_val, xgb_probs2, figsize=(12,8))
plt.xlim(-0.01,1.01)
plt.ylim(-0.01,1.05)
plt.show()

ROC Curve

In [None]:
# fit a model
SEED=1

dt_clf = DecisionTreeClassifier(random_state=SEED)
rf_clf = RandomForestClassifier(random_state=SEED)
xgb_clf = XGBClassifier(random_state=SEED)
lgbm_clf = LGBMClassifier(random_state=SEED)
knn_clf = KNeighborsClassifier() 
log_clf = LogisticRegression(random_state=SEED)
bag_clf = BaggingClassifier(random_state=SEED)
gbst_clf = GradientBoostingClassifier(random_state=SEED)


# trains the classifiers
dt_clf.fit(x_train_selected, y_train)
rf_clf.fit(x_train_selected, y_train)
xgb_clf.fit(x_train_selected, y_train)
lgbm_clf.fit(x_train_selected, y_train)
knn_clf.fit(x_train_selected, y_train)
log_clf.fit(x_train_selected, y_train)
bag_clf.fit(x_train_selected, y_train)
gbst_clf.fit(x_train_selected, y_train)


# predict probabilities

dt_probs = dt_clf.predict_proba(x_val_selected)
rf_probs = rf_clf.predict_proba(x_val_selected)
xgb_probs = xgb_clf.predict_proba(x_val_selected)
lgbm_probs = lgbm_clf.predict_proba(x_val_selected)
knn_probs = knn_clf.predict_proba(x_val_selected)
log_probs = log_clf.predict_proba(x_val_selected)
bag_probs = bag_clf.predict_proba(x_val_selected)
gbst_probs = gbst_clf.predict_proba(x_val_selected)


probas_list = [dt_probs, rf_probs, xgb_probs, lgbm_probs, knn_probs, log_probs, bag_probs, gbst_probs]

clf_names = ['Decision tree', 'Random Forest', 'XGBoost', 'LGBM', 'K Nearest Neighbor', 'Logistic Regression', 'Bagging Classifier', 'Gradient Boosting Classifier']

skplt.metrics.plot_calibration_curve(y_val, probas_list, clf_names, figsize=(16,12))
plt.show()

### Gains curve to check the quality of the model against the baseline(non-use of machine learning)

In [None]:
# get what the predicted probabilities are to use creating cumulative gains chart
probs = xgb_clf.predict_proba(x_val_selected)

skplt.metrics.plot_cumulative_gain(
    y_val, probs, figsize=(10, 8), title_fontsize=20, text_fontsize=18
)
plt.ylim(0,1.05)
plt.show()

### Lift curve to check the quality of the model against the baseline(non-use of machine learning)

In [None]:
skplt.metrics.plot_lift_curve(
    y_val, probs, figsize=(10, 8), title_fontsize=20, text_fontsize=18
)
plt.legend(bbox_to_anchor=(1, 1), fontsize=14)
plt.show()

# References


https://scikit-learn.org/stable/modules/permutation_importance.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/
https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a