# Predictive model for credit approval
Credit score models calculate the probability of default and are one of the most main tools used by several companies to approve or deny credit.

Description:

Each row represents a customer and the columns represent the data (information) for those customers.
The response variable is the defaulting column, which indicates whether the customer has become defaulting (1) or not (0).
The variables are described below:


- ```ìdade```: The age of the customer
- ```numero_de_dependentes```: The number of people dependent on the customer.
- ```salario_mensal```: Monthly salary of the client.
- ```numero_emprestimos_imobiliarios```: Number of real estate loans that the customer has open.
- ```numero_vezes_passou_90_dias```: Number of times the policyholder spent more 90 days overdue.
- ```util_linhas_inseguras```: How much the customer is using in relation to their credit limit, on lines that are not secured by personal assets, such as real estate and cars.
- ```vezes_passou_de_30_59_dias```:  Number of times the customer delayed the payment of a loan, (between 30 and 59 days).
- ```razao_debito```: Ratio between debts and the borrower's equity. debt ratio = Debts / Equity
- ```numero_linhas_crdto_aberto```: Number of loans outstanding by the customer.
- ```number_of_ numero_de_vezes_que_passou_60_89_dias```: Number of times the customer delayed the payment of a loan, (between 60 and 89 days).


Acknowledgments

LigthGBM Simple fe by [@caesarlupum](https://www.kaggle.com/caesarlupum/ashrae-ligthgbm-simple-fe), Brazil against the advance of Covid-19 by [@caesarlupum](https://www.kaggle.com/caesarlupum/brazil-against-the-advance-of-covid-19), eda and prediction by [@gpreda Introduction](https://www.kaggle.com/gpreda/santander-eda-and-prediction).

### Loading Required libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.metrics import roc_auc_score,precision_recall_curve,roc_curve
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import itertools
from datetime import datetime
from scipy import interp
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold


pd.set_option('max_columns', 100)

In [None]:
# Training data
train = pd.read_csv('/kaggle/input/risco-de-credito/treino.csv')
print('Training data shape: ', train.shape)
train.head()

# Exploratory Data Analysis

In [None]:
# Test data
test = pd.read_csv('/kaggle/input/risco-de-credito/teste.csv')
print('Test data shape: ', test.shape)
test.head()

In [None]:
df_lgb_ = train.copy()
target = train['inadimplente']
df_lgb = train.drop(['inadimplente'], axis=1)
train_df = train.copy()
x = train.copy()

### Examine the Distribution of the Target Column

In [None]:
x=train['inadimplente'].value_counts().values
sns.barplot([0,1],x)
plt.title('Target variable count')

print("There are {}% target values with 1".format(100 * train['inadimplente'].value_counts()[1]/train.shape[0]))

The data is unbalanced with respect with target value.

### Checking missing data in train

number and percentage of missing values in each column.

In [None]:
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head(10)

missing values: salario_mensal	```19.78%``` and numero_de_dependentes ```2.61%```

### Column Types

In [None]:
train.dtypes.value_counts()

numeric variables 7 ```int64``` and 4 ```float64``` (which can be either discrete or continuous).

### Duplicate values
Let's now check how many duplicate values exists per columns.

In [None]:
features = train.columns.values[1:11]
unique_max_train = []
unique_max_test = []
for feature in features:
    values = train[feature].value_counts()
    unique_max_train.append([feature, values.max(), values.idxmax()])
    values = test[feature].value_counts()
    unique_max_test.append([feature, values.max(), values.idxmax()])


In [None]:
np.transpose((pd.DataFrame(unique_max_train, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(10))

In [None]:
np.transpose((pd.DataFrame(unique_max_test, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(10))

Same columns in train and test set have very close number of duplicates of same or very close values. 
This is an interesting pattern that we might be able to use in the future

### Correlations

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some general interpretations of the absolute value of the correlation coefficent are:

- .00-.19 “very weak”
- .20-.39 “weak”
- .40-.59 “moderate”
- .60-.79 “strong”
- .80-1.0 “very strong”

In [None]:
# Find correlations with the target and sort
correlations = train.corr()['inadimplente'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(5))
print('\nMost Negative Correlations:\n', correlations.head(5))

more significant correlations: the **vezes_passou_de_30_59_dias** is the most positive correlation;


In [None]:
corr_train = train.corr()
plt.figure(figsize = (14, 10))
# Heatmap of correlations
sns.heatmap(corr_train, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

```numero_vezes_passou_90_dias```, ```vezes_passou_de_30_59_dias```, ````numero_de_vezes_que_passou_60_89_dias``` are very correlated. expected correlation ;)

```numero_emprestimos_imobiliarios``` has medium correlation with ```salario_mensal:``` indicate that people with a high salary have more loans.

##### Age informative plots

In [None]:
plt.style.use('fivethirtyeight')
# Plot the distribution of ages in years
plt.hist(train['idade'], edgecolor = 'k')
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
print('min age {} max age {}'.format(train['idade'].min(), train['idade'].max()))
print('age <20 {}, age >99 {}'.format(len(train[train['idade']>99]),len(train[train['idade']<20])))

We have some inconsisent values. age equal 0 for example. We need drop these rows.

In [None]:
plt.figure(figsize = (10, 8))
# KDE plot of loans that were repaid on time
sns.kdeplot(train.loc[train['inadimplente'] == 0, 'idade'], label = 'target == 0')
# KDE plot of loans which were not repaid on time
sns.kdeplot(train.loc[train['inadimplente'] == 1, 'idade'], label = 'target == 1')
# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

#### Average failure to repay loans by age bracket.

In [None]:
# Age information into a separate dataframe
age_data = train[['inadimplente', 'idade']]
# Bin the age data
age_data['age_binned'] = pd.cut(age_data['idade'], bins = np.linspace(20, 60, num = 6))
age_data.head(10)

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('age_binned').mean()
age_groups

In [None]:
plt.figure(figsize = (8, 8))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['inadimplente'])
# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest two age groups 20-28, 28-36.


This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time.

In [None]:
plt.figure(figsize = (10, 8))
sns.kdeplot(train.loc[train['inadimplente'] == 0, 'numero_linhas_crdto_aberto'], label = 'target == 0')
sns.kdeplot(train.loc[train['inadimplente'] == 1, 'numero_linhas_crdto_aberto'], label = 'target == 1')
plt.xlabel('number of open credit lines'); plt.ylabel('Density'); plt.title('Distribution of number of open credit lines');

In [None]:
plt.figure(figsize = (10, 8))
sns.kdeplot(train.loc[train['inadimplente'] == 0, 'numero_emprestimos_imobiliarios'], label = 'target == 0')
sns.kdeplot(train.loc[train['inadimplente'] == 1, 'numero_emprestimos_imobiliarios'], label = 'target == 1')
plt.xlabel('number real estate loans'); plt.ylabel('Density'); plt.title('Distribution of number real estate loans');

### Train Test Distribution Analysis

In [None]:
def plot_dist_col(column, train, test):
    '''plot dist curves for train and test  data for the given column name'''
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.distplot(train[column].dropna(), color='green', ax=ax).set_title(column, fontsize=10)
    sns.distplot(test[column].dropna(), color='purple', ax=ax).set_title(column, fontsize=10)
    plt.xlabel(column, fontsize=12)
    plt.legend(['train', 'test'])
    plt.show()

In [None]:
plot_dist_col('util_linhas_inseguras', train, test)    

In [None]:
plot_dist_col('idade', train, test) 

In [None]:
test.dtypes

In [None]:
plot_dist_col('razao_debito', train, test) 

In [None]:

plot_dist_col('salario_mensal', train, test)

In [None]:
plot_dist_col('numero_emprestimos_imobiliarios', train, test)

In [None]:
plot_dist_col('salario_mensal', train, test)

In [None]:
plot_dist_col('numero_de_dependentes', train, test) 

In general we have similar distribution in dataset, but **util_linhas_inseguras**, **numero_emprestimos_imobiliarios**,**salario_mensal**  have higher values in test data.

# Baseline

In [None]:
train = train.drop(columns = ['inadimplente'])
# Feature names
features = list(train.columns)
# Median imputation of missing values
imputer = SimpleImputer(strategy = 'median')
# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
imputer.fit(train)
# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(test)
# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

### Logistic Regression

In [None]:
# Confusion matrix 
def plot_confusion_matrix(cm, classes,
                          normalize = False,
                          title = 'Confusion matrix"',
                          cmap = plt.cm.Blues) :
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])) :
        plt.text(j, i, cm[i, j],
                 horizontalalignment = 'center',
                 color = 'white' if cm[i, j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def auc_score(y_true, y_pred):
    """
    Calculates the Area Under ROC Curve (AUC)
    """
    return roc_auc_score(y_true, y_pred)
def plot_curve(y_true_train, y_pred_train, y_true_val, y_pred_val, model_name):
    """
    Plots the ROC Curve given predictions and labels
    """
    fpr_train, tpr_train, _ = roc_curve(y_true_train, y_pred_train, pos_label=1)
    fpr_val, tpr_val, _ = roc_curve(y_true_val, y_pred_val, pos_label=1)
    plt.figure(figsize=(8, 8))
    plt.plot(fpr_train, tpr_train, color='black',
             lw=2, label=f"ROC train curve (AUC = {round(roc_auc_score(y_true_train, y_pred_train), 4)})")
    plt.plot(fpr_val, tpr_val, color='darkorange',
             lw=2, label=f"ROC validation curve (AUC = {round(roc_auc_score(y_true_val, y_pred_val), 4)})")
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.title(f'ROC Plot for {model_name}', weight="bold", fontsize=20)
    plt.legend(loc="lower right", fontsize=16)
def plot_pre_curve(y_test,probs):
    precision, recall, thresholds = precision_recall_curve(y_test, probs)
    plt.plot([0, 1], [0.5, 0.5], linestyle='--')
    # plot the precision-recall curve for the model
    plt.plot(recall, precision, marker='.')
    plt.title("precision recall curve")
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    # show the plot
    plt.show()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train, target,
                                                  test_size=0.30, 
                                                  random_state=2020, 
                                                  stratify=target)

In [None]:
# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)
# Train on the training data
log_reg.fit(X_train, y_train)

In [None]:
# Get score on training set and validation set for Logistic Regression
train_preds = log_reg.predict_proba(X_train)[:, 1]
val_preds = log_reg.predict_proba(X_val)[:, 1]
train_score = auc_score(y_train, train_preds)
val_score = auc_score(y_val, val_preds)

##### Evaluation Logistic Regression ROC_AUC

In [None]:
# Plot ROC curve
plot_curve(y_train, train_preds, y_val, val_preds, "Logistic Regression Baseline")

In [None]:
plot_pre_curve(y_val ,val_preds)

#### Classification report train

In [None]:
train_cm = confusion_matrix(y_train,train_preds.round())
print('Confusion matrix: \n',train_cm)
print('Classification report: \n',classification_report(y_train, train_preds.round()))

In [None]:
# visualize with seaborn library
sns.heatmap(train_cm,annot=True,fmt="d") 
plt.show()

### Improved Model: Random Forest

In [None]:

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 2020, verbose = 1, n_jobs = -1)
# Train on the training data
random_forest.fit(X_train,y_train)
    # Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Get score on training set and validation set for random forest
train_preds = random_forest.predict_proba(X_train)[:, 1]
val_preds = random_forest.predict_proba(X_val)[:, 1]
train_score = auc_score(y_train, train_preds)
val_score = auc_score(y_val, val_preds)

In [None]:
# Plot ROC curve
plot_curve(y_train, train_preds, y_val, val_preds, "Random Forest Baseline")

In [None]:
plot_pre_curve(y_val ,val_preds)

#### Classification report train

In [None]:
train_cm = confusion_matrix(y_train,train_preds.round())
print('Confusion matrix: \n',train_cm)
print('Classification report: \n',classification_report(y_train, train_preds.round()))

In [None]:
# visualize with seaborn library
sns.heatmap(train_cm,annot=True,fmt="d") 
plt.show()

This is great, the model is accurate! just 5 FP of 71863, 48 FN of 5084.

# Feature Selection

In [None]:
# Lgbm
import lightgbm as lgb

In [None]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(df_lgb.shape[1])
train_weight = 1-y_train.replace(y_train.value_counts()/len(y_train))
positive_weight = train_weight[y_train==1].values[0]
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 5000, class_weight = 'balanced', scale_pos_weight= positive_weight)
    
# Train using early stopping
model.fit(X_train, y_train, early_stopping_rounds=100, eval_set = [(X_val, y_val)], 
              eval_metric = 'auc', verbose = 200)
    
    # Record the feature importances
feature_importances += model.feature_importances_

In [None]:
# Get score on training set and validation set for random forest
train_preds = model.predict_proba(X_train)[:, 1]
val_preds = model.predict_proba(X_val)[:, 1]
train_score = auc_score(y_train, train_preds)
val_score = auc_score(y_val, val_preds)

In [None]:
# Plot ROC curve
plot_curve(y_train, train_preds, y_val, val_preds, "LGBM Classifier")

classification report

In [None]:
train_cm = confusion_matrix(y_train,train_preds.round())
print('Confusion matrix: \n',train_cm)
print('Classification report: \n',classification_report(y_train, train_preds.round()))

In [None]:
# visualize with seaborn library
sns.heatmap(train_cm,annot=True,fmt="d") 
plt.show()

Average feature importances!

In [None]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(df_lgb.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
feature_importances.head()

**razao_debito** and **util_linhas_inseguras** are the most important features for our model !

Find the features with zero importance

In [None]:
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
feature_importances.tail()

we not have features that have zero importance. Nice !


In [None]:
def plot_feature_importances(df, threshold = 0.9):
    """
    Plots 10 most important features and the cumulative importance of features.
    Prints the number of features needed to reach threshold cumulative importance.
    
    Parameters
    --------
    df : dataframe
        Dataframe of feature importances. Columns must be feature and importance
    threshold : float, default = 0.9
        Threshold for prining information about cumulative importances
        
    Return
    --------
    df : dataframe
        Dataframe ordered by feature importances with a normalized column (sums to 1)
        and a cumulative importance column
    
    """
    
    plt.rcParams['font.size'] = 15
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    df['cumulative_importance'] = np.cumsum(df['importance_normalized'])

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:10]))), 
            df['importance_normalized'].head(10), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:10]))))
    ax.set_yticklabels(df['feature'].head(10))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    # Cumulative importance plot
    plt.figure(figsize = (8, 6))
    plt.plot(list(range(len(df))), df['cumulative_importance'], 'r-')
    plt.xlabel('Number of Features'); plt.ylabel('Cumulative Importance'); 
    plt.title('Cumulative Feature Importance');
    plt.show();
    
    importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
    print('%d features required for %0.2f of cumulative importance' % (importance_index + 1, threshold))
    
    return df

In [None]:
norm_feature_importances = plot_feature_importances(feature_importances)

# Modeling

In [None]:
import itertools
from scipy import interp
def gradient_boosting_model(params, folds, test_df, model='LGB',stack = False):    
    print(str(model)+' modeling...')
    start_time = timer(None)
    plt.rcParams["axes.grid"] = True
    nfold = folds
    skf = StratifiedKFold(n_splits=nfold, shuffle=False, random_state=44400)

    oof = np.zeros(len(train_df))
    mean_fpr = np.linspace(0,1,100)
    cms= []
    tprs = []
    aucs = []
    y_real = []
    y_proba = []
    recalls = []
    roc_aucs = []
    f1_scores = []
    accuracies = []
    precisions = []
    feature_importance_df = pd.DataFrame()
    predictions = np.zeros(len(test_df))

    i = 1
    for train_idx, valid_idx in skf.split(train_df, train_df['inadimplente'].values):
        print("\nfold {}".format(i))
        
        if model == 'LGB':
        
            trn_data = lgb.Dataset(train_df.iloc[train_idx][features].values,
                                   label=train_df.iloc[train_idx]['inadimplente'].values
                                   )
            val_data = lgb.Dataset(train_df.iloc[valid_idx][features].values,
                                   label=train_df.iloc[valid_idx]['inadimplente'].values
                                   )   

            clf = lgb.train(param_lgb, trn_data, num_boost_round=1000,  valid_sets = [trn_data, val_data], verbose_eval=800, early_stopping_rounds = 10000)
            oof[valid_idx] = clf.predict(train_df.iloc[valid_idx][features].values) 
  
            predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / skf.n_splits
    
        if model == 'XGB':

            trn_data = xgb.DMatrix(train_df.iloc[train_idx][features], 
                                   label=train_df.iloc[train_idx]['inadimplente'].values)
            val_data = xgb.DMatrix(train_df.iloc[valid_idx][features], 
                                   label=train_df.iloc[valid_idx]['inadimplente'].values)

            watchlist = [(trn_data, 'train'), (val_data, 'valid')]

            clf = xgb.train(params, dtrain = trn_data, evals=watchlist, early_stopping_rounds=1000, maximize=True, verbose_eval=800)
            oof[valid_idx] = clf.predict(val_data, ntree_limit=clf.best_ntree_limit)
            
            test_xgb = xgb.DMatrix(test_df[features])
            predictions += clf.predict(test_xgb, ntree_limit=clf.best_ntree_limit) / skf.n_splits
        
        # Scores 
        roc_aucs.append(roc_auc_score(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx]))
        accuracies.append(accuracy_score(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx].round()))
        recalls.append(recall_score(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx].round()))
        precisions.append(precision_score(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx].round()))
        f1_scores.append(f1_score(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx].round()))

        # Roc curve by folds
        f = plt.figure(1)
        fpr, tpr, t = roc_curve(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx])
        tprs.append(interp(mean_fpr, fpr, tpr))
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.4f)' % (i,roc_auc))

        # Precion recall by folds
        g = plt.figure(2)
        precision, recall, _ = precision_recall_curve(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx])
        y_real.append(train_df.iloc[valid_idx]['inadimplente'].values)
        y_proba.append(oof[valid_idx])
        plt.plot(recall, precision, lw=2, alpha=0.3, label='P|R fold %d' % (i))  

        i= i+1
        
        # Confusion matrix by folds
        cms.append(confusion_matrix(train_df.iloc[valid_idx]['inadimplente'].values, oof[valid_idx].round()))
        
        # Features imp
        fold_importance_df = pd.DataFrame()
        fold_importance_df["Feature"] = features
        if model == 'LGB':
            fold_importance_df["importance"] = clf.feature_importance()
        fold_importance_df["fold"] = nfold + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)    

    # Metrics
    print(
            '\nCV roc score        : {0:.4f}, std: {1:.4f}.'.format(np.mean(roc_aucs), np.std(roc_aucs)),
            '\nCV accuracy score   : {0:.4f}, std: {1:.4f}.'.format(np.mean(accuracies), np.std(accuracies)),
            '\nCV recall score     : {0:.4f}, std: {1:.4f}.'.format(np.mean(recalls), np.std(recalls)),
            '\nCV precision score  : {0:.4f}, std: {1:.4f}.'.format(np.mean(precisions), np.std(precisions)),
            '\nCV f1 score         : {0:.4f}, std: {1:.4f}.'.format(np.mean(f1_scores), np.std(f1_scores))
    )
    
    # Roc plt
    f = plt.figure(1)
    plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'grey')
    mean_tpr = np.mean(tprs, axis=0)
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color='blue',
             label=r'Mean ROC (AUC = %0.4f)' % (np.mean(roc_aucs)),lw=2, alpha=1)

    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(str(model)+' ROC curve by folds')
    plt.legend(loc="lower right")
    
    # PR plt
    g = plt.figure(2)
    plt.plot([0,1],[1,0],linestyle = '--',lw = 2,color = 'grey')
    y_real = np.concatenate(y_real)
    y_proba = np.concatenate(y_proba)
    precision, recall, _ = precision_recall_curve(y_real, y_proba)
    plt.plot(recall, precision, color='blue',
             label=r'Mean P|R')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(str(model)+' P|R curve by folds')
    plt.legend(loc="lower left")

    # Confusion maxtrix
    plt.rcParams["axes.grid"] = False
    cm = np.average(cms, axis=0)
    class_names = [0,1]
    plt.figure()
    plot_confusion_matrix(cm, 
                          classes=class_names, 
                          title= str(model).title()+' Confusion matrix [averaged/folds]')
    
    # Feat imp plt
    if model != 'XGB':
        cols = (feature_importance_df[["Feature", "importance"]]
            .groupby("Feature")
            .mean()
            .sort_values(by="importance", ascending=False)[:30].index)
        best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

        plt.figure(figsize=(10,10))
        sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False),
                edgecolor=('white'), linewidth=2, palette="rocket")
        plt.title(str(model)+' Features importance (averaged/folds)', fontsize=18)
        plt.tight_layout()
        
    # Timer end    
    timer(start_time)
    
    return predictions
    
#Timer
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('Time taken for Modeling: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix, accuracy_score, roc_auc_score, f1_score, roc_curve, auc,precision_recall_curve
param_xgb = {
            'n_jobs' : -1, 'n_estimators' : 500, 'seed' : 4040,
            'random_state':404, 'eval_metric':'auc' }
# Test data
sub_df = pd.read_csv('/kaggle/input/risco-de-credito/teste.csv')
preds_xgb = gradient_boosting_model(param_xgb, 10, sub_df, 'XGB')



## Gradient Boosting Model function

In [None]:
param_lgb = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'class_weight':'unbalanced',
    'scale_pos_weight':positive_weight,    
    'objective': 'binary', 
    'verbosity': 1
}

# Test data
sub_df = pd.read_csv('/kaggle/input/risco-de-credito/teste.csv')

predictions = gradient_boosting_model(param_lgb, 10, sub_df, 'LGB')
sub_df["inadimplente"] = predictions.round()
sub_df["inadimplente score"] = predictions

#### Creation final test dataset

In [None]:
sub_df.to_csv("final_test.csv", index=False)
sub_df.head()

### Defaulting-Score for customers

In [None]:
def severity_validation(df):
    df['defaulting-score'] = "None"
    for i, row in df.iterrows():
        if row['inadimplente'] <0.5:
            df['defaulting-score'][i] = "low-defaulting-score"
        elif row['inadimplente'] <=0.75:
            df['defaulting-score'][i] = "medium-defaulting-score" 
        else:
            df['defaulting-score'][i] = "high-defaulting-score" 
    return df

customer_df= pd.DataFrame(predictions, columns=['inadimplente'])

customer_score=severity_validation(customer_df)
customer_score['inadimplente']
q50, q25 = np.percentile(customer_score['inadimplente'], [50 ,25])
q75, q25 = np.percentile(customer_score['inadimplente'], [75 ,25])
q100, q75 = np.percentile(customer_score['inadimplente'], [100 ,75])

iqr_50 = q50 - q25
iqr_75 = q75 - q50
iqr_100 = q100 - q75
iqr_75_25 = q75 - q25

In [None]:
customer_score['inadimplente result'] = predictions.round()
print("minimum defaulting prob ",customer_score[customer_score['defaulting-score']=='high-defaulting-score']['inadimplente'].min())
print("maximum defaulting prob ",customer_score[customer_score['defaulting-score']=='high-defaulting-score']['inadimplente'].max())

customer_score[customer_score['defaulting-score']=='high-defaulting-score'].head(10)

Now we have how to identify high-defaulting customers, the customer can be assigned a "defaulting-score" based on the predicted label such that:

- Low-defaulting-score for Customers with label < 0.50
- Medium-defaulting-score Score for Customers with label between 0.5 and 0.75
- High-defaulting-score Score for Customers with label > 0.75

### Factory Analysis in defaulting customers


In [None]:
# Install library 
!pip install factor_analyzer
# import factor analyzer library
from factor_analyzer import FactorAnalyzer

In [None]:
fa = FactorAnalyzer()
fa.fit(sub_df, 10)

ev, v = fa.get_eigenvalues()

# Create scree plot using matplotlib
plt.figure(figsize=(25,10))
plt.scatter(range(1,sub_df.shape[1]+1),ev)
plt.plot(range(1,sub_df.shape[1]+1),ev)
plt.hlines(1, 0, sub_df.shape[1], colors='r')
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

3 factors

In [None]:
# Perform Factor Analysis
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
fa.fit(sub_df)
loads = fa.loadings_
loads = pd.DataFrame(loads, index=sub_df.T.index)

In [None]:
#Heatmap of loadings
plt.figure(figsize=(30,25))
sns.heatmap(loads, annot=True, cmap="YlGnBu")

### Factors 
The factors represents high correlated variables. We just considered factor loadings >35 in each factor. 
Let's analyze defaulting customers and group by the highest scores(factor loadings) for each factor.

- factor 1 (**customer delaying payment.**) - vezes_passou_de_30_59_dias, numero_vezes_passou_90_dias, numero_de_vezes_que_passou_60_89_dias.
- factor 2 (**customer with many open loans**) - numero_linhas_crdto_aberto,numero_emprestimos_imobiliarios.
- factor 3 (**young customer with few dependents**) - idade, numero_de_dependentes.


In [None]:
test

In [None]:
def score_factor1(df, df_factor_analysis, target, name_target):
    df['score_factor1_target'] = df[df['inadimplente']==target]['vezes_passou_de_30_59_dias'] * df_factor_analysis.T['vezes_passou_de_30_59_dias'][0] \
    + df[df[name_target]==target]['numero_vezes_passou_90_dias'] * df_factor_analysis.T['numero_vezes_passou_90_dias'][0] \
    + df[df[name_target]==target]['numero_de_vezes_que_passou_60_89_dias'] * df_factor_analysis.T['numero_de_vezes_que_passou_60_89_dias'][0]        

score_factor1(sub_df, loads, 0, 'inadimplente')       
score_factor1(sub_df, loads, 1, 'inadimplente')


In [None]:
def score_factor2(df, df_factor_analysis, target, name_target):
    df['score_factor2_target'] = df[df['inadimplente']==target]['numero_linhas_crdto_aberto'] * df_factor_analysis.T['numero_linhas_crdto_aberto'][1] \
    + df[df[name_target]==target]['numero_emprestimos_imobiliarios'] * df_factor_analysis.T['numero_emprestimos_imobiliarios'][1]  

score_factor2(sub_df, loads, 0, 'inadimplente')       
score_factor2(sub_df, loads, 1, 'inadimplente')


In [None]:
def score_factor3(df, df_factor_analysis, target, name_target):
    df['score_factor3_target'] = df[df['inadimplente']==target]['idade'] * df_factor_analysis.T['idade'][2] \
    + df[df[name_target]==target]['numero_de_dependentes'] * df_factor_analysis.T['numero_de_dependentes'][2]   

score_factor3(sub_df, loads, 0, 'inadimplente')       
score_factor3(sub_df, loads, 1, 'inadimplente')


customers with high 'inadimplente score' and high score for the Factor 1.

In [None]:
sub_df

In [None]:
sub_df.columns

In [None]:
sub_df[['score_factor1_target', 'inadimplente score' ]].sort_values(by=['score_factor1_target', 'inadimplente score'],ascending=False).head(10)

customers with high 'inadimplente score' and high score for the Factor 2.

In [None]:
sub_df[['score_factor2_target', 'inadimplente score' ]].sort_values(by=['score_factor2_target', 'inadimplente score'],ascending=False).head(10)

customers with high 'inadimplente score' and high score for the Factor 3.

In [None]:
sub_df[['score_factor3_target', 'inadimplente score' ]].sort_values(by=['score_factor3_target', 'inadimplente score'],ascending=False).head(10)

# End Notebook