# German Credit Data Analysis

Loans form an integral part of banking operations. However, not all the loans are promptly returned and hence it is important for a bank to closely monitter its loan applications.  This project is an analysis of the German credit data. It contains  details of 1000 loan applicants with 20 attributes and the classification whether an applicant is considered a Good or a Bad credit risk. 

In this project, the relationship between the credit risk and various attribues will be explored through basic statistical techniques, and presented through visualizations.

### **Contents**

1. Import data
2. Data preparation, cleaning
3. Exploratory data analysis
4. Feature engineering
5. Models
6. Summary

I will be using Python 3.8 AzureML kernel.

## 1. Import data

Let's begin by downloading the data from the [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/about.html). 

The dataset has been downloaded and extracted.

## 2. Data Preparation and Cleaning

In this step, we do data preparation and cleaning, making the data suitable for subsequent analysis. 

**2.1 Load data into dataframe**

The datafile is in `.data` format,  delimited with space, and has no headers.

In [None]:
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data', delimiter=' ',header=None)

In [None]:
display(df)

In [None]:
#Based on the description, we name the columns.

df.columns=['account_bal','duration','payment_status','purpose',
                   'credit_amount','savings_bond_value','employed_since',
                   'intallment_rate','sex_marital','guarantor','residence_since',
                   'most_valuable_asset','age','concurrent_credits','type_of_housing',
                   'number_of_existcr','job','number_of_dependents','telephone',
                   'foreign','target']
df= df.replace(['A11','A12','A13','A14', 'A171','A172','A173','A174','A121','A122','A123','A124'],
                  ['neg_bal','positive_bal','positive_bal','no_acc','unskilled','unskilled','skilled','highly_skilled',
                   'none','car','life_insurance','real_estate'])

# 3. Exploratory Data Analysis and Visualization

In [None]:
# import libraries for visualizations
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'


## 3.1 Examine missing values

In [None]:
# check for missing values
df.isna().any().any()


In [None]:
# 3.1 Examining distribution of target column

df.target.unique()


In [None]:
from sklearn.preprocessing import LabelEncoder

le= LabelEncoder()
le.fit(df.target)
df.target=le.transform(df.target)
df.target.head(5)

In [None]:
loans_good_bad=round(((df.target.value_counts()/df.target.count())*100))
#good_bad_per
plt.pie(loans_good_bad,labels=['Good', 'Bad'], autopct='%1.0f%%', startangle=90)
plt.title('Percentage of Good vs. Bad loans');

In [None]:
df[['credit_amount','duration','age']].describe()

In [None]:
df['credit_amount']=np.log(df['credit_amount'])

In [None]:
df[['credit_amount','duration','age']].describe()

In [None]:
# histograms of continues variables

fig, axes = plt.subplots(1,3, figsize=(16,8))
plt.suptitle('Histogram of continuous variables')
axes[0].hist(df['duration'])
axes[0].set_xlabel('No. of observations')
axes[0].set_ylabel('Years')
axes[0].set_title('Histogram of loan duration');

axes[1].hist(df['credit_amount'])
axes[1].set_xlabel('No. of observations')
axes[1].set_ylabel('Credit amount (dollars)')
axes[1].set_title('Histogram of Credit amount');

axes[2].hist(df['age'])
axes[2].set_xlabel('No. of observations')
axes[2].set_ylabel('Age')
axes[2].set_title('Histogram of Age');

In [None]:
# box-plots of continues variables

fig, ax = plt.subplots(1,3,figsize=(20,5))
plt.suptitle('BOX PLOTS')
sns.boxplot(df['credit_amount'], ax=ax[0]);
sns.boxplot(df['duration'], ax=ax[1], color='salmon');
sns.boxplot(df['age'], ax=ax[2], color='darkviolet');

**4.2 Relationship between the credit amount and repayment duration**

* scatter plot

**Observations**

The scatter plot shows that in general, larger loans have longer duration of repayment. Cases where large loans are given with short repayment period have turned out to be bad loans. 

In [None]:
sns.scatterplot(y=df.credit_amount, x=df.duration, hue=df.target, s=100, );

**4.3 Exploration of categorical variables**

**Relationship between credit risk and skills of loan applicant**

* Bar-graph

**Observations**

The graph shows that candidates who are umeployed/unskilled pose a high risk

In [None]:
df.groupby('job')['target'].value_counts().unstack(level=1).plot.barh(stacked=True, figsize=(10, 6))

**4.4 Relationship between credit amount and duration of the loan**

* Line graph

**Observation**

There is a linear relationship between the credit amount and duration. The larger the credit amount, the longer is the repayment duration.

In [None]:
sns.lineplot(data=df, x='duration', y='credit_amount', hue='target', palette='deep');

**4.5 Relationship between the most valuable asset of the candidate and the credit amount, credit risk**

* stacked bar chart
* scatter plot

The categorical coding used in the graphs is :

* A121 : real estate
* A122 : if not A121 : building society savings agreement/life insurance
* A123 : if not A121/A122 : car or other, not in attribute 6
* A124 : unknown / no property

**Observations**

The graphs show that people with real estate assets are very risky.

In [None]:
df.groupby('most_valuable_asset')['target'].value_counts().unstack(level=1).plot.barh(stacked=True, figsize=(10, 6))

In [None]:
sns.scatterplot(y=df.credit_amount, 
                x=df.most_valuable_asset, 
                hue=df.target, 
                s=100, 
                );

## 4. Encode categorical variables

Most machine learning models cannot deal with categorical variables. So we need to encode the 13 categorical variables that we have in the german dataset. 

In [None]:
# Number of unique classes in each object column
df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

We have categorical variables with 2 to 10 categories. We go for Label encoding for variables with only two categories where as for variables with more than two categories, we go for one-hot encoding. In label encoding, we assign each unique category in a categorical variable with an integer. No new columns are created. In one-hot encoding, we create a new column for each unique category in a categorical variable. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

For label encoding, we use the Scikit-Learn LabelEncoder and for one-hot encoding, the pandas get_dummies(df) function.

In [None]:
# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le_count = 0

for col in df:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:
            le.fit(df[col])
            df[col] = le.transform(df[col])
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding of categorical variables
df = pd.get_dummies(df)

print('Encoded Features shape: ', df.shape)

Now that we have encoded the variables, let's continue with the EDA. 

**4.1 Correlation between the variables**

Let's look at correlations between the features and the target using Pearson correlation coefficient. In this case, a postive correlation represnets correlation with credit default while a negative correlation represnets correlation with credit repayment.

**Observations:**

Positive correlation:
* People with checking accounts with a negative balance (`account_bal_A11`) are likely to default the loan. 

* Longer duration loans (`duration`) tends to be defaulted.

Negative correlation:

* People with no checking account (`account_bal_A14`) are likely to repay the loan.

In [None]:
# Find correlations with the target and sort
correlations = df.corr()['target'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

In [None]:
# Extract the significantly correlated variables
corr_data = df[['target', 'account_bal_neg_bal','duration','account_bal_no_acc']]
corr_data_corrs = corr_data.corr()


# Heatmap of correlations
sns.heatmap(corr_data_corrs,  vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

## 5. Feature engineering
Feature engineering refers to creating most useful features out of the data. This represents one of the patterns in machine learning: feature engineering has a greater return on investment than model building and hyperparameter tuning. [[Source]](https://www.featurelabs.com/blog/secret-to-data-science-success/)

Feature engineering refers to a geneal process and can involve both **feature construction**: adding new features from the existing data, and **feature selection**: choosing only the most important features or other methods of dimensionality reduction. There are many techniques we can use to both create features and select features.

For this problem, we will try to construct polynomial features.

### Polynomial Features
Here, we find interactions between the significant features. The correlation between the interaction features are target are checked.If the interaction features are found to have greater correlation with the target compared to the original features, they are included in the machine learning model as they can help the model learn better. 

In [None]:
# Make a new dataframe for polynomial features
poly_features = df[['duration','account_bal_neg_bal','account_bal_no_acc']]
poly_target=df['target']

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 2)
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
print('Polynomial Features shape: ', poly_features.shape)

This creates a considerable number of new features. To get the names we have to use the polynomial features `get_feature_names` method.

In [None]:
poly_transformer.get_feature_names(input_features = ['duration','account_bal_neg_bal','account_bal_no_acc'])

Now, we can see whether any of these new features are correlated with the target.

In [None]:
# Create a dataframe for polynomial features 
poly_features = pd.DataFrame(
    poly_features, columns = poly_transformer.get_feature_names(
        ['duration','account_bal_neg_bal','account_bal_no_acc']))

# Add in the target
poly_features['target'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['target'].sort_values()

# Display the correlations
poly_corrs

All the new variables have a greater (in terms of absolute magnitude) correlation with the target than the original features. 
We will add these features to a copy of the german dataset and then evaluate models with and without the features. 

In [None]:
list(poly_features)

In [None]:
# deleting duplicate columns in poly_features

for i in list(poly_features.columns):
  for j in list(df.columns):
    if (i==j):
      poly_features.drop(labels=i, axis=1, inplace=True)

poly_features.drop(labels='1', axis=1, inplace=True)
list(poly_features)

In [None]:
# Print shape of original german_df
print('Original features shape: ', df.shape)

# Merge polnomial features into the dataframe
df_poly = df.merge(poly_features, left_index=True, right_index=True, how = 'left')

# Print out the new shapes
print('Merged polynomial features shape: ', df_poly.shape)

In [None]:
df_poly.isna().any().any()

## 6. Data split to train and test datasets

In [None]:
from sklearn.model_selection import train_test_split
x, y = df.drop('target', axis=1), df['target']
x.shape, y.shape

x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=.2, random_state=42)
x_train.shape, x_test.shape

In [None]:
# Let's normalize the features to prevent undue influence in the model.

from sklearn.preprocessing import MinMaxScaler

# scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# fit on features dataset
scaler.fit(x_train)
scaler.fit(x_test)
x_train= scaler.transform(x_train)
x_test= scaler.transform(x_test)

## 7. Models

**Evaluation criteria**

Let's have a look at the different options available.

|Evaluation criteria| Description
|:---|---
|Accuracy| (true positive+ true negative) / total obs
|Precision| true positive/ total predicted positive
|Recall| true positive/ total actual positive
|F1 | 2* precision * recall / (precision + recall)
|AUC ROC| Area Under ROC Curve  (TPR Vs. FPR for all classification thresholds)

* Accuracy: The german dataset is an imbalanced dataset. Accuracy would give a high score by predicting the majority class but would fail to predict the minority class, which is the defaulters. Hence, this is not a suitable metric for this dataset.

* Precision: Precision is a good metric when the costs of false positive is high. Example, email spam detection.

* Recall: This metric is suitable when the costs of false negative is high. Example, predicting a defulter as not defaulter. This costs huge loss for the bank. Hence, this is a suitable metric for our case.

* F1: measure of both precision and recall.

* AUC ROC: It is the plot of TPR vs FPR. All other criteria discussed here assumes 0.5 as the decision threshold for the classification. However, it maynot be always true. The AUC helps us evaluate the performance of the model for all classification thresholds. The higher the value of the AUC metric, the better the model.
 * True positive rate (TPR) = TP/ Total actual positive
 * False positive rate (FPR) = FP/ Total actual negative

We will use Recall and AUC ROC as evaluation metric.

In [None]:
import numpy as np
y.value_counts(normalize=True)

It means that the baseline accuracy is 70%, ie, even if we classify all the samples as defaulters, we will be 70% accurate. 

**Models without tuning**

In [None]:
# import packages, functions, and classes
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import roc_auc_score, recall_score, classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_validate

In [None]:
# prepare models
models = []
models.append(('DT', DecisionTreeClassifier(random_state=42)))
models.append(('LR', LogisticRegression(random_state=42)))
models.append(('RF', RandomForestClassifier(random_state=42)))
models.append(('NB', GaussianNB())) 
models.append(('XGB', XGBClassifier(random_state=42)))
models.append(('KNN', KNeighborsClassifier())) 
models.append(('SVM', SVC(gamma='auto',random_state=42)))
'''
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('CART', DecisionTreeClassifier()))
'''

# evaluate each model in turn
results_recall = []
results_roc_auc= []
names = []
# recall= tp/ (tp+fn). Best value=1, worst value=0
scoring = ['recall', 'roc_auc']

for name, model in models:
        # split dataset into k folds. use one fold for validation and remaining k-1 folds for training
        skf= StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
        # Evaluate a score by cross-validation. Returns array of scores of the model for each run of the cross validation.
        #cv_results = cross_val_score(model, x_train, y_train, cv=skf, scoring=scoring)
        cv_results = cross_validate(model, x_train, y_train, cv=skf, scoring=scoring)
        results_recall.append(cv_results['test_recall'])
        results_roc_auc.append(cv_results['test_roc_auc'])
        names.append(name)
        msg = "%s- recall:%f roc_auc:%f" % (name, cv_results['test_recall'].mean(),cv_results['test_roc_auc'].mean())
        print(msg)
        
# boxplot algorithm comparison
fig = plt.figure(figsize=(11,6))
fig.suptitle('Recall scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_recall, showmeans=True)
ax.set_xticklabels(names)
plt.show();

fig = plt.figure(figsize=(11,6))
fig.suptitle('AUC scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_roc_auc, showmeans=True)
ax.set_xticklabels(names)
plt.show();


In [None]:
# boxplot algorithm comparison
fig = plt.figure(figsize=(11,6))
fig.suptitle('Recall scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_recall, showmeans=True)
ax.set_xticklabels(names)
plt.show();

fig = plt.figure(figsize=(11,6))
fig.suptitle('AUC scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_roc_auc, showmeans=True)
ax.set_xticklabels(names)
plt.show();

Gaussian NB model has the highest `roc_auc` score. However, Logistic regression, Randon forests, XGBoost and SVM bas better AUC score than Gaussian NB. Now let us tune hyperparameters for each of these models.

In [None]:
cv_results

In [None]:
results_roc_auc

In [None]:
# ROC AUC plot
from sklearn.metrics import roc_curve, roc_auc_score
results_table = pd.DataFrame(columns = ['models', 'fpr','tpr','auc'])
predictions = {'LR': y_pred_log, 'SVC': y_pred_svc, 'NB': y_pred_nb, 'XGB': y_pred_xgb, 'Stacked': final_predictions}


for i in results_table.index:
    plt.plot(results_table.loc[i]['fpr'], 
             results_table.loc[i]['tpr'], 
             label = "{}, AUC={:.3f}".format(i, results_table.loc[i]['auc']))

plt.plot([0,1], [0,1], color = 'black', linestyle = '--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop = {'size':13}, loc = 'lower right')
plt.show()

### 7.2 Logistic regression

In [None]:
tuned_models_test=[]
tuned_models_train=[]

# Create the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001, random_state=42)

# Train on the training data
log_reg.fit(x_train, y_train)

# Evaluate on test dataset
recall_test= recall_score(y_test,log_reg.predict(x_test))
roc_test=roc_auc_score(y_test,log_reg.predict_proba(x_test)[:, 1])
print('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2))
tuned_models_test.append(('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2)))

# Evaluate on train dataset
roc_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='roc_auc').mean()
recall_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='recall').mean()
print('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2))
tuned_models_train.append(('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2)))
print(classification_report(y_test, log_reg.predict(x_test)))

In [None]:
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, log_reg)
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

## KNN

In [None]:
from xgboost import XGBClassifier
model_xg = XGBClassifier(random_state=42)
model_xg.fit(x_train, y_train)
#evaluate
print('Train accuracy:', cross_val_score(model_xg, x_train, y_train, cv=skf).mean())
print('Test accuracy:', accuracy_score(y_test, model_xg.predict(x_test)))

## XGboost

In [None]:
from xgboost import XGBClassifier
model_xg = XGBClassifier(random_state=42)
model_xg.fit(x_train, y_train)
#evaluate
print('Train accuracy:', cross_val_score(model_xg, x_train, y_train, cv=skf).mean())
print('Test accuracy:', accuracy_score(y_test, model_xg.predict(x_test)))

In [None]:
# Initialize the set of parameters for exhaustive search and fit 
parameters = {'max_features': [7, 10, 16, 18], 
              'min_samples_leaf': [1, 3, 5, 7], 
              'max_depth': [15, 20, 24, 27]}
rf = XGBClassifier(n_estimators=50, random_state=42, n_jobs=-1)
gcv = GridSearchCV(rf, parameters, n_jobs=-1, cv=skf, verbose=1)
gcv.fit(x_train, y_train)