# Project Report: German Credit Risk Analysis

## Introduction
The German Credit Risk Analysis project aims to predict credit risk using machine learning techniques. The dataset used in this project contains information about customers' credit applications, including demographic features, credit amount, duration, and risk classification.

## Libraries
- Pandas
- NumPy
- Matplotlib
- Seaborn
- LightGBM
- Scikit-learn
- Imbalanced-learn

## Data Exploration and Preprocessing
- Imported the dataset and preprocessed column names.
- Explored data dimensions and target class balance.
- Visualized numerical and categorical features.
- Preprocessed the data by handling missing values, encoding categorical variables, and splitting the dataset into training and testing sets.

## Model Training and Evaluation
- Trained multiple machine learning models, including Logistic Regression, Decision Trees, Random Forest, and LightGBM.
- Tuned hyperparameters using RandomizedSearchCV.
- Evaluated models using various metrics such as accuracy, precision, recall, F1 score, and ROC AUC.
- Plotted ROC curves, confusion matrices, and learning curves to analyze model performance.

## Feature Importance Analysis
- Conducted feature importance analysis using LightGBM.
- Visualized feature importance to identify significant predictors in the model.

## Model Deployment
- Built a preprocessing pipeline to handle data transformations.
- Applied the trained model to new data for credit risk prediction.
- Appended predicted probabilities and classes to the dataset for further analysis.

## Conclusion
The German Credit Risk Analysis project successfully developed machine learning models to predict credit risk based on customer attributes. The LightGBM model demonstrated the best performance with an accuracy of XX% and an ROC AUC of XX%. Feature importance analysis identified key predictors influencing credit risk. The deployed model can be used for real-time credit risk assessment in financial institutions.


### Libraries:

This cell imports various standard libraries and modules necessary for data analysis, visualization, machine learning, and pipeline construction. Here's a breakdown of what each import statement does:

- **pandas**: Used for data manipulation and analysis.
- **numpy**: Provides support for large, multi-dimensional arrays and matrices, along with mathematical functions.
- **matplotlib.pyplot**: A plotting library for creating static, interactive, and animated visualizations in Python.
- **seaborn**: Built on top of matplotlib, seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
- **datetime**: Provides classes for manipulating dates and times.
- **time**: Provides various time-related functions.
- **matplotlib.gridspec**: Allows the creation of subplots with different sizes and alignments.

Additionally, the code imports utility functions, custom transformers, and machine learning models:

- **viz_utils**: Contains utility functions for data visualization.
- **ml_utils**: Contains utility functions for machine learning tasks.
- **custom_transformers**: Contains custom transformer classes for data preprocessing.
- **Pipeline**: Enables constructing pipelines in scikit-learn for sequential execution of multiple data processing steps.
- **ColumnTransformer**: Allows applying different transformations to different columns or subsets of data.
- **StandardScaler**: Scales features by removing the mean and scaling to unit variance.
- **SimpleImputer**: Imputes missing values using a specified strategy.
- **LogisticRegression**: Implements logistic regression for binary classification.
- **DecisionTreeClassifier**: Implements decision tree-based classifiers.
- **RandomForestClassifier**: Implements random forest classifiers.
- **lightgbm**: A gradient boosting framework that uses tree-based learning algorithms.
- **train_test_split**: Splits data into random train and test subsets.
- **RandomizedSearchCV**: Performs hyperparameter optimization using random search.
- **cross_val_score**: Evaluates a score by cross-validation.
- **classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score, f1_score**: Functions for evaluating classification model performance metrics.
- **RandomUnderSampler, SMOTE**: Techniques for handling imbalanced datasets by undersampling the majority class or oversampling the minority class, respectively.

These imports set up the environment for data analysis, preprocessing, modeling, and evaluation. They provide the necessary tools and functionality to perform various tasks in a machine learning project.


In [None]:
# Stsandard libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import time
from matplotlib.gridspec import GridSpec

# Utilities
from utils.viz_utils import *
from utils.ml_utils import *
from utils.custom_transformers import *

# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Modeling
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, cross_val_predict, \
                                    learning_curve
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, \
    accuracy_score, precision_score, recall_score, f1_score
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

### CustomFunctions:

The `catplot_percentage_analysis` function is designed to visualize categorical data by plotting the percentage distribution of each category with respect to a specified target variable (`hue`). Here's an overview of what the function does:

- **Purpose**: To analyze the distribution and proportion of categorical variables in a dataset concerning a specific target variable.

- **Steps**:
  1. Retrieve the categorical variables from the dataset.
  2. Set up parameters for plotting, such as the number of columns for the matplotlib figure (`fig_cols`).
  3. Apply loops to generate plots and format them.

- **Arguments**:
  - `df_categorical`: The dataset containing categorical variables to be analyzed (pandas.DataFrame).
  - `hue`: The target variable used for stratification and color differentiation in the plots.
  - `fig_cols`: The number of columns in the matplotlib figure (integer, default is 2).
  - `palette`: The color palette to be used in the plots (string, default is 'viridis').
  - `figsize`: The size of the matplotlib figure (tuple, default is (16, 10)).

- **Return**: None (The function generates and displays matplotlib plots directly).

The function utilizes seaborn for styling and plotting. It creates horizontal bar plots for each categorical variable, representing the percentage distribution of categories with respect to the target variable (`hue`). The plots are organized in a grid layout based on the specified number of columns (`fig_cols`). Empty subplot spaces are handled to ensure a clean and organized visualization.

Overall, this function provides a convenient way to explore the relationship between categorical variables and the target variable in a dataset.


In [None]:
def catplot_percentage_analysis(df_categorical, hue, fig_cols=2, palette='viridis', figsize=(16, 10)):

    sns.set(style='white', palette='muted', color_codes=True)
    sns.set_palette(palette)
    cat_features = list(df_categorical.drop(hue, axis=1).columns)
    total_cols = len(cat_features)
    fig_rows = ceil(total_cols / fig_cols)

    fig, axs = plt.subplots(nrows=fig_rows, ncols=fig_cols, figsize=(figsize))
    i, j = 0, 0

    for col in cat_features:
        try:
            ax = axs[i, j]
        except:
            ax = axs[j]

        col_to_hue = pd.crosstab(df_categorical[col], df_categorical[hue])
        col_to_hue.div(col_to_hue.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, ax=ax)

        format_spines(ax, right_border=False)
        ax.set_title(col)
        ax.set_ylabel('')

        j += 1
        if j == fig_cols:
            j = 0
            i += 1

    i, j = (0, 0)
    for n_plots in range(fig_rows * fig_cols):

        if n_plots >= len(cat_features):
            try:
                axs[i][j].axis('off')
            except TypeError as e:
                axs[j].axis('off')

        j += 1
        if j == fig_cols:
            j = 0
            i += 1

    plt.tight_layout()
    plt.show()


In [None]:
# Data path
df_ori = pd.read_csv('german_credit_data.csv')
df = df_ori.iloc[:, 1:]
df.columns = [col.lower().strip().replace(' ', '_') for col in df.columns]

# Results
print(f'Data dimension: {df.shape}')
df.head()

In [None]:
# Target class balance
fig, ax = plt.subplots(figsize=(7, 7))
label_names = ['Good', 'Bad']
color_list = ['navy', 'mediumvioletred']
text = f'Total\n{len(df_ori)}'
title = 'Target Class Balance'

# Visualizing it through a donut chart
donut_plot(df, col='risk', ax=ax, label_names=label_names, colors=color_list, title=title, text=text)

In [None]:
# Overview from the data
df_overview = data_overview(df)
df_overview

In [None]:
num_cols = ['age', 'credit_amount', 'duration']
color_sequence = ['navy', 'mediumseagreen', 'navy']
numplot_analysis(df[num_cols], color_sequence=color_sequence, hist=True)
plt.show()

In [None]:
num_cols += ['risk']
numplot_analysis(df[num_cols], hue='risk', color_hue=color_list)

In [None]:
boxenplot(df, features=['age', 'credit_amount', 'duration'], hue='risk', fig_cols=3, figsize=(15, 5), 
          palette=color_list)

In [None]:
cat_features = [col for col, dtype in df.dtypes.items() if dtype == 'object']
catplot_analysis(df[cat_features], palette='plasma')

In [None]:
catplot_analysis(df[cat_features], hue='risk', palette=color_list)

In [None]:
rev_color_list = ['mediumvioletred', 'navy']
catplot_percentage_analysis(df[cat_features], hue='risk', palette=rev_color_list)

In [None]:
mean_sum_analysis(df, group_col='purpose', value_col='credit_amount')

In [None]:
mean_sum_analysis(df, group_col='purpose', value_col='duration')

In [None]:
gender_palette = ['cornflowerblue', 'salmon']
mean_sum_analysis(df, group_col='sex', value_col='credit_amount', orient='horizontal', 
                  palette=gender_palette, figsize=(12, 4))

In [None]:
sns.pairplot(df[num_cols], hue='risk', palette=color_list)
plt.show()

In [None]:
amount_risk = df.groupby(by='risk', as_index=False).sum().loc[:, ['risk', 'credit_amount']]
amount_risk['percentage'] = amount_risk['credit_amount'] / amount_risk['credit_amount'].sum()
amount_risk.style.background_gradient(cmap='Reds_r')

In [None]:
# Creating figure
fig, ax = plt.subplots(figsize=(7, 7))

# Defining useful elements for the donut chart
values = amount_risk['credit_amount']
labels = amount_risk['risk']
center_circle = plt.Circle((0, 0), 0.8, color='white')

# Plotting the pizza chart and the center circle
ax.pie(values, labels=labels, colors=['darkred', 'cadetblue'], autopct=make_autopct(values))
ax.add_artist(center_circle)

kwargs = dict(size=20, fontweight='bold', va='center')
ax.text(0, 0, f'Total Amount\n${values.sum()}', ha='center', **kwargs)
ax.set_title('Credit Amount Made Available to Customers by Risk', size=14, color='dimgrey')
plt.show()

In [None]:
# Creating figure
fig, axs = plt.subplots(ncols=2, figsize=(15, 5))

# Scatterplot
sns.scatterplot(x='age', y='credit_amount', data=df, hue='housing', ax=axs[0], palette='magma', alpha=.8)
sns.scatterplot(x='age', y='credit_amount', data=df, hue='job', ax=axs[1], palette='YlGnBu')

# Customizing plot
format_spines(axs[0], right_border=False)
format_spines(axs[1], right_border=False)
axs[0].set_title('Credit Amomunt and Age Distribution by Housing', size=12, color='dimgrey')
axs[1].set_title('Credit Amomunt and Age Distribution by Job', size=12, color='dimgrey')
plt.show()

In [None]:
g = (sns.jointplot(x='credit_amount', y='duration', data=df, color='seagreen', kind='hex'))

In [None]:
# Creating new categories for duration col
bins = [0, 10, 30, 50, np.inf]
labels = ['<= 10', 'between 10 and 30', 'between 30 and 50', '> 50']
df['cat_duration'] = pd.cut(df['duration'], bins=bins, labels=labels)

# Creating figure
fig, ax = plt.subplots(figsize=(10, 5))

# Scatterplot
sns.scatterplot(x='age', y='credit_amount', data=df, hue='cat_duration', palette='YlGnBu')

# Customizing plot
format_spines(ax, right_border=False)
ax.set_title('Credit Amomunt and Age Distribution by Duration Category', size=14, color='dimgrey')
ax.legend(loc='upper right', fancybox=False, framealpha=0.2)
df.drop('cat_duration', axis=1, inplace=True)
plt.show()

In [None]:
# Definindo melhor número de clusters
columns = ['age', 'duration']
cluster_data = df.loc[:, columns]
K_min, K_max = 1, 8

# Call the elbow_method_kmeans1 function
elbow_method_kmeans1(cluster_data, K_min, K_max)

In [None]:
model = KMeans(n_clusters=3)
cluster_data
model.fit(cluster_data)
plot_kmeans_clusters_2d(cluster_data, model)

In [None]:
# Definindo melhor número de clusters
columns = ['credit_amount', 'duration']
cluster_data = df.loc[:, columns]
K_min, K_max = 1, 8
elbow_method_kmeans1(cluster_data, K_min, K_max)

In [None]:
model = KMeans(n_clusters=2)
cluster_data
model.fit(cluster_data)
plot_kmeans_clusters_2d(cluster_data, model)

In [None]:
# Creating a target column
df['target'] = df['risk'].apply(lambda x: 1 if x == 'bad' else 0)
df.drop('risk', axis=1, inplace=True)
df.head()

In [None]:
# Building the preprocessing Pipeline
preprocessing_pipeline = Pipeline([
    ('dup_dropped', DropDuplicates()),
    ('data_splitter', SplitData(target='target'))
])

# Applying this pipeline
X_train, X_test, y_train, y_test = preprocessing_pipeline.fit_transform(df)
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of X_test: {X_test.shape}')

In [None]:
# Splitting the data by dtype
num_features = [col for col, dtype in X_train.dtypes.items() if dtype != 'object']
cat_features = [col for col, dtype in X_train.dtypes.items() if dtype == 'object']

# Building a numerical pipeline
num_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Building a categorical pipeline
cat_pipeline = Pipeline([
    ('encoder', DummiesEncoding(dummy_na=True))
])

# Building a complete pipeline
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

In [None]:
# Applying the data prep pipeline
X_train_prep = full_pipeline.fit_transform(X_train)
X_test_prep = full_pipeline.fit_transform(X_test)

print(f'Shape of X_train_prep: {X_train_prep.shape}')
print(f'Shape of X_test_prep: {X_test_prep.shape}')

In [None]:
# Returning the final features of the dataset
encoded_features = full_pipeline.named_transformers_['cat']['encoder'].features_after_encoding
model_features = num_features + encoded_features
df_train_prep = pd.DataFrame(X_train_prep, columns=model_features)
df_train_prep.head()

In [None]:
# Logistic Regression hyperparameters
logreg_param_grid = {
    'C': np.linspace(0.1, 10, 20),
    'penalty': ['l1', 'l2'],
    'class_weight': ['balanced', None],
    'random_state': [42],
    'solver': ['liblinear']
}

# Decision Trees hyperparameters
tree_param_grid = {
    'criterion': ['entropy', 'gini'],
    'max_depth': [3, 5, 10, 20],
    'max_features': np.arange(1, X_train.shape[1]),
    'class_weight': ['balanced', None],
    'random_state': [42]
}

# Random Forest hyperparameters
forest_param_grid = {
    'bootstrap': [True, False],
    'max_depth': [3, 5, 10, 20, 50],
    'n_estimators': [50, 100, 200, 500],
    'random_state': [42],
    'max_features': ['auto', 'sqrt'],
    'class_weight': ['balanced', None]
}

# LightGBM hyperparameters
lgbm_param_grid = {
    'num_leaves': list(range(8, 92, 4)),
    'min_data_in_leaf': [10, 20, 40, 60, 100],
    'max_depth': [3, 4, 5, 6, 8, 12, 16],
    'learning_rate': [0.1, 0.05, 0.01, 0.005],
    'bagging_freq': [3, 4, 5, 6, 7],
    'bagging_fraction': np.linspace(0.6, 0.95, 10),
    'reg_alpha': np.linspace(0.1, 0.95, 10),
    'reg_lambda': np.linspace(0.1, 0.95, 10),
}

lgbm_fixed_params = {
    'application': 'binary',
    'objective': 'binary',
    'metric': 'auc',
    'is_unbalance': 'true',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'bagging_freq': 20,
    'learning_rate': 0.05,
    'verbose': 0
}

In [None]:
# Setting up classifiers
set_classifiers = {
    'LogisticRegression': {
        'model': LogisticRegression(),
        'params': logreg_param_grid
    },
    'DecisionTrees': {
        'model': DecisionTreeClassifier(),
        'params': tree_param_grid
    },
    'RandomForest': {
        'model': RandomForestClassifier(),
        'params': forest_param_grid
    },
    'LightGBM': {
        'model': lgb.LGBMClassifier(**lgbm_fixed_params),
        'params': lgbm_param_grid
    }
}

In [None]:
# Creating an instance for the homemade class BinaryClassifiersAnalysis
clf_tool = BinaryClassifiersAnalysis()
clf_tool.fit(set_classifiers, X_train_prep, y_train, random_search=True, cv=5, verbose=5)

In [None]:
# Evaluating metrics
df_performances = clf_tool.evaluate_performance(X_train_prep, y_train, X_test_prep, y_test, cv=5)
df_performances.reset_index(drop=True).style.background_gradient(cmap='Blues')

In [None]:
fig, ax = plt.subplots(figsize=(13, 11))
lgbm_feature_importance = clf_tool.feature_importance_analysis(model_features, specific_model='LightGBM', ax=ax)
plt.show()

In [None]:
clf_tool.plot_roc_curve()

In [None]:
clf_tool.plot_confusion_matrix(classes=['Good', 'Bad'])

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
clf_tool.plot_learning_curve('LightGBM', ax=ax)
plt.show()

In [None]:
clf_tool.plot_score_distribution('LightGBM', shade=True)

In [None]:
clf_tool.plot_score_bins(model_name='LightGBM', bin_range=0.1)

In [None]:
# Libs
import pandas as pd
from utils.custom_transformers import *
from sklearn.pipeline import Pipeline

# Reading the raw data
df_ori = import_data('german_credit_data.csv', optimized=True)
df = df_ori.iloc[:, 1:]
df.columns = [col.lower().strip().replace(' ', '_') for col in df.columns]
df['target'] = df['risk'].apply(lambda x: 1 if x == 'bad' else 0)
df.drop('risk', axis=1, inplace=True)

# Applying the data prep pipeline (the pkl file could be read from a specific path)
scoring_data = full_pipeline.fit_transform(df)

# Using the trained model for predicting (the pkl file could be read from a specific path)
model = clf_tool.classifiers_info['LightGBM']['estimator']
y_pred = model.predict(scoring_data)
y_score = model.predict_proba(scoring_data)[:, 1]

# Appending the predictions to the data
df['y_score'] = y_score
df['y_pred'] = y_pred
df['y_class'] = df['y_pred'].apply(lambda x: 'Bad' if x == 1 else 'Good')

# Creating bins
bins = df['y_score'].quantile(np.arange(0, 1.01, 0.1)).values
labels = ['Faixa ' + str(i) for i in range(len(bins)-1, 0, -1)]
df['faixa'] = pd.cut(df['y_score'], bins=bins, labels=labels, include_lowest=True)
df.head()