# Credit Card Application Approval

This project is concerned with a dataset dealing with credit card applications. Based on the feature given in the dataset the task is to predict if a person's request for a credit card is approved (or denied).

## Dataset

Information on the "Credit Approval" dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) can be found here:

* Download URL: https://archive.ics.uci.edu/static/public/27/credit+approval.zip
* DOI: https://doi.org/10.24432/C5FS30
* Dataset creators: J. R. Quinlan
* License: Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode))

## Tasks

Below you can find a summary of the single subtasks you are required to work on during this project.

### Exploratory Data Analysis (EDA)

Perform a thorough analysis of the data. Preferably, use well-established tools from the Python package eco-system such as, e.g., [Pandas](https://pandas.pydata.org/docs), [Matplotlib](https://matplotlib.org/stable/index.html) / [Seaborn](https://seaborn.pydata.org/). Another helpful tool is [Ydata Profiling](https://docs.profiling.ydata.ai/).

Things to consider for the analysis:

* Visualise as much as possible. Make your visualisation easy to understand by using, e.g., labels for the axes or titles.
* Take into account differences regarding the features such as categorical vs. continuous.
* Consider correlations between different features. Also analyse how single features are correlated with the target.
* Check for missing values.

### Machine Learning (ML)

Apply machine learning models of your choice to solve this classification task. Again, use appropriate tools such as those found in the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library. You may also consider using tools such as [XGBoost](https://xgboost.readthedocs.io/en/latest/python/) or a neural network based on [PyTorch](https://pytorch.org/docs/stable/index.html) or [TensorFlow](https://www.tensorflow.org/api_docs).

Things to consider:

* Make sure to split your data into train and test data before using any ML model.
* Think about how to handle missing values and how to deal with features of different type (categorical and continuous). This also pertains to techniques such as feature encoding (e.g., refer to [this link form the Scikit-Learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html)) and feature engineering (e.g., frequency / count encoding or target encoding for categorical features).
* Use data processing pipelines to have a clean way of preparing your data for a particular ML model. Note that different types of models (e.g., Logistic Regression vs. Gradient Boosted Trees) may require different preparation steps for the data.
* Choose a proper metric (or several if appropriate) to evaluate a given model.
* Optimise the hyper-parameters of your ML models to achieve the best possible performance on the data.
* Compare different ML models.

### Comments

Document your workflow appropriately. If you choose to work with Juypter Notebooks this can be achieved by having dedicated notebooks for different parts of the project (e.g., EDA and ML models). Within a single notebook use sections and comments to document important decisions and the intent of your analysis.

Your notebooks will look much cleaner and become a lot easier to comprehend if you avoid code duplication. That is, before using many code snippets that only differ slightly, consider finding a common abstraction and have a single dedicated place for this code (e.g., inside a function or a class) that enables easy reuse. It is oftentimes suitable to move code to a Python module. This module can then be readily imported in your Jupyter notebooks.

It should be possible to (easily) reproduce your results by re-executing your notebooks.

If you are working in groups it must be obvious which group member has conducted which part of the work. Hence, please make sure to add annotations inside the docstring of functions / classes or appriate comments in the sections of your Jupyter notebooks.

## Presentation of Results

### Oral Presentation

In the presentation your are meant to present the workflow during the project as well as the main results (in total 20 - 40 minutes for *all* members of the group combined, *not* per group member). Outline which tools you have used (e.g., Pandas, Scikit-Learn) and how you have approached the data to arrive at certain results. Also discuss the choice / usage of your ML models in relation to the EDA.

Choose a suitable medium such as ML-office-alike slides or Jupyter notebooks. If you are using the latter, please pay special attention to conciseness and a clean structure. Comprehensibly prepare your results by using, e.g., flow-charts for representing workflows and figures / tables for summarizing quantitative results. Please pay special attention to legiblity of axes labels, titles and legends in plots as well to colors and line types.

### Comments

If you are working in groups it must be obvious from your presentation which group member has conducted which part of the work.

## Grading

The grade is to 100% determined by the presentation.

In case of a group work *every group member will get an individual grade*. It therefore must be obvious from your presentation which group member is responsible for which part of the work. It is also possible for group members to for example conduct different quantitative analyses of the data (by considering different ML models).

# import & fetch data

In [None]:
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import xgboost as xgb

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, FunctionTransformer
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split, StratifiedKFold, cross_validate
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.base import clone
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

#from ydata_profiling import ProfileReport

In [None]:
%matplotlib qt

In [None]:
credit_approval = fetch_ucirepo(id=27)

X = credit_approval.data.features
y = credit_approval.data.targets

# EDA

In [None]:
credit_approval.variables

In [None]:
X

In [None]:
y

In [None]:
y.value_counts()

In [None]:
# replace with Marcels toolbox
profile = ProfileReport(X, title = "Profiling Report")
profile

## correlations

A4 and A5 have a correlation of 1 -> discard A5

A6 and A7 have a correlation of 0.57

A9 and A16 (target) have a correlation of .72

## univariate

A1 binary, 12 missing values
replace missing values with most frequent category (b)

A2 continuous, 12 missing values
replace missing values with median

A3 let be

A4 trinary, 6 missing values
summarize categories y and l to 'not-u' category, replace missing valuses with most frequent

A5 discarded, correlation of 1 with A4

A6 realtively uniformly disributed over 13 categories, not really sure what to do
maybe discard, because correlated by >.5 with A7
(maybe PCA with A7) (alternatively: frequency encode)

A7 9 categries
summarise to categories v, h, 'not-v-or-h', encode missing values as most frequent

A8 let be

A9 let be (note: highly correlated with target)

A10 let be

A11 continuous, replace 67 and 40 with median
(maybe bin to null and not null)

A12 let be

A13 summarise to g and not-g

A14 replace 2000 with median (maybe bin into a more-than-500-feature), replace with median

A15 encode a null and not-null

A16 let be

mim-max scale continuious variables
onehot en

# data preprocessing
baseline classifier: 82% accuracy

## train-test split

In [None]:
# train test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)


# further split test into test and validation set
X_train_actual, X_valid, y_train_actual, y_valid = train_test_split(
    X_train, y_train, test_size=0.15, random_state=42, stratify=y_train
)

## Feature engineering

do stuff from EDA

evaluate feature engineering effectiveness by comparing classifier performance wth and without transformations applied

In [None]:
feat_eng_transformer = make_column_transformer(
    (
       make_pipeline(
           SimpleImputer(strategy='most_frequent'),
           OneHotEncoder(drop="first")
       ),
        ['A1']
    ),
    
    (
       make_pipeline(
           SimpleImputer(strategy='median'),
       ),
        ['A2']
    ),
    # A5
    remainder="drop",
)

In [None]:
feat_eng_transformer

In [None]:
X.A4.value_counts()

In [None]:
pipe = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    FunctionTransformer(lambda col: map(lambda e: 'u' if e == 'u' else 'n',col))
)
pipe

In [None]:
data = X.A4.values.reshape(-1,1)

In [None]:
r = pipe.fit_transform(data);

In [None]:
r

r = pipe.fit_transform(data)
r

In [None]:
r

In [None]:
feat_eng_transformer = make_column_transformer(
    (
       make_pipeline(
           SimpleImputer(strategy='most_frequent'),
           OneHotEncoder(drop='first', categories=[('u'),('y'),('l')]),
       ),
        ['A4']
    ),
    remainder="drop",
    sparse_threshold=0
)

In [None]:
feat_eng_transformer.fit_transform(X)

In [None]:
cca_pipeline =   make_column_transformer(
    # categorical
    (
        make_pipeline(
            SimpleImputer(strategy="most_frequent"),
            OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False),
        ),
        credit_approval.variables[(credit_approval.variables.type=='Categorical') & (credit_approval.variables.role =='Feature')].name.values
    ),
    # continuous
    (
        make_pipeline(
            SimpleImputer(strategy="median"), MinMaxScaler()
        ),
        credit_approval.variables[credit_approval.variables.type=='Continuous'].name.values
    ),
    remainder="passhrough",
    verbose=True,
    verbose_feature_names_out=True,
)
cca_pipeline

### PCA of A6 and A7

## Feature selection

### Varianzanalyse

### Sequential Feature Selection

In [None]:
estimator = RandomForestClassifier(n_estimators=2, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=False)

sfs = SequentialFeatureSelector(
    estimator=clone(estimator),
    n_features_to_select=15,
    direction="forward",
    scoring=make_scorer(accuracy_score),
    n_jobs=-1,
    cv=cv,
).fit(df_train_actual, y_train_actual)

sfs_custom = custom_feature_selection.SequentialFeatureSelector(
    estimator=clone(estimator),
    n_features_to_select=15,
    scorer=make_scorer(accuracy_score),
    direction="forward",
    verbose=1,
    n_jobs=-1,
    cv=cv,
).fit(df_train_actual, y_train_actual)

# Machine Learning

In [None]:
"""
Combining categorial / numerical data:
- Seperate classifiers, e.g. decicion tree + regressor
- Encoding of categorical data.


Feature Engineering:
- Binarising highly imbalanced features
- Introducing "unknown" category for missing values
- Summarise categories
- Binning continuous variables
- Frequency encoding
- Target encoding


Machine Learning Models:
- LogisticRegression
- RandomForest (incl. missing values support)
- LightGBM
- XGBoost (doesn't work, also not in lecture script 02.3)
- LinearSupportVectorClassifier

Feature selection:
- Forward / backward feature selection
- Recursive / sequential feature selection


Compare
- LinearSVC vs SVC vs Adaboost


Metrics:
- Cross validation
- Model evaluation metrics (FDR, TPR), precicion/recall, ROC_AUC
- Graphics were all models are in comparison

"""

In [None]:
"""

Index

EDA
 encoding
 split
 PCA
 Dimension reduction
Classifier
Hyperparameter tuning

"""

In [None]:
cca_pipeline = make_pipeline(
    make_column_transformer(
        # categorical
        (
            make_pipeline(
                SimpleImputer(strategy="most_frequent"),
                OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False),
            ),
            credit_approval.variables[(credit_approval.variables.type=='Categorical') & (credit_approval.variables.role =='Feature')].name.values
        ),
        # continuous
        (
            make_pipeline(
                SimpleImputer(strategy="median"), MinMaxScaler()
            ),
            credit_approval.variables[credit_approval.variables.type=='Continuous'].name.values
        ),
        remainder="passhrough",
        verbose=True,
        verbose_feature_names_out=True,
    ),
    #xgb.XGBRegressor(),#objective="reg:linear", random_state=42),
    LogisticRegression(max_iter=100_000),
)
cca_pipeline

In [None]:
"""
Aktueller Stand

XGBoost funktioniert irgendwie nicht mit den aktuellen Datentypen, deshalb habe ich erstmal noch die LogisticRegression in die Pipeline gepackt.

Mit der Logistic Regression bekommt man auch die untenstehende Warning, dass einige unknown Kategorien weiterhin als Zeros encoded werden. Das ist seltsam,
da der Simple Imputer eigentlich dafür sorgen sollte, das keine unknown Categories mehr in den daten sind (sondern diese alle durch die most frequent category ersetzt werden).

"""

In [None]:
cross_validate(
    estimator=cca_pipeline,
    X=X_train_actual,
    y=y_train_actual.values.ravel(), 
    cv=StratifiedKFold(n_splits=7, shuffle=True, random_state=42),
    scoring="accuracy")

In [None]:
cv_cca = StratifiedKFold(n_splits=7, shuffle=True, random_state=42)

cv_results = run_cv(
    cca_transform_xgb,
    X_train_actual,
    y_train_actual.replace({'-': 1, '+': 0}),
    cv=cv_cca,
)
print_test_scores(cv_results)

## Hyperparameter tuning

In [None]:
pipe=Pipeline(
    steps = [
        #("encoder", ce.OneHotEncoder()),
        ('xgb', xgb.XGBRegressor(objective="reg:linear", random_state=42))

    ]
)

In [None]:
X = credit_approval.data.features
y = targets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)#, random_state=42)


#xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

pipe.fit(X_train._get_numeric_data(), y_train)

y_pred = pipe.predict(X_test._get_numeric_data())

mse=mean_squared_error(y_test, y_pred)

In [None]:
1-mse

## model evalutation