# Credit Card Application Approval

This project is concerned with a dataset dealing with credit card applications. Based on the feature given in the dataset the task is to predict if a person's request for a credit card is approved (or denied).

## Dataset

Information on the "Credit Approval" dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) can be found here:

* Download URL: https://archive.ics.uci.edu/static/public/27/credit+approval.zip
* DOI: https://doi.org/10.24432/C5FS30
* Dataset creators: J. R. Quinlan
* License: Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode))

## Tasks

Below you can find a summary of the single subtasks you are required to work on during this project.

### Exploratory Data Analysis (EDA)

Perform a thorough analysis of the data. Preferably, use well-established tools from the Python package eco-system such as, e.g., [Pandas](https://pandas.pydata.org/docs), [Matplotlib](https://matplotlib.org/stable/index.html) / [Seaborn](https://seaborn.pydata.org/). Another helpful tool is [Ydata Profiling](https://docs.profiling.ydata.ai/).

Things to consider for the analysis:

* Visualise as much as possible. Make your visualisation easy to understand by using, e.g., labels for the axes or titles.
* Take into account differences regarding the features such as categorical vs. continuous.
* Consider correlations between different features. Also analyse how single features are correlated with the target.
* Check for missing values.

### Machine Learning (ML)

Apply machine learning models of your choice to solve this classification task. Again, use appropriate tools such as those found in the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library. You may also consider using tools such as [XGBoost](https://xgboost.readthedocs.io/en/latest/python/) or a neural network based on [PyTorch](https://pytorch.org/docs/stable/index.html) or [TensorFlow](https://www.tensorflow.org/api_docs).

Things to consider:

* Make sure to split your data into train and test data before using any ML model.
* Think about how to handle missing values and how to deal with features of different type (categorical and continuous). This also pertains to techniques such as feature encoding (e.g., refer to [this link form the Scikit-Learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html)) and feature engineering (e.g., frequency / count encoding or target encoding for categorical features).
* Use data processing pipelines to have a clean way of preparing your data for a particular ML model. Note that different types of models (e.g., Logistic Regression vs. Gradient Boosted Trees) may require different preparation steps for the data.
* Choose a proper metric (or several if appropriate) to evaluate a given model.
* Optimise the hyper-parameters of your ML models to achieve the best possible performance on the data.
* Compare different ML models.

### Comments

Document your workflow appropriately. If you choose to work with Juypter Notebooks this can be achieved by having dedicated notebooks for different parts of the project (e.g., EDA and ML models). Within a single notebook use sections and comments to document important decisions and the intent of your analysis.

Your notebooks will look much cleaner and become a lot easier to comprehend if you avoid code duplication. That is, before using many code snippets that only differ slightly, consider finding a common abstraction and have a single dedicated place for this code (e.g., inside a function or a class) that enables easy reuse. It is oftentimes suitable to move code to a Python module. This module can then be readily imported in your Jupyter notebooks.

It should be possible to (easily) reproduce your results by re-executing your notebooks.

If you are working in groups it must be obvious which group member has conducted which part of the work. Hence, please make sure to add annotations inside the docstring of functions / classes or appriate comments in the sections of your Jupyter notebooks.

## Presentation of Results

### Oral Presentation

In the presentation your are meant to present the workflow during the project as well as the main results (in total 20 - 40 minutes for *all* members of the group combined, *not* per group member). Outline which tools you have used (e.g., Pandas, Scikit-Learn) and how you have approached the data to arrive at certain results. Also discuss the choice / usage of your ML models in relation to the EDA.

Choose a suitable medium such as ML-office-alike slides or Jupyter notebooks. If you are using the latter, please pay special attention to conciseness and a clean structure. Comprehensibly prepare your results by using, e.g., flow-charts for representing workflows and figures / tables for summarizing quantitative results. Please pay special attention to legiblity of axes labels, titles and legends in plots as well to colors and line types.

### Comments

If you are working in groups it must be obvious from your presentation which group member has conducted which part of the work.



In [156]:
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import xgboost as xgb
import category_encoders as ce

import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import uniform, randint

from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, MinMaxScaler
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split, StratifiedKFold, cross_validate
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.base import clone
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

#from ydata_profiling import ProfileReport

In [4]:
%matplotlib qt

# fetch dataset

In [5]:
credit_approval = fetch_ucirepo(id=27)

# data (as pandas dataframes)

In [6]:
X = credit_approval.data.features
y = credit_approval.data.targets

# metadata

In [7]:
print(credit_approval.metadata)

{'uci_id': 27, 'name': 'Credit Approval', 'repository_url': 'https://archive.ics.uci.edu/dataset/27/credit+approval', 'data_url': 'https://archive.ics.uci.edu/static/public/27/data.csv', 'abstract': 'This data concerns credit card applications; good mix of attributes', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 690, 'num_features': 15, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['A16'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1987, 'last_updated': 'Wed Aug 23 2023', 'dataset_doi': '10.24432/C5FS30', 'creators': ['J. R. Quinlan'], 'intro_paper': None, 'additional_info': {'summary': 'This file concerns credit card applications.  All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.\r\n  \r\nThis dataset is interesting because there is a good mix of attributes --

# variable information

In [8]:
credit_approval.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,A16,Target,Categorical,,,,no
1,A15,Feature,Continuous,,,,no
2,A14,Feature,Continuous,,,,yes
3,A13,Feature,Categorical,,,,no
4,A12,Feature,Categorical,,,,no
5,A11,Feature,Continuous,,,,no
6,A10,Feature,Categorical,,,,no
7,A9,Feature,Categorical,,,,no
8,A8,Feature,Continuous,,,,no
9,A7,Feature,Categorical,,,,yes


## Grading

The grade is to 100% determined by the presentation.

In case of a group work *every group member will get an individual grade*. It therefore must be obvious from your presentation which group member is responsible for which part of the work. It is also possible for group members to for example conduct different quantitative analyses of the data (by considering different ML models).

In [9]:
"""
Usefool tools
- pipelines
- feature union
-

- confusion matrix


"""

'\nUsefool tools\n- pipelines\n- feature union\n-\n\n- confusion matrix\n\n\n'

# Exploratory Data Analysis

## Dataset general overview

In [10]:
X

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,202.0,g,f,1,t,t,1.25,v,w,g,u,0.000,30.83,b
1,560,43.0,g,f,6,t,t,3.04,h,q,g,u,4.460,58.67,a
2,824,280.0,g,f,0,f,t,1.50,h,q,g,u,0.500,24.50,a
3,3,100.0,g,t,5,t,t,3.75,v,w,g,u,1.540,27.83,b
4,0,120.0,s,f,0,f,t,1.71,v,w,g,u,5.625,20.17,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,260.0,g,f,0,f,f,1.25,h,e,p,y,10.085,21.08,b
686,394,200.0,g,t,2,t,f,2.00,v,c,g,u,0.750,22.67,a
687,1,200.0,g,t,1,t,f,2.00,ff,ff,p,y,13.500,25.25,a
688,750,280.0,g,f,0,f,f,0.04,v,aa,g,u,0.205,17.92,b


In [11]:
y.value_counts()

A16
-      383
+      307
dtype: int64

In [None]:
# replace with Marcels toolbox
profile = ProfileReport(X, title = "Profiling Report")
profile

## Conclusion report:

## Categorical features

In [None]:
"""
Histrogramm of all idividual features on a grid
- Maybe remove some sparse values.
- How much shared information between different features ?
- How many values in each respective combination of all categorical types ?


Information measurement of single features (is this usefull if we have so many features?)


"""

In [12]:
credit_approval.variables[credit_approval.variables.type=='Categorical']

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,A16,Target,Categorical,,,,no
3,A13,Feature,Categorical,,,,no
4,A12,Feature,Categorical,,,,no
6,A10,Feature,Categorical,,,,no
7,A9,Feature,Categorical,,,,no
9,A7,Feature,Categorical,,,,yes
10,A6,Feature,Categorical,,,,yes
11,A5,Feature,Categorical,,,,yes
12,A4,Feature,Categorical,,,,yes
15,A1,Feature,Categorical,,,,yes


In [189]:
_, ax = plt.subplots(nrows=2, ncols=5, figsize = (10,10))
X.A13.hist(ax=ax[0][0])
X.A12.hist(ax=ax[0][1])
X.A10.hist(ax=ax[0][2])
X.A9.hist(ax=ax[0][3])
X.A7.hist(ax=ax[0][4])
X.A6.hist(ax=ax[1][0])
X.A5.hist(ax=ax[1][1])
X.A4.hist(ax=ax[1][2])
X.A1.hist(ax=ax[1][3])
ax[1][4].remove()

Are there any obvious strong dependencies ?

## Numerical features

In [None]:
"""
- No obvious strong correlations.
- Standardize features.

Further analysis:
- Principal component analysis.


- Outlier removal/replacement with mean/median (gaussian distribution?)

"""

In [None]:
sns.pairplot(X.dropna())

In [None]:
pca = PCA(n_components=5)
pca.fit(X._get_numeric_data().dropna())
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

## NaN analysis

In [None]:
"""
- Drop rows with much missing data
- Data imputation for rest of NaN's, look at distribution of column values to decide to replace with median/mean.

"""

In [None]:
# Number of NaN's per feature
X.isna().sum()

In [None]:
# Total share of rows with any value NaN
(X.isna().sum(axis=1)>0).sum()/690

In [None]:
# How strongly do NaN's occur together ? (How much data would we loose if just completely drop any line with a NaN ?)
X.isna().sum(axis=1).value_counts()

Simply dropping every line with any value NaN only removes ~5% of data.\
Which is a loss we are willing to take in the first run. We later come back and try different methods of dropping NaN to optimized performance.

In [None]:
X_clean = X.dropna(how='any')

# Machine Learning

In [None]:
"""
Approaches for combining categorial / numerical data:
- Seperate classifiers, e.g. decicion tree + regressor
- Encoding of categorical data.

Feature selection:
- Forward / backward feature selection
- Recursive / sequential feature selection

Models:
- Regression
- XGBoost
- Neural network
- Random forest with missing data imputation
- LightGBM

Evaluating classifier performance:
- Cross validation
- Model evaluation metrics (FDR, TPR), precicion/recall, ROC_AUC
- Graphics were all models are in comparison

"""

## data preperation

## baseline classifier: 82% accuracy

In [151]:
# train test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)


# further split test into test and validation set
X_train_actual, X_valid, y_train_actual, y_valid = train_test_split(
    X_train, y_train, test_size=0.15, random_state=42, stratify=y_train
)

In [187]:
cca_transform_xgb = make_pipeline(
    make_union(
        # categorical
        make_column_transformer(
            (
                make_pipeline(
                    SimpleImputer(strategy="most_frequent"),
                    OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False),
                ),
                credit_approval.variables[credit_approval.variables.type=='Categorical'].name.values
            ),
            remainder="drop",
            verbose=False,
            verbose_feature_names_out=False,
        ),
        # continuous
        make_column_transformer(
            (
                make_pipeline(
                    SimpleImputer(strategy="median"), MinMaxScaler()
                ),
                credit_approval.variables[credit_approval.variables.type=='Continuous'].name.values
            ),
            remainder="drop",
            verbose=False,
            verbose_feature_names_out=False,
        )
    ),
    xgb.XGBRegressor(),#objective="reg:linear", random_state=42),
    #LogisticRegression(max_iter=100_000),
)
cca_transform_xgb

In [154]:
 def print_test_scores(
    cv_results,
    scorings=(
        "test_accuracy",
        "train_accuracy",
        "test_average_precision",
        "train_average_precision",
    ),
):
    for score in scorings:
        print(
            f"mean {score:30s} score = {cv_results[score].mean():.5f} +/- {cv_results[score].std():.5f}"
        )

def run_cv(
    estimator, X, y, cv, scoring=("accuracy", "average_precision"), verbosity_level=0
):
    return cross_validate(
        estimator=estimator,
        X=X,
        y=y,
        cv=cv,
        scoring=scoring,
        return_estimator=True,
        return_indices=True,
        return_train_score=True,
        verbose=verbosity_level,
    )

In [186]:
y_train_actual.replace({'-': 1, '+': 0})

Unnamed: 0,A16
411,1
339,1
433,1
313,1
320,0
...,...
29,0
536,1
329,1
441,1


In [188]:
cross_validate(
    estimator=cca_transform_xgb,
    X=X_train_actual,
    y=y_train_actual.replace({'-': 1, '+': 0}),
    cv=StratifiedKFold(n_splits=7, shuffle=True, random_state=42),
    scoring="accuracy")

ValueError: 
All the 7 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
7 fits failed with the following error:
Traceback (most recent call last):
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'A16'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/__init__.py", line 447, in _get_column_indices
    col_idx = all_columns.get_loc(col)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'A16'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 416, in fit
    Xt = self._fit(X, y, **fit_params_steps)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 370, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/joblib/memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 950, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 1255, in fit_transform
    results = self._parallel_func(X, y, fit_params, _fit_transform_one)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 1277, in _parallel_func
    return Parallel(n_jobs=self.n_jobs)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 65, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/joblib/parallel.py", line 1863, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/joblib/parallel.py", line 1792, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 127, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/pipeline.py", line 950, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 740, in fit_transform
    self._validate_column_callables(X)
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 448, in _validate_column_callables
    transformer_to_input_indices[name] = _get_column_indices(X, columns)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/__init__.py", line 455, in _get_column_indices
    raise ValueError("A given column is not a column of the dataframe") from e
ValueError: A given column is not a column of the dataframe


In [176]:
cv_cca = 

cv_results = run_cv(
    cca_transform_xgb,
    X_train_actual,
    ,^
    cv=cv_cca,
)
print_test_scores(cv_results)

mean test_accuracy                  score = nan +/- nan
mean train_accuracy                 score = nan +/- nan
mean test_average_precision         score = nan +/- nan
mean train_average_precision        score = nan +/- nan


Traceback (most recent call last):
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 136, in __call__
    score = scorer._score(
            ^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 353, in _score
    y_pred = method_caller(estimator, "predict", X)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/_response.py", line 109, in _get_response_values
    y_pred, pos_label = estimator.predict(X), None
                        ^^^^^^^^^^^^^^^^^
  File "/home/max/miniconda3/envs/ki2/lib/python3.11/site-packages/sklearn/utils/_available_if.py", line 31, in __get__
    if not self.check(obj):
     

### sequential Feature Selection 

In [None]:
estimator = RandomForestClassifier(n_estimators=2, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=False)

sfs = SequentialFeatureSelector(
    estimator=clone(estimator),
    n_features_to_select=15,
    direction="forward",
    scoring=make_scorer(accuracy_score),
    n_jobs=-1,
    cv=cv,
).fit(df_train_actual, y_train_actual)

sfs_custom = custom_feature_selection.SequentialFeatureSelector(
    estimator=clone(estimator),
    n_features_to_select=15,
    scorer=make_scorer(accuracy_score),
    direction="forward",
    verbose=1,
    n_jobs=-1,
    cv=cv,
).fit(df_train_actual, y_train_actual)

In [None]:
pd.DataFrame(tmp)

In [None]:
targets = credit_approval.data.targets.replace({'+':1,'-':0})
targets

In [None]:
pipe=Pipeline(
    steps = [
        #("encoder", ce.OneHotEncoder()),
        ('xgb', xgb.XGBRegressor(objective="reg:linear", random_state=42))

    ]
)

In [None]:
X = credit_approval.data.features
y = targets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)#, random_state=42)

#xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

pipe.fit(X_train._get_numeric_data(), y_train)

y_pred = pipe.predict(X_test._get_numeric_data())

mse=mean_squared_error(y_test, y_pred)

In [None]:
1-mse