# Credit Card Application Approval

This project is concerned with a dataset dealing with credit card applications. Based on the feature given in the dataset the task is to predict if a person's request for a credit card is approved (or denied).

## Dataset

Information on the "Credit Approval" dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) can be found here: 

* Download URL: https://archive.ics.uci.edu/static/public/27/credit+approval.zip
* DOI: https://doi.org/10.24432/C5FS30
* Dataset creators: J. R. Quinlan
* License: Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode))

## Tasks

Below you can find a summary of the single subtasks you are required to work on during this project.

### Exploratory Data Analysis (EDA)

Perform a thorough analysis of the data. Preferably, use well-established tools from the Python package eco-system such as, e.g., [Pandas](https://pandas.pydata.org/docs), [Matplotlib](https://matplotlib.org/stable/index.html) / [Seaborn](https://seaborn.pydata.org/). Another helpful tool is [Ydata Profiling](https://docs.profiling.ydata.ai/).

Things to consider for the analysis:

* Visualise as much as possible. Make your visualisation easy to understand by using, e.g., labels for the axes or titles.
* Take into account differences regarding the features such as categorical vs. continuous.
* Consider correlations between different features. Also analyse how single features are correlated with the target.
* Check for missing values.

### Machine Learning (ML)

Apply machine learning models of your choice to solve this classification task. Again, use appropriate tools such as those found in the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library. You may also consider using tools such as [XGBoost](https://xgboost.readthedocs.io/en/latest/python/) or a neural network based on [PyTorch](https://pytorch.org/docs/stable/index.html) or [TensorFlow](https://www.tensorflow.org/api_docs).

Things to consider:

* Make sure to split your data into train and test data before using any ML model.
* Think about how to handle missing values and how to deal with features of different type (categorical and continuous). This also pertains to techniques such as feature encoding (e.g., refer to [this link form the Scikit-Learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html)) and feature engineering (e.g., frequency / count encoding or target encoding for categorical features).
* Use data processing pipelines to have a clean way of preparing your data for a particular ML model. Note that different types of models (e.g., Logistic Regression vs. Gradient Boosted Trees) may require different preparation steps for the data.
* Choose a proper metric (or several if appropriate) to evaluate a given model.
* Optimise the hyper-parameters of your ML models to achieve the best possible performance on the data.
* Compare different ML models.

### Comments

Document your workflow appropriately. If you choose to work with Juypter Notebooks this can be achieved by having dedicated notebooks for different parts of the project (e.g., EDA and ML models). Within a single notebook use sections and comments to document important decisions and the intent of your analysis. 

Your notebooks will look much cleaner and become a lot easier to comprehend if you avoid code duplication. That is, before using many code snippets that only differ slightly, consider finding a common abstraction and have a single dedicated place for this code (e.g., inside a function or a class) that enables easy reuse. It is oftentimes suitable to move code to a Python module. This module can then be readily imported in your Jupyter notebooks.

It should be possible to (easily) reproduce your results by re-executing your notebooks.

If you are working in groups it must be obvious which group member has conducted which part of the work. Hence, please make sure to add annotations inside the docstring of functions / classes or appriate comments in the sections of your Jupyter notebooks.

## Presentation of Results

### Oral Presentation

In the presentation your are meant to present the workflow during the project as well as the main results (in total 20 - 40 minutes for *all* members of the group combined, *not* per group member). Outline which tools you have used (e.g., Pandas, Scikit-Learn) and how you have approached the data to arrive at certain results. Also discuss the choice / usage of your ML models in relation to the EDA.

Choose a suitable medium such as ML-office-alike slides or Jupyter notebooks. If you are using the latter, please pay special attention to conciseness and a clean structure. Comprehensibly prepare your results by using, e.g., flow-charts for representing workflows and figures / tables for summarizing quantitative results. Please pay special attention to legiblity of axes labels, titles and legends in plots as well to colors and line types.

### Comments

If you are working in groups it must be obvious from your presentation which group member has conducted which part of the work. 



In [None]:
from ucimlrepo import fetch_ucirepo 
import numpy as np
import pandas as pd
import xgboost as xgb
import category_encoders as ce

import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import uniform, randint

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
from sklearn.decomposition import PCA

from dataprep.eda import create_report

In [None]:
%matplotlib qt

# fetch dataset 

In [None]:
credit_approval = fetch_ucirepo(id=27) 

# data (as pandas dataframes) 

In [None]:
X = credit_approval.data.features
y = credit_approval.data.targets

# metadata 

In [None]:
print(credit_approval.metadata) 

# variable information

In [None]:
credit_approval.variables

## Grading

The grade is to 100% determined by the presentation. 

In case of a group work *every group member will get an individual grade*. It therefore must be obvious from your presentation which group member is responsible for which part of the work. It is also possible for group members to for example conduct different quantitative analyses of the data (by considering different ML models).

In [None]:
"""
Usefool tools
- pipelines
- feature union
- 

- confusion matrix


"""

# Exploratory Data Analysis

## Dataset general overview

In [None]:
X

In [None]:
y.value_counts()

In [None]:
# replace with Marcels toolbox
create_report(X.dropna())

## Categorical features

In [None]:
"""
Histrogramm of all idividual features on a grid
- Maybe remove some sparse values.
- How much shared information between different features ?
- How many values in each respective combination of all categorical types ?


Information measurement of single features (is this usefull if we have so many features?)


"""

In [None]:
credit_approval.variables[credit_approval.variables.type=='Categorical']

In [None]:
_, ax = plt.subplots(nrows=2, ncols=5, figsize = (10,10))
X.A13.hist(ax=ax[0][0])
X.A12.hist(ax=ax[0][1])
X.A10.hist(ax=ax[0][2])
X.A9.hist(ax=ax[0][3])
X.A7.hist(ax=ax[0][4])
X.A6.hist(ax=ax[1][0])
X.A5.hist(ax=ax[1][1])
X.A4.hist(ax=ax[1][2])
X.A1.hist(ax=ax[1][3])
ax[1][4].remove()

Are there any obvious strong dependencies ?

## Numerical features

In [None]:
"""
- No obvious strong correlations. 
- Standardize features.

Further analysis:
- Principal component analysis.


- Outlier removal/replacement with mean/median (gaussian distribution?)

"""

In [None]:
sns.pairplot(X.dropna())

In [None]:
pca = PCA(n_components=5)
pca.fit(X._get_numeric_data().dropna())
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

## NaN analysis

In [None]:
"""
- Drop rows with much missing data
- Data imputation for rest of NaN's, look at distribution of column values to decide to replace with median/mean.

"""

In [None]:
# Number of NaN's per feature
X.isna().sum()

In [None]:
# Total share of rows with any value NaN
(X.isna().sum(axis=1)>0).sum()/690

In [None]:
# How strongly do NaN's occur together ? (How much data would we loose if just completely drop any line with a NaN ?)
X.isna().sum(axis=1).value_counts()

Simply dropping every line with any value NaN only removes ~5% of data.\
Which is a loss we are willing to take in the first run. We later come back and try different methods of dropping NaN to optimized performance. 

In [None]:
X_clean = X.dropna(how='any')

# Machine Learning

In [None]:
"""
Approaches for combining categorial / numerical data:
- Seperate classifiers, e.g. decicion tree + regressor
- Encoding of categorical data.

Feature selection:
- Forward / backward feature selection

Models:
- Regression
- XGBoost
- Neural network
- Random forest with missing data imputation
- LightGBM

Evaluating classifier performance:
- Cross validation
- Model evaluation metrics (FDR, TPR), precicion/recall, ROC_AUC
- Graphics were all models are in comparison

"""

## Encode categorical data

## baseline classifier: 82% accuracy

In [None]:
X

In [None]:
pd.DataFrame(tmp)

In [None]:
targets = credit_approval.data.targets.replace({'+':1,'-':0})
targets

In [None]:
pipe=Pipeline(
    steps = [
        #("encoder", ce.OneHotEncoder()),
        ('xgb', xgb.XGBRegressor(objective="reg:linear", random_state=42))
        
    ]
)

In [None]:
X = credit_approval.data.features
y = targets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)#, random_state=42)

#xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)

pipe.fit(X_train._get_numeric_data(), y_train)

y_pred = pipe.predict(X_test._get_numeric_data())

mse=mean_squared_error(y_test, y_pred)


In [None]:
1-mse