# Credit Risk Analysis

## Import packages

1. `sys`: System-specific parameters and functions.
2. `reload` (from `imp`): Reload previously imported modules.
3. `matplotlib.pyplot`: Data visualization.
4. `numpy`: Numerical computing.
5. `pandas`: Data manipulation and analysis.
6. `seaborn`: Statistical data visualization.
7. `SimpleImputer` (from `sklearn.impute`): Handling missing data.
8. `LogisticRegression` (from `sklearn.linear_model`): Logistic regression for classification.

In [None]:
import sys

sys.path.append("..")

from imp import reload

import numpy as np
import pandas as pd

from helper_functions import data_utils, preprocessing
# from helper_functions import config, data_utils, evaluation, plot, preprocessing

# import lightgbm as lgb

# Ignore warnings
import warnings
warnings.filterwarnings('ignore', category = FutureWarning)

: 

## Load normalized data set


In this notebook, we are going to encode a previously normalized `dataset` followed by the creation of the `ML` model.

In [None]:
app_normalized = data_utils.get_normalized_model()
app_normalized['TARGET_LABEL_BAD=1'] = app_normalized.pop('TARGET_LABEL_BAD=1')
app_normalized = preprocessing.categorical_columns(app_normalized)

: 

In [None]:
app_normalized.head()

: 

In [None]:
print(app_normalized.info())

: 

### Encoding

- We do the encoding process for....

- Some of the encoding techniques offered by category_encoders are:
    - `One-Hot Encoding:` Encoding using the One-Hot Encoding method.
    - `Ordinal Encoding:` Ordinal encoding, where ordinal labels are assigned to categories.
    - `Binary Encoding:` Base-2 encoding to reduce dimensionality in categorical variables with multiple categories.
    - `BaseN Encoding:` Base-N encoding to reduce dimensionality in categorical variables with multiple categories.
    - `Target Encoding:` Encoding using the target variable to assign values to categories.
    - `CatBoost Encoding:` Specific encoding for working with the CatBoost algorithm.

In [None]:
app_dum = preprocessing.encoding(app_normalized, True) # True for pandas get_dummies
# app_enc = preprocessing.encoding(app_normalized, False) # False for different encoder

: 

In [None]:
print(app_dum.columns)

: 

In [None]:
# using get_dummies
lr_model_enc = preprocessing.model_logistic_regression(app_dum, True)

: 

In [None]:
preprocessing.model_catboost_classifier(app_dum)

: 

### Comparing different models

#### MSE VS R²

In [None]:
# reload(preprocessing)
preprocessing.basic_models(app_dum)

: 

```
 |                   Model|       MSE|        R²
0|       Linear Regression|  0.239810| -0.250364
1|     Logistic Regression|  0.425053| -1.216218
2|     KNeighborsRegressor|  0.286204| -0.492264
3|    Gaussian Naive Bayes|  0.471012| -1.455852
4|  Multi Layer Perceptron|  0.306286| -0.596973
5|                CatBoost|  0.187573|  0.022000
6|        Ridge Regression|  0.239738| -0.249988
7|        LASSO Regression|  0.250000| -0.303496
8|          Decission Tree|  0.467938| -1.439824
9|           Random Forest|  0.270495| -0.410358
```

- The Mean Squared Error (MSE) measures the average of squared errors between the predicted values and the actual values. A lower MSE indicates better accuracy, as it means the model's predictions are closer to the actual values.

- R² (R-squared) is a metric that indicates the proportion of variance in the target variable that is explained by the predictor variables. It provides an indication of how well the model fits the data. A negative R² value suggests that the model does not fit the data well.

- Among the models listed, CatBoost seems to perform the best. It has the lowest MSE (0.187573) and the highest R² (0.022000). This indicates that CatBoost has achieved the best balance between accuracy and explanatory power compared to the other models.

- It's important to note that the evaluation metrics can vary depending on the specific dataset and the problem being solved. It's always a good practice to cross-validate the models and consider other factors, such as interpretability and computational cost, when choosing the best model for your particular problem.







### NEURAL NETWORKS

In [None]:
# import numpy as np
# import pandas as pd
# from sklearn.metrics import classification_report
# from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
# from sklearn.metrics import make_scorer
# from sklearn.pipeline import Pipeline
# import matplotlib.pyplot as plt
# import tensorflow as tf
# from tensorflow.keras import Sequential, layers
# from tensorflow.keras.optimizers import Adam
# from tensorflow.keras import regularizers
# from src import config, data_utils, evaluation, plot

: 