# Credit Risk Analysis

## Import packages

1. `sys`: System-specific parameters and functions.
2. `reload` (from `imp`): Reload previously imported modules.
3. `matplotlib.pyplot`: Data visualization.
4. `numpy`: Numerical computing.
5. `pandas`: Data manipulation and analysis.
6. `seaborn`: Statistical data visualization.
7. `SimpleImputer` (from `sklearn.impute`): Handling missing data.
8. `LogisticRegression` (from `sklearn.linear_model`): Logistic regression for classification.

In [1]:
import sys

sys.path.append("..")

from imp import reload

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

from helper_functions import config, data_utils, evaluation, plot, preprocessing

  from imp import reload


In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore', category = FutureWarning)

## Load normalized data set


In this notebook, we are going to encode a previously normalized `dataset` followed by the creation of the `ML` model.

In [3]:
app_normalized = data_utils.get_normalized_model()
app_normalized['TARGET_LABEL_BAD=1'] = app_normalized.pop('TARGET_LABEL_BAD=1')

In [4]:
app_normalized = preprocessing.categorical_columns(app_normalized)
app_normalized.head()

Unnamed: 0,PAYMENT_DAY,APPLICATION_SUBMISSION_TYPE,SEX,MARITAL_STATUS,QUANT_DEPENDANTS,RESIDENCIAL_STATE,FLAG_RESIDENCIAL_PHONE,MONTHS_IN_RESIDENCE,FLAG_EMAIL,COMPANY,...,PRODUCT,AGE,HAS_DEPENDANTS,HAS_RESIDENCE,MONTHLY_INCOMES_TOT,HAS_CARDS,HAS_BANKING_ACCOUNTS,HAS_PERSONAL_ASSETS,HAS_CARS,TARGET_LABEL_BAD=1
0,1 - 14,Web,F,other,1,RN,Y,+ 1 year,1,N,...,1,26 - 35,True,True,[650 - 1320],True,False,False,False,1
1,15 - 30,Carga,F,married,0,RJ,Y,0 - 6 months,1,Y,...,1,26 - 35,False,True,[650 - 1320],False,False,False,False,1
2,1 - 14,Web,F,married,0,RN,Y,+ 1 year,1,N,...,1,26 - 35,False,True,[0 - 650],False,False,False,False,0
3,15 - 30,Web,F,married,0,PE,N,+ 1 year,1,N,...,1,> 60,False,False,[0 - 650],False,False,False,False,0
4,1 - 14,Web,M,married,0,RJ,Y,6 months - 1 year,1,N,...,1,46 - 60,False,True,[650 - 1320],False,False,False,False,1


In [5]:
print(app_normalized.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49935 entries, 0 to 49934
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   PAYMENT_DAY                  49935 non-null  category
 1   APPLICATION_SUBMISSION_TYPE  49935 non-null  category
 2   SEX                          49935 non-null  category
 3   MARITAL_STATUS               49935 non-null  category
 4   QUANT_DEPENDANTS             49935 non-null  category
 5   RESIDENCIAL_STATE            49935 non-null  category
 6   FLAG_RESIDENCIAL_PHONE       49935 non-null  category
 7   MONTHS_IN_RESIDENCE          49935 non-null  category
 8   FLAG_EMAIL                   49935 non-null  category
 9   COMPANY                      49935 non-null  category
 10  FLAG_PROFESSIONAL_PHONE      49935 non-null  category
 11  PRODUCT                      49935 non-null  category
 12  AGE                          49935 non-null  category
 13  H

### Encoding

Algunas de las técnicas de codificación que ofrece category_encoders son:

- One-Hot Encoding: Codificación mediante el método de One-Hot Encoding.
- Ordinal Encoding: Codificación ordinal, donde se asignan etiquetas ordinales a las categorías.
- Binary Encoding: Codificación en base 2 para reducir la dimensionalidad en variables categóricas con múltiples categorías.
- BaseN Encoding: Codificación en base-N para reducir la dimensionalidad en variables categóricas con múltiples categorías.
- Target Encoding: Codificación utilizando el target (variable objetivo) para asignar valores a las categorías.
- CatBoost Encoding: Codificación específica para trabajar con el algoritmo CatBoost.

### TRANSLATE THIS...

In [55]:
# False for get_dummies, True for onehot and category
X, y = preprocessing.encoding(app_normalized, False)

In [45]:
# # Print shape of input data
# # print("Input train data shape: ", train_set.shape)
# print("Input val data shape: ", test_set.shape)
# print("Input test data shape: ", val_set.shape, "\n")

Input train data shape:  (33955, 21)
Input val data shape:  (9987, 21)
Input test data shape:  (5993, 21) 



In [52]:
# x = pd.get_dummies(train_set)
# # print(len(x.columns))

73


In [53]:
# x = pd.get_dummies(train_set, drop_first = True)
# len(x.columns)

58

In [54]:
# x.head()

Unnamed: 0,HAS_DEPENDANTS,HAS_RESIDENCE,HAS_CARDS,HAS_BANKING_ACCOUNTS,HAS_PERSONAL_ASSETS,HAS_CARS,PAYMENT_DAY_15 - 30,APPLICATION_SUBMISSION_TYPE_Web,SEX_M,MARITAL_STATUS_other,...,AGE_26 - 35,AGE_36 - 45,AGE_46 - 60,AGE_< 18,AGE_> 60,MONTHLY_INCOMES_TOT_[1320 - 3323],MONTHLY_INCOMES_TOT_[3323 - 8560],MONTHLY_INCOMES_TOT_[650 - 1320],MONTHLY_INCOMES_TOT_[> 8560],TARGET_LABEL_BAD=1_1
15631,False,True,True,False,False,False,0,1,1,0,...,0,0,1,0,0,0,0,0,0,0
1480,False,True,False,True,False,True,1,0,0,0,...,0,0,1,0,0,0,0,1,0,1
22852,True,True,False,False,False,False,0,1,1,0,...,0,1,0,0,0,0,0,1,0,1
24418,True,False,False,True,False,False,1,0,1,0,...,0,1,0,0,0,0,0,1,0,0
7103,False,True,False,True,False,True,1,0,1,0,...,1,0,0,0,0,0,0,1,0,0


In [None]:
# X_train, y_train, X_test, y_test, X_val, y_val = xxx(train, test, val)