# Carregamento do Dataset Processado

Nesta etapa carregamos o dataset previamente limpo (`application_train_limpo.csv`)

Esse dataset servirá como base para:

- Engenharia de features
- Transformações adicionais
- Preparação final para modelagem

In [44]:
import pandas as pd

In [45]:
# Carregar o dataset limpo para análise e engenharia de features
df_train = pd.read_csv('../data/processed/application_train_limpo.csv')
df_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,DAYS_BIRTH_YEARS
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,25.920548
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,45.931507
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,52.180822
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,,,,,,,52.068493
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,54.608219


# Engenharia de Features – Razões Financeiras

Criamos novas variáveis baseadas em relações financeiras relevantes para análise de risco de crédito.

1. ANUITY_INCOME_RATIO  
   Representa o índice de comprometimento da renda:
   
   AMT_ANNUITY / AMT_INCOME_TOTAL

   Mede quanto da renda do cliente está comprometida com o pagamento da prestação.

2. CREDIT_INCOME_RATIO  
   Representa o nível de alavancagem do cliente:
   
   AMT_CREDIT / AMT_INCOME_TOTAL

   Indica o tamanho do crédito em relação à renda total.

Essas variáveis são importantes porque capturam capacidade de pagamento e exposição ao risco.


In [46]:
# Criar novas features de razão entre variáveis financeiras para capturar relações importantes
# Criando o Índice de Comprometimento de Renda (Prestação / Renda Total)
df_train['ANUITY_INCOME_RATIO'] = (df_train['AMT_ANNUITY'] / df_train['AMT_INCOME_TOTAL']) 
# Criando o Índice de Comprometimento de Crédito (Crédito Total / Renda Total)
df_train['CREDIT_INCOME_RATIO'] = df_train['AMT_CREDIT'] / df_train['AMT_INCOME_TOTAL'] 
print('Novas features criadas: ANUITY_INCOME_RATIO, CREDIT_INCOME_RATIO')
df_train[['ANUITY_INCOME_RATIO', 'CREDIT_INCOME_RATIO']].head()

Novas features criadas: ANUITY_INCOME_RATIO, CREDIT_INCOME_RATIO


Unnamed: 0,ANUITY_INCOME_RATIO,CREDIT_INCOME_RATIO
0,0.121978,2.007889
1,0.132217,4.79075
2,0.1,2.0
3,0.2199,2.316167
4,0.179963,4.222222


# Criação da Feature SOURCES_MEAN

A variável EXT_SOURCE_1 foi removida anteriormente devido ao alto percentual de valores nulos.

Assim, criamos uma nova variável chamada:

SOURCES_MEAN

Ela representa a média entre:

- EXT_SOURCE_2
- EXT_SOURCE_3

Essas variáveis normalmente representam scores externos de crédito, e sua média pode capturar melhor a qualidade de risco do cliente.


In [47]:
# A coluna EXT_SOURCE_1 foi removida devido à alta porcentagem de valores nulos
# Criando a feature de média das fontes externas disponíveis
df_train['SOURCES_MEAN'] = df_train[['EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
df_train[['SOURCES_MEAN']].head()

Unnamed: 0,SOURCES_MEAN
0,0.201162
1,0.622246
2,0.642739
3,0.650442
4,0.322738


# Criação da Feature GOODS_PRICE_RATIO

Criamos a variável:

GOODS_PRICE_RATIO = AMT_GOODS_PRICE / AMT_CREDIT

Interpretação:

- Valor próximo de 1 → crédito cobre praticamente todo o valor do bem
- Valor maior que 1 → o bem custa mais que o crédito concedido
- Valor menor que 1 → crédito superior ao valor do bem

Essa variável ajuda a entender a estrutura da operação de crédito.


In [48]:
# Criando a razão entre o preço dos bens financiados e o valor do crédito concedido
# Se o valor for maior que 1, indica que o bem custa mais que o crédito solicitado
df_train['GOODS_PRICE_RATIO'] = df_train['AMT_GOODS_PRICE'] / df_train['AMT_CREDIT']
df_train[['GOODS_PRICE_RATIO']].head()

Unnamed: 0,GOODS_PRICE_RATIO
0,0.863262
1,0.873211
2,1.0
3,0.949845
4,1.0


# Análise dos Tipos de Dados

Nesta etapa verificamos:

- Quantidade de colunas por tipo de dado
- Número de variáveis categóricas (tipo object)

Esse diagnóstico é importante para definir:

- Estratégia de transformação
- Necessidade de encoding
- Preparação para modelos baseados em árvores ou regressão


In [49]:
# Analisar os tipos de colunas para entender a estrutura dos dados e identificar possíveis transformações necessárias
tipos_colunas = df_train.dtypes.value_counts()
print("Tipos de colunas e suas contagens:")
print(tipos_colunas)


colunas_texto = df_train.select_dtypes(include=['object']).columns.tolist()
print(f'Colunas de texto (categoricas): {len(colunas_texto)}')

Tipos de colunas e suas contagens:
int64      41
float64    32
object     13
Name: count, dtype: int64
Colunas de texto (categoricas): 13


# Transformação de Variáveis Categóricas

Utilizamos One-Hot Encoding com `pd.get_dummies()` para converter variáveis categóricas em variáveis numéricas.

Essa etapa é necessária porque:

- Modelos de Machine Learning não trabalham diretamente com texto
- Precisamos transformar categorias em representações numéricas

Também comparamos:

- Número de colunas antes do encoding
- Número de colunas após o encoding

O aumento no número de colunas ocorre devido à criação das variáveis dummy.


In [50]:
df_final = pd.get_dummies(df_train).astype(float)
print(f'Número de colunas antes de one-hot encoding: {df_train.shape[1]}')
print(f'Número de colunas após one-hot encoding: {df_final.shape[1]}')

df_final.head()

Número de colunas antes de one-hot encoding: 86
Número de colunas após one-hot encoding: 199


Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,100002.0,1.0,0.0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461.0,-637.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,100003.0,0.0,0.0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765.0,-1188.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,100004.0,0.0,0.0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046.0,-225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006.0,0.0,0.0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005.0,-3039.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100007.0,0.0,0.0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932.0,-3038.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Salvamento do Dataset Final

Após a engenharia de features e transformação das variáveis categóricas, salvamos o dataset final em:

../data/processed/application_train_final.csv

Esse arquivo será utilizado na etapa de modelagem e treinamento do modelo de Machine Learning.

Separar etapas de:

- Dados brutos
- Dados limpos
- Dados finais para modelagem




In [51]:
# Salvar o dataset final com as novas features e colunas categóricas transformadas
df_final.to_csv('../data/processed/application_train_final.csv', index=False)
print('Dataset final salvo em ../data/processed/application_train_final.csv')

Dataset final salvo em ../data/processed/application_train_final.csv
