<a href="https://colab.research.google.com/github/sergiocarvalho-estatistico/credit_risk_with_python/blob/master/dataexplorer/dataprep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Concessão de Crédito

##### Preparação, Manipualação e Exploração dos dados

## Objetivos

* Tipo de variáveis
    * Inteiras ou discretas:
    * Numéricas ou double
    * Categóricas
    * Qualitativas
    * Qualidade dos dados
        * Quantidade de NA’s por variável
* Criação de novas variáveis, se precisar
* Transformação das variáveis, se precisar


## Conjunto de Dados de Teste

Importando um dataset com Pandas

In [0]:
import pandas as pd
import pandas_profiling

In [97]:
# Read dataset
file_name = "https://raw.githubusercontent.com/sergiocarvalho-estatistico/credit_risk_with_python/master/dados/dataset.csv"
df = pd.read_csv(file_name)
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Removendo a coluna **Loan_ID**

In [98]:
df = df[df.columns[1::]] 
df.head(5)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Qualidade dos Dados 

Será que há variáveis com valores nulos? 

In [99]:
df.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

## Dados Missing

### Variáveis Categóricas:

* Input de Informação para os Valores Missing
* Utilizaremos um modelo de classificação multi-classe.
* Variáveis 
  * Gender
  * Married
  * Dependents
  * Self_Employed. 
  * Credit_History

In [100]:
names = ['Gender','Married','Dependents','Self_Employed','Credit_History']

for x in names:
    df.train = df[df[x].isnull() == False]
    df.test = df[df[x].isnull() == True]

    

  """Entry point for launching an IPython kernel.
  


In [102]:

df.train

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


## Análise da distribuição sob Valores Missing

Queremos observar aqui o comportamento da variável resposta *Loan_Status* considerando os valores missing das variáveis explicativas.

### Variável Credit_History

Vejam que interessante, a variável Credit_History tem a característíca de uma variável categórico, no entanto, o seu tipo de dados é integer.

In [88]:
df['Credit_History'].value_counts()

1.0                  475
0.0                   89
No Credit_History     50
Name: Credit_History, dtype: int64

#### **Variáveis Numéricas** 

* LoanAmount

In [85]:
df[df['LoanAmount'].isnull() == True]['Loan_Status'].value_counts()/df[df['LoanAmount'].isnull() == True]['Loan_Status'].count()

N    0.5
Y    0.5
Name: Loan_Status, dtype: float64

Os valores missing da variável LoanAmount alteraram as proporções das classes N e Y da variável Loan_Status.

* Loan_Amount_Term

In [86]:
df[df['Loan_Amount_Term'].isnull() == True]['Loan_Status'].value_counts()/df[df['Loan_Amount_Term'].isnull() == True]['Loan_Status'].count()

Y    0.571429
N    0.428571
Name: Loan_Status, dtype: float64

Note que para os valores missing da variável Loan_Amount_Term a variável Loan_Status tem suas proporções alteradas.

## Resolvendo os valores Missing

Para construir um modelo que impute valores nas variáveis com valores missing vamos retirar todas as variáveis que possuem alguma porcentagem de valores nulos e deixar no data frame somente uma delas em cada modelo.

## Tipos de Variáveis

Quais são as variáveis categóricas e numéricas no dataframe

In [14]:
df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

As variáveis apresentadas como objectos

* Gender
* Marride
* Dependents
* Education
* Self_Employed
* Property_Area
* Loan_Status

serão convertidas em variáveis categóricas.


In [0]:
df['Gender'] = pd.Categorical(df.Gender)
df['Married'] = pd.Categorical(df.Married)
df['Dependents'] = pd.Categorical(df.Dependents)
df['Education'] = pd.Categorical(df.Education)
df['Self_Employed'] = pd.Categorical(df.Self_Employed)
df['Property_Area'] = pd.Categorical(df.Property_Area)
df['Loan_Status'] = pd.Categorical(df.Loan_Status)

In [57]:
df.describe().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ApplicantIncome,614.0,5403.459283,6109.041673,150.0,2877.5,3812.5,5795.0,81000.0
CoapplicantIncome,614.0,1621.245798,2926.248369,0.0,0.0,1188.5,2297.25,41667.0
LoanAmount,592.0,146.412162,85.587325,9.0,100.0,128.0,168.0,700.0
Loan_Amount_Term,600.0,342.0,65.12041,12.0,360.0,360.0,360.0,480.0
