Import de módulos para el proyecto

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Conexión con Google Drive para acceder al archivo

In [3]:
credito_df = pd.read_csv('/content/drive/MyDrive/Bootcamp/Proyecto Produbanco - Coding para procesamiento/creditos_historicos.csv', low_memory=False)

Consultar tamaño del dataframe para saber numero de observaciones y series --- 2260668 observaciones, 23 variables

In [4]:
credito_df.shape

(2260668, 23)

#**Definicion de mal pagador**

Consultar los estados del crédito de cada cliente para crear categorías de riesgo. Ej. Fully Paid representa riesgo bajo, Late es de riesgo alto

Crear función que clasifique estos estados en riesgo alto o bajo

In [5]:
credito_df['loan_status'].value_counts()

loan_status
Fully Paid                                             1041952
Current                                                 919695
Charged Off                                             261655
Late (31-120 days)                                       21897
In Grace Period                                           8952
Late (16-30 days)                                         3737
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     31
Name: count, dtype: int64

In [6]:
def status_riesgo(status):
    if status in ['Fully Paid', 'Current', 'In Grace Period']:
       return 'low risk'
    else:
       return 'high risk'

Crear nueva columna de riesgo según el estado del credito

In [7]:
credito_df['risk_status'] = credito_df['loan_status'].apply(lambda x:status_riesgo(x))

En esta nueva serie, las observaciones son 1970599 riesgo bajo y 290069 riesgo alto

In [8]:
credito_df['risk_status'].value_counts()

risk_status
low risk     1970599
high risk     290069
Name: count, dtype: int64

Definir función para evaluar malos pagadores según tres parámetros: riesgo_status, meses desde la última vez que cayeron en mora y las veces que estuvieron en mora en los últimos dos años.

Ej. Si el pagador tiene riesgo bajo y ha pasado más de un mes, mal pagador. Si en los últimos dos años, estuvo en mora más de 2 veces, mal pagador

In [12]:
def marca(riesgo,meses,record2yrs):
    if (riesgo == 'high risk') or (riesgo == 'low risk' and meses <6) or (record2yrs >2):
       return 1
    else:
       return 0

Crear serie aplicando la función definida anteriormente

In [13]:
credito_df['marca_mal_pagador'] = credito_df.apply(lambda x: marca(riesgo=x['risk_status'],meses=x['mths_since_last_delinq'], record2yrs=x['delinq_2yrs']), axis =1)

Existen 384384 clientes marcados como malos pagadores

In [14]:
credito_df['marca_mal_pagador'].value_counts()

marca_mal_pagador
0    1876284
1     384384
Name: count, dtype: int64

#**Data Quality**

*Duplicados*

Para evaluar duplicados en este df revisaremos el id_cliente, ya que las otras variables podrían estar repetidas sin que signifique registro duplicado

In [15]:
credito_df['id_cliente']. duplicated().value_counts()

id_cliente
False    2260668
Name: count, dtype: int64

No existen duplicados, no necesita procesamiento adicional

*Nulos*

Consultar nulos (valores NaN)

- isnull().any() retorna un detalle de columnas con nulos en todo el df

In [16]:
credito_df.isnull().any()

id_cliente                 False
loan_status                False
loan_amnt                  False
installment                False
term                       False
emp_title                   True
emp_length                  True
home_ownership             False
annual_inc                  True
verification_status        False
purpose                    False
addr_state                 False
delinq_2yrs                 True
next_pymnt_d                True
earliest_cr_line            True
mths_since_last_delinq      True
total_pymnt                False
recoveries                 False
collection_recovery_fee    False
last_pymnt_d                True
settlement_status           True
application_type           False
tot_hi_cred_lim             True
risk_status                False
marca_mal_pagador          False
dtype: bool

Crear view para consultar la cantidad de nulos en las series identificadas. Emplear *isnull().value_counts()*

emp_title -- 166969

emp_length -- 146907

annual_inc -- 4

delinq_2yrs -- 29

next_pymnt_d -- 1303607 nulos

earliest_cr_line -- 29

last_pymnt_d -- 2426

mths_since_last_delinq -- 1158473

settlement_status -- 2227583

tot_hi_cred_lim -- 70247

In [17]:
view= credito_df[credito_df['last_pymnt_d'].isnull()]
view.shape

(2426, 25)

Crear nuevo df eliminando las observaciones NaN de las columnas con menos nulos: annual inc, delinq 2yrs, earliest cr line.

Emplear *dropna*

In [18]:
creditos2 = credito_df.dropna(subset=['annual_inc','delinq_2yrs','earliest_cr_line'])
creditos2.shape

(2260639, 25)

Consultar los nulos identificados y ver a qué pertenecen

- Los nulos de last payment pertenecen a los incobrables, tardíos más de 30 días, incobrables y default

- Los nulos de next payment están asociados a los créditos pagados o incobrables. Asignar fecha anterior a 2000

In [19]:
nulos = creditos2.next_pymnt_d.isnull()

In [20]:
creditos2['loan_status'][nulos].value_counts()

loan_status
Fully Paid     1041952
Charged Off     261655
Name: count, dtype: int64

#*Tratamiento de valores NaN*

- Los nulos en columnas **emp title, emp length y settlement** son significativos, no se pueden eliminar registros. Reemplazar con valor 'unknown'

In [21]:
creditos2['emp_title'].fillna('unknown', inplace=True)
creditos2['emp_length'].fillna('unknown', inplace=True)
creditos2['settlement_status'].fillna('unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['emp_title'].fillna('unknown', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['emp_length'].fillna('unknown', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['settlement_status'].fillna('unknown', inplace=True)


- Fechas en **next y last payment d** reemplazar con Dic-2000 para poder separarlas de las otras fechas en las series, cuyos valores van del 2007 a 2015

In [22]:
creditos2['last_pymnt_d'].fillna('Dec-2000', inplace=True)
creditos2['next_pymnt_d'].fillna('Dec-2000', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['last_pymnt_d'].fillna('Dec-2000', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['next_pymnt_d'].fillna('Dec-2000', inplace=True)


- Para tratar nulos de **months since last delinq** primero consultar valores, si no existen ceros se puede asignar este valor

In [23]:
#creditos2.mths_since_last_delinq.value_counts().to_dict()

Existen 2400 observaciones con cero entonces no se puede asignar el mismo valor, rellenar con -999

In [24]:
creditos2['mths_since_last_delinq'].fillna(-999, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['mths_since_last_delinq'].fillna(-999, inplace=True)


- Para **tot hi cred lim**, rellenar NaN con el promedio del resto de valores. Crear variable para después aplicar el *fillna*

In [25]:
prom_cred_mas_alto = (creditos2['tot_hi_cred_lim'].mean()).astype(int)
prom_cred_mas_alto

178242

In [26]:
creditos2['tot_hi_cred_lim'].fillna(prom_cred_mas_alto,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['tot_hi_cred_lim'].fillna(prom_cred_mas_alto,inplace=True)


Las dos celdas siguientes se usan para confirmar nulos en el df y para ver si los valores NaN se rellenaron correctamente en las series procesadas

In [27]:
creditos2.isnull().any()

id_cliente                 False
loan_status                False
loan_amnt                  False
installment                False
term                       False
emp_title                  False
emp_length                 False
home_ownership             False
annual_inc                 False
verification_status        False
purpose                    False
addr_state                 False
delinq_2yrs                False
next_pymnt_d               False
earliest_cr_line           False
mths_since_last_delinq     False
total_pymnt                False
recoveries                 False
collection_recovery_fee    False
last_pymnt_d               False
settlement_status          False
application_type           False
tot_hi_cred_lim            False
risk_status                False
marca_mal_pagador          False
dtype: bool

In [28]:
creditos2.settlement_status.value_counts()

settlement_status
unknown     2227583
ACTIVE        14811
COMPLETE      13517
BROKEN         4728
Name: count, dtype: int64

Cambiar tipos de datos para facilitar el analisis

In [29]:
creditos2['annual_inc'] = creditos2['annual_inc'].astype(int)
creditos2['delinq_2yrs'] = creditos2['delinq_2yrs'].astype(int)
creditos2['mths_since_last_delinq'] = creditos2['mths_since_last_delinq'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['annual_inc'] = creditos2['annual_inc'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['delinq_2yrs'] = creditos2['delinq_2yrs'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['mths_since_last_delinq'] = creditos2['mths_since_last_delinq'].astyp

In [30]:
creditos2.dtypes

id_cliente                   int64
loan_status                 object
loan_amnt                    int64
installment                float64
term                        object
emp_title                   object
emp_length                  object
home_ownership              object
annual_inc                   int64
verification_status         object
purpose                     object
addr_state                  object
delinq_2yrs                  int64
next_pymnt_d                object
earliest_cr_line            object
mths_since_last_delinq       int64
total_pymnt                float64
recoveries                 float64
collection_recovery_fee    float64
last_pymnt_d                object
settlement_status           object
application_type            object
tot_hi_cred_lim            float64
risk_status                 object
marca_mal_pagador            int64
dtype: object

**Convertir columnas de fecha a dt**

Los cambios se hacen inplace, el formato es mes, año (%b-%Y)

In [31]:
creditos2['next_pymnt_d'] = pd.to_datetime(creditos2['next_pymnt_d'], format = '%b-%Y')
creditos2['last_pymnt_d'] = pd.to_datetime(creditos2['last_pymnt_d'], format = '%b-%Y')
creditos2['earliest_cr_line'] = pd.to_datetime(creditos2['earliest_cr_line'],format='%b-%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['next_pymnt_d'] = pd.to_datetime(creditos2['next_pymnt_d'], format = '%b-%Y')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['last_pymnt_d'] = pd.to_datetime(creditos2['last_pymnt_d'], format = '%b-%Y')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['earliest_cr_line']

# **Feature Transformation**


**Creación de variables categoricas**

*Vintage del cliente*

Después de cambiar tipo de dato earliest_cr_line a datetime, ordenar los valores, consultar el más antiguo y más reciente poder definir límites de categorías de vintage de clientes

In [32]:
creditos2.earliest_cr_line.sort_values()

596125    1933-03-01
1613074   1934-02-01
1543630   1934-04-01
1543216   1934-04-01
501898    1941-08-01
             ...    
34228     2015-11-01
6028      2015-11-01
11305     2015-11-01
26955     2015-11-01
16037     2015-11-01
Name: earliest_cr_line, Length: 2260639, dtype: datetime64[ns]

- Separar solo el año del datetime y definir limites para los bins de segmentacion, crear categorias very long, long term, established y new

In [33]:
creditos2['vintage_cliente'] = pd.cut(creditos2.earliest_cr_line.dt.year, [1900, 1969,1970,2000,2010,2020], labels=['very long term', 'long term', 'established','mid-term ','new'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['vintage_cliente'] = pd.cut(creditos2.earliest_cr_line.dt.year, [1900, 1969,1970,2000,2010,2020], labels=['very long term', 'long term', 'established','mid-term ','new'])


*Social Class*

Categoría basada en el nivel de ingresos del cliente

- Se empleará para determinar un promedio de ingresos y hacer más sencilla la evaluación del ratio loan/income

In [34]:
creditos2['soc_class'] = pd.cut(creditos2.annual_inc,[-1,13000,35000,65000,130000,110000000], labels=['lower','working','middle','upper-middle','upper'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['soc_class'] = pd.cut(creditos2.annual_inc,[-1,13000,35000,65000,130000,110000000], labels=['lower','working','middle','upper-middle','upper'])


Crear columna con ingresos promedio por cada categoria

In [35]:
creditos2['avg_annual_inc'] = (creditos2.groupby(['soc_class'])['annual_inc'].transform('mean')).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['avg_annual_inc'] = (creditos2.groupby(['soc_class'])['annual_inc'].transform('mean')).astype(int)


Creacion de ratio loan/income para verificar nivel de deuda respecto a ingresos del cliente

In [36]:
creditos2['loan/income'] = (creditos2['loan_amnt'] / creditos2['avg_annual_inc']).round(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  creditos2['loan/income'] = (creditos2['loan_amnt'] / creditos2['avg_annual_inc']).round(2)


#**Correlation Analysis**