# <center>**Aprendizaje de Máquinas para Datos Masivos**</center>  
<center>**Docente Juan David Velásquez, PhD**</center>
<center>**Universidad Nacional de Colombia - Sede Medellín**</center>
<center>**Semestre 2016-01**</center>

## <center>**Identificación de préstamos riesgosos a partir de Arboles de Decisión**</center>
<center>**Parte 2 -Python-**</center>


### <div align="right">_**Presentado Por:**_</div>
<div align="right">Ibeth Karina Vergara Baquero</div>
<div align="right">José Manuel Osorio Restrepo</div>
<div align="right">Christian R. Ortiz Jiménez</div>

<hr style="height:1px">

## Contenido
* [6. Desarrollo empleando Python](#6.-Desarrollo-empleando-Python)
    * [6.1. Paso 1: Recolección de datos](#6.1.-Paso-1:-Recolección-de-datos)
    * [6.2. Paso 2: Exploración y Preparación de datos](#6.2.-Paso-2:-Exploración-y-Preparación-de-datos)
        * [6.2.1. Pre-procesamiento de datos](#6.2.1.-Pre-procesamiento-de-datos)
    * [6.3. Paso 3: Entrenamiento del modelo](#6.3.-Paso-3:-Entrenamiento-del-modelo)
    * [6.4. Paso 4: Evaluación del desempeño del modelo](#6.4.-Paso-4:-Evaluación-del-desempeño-del-modelo)
    * [6.5. Paso 5: Mejorar el desempeño del modelo](#6.5.-Paso-5:-Mejorar-el-desempeño-del-modelo)



## 6. Desarrollo empleando Python

In [92]:
#Importación de paquetes
from __future__ import print_function

import os
import subprocess

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz

## 6.1. Paso 1: Recolección de datos

In [93]:
#Se lee el archivo
df_credit=pd.read_csv("credit.csv")

In [94]:
#Encabezado del archivo (6 primeras filas)
df_head=df_credit.head(6)
df_head

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
0,< 0 DM,6,critical,furniture/appliances,1169,unknown,> 7 years,4,4,67,none,own,2,skilled,1,yes,no
1,1 - 200 DM,48,good,furniture/appliances,5951,< 100 DM,1 - 4 years,2,2,22,none,own,1,skilled,1,no,yes
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 years,2,3,49,none,own,1,unskilled,2,no,no
3,< 0 DM,42,good,furniture/appliances,7882,< 100 DM,4 - 7 years,2,4,45,none,other,1,skilled,2,no,no
4,< 0 DM,24,poor,car,4870,< 100 DM,1 - 4 years,3,4,53,none,other,2,skilled,2,no,yes
5,unknown,36,good,education,9055,unknown,1 - 4 years,2,4,35,none,other,1,unskilled,2,yes,no


## 6.2. Paso 2: Exploración y Preparación de datos

In [95]:
#Características globales del archivo
df_credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
checking_balance        1000 non-null object
months_loan_duration    1000 non-null int64
credit_history          1000 non-null object
purpose                 1000 non-null object
amount                  1000 non-null int64
savings_balance         1000 non-null object
employment_duration     1000 non-null object
percent_of_income       1000 non-null int64
years_at_residence      1000 non-null int64
age                     1000 non-null int64
other_credit            1000 non-null object
housing                 1000 non-null object
existing_loans_count    1000 non-null int64
job                     1000 non-null object
dependents              1000 non-null int64
phone                   1000 non-null object
default                 1000 non-null object
dtypes: int64(7), object(10)
memory usage: 132.9+ KB


In [96]:
#Registros del atributo"checking_balance"
#DM: Deutsche Marks (Marcos Alemanes)
#Opción 1
df_credit["checking_balance"].value_counts()

unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64

In [97]:
#Registros del atributo"checking_balance"
#DM: Deutsche Marks (Marcos Alemanes)
#Opción 2
pd.Categorical(df_credit.checking_balance).describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
1 - 200 DM,269,0.269
< 0 DM,274,0.274
> 200 DM,63,0.063
unknown,394,0.394


In [98]:
#Registros del atributo "saving_balance"
#Opción 1
df_credit["savings_balance"].value_counts()

< 100 DM         603
unknown          183
100 - 500 DM     103
500 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64

In [99]:
#Registros del atributo "saving_balance"
#Opción 2
pd.Categorical(df_credit.savings_balance).describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
100 - 500 DM,103,0.103
500 - 1000 DM,63,0.063
< 100 DM,603,0.603
> 1000 DM,48,0.048
unknown,183,0.183


In [100]:
#Caracterización estadística del atributo "months_loan_duration"
df_credit.months_loan_duration.describe()

count    1000.000000
mean       20.903000
std        12.058814
min         4.000000
25%        12.000000
50%        18.000000
75%        24.000000
max        72.000000
Name: months_loan_duration, dtype: float64

In [101]:
#Caracterización estadística del atributo "amount"
df_credit.amount.describe()

count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64

In [102]:
#Conteo del atributo "default"
#Opción 1
df_credit["default"].value_counts()

no     700
yes    300
Name: default, dtype: int64

In [103]:
#Conteo del atributo "default"
#Opción 2
pd.Categorical(df_credit.default).describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
no,700,0.7
yes,300,0.3


## 6.2.1. Pre-procesamiento de datos

Para construir un árbol de decisión en Python, empleando scikit-learn, es necesario plantear dos entradas principales: los clasificadores (atributos) y la clase final (objetivo). En este caso el objeto es determinar si un préstamo es riesgoso (no pago) o no; los clasificadores entonces serán las otras 16 columnas.

In [197]:
#Selección de los atributos (primeras 16 columnas del DataFrame "df_credit")
#Con df_credit.drop() se pueden eliminar las columnas especificadas, en este caso la 17.
#Recuerde que en este caso el valor 0 corresponde a la columna 1.
X_Atrib=df_credit.drop(df_credit.columns[[16]],axis=1) 
X_Atrib.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone
0,< 0 DM,6,critical,furniture/appliances,1169,unknown,> 7 years,4,4,67,none,own,2,skilled,1,yes
1,1 - 200 DM,48,good,furniture/appliances,5951,< 100 DM,1 - 4 years,2,2,22,none,own,1,skilled,1,no
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 years,2,3,49,none,own,1,unskilled,2,no
3,< 0 DM,42,good,furniture/appliances,7882,< 100 DM,4 - 7 years,2,4,45,none,other,1,skilled,2,no
4,< 0 DM,24,poor,car,4870,< 100 DM,1 - 4 years,3,4,53,none,other,2,skilled,2,no


In [105]:
#Selección del criterio final de clasificación
#En este caso se eliminan las columnas 1 a 16 empleando un arreglo de numpy.
Y_Clases=df_credit.drop(df_credit.columns[[np.arange(0, 16, 1)]], axis=1) 
Y_Clases.head()

Unnamed: 0,default
0,no
1,yes
2,no
3,no
4,yes


Nota: el paquete scikit learn solo acepta datos numéricos, por lo cual se debe realizar una transformación de las variables categóricas. 

In [106]:
#Se remplazan las categorías "no" por cero (0), y las categorías "yes" por uno (1).
Y_Clases_r=Y_Clases.default.replace({"no": 0, "yes": 1})
Y_Clases_r.head()

0    0
1    1
2    0
3    0
4    1
Name: default, dtype: int64

In [107]:
#Para las variables más "complejas", es decir que no son solo si o no; una opción es emplear variables Dummies
X_Atrib_r = pd.get_dummies(X_Atrib)
X_Atrib_r.head()

Unnamed: 0,months_loan_duration,amount,percent_of_income,years_at_residence,age,existing_loans_count,dependents,checking_balance_1 - 200 DM,checking_balance_< 0 DM,checking_balance_> 200 DM,...,other_credit_store,housing_other,housing_own,housing_rent,job_management,job_skilled,job_unemployed,job_unskilled,phone_no,phone_yes
0,6,1169,4,4,67,2,1,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,48,5951,2,2,22,1,1,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,12,2096,2,3,49,1,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,42,7882,2,4,45,1,2,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,24,4870,3,4,53,2,2,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


Ahora se importarán los paquetes especificos que permitirán la construcción del árbol:

In [108]:
#Se importan algunos sub-paquetes de scikit-learn para: 
from sklearn.cross_validation import train_test_split #Determinar los datos de prueba y de entrenamiento
from sklearn.tree import DecisionTreeClassifier       #Construir el árbol

*"train_test_split"* permitirá:
 * Seleccionar los datos de entrenamiento y de prueba (900 y 100 respectivamente)
 * Generar números pseudo-aleatorios para garantizar la reproducción de este ejercicio.

In [109]:
#En una sola línea de comando es posible definir los datos de prueba y de entrenamiento:
X_train, X_test, Y_train, Y_test = train_test_split(X_Atrib_r, Y_Clases_r, test_size=0.1, random_state=123)

In [110]:
#Encabezado de los datos de prueba (atributos o posibles clasificadores)
X_test.head()

Unnamed: 0,months_loan_duration,amount,percent_of_income,years_at_residence,age,existing_loans_count,dependents,checking_balance_1 - 200 DM,checking_balance_< 0 DM,checking_balance_> 200 DM,...,other_credit_store,housing_other,housing_own,housing_rent,job_management,job_skilled,job_unemployed,job_unskilled,phone_no,phone_yes
131,36,6887,4,3,29,1,1,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
203,12,902,4,4,21,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
50,24,2333,4,2,29,1,1,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
585,18,2039,1,4,20,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
138,15,2728,4,2,35,3,1,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [111]:
#Encabezado de los datos de prueba (salidas o clase)
Y_test.head()

131    1
203    1
50     0
585    1
138    0
Name: default, dtype: int64

## 6.3. Paso 3: Entrenamiento del modelo

In [200]:
#Planteamiento del modelo del árbol
tree = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=123)

In [201]:
#Definición del árbol a partir de los datos de entrenamiento
Arbol=tree.fit(X_train, Y_train)
Arbol

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best')

In [202]:
#Importancia de los atributos
pd.DataFrame(tree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort(['Imp'], ascending = False)

  from ipykernel import kernelapp as app


Unnamed: 0,Imp
checking_balance_unknown,0.429509
months_loan_duration,0.113746
age,0.107053
amount,0.086868
other_credit_none,0.080336
credit_history_very good,0.048271
purpose_furniture/appliances,0.039873
savings_balance_< 100 DM,0.038806
checking_balance_> 200 DM,0.038488
percent_of_income,0.017051


## 6.4. Paso 4: Evaluación del desempeño del modelo

In [204]:
#Predicción de la clasificación de los datos de prueba a partir del Arbol creado
Arbol.predict(X_test)

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1])

In [177]:
#Se importan algunas métricas para la evaluación del desempeño
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import classification_report

In [178]:
#Error absoluto medio
mean_absolute_error(Y_test, Arbol.predict(X_test))

0.23999999999999999

In [189]:
#Matriz de confusión
#labels=["no", "yes"]
confusion_matrix(Y_test, Arbol.predict(X_test))

array([[58,  8],
       [16, 18]])

In [167]:
#Precisión de la predicción
accuracy_score(Y_test, Arbol.predict(X_test))

0.76000000000000001

In [192]:
#Reporte de la predicción
target_names = ['class 0 -n', 'class 1 -y']
print(classification_report(Y_test, Arbol.predict(X_test),target_names=target_names))

             precision    recall  f1-score   support

 class 0 -n       0.78      0.88      0.83        66
 class 1 -y       0.69      0.53      0.60        34

avg / total       0.75      0.76      0.75       100



___

**<div align="right">Para regresar a la parte 1 ingrese a:  [ADM_Arboles de Decisión - R - [1 de 2]](ADM_Arboles de Decisión - R - [1 de 2].ipynb)</div>**