Clasificación
---------------

El objetivo de esta práctica es predecir si los ingresos de una persona superan o no los 50.000$ (variable income)

Archivos: `census_train.csv`, `census_test.csv`

Este conjunto de datos es una versión modificada del utilizado en el artículo ["Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf) escrito por Ron Kohavi. Los datos originales se pueden encontrar en el [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income)

Para ello, se tienen 13 características:


| Variable | Descripción  | Valores |  
|-----|-------|-------|
| **age** | Edad  |    númerico   |
| **workclass** | tipo de ocupación |Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,Local-gov, State-gov, Without-pay, Never-worked|
| **education_level** | Nivel educativo  | Bachelors, Some-college, 11th, HS-grad,Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool      |
| **education-num** | Número de años de educación completados.  |    númerico   |
| **marital-status** | estadocivil | Married-civ-spouse,Divorced,Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse      |
| **occupation** | ocupación   |   Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces    |
| **relationship** | familia  |   Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried    |
| **race** | raza  |  White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black     |
| **sex** | Género   |   Female, Male    |
| **capital-gain** |  Ganancia de capitales |   númerico    |
| **capital-loss** | Pérdida de capitales  |  númerico     |
| **hours-per-week** |  Promedio de horas trabajadas por semana |   númerico    |
| **native-country** | País de origen  |   United-States,Cambodia,England,Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holand-Netherlands    |
|       |
|**income**  | Ingresos  | númerico  |




# Carga de Librerias y Funciones auxiliares
Primero cargamos las librerías con las que vamos a trabajar y las funciones auxiliares que necesitemos.

In [39]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline

cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing

In [40]:
def plot_confusion_matrix(confmat):
    fig, ax = plt.subplots(figsize=(7, 7))
    ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.5)
    for i in range(confmat.shape[0]):
        for j in range(confmat.shape[1]):
            ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')

    plt.xlabel('predicted label')
    plt.ylabel('true label')

    plt.tight_layout()
    plt.show()

# Carga y transformaciones de datos

In [47]:
data = pd.read_csv('./data/census_train.csv',sep=',', decimal='.')
data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,49,Private,Masters,14.0,Divorced,Sales,Unmarried,Other,Female,0.0,0.0,20.0,Peru,<=50K
1,43,Private,Assoc-acdm,12.0,Divorced,Craft-repair,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
2,53,Private,Doctorate,16.0,Married-civ-spouse,Prof-specialty,Wife,White,Female,99999.0,0.0,37.0,United-States,>50K
3,23,Private,HS-grad,9.0,Married-civ-spouse,Adm-clerical,Wife,White,Female,3908.0,0.0,40.0,United-States,<=50K
4,32,Private,Some-college,10.0,Divorced,Handlers-cleaners,Unmarried,Black,Male,0.0,0.0,40.0,Nicaragua,<=50K


Codificamos las distintas variables para darnos opciones distintas a la hora de elegir el algoritmo del modelo

In [48]:
# codificacion de variables

wrk_LE = preprocessing.LabelEncoder() # para Workclass
edu_LE = preprocessing.LabelEncoder() # para Education Level
mar_LE = preprocessing.LabelEncoder() # Marital Status
ocu_LE = preprocessing.LabelEncoder() # Occupation
rel_LE = preprocessing.LabelEncoder() # Relationship
rac_LE = preprocessing.LabelEncoder() # Race
sex_LE = preprocessing.LabelEncoder() # Sex
nat_LE = preprocessing.LabelEncoder() # Native Country
inc_LE = preprocessing.LabelEncoder() # income

wrk_LE.fit(data['workclass'])
edu_LE.fit(data['education_level'])
mar_LE.fit(data['marital-status'])
ocu_LE.fit(data['occupation'])
rel_LE.fit(data['relationship'])
rac_LE.fit(data['race'])
sex_LE.fit(data['sex'])
nat_LE.fit(data['native-country'])
inc_LE.fit(data['income'])

data['workclass']       = wrk_LE.transform(data['workclass'])
data['education_level'] = edu_LE.transform(data['education_level'])
data['marital-status']  = mar_LE.transform(data['marital-status'])
data['occupation']      = ocu_LE.transform(data['occupation'])
data['relationship']    = rel_LE.transform(data['relationship'])
data['race']            = rac_LE.transform(data['race'])
data['sex']             = sex_LE.transform(data['sex'])
data['native-country']  = nat_LE.transform(data['native-country'])
data['income']          = inc_LE.transform(data['income'])

#Las variables Float las convertimos a enteros
float_col = data.select_dtypes(include = ['float64']) # Selecciona las columnas Float
for col in float_col.columns.values:
    data[col] = data[col].astype('int64')       # Transforma


In [49]:
data.dtypes

age                int64
workclass          int64
education_level    int64
education-num      int64
marital-status     int64
occupation         int64
relationship       int64
race               int64
sex                int64
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country     int64
income             int64
dtype: object

In [50]:
data.isnull().any()

age                False
workclass          False
education_level    False
education-num      False
marital-status     False
occupation         False
relationship       False
race               False
sex                False
capital-gain       False
capital-loss       False
hours-per-week     False
native-country     False
income             False
dtype: bool

In [51]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,36177.0,38.598751,13.229011,17.0,28.0,37.0,47.0,90.0
workclass,36177.0,2.205877,0.963834,0.0,2.0,2.0,2.0,6.0
education_level,36177.0,10.314786,3.824565,0.0,9.0,11.0,12.0,15.0
education-num,36177.0,10.113663,2.553911,1.0,9.0,10.0,13.0,16.0
marital-status,36177.0,2.579678,1.498088,0.0,2.0,2.0,4.0,6.0
occupation,36177.0,5.964701,4.026909,0.0,2.0,6.0,9.0,13.0
relationship,36177.0,1.408519,1.596411,0.0,0.0,1.0,3.0,5.0
race,36177.0,3.681842,0.830472,0.0,4.0,4.0,4.0,4.0
sex,36177.0,0.677088,0.467596,0.0,0.0,1.0,1.0,1.0
capital-gain,36177.0,1088.861735,7506.099972,0.0,0.0,0.0,0.0,99999.0


In [38]:
data.head(10)

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,49,2,12,14.0,0,11,4,3,0,0.0,0.0,20.0,28,0
1,43,2,7,12.0,0,2,1,4,1,0.0,0.0,40.0,38,0
2,53,2,10,16.0,2,9,5,4,0,99999.0,0.0,37.0,38,1
3,23,2,11,9.0,2,0,5,4,0,3908.0,0.0,40.0,38,0
4,32,2,15,10.0,0,5,4,2,1,0.0,0.0,40.0,26,0
5,29,0,15,10.0,3,0,3,4,0,0.0,0.0,40.0,38,0
6,51,2,15,10.0,4,0,3,4,0,0.0,0.0,40.0,38,0
7,33,3,11,9.0,2,3,0,4,1,0.0,0.0,50.0,38,0
8,35,1,12,14.0,2,9,0,4,1,0.0,0.0,40.0,38,0
9,39,2,15,10.0,2,11,0,4,1,0.0,0.0,40.0,38,0
