## PROCESADO DE LOS DATOS

Con este código se pretende:

    1- Completar datos faltantes.
    2- Variables temporales (ídem que el anterior).
    3- Transformación de variables continuas (ídem que en el anterior).
    4- Reagrupar etiquetas poco frecuentes.
    5- Recodificación de variables categóricas.
    6- Estandarización (reescalamiento) de variables.
    
<u><b>Nota importante:</b></u>
Hay que establecer la semilla (SET THE SEED) en todas las manipulaciones en las que se esté introduciendo aleatoriedad para que el procesado sea <b>reproducible</b>.

In [1]:
# Manejo de datos
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Representación gráfica
import matplotlib.pyplot as plt

# Divisisión del conjunto de datos: entrenamiento y prueba
from sklearn.model_selection import train_test_split

# Visualización de todas las columnas del banco de datos
pd.pandas.set_option('display.max_columns', None)

# Silenciar los avisos
import warnings
warnings.simplefilter(action='ignore')

In [2]:
# carga de los datos, los mismos que en el script anterior
data = pd.read_csv('houseprice.csv')
#print(data.shape)
#data.head()

## 0. División de los datos en conjunto de entrenamiento y conjunto de prueba

<b>Requiere establecer la semilla para el algoritmo.</b>

In [3]:
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.1,
                                                    random_state=0)  #SEMILLA!!

X_train.shape, X_test.shape
print('Dimensión del set de entrenamiento: ', X_train.shape)
print('Dimensión del set de prueba: ', X_test.shape)

Dimensión del set de entrenamiento:  (1313, 81)
Dimensión del set de prueba:  (146, 81)


## 1. Datos faltantes

Separamos el proceso en variables categóricas y variables numéricas.

### 1.1. Variables categóricas.
Decisión: se reemplaza por "missing".

In [4]:
# Lista de variables categóricas

vars_with_na = [
    var for var in data.columns
    if X_train[var].isnull().sum() > 0 and X_train[var].dtypes == 'O'
]

# Porcentaje de datos faltantes
#X_train[vars_with_na].isnull().mean()

In [5]:
# Reemplazamiento: "Missing"

X_train[vars_with_na] = X_train[vars_with_na].fillna('Missing')
X_test[vars_with_na] = X_test[vars_with_na].fillna('Missing')

#### Comprobaciones

In [6]:
# C-1: no se han perdido datos en las variables procesadas

print (
    'C-1: Número de datos faltantes en las variables procesada:',
    X_train[vars_with_na].isnull().sum().sum()
)

# Lista y número por variable
# X_train[vars_with_na].isnull().sum()

C-1: Número de datos faltantes en las variables procesada: 0


In [7]:
# C-2: no se ha anulado la variable procesada

print (
    'C-2: Número de variables procesadas nulas:',
    [var for var in vars_with_na if X_test[var].isnull().sum() > 0]
)

C-2: Número de variables procesadas nulas: []


### 1.2. Variables numércias

Decisión:

    - Se añade una variable dicotómica que toma valor 1 si falta el dato, 0 en caso contrario
    - Se rellena el dato faltante con la modea de la variable en el test de entrenamiento


In [8]:
# Listado de las variables numéricas.
vars_with_na = [
    var for var in data.columns
    if X_train[var].isnull().sum() > 0 and X_train[var].dtypes != 'O'
]

# Porcentaje de datos faltantes.
# X_train[vars_with_na].isnull().mean()

In [9]:
# Reemplazo

for var in vars_with_na:

    # Cálculo de la mdoa del set de entrenamiento
    mode_val = X_train[var].mode()[0]

    # Variables dicotómicas
    X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
    X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)

    # Reemplazamiento
    X_train[var] = X_train[var].fillna(mode_val)
    X_test[var] = X_test[var].fillna(mode_val)

#### Comprobación

In [10]:
# C-1: no se han perdido datos en las variables procesadas

print (
    'C-1: Número de datos faltantes en las variables procesada:',
    X_train[vars_with_na].isnull().sum().sum()
)

# Lista y número por variable
# X_train[vars_with_na].isnull().sum()

C-1: Número de datos faltantes en las variables procesada: 0


In [11]:
# C-2: no se ha anulado la variable procesada

print (
    
    'C-2: Número de variables procesadas nulas:',
    [vr for var in vars_with_na if X_test[var].isnull().sum() > 0]

)

C-2: Número de variables procesadas nulas: []


In [12]:
# C-3: Variables dicotómicas creadas

X_train[['LotFrontage_na', 'MasVnrArea_na', 'GarageYrBlt_na']].head()

Unnamed: 0,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
45,1,0,0
1347,1,0,0
55,0,0,0
381,0,0,0
776,0,0,0


## 2. Variables temporales

Decisión: se obtiene el número de años respecto del año de venta para cada variable temporal.

In [13]:
def elapsed_years(df, var):
    #Función de captura la diferencia entre dos años
    
    df[var] = df['YrSold'] - df[var]
    return df

In [14]:
#Se aplica la función
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

## 3. Variables continuas:

Decisión: se aplica transformación logarítmica.

In [15]:
for var in ['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']:
    X_train[var] = np.log(X_train[var])
    X_test[var] = np.log(X_test[var])

#### Comprobación

In [16]:
print (

    # C-1: Las variables procesadas en el set de prueba no son nulas
    'Variables nulas set de entrenamiento:',
    
    [var for var in ['LotFrontage', 'LotArea', '1stFlrSF',
                 'GrLivArea', 'SalePrice'] if X_test[var].isnull().sum() > 0],
   # C-2: Las variables procesadas en el set de entrenamiento no son nulas
    'Variables nulas set de prueba:',

    [var for var in ['LotFrontage', 'LotArea', '1stFlrSF',
                 'GrLivArea', 'SalePrice'] if X_train[var].isnull().sum() > 0] 

)

Variables nulas set de entrenamiento: [] Variables nulas set de prueba: []


## 4. Variables categóricas

### 4.1. Categorías poco frecuentes

Decisión: dentro de cada variable agrupar las categorías que representan menos del 1% de los casos en una categoría nueva llamada "Rare".

In [17]:
# Lista de variables categóricas

cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

In [18]:
def find_frequent_labels(df, var, rare_perc):
    
    # función: busca categorías por debajo de un % de frecuencia
    
    df = df.copy()
    tmp = df.groupby(var)['SalePrice'].count() / len(df)
    return tmp[tmp > rare_perc].index

In [19]:
#Se aplica la función
for var in cat_vars:
    
    # Identificar las categorías
    frequent_ls = find_frequent_labels(X_train, var, 0.01)
    
    # Reemplazar por "Rare"
    X_train[var] = np.where(X_train[var].isin(
        frequent_ls), X_train[var], 'Rare')
    
    X_test[var] = np.where(X_test[var].isin(
        frequent_ls), X_test[var], 'Rare')

### 4.2. Codificación

Se asignan los valores numéricos a las etiquetas de las variables según su relación con la variable objetivo. Categorías en las que el precio medio sea más elevado, se les asigna un valor mayor.

In [20]:
def replace_categories(train, test, var, target):
    # function: asigna un valor discreto a un variable "string",
    # de modo que el valor más pequeño corresponde a la categoría
    # que presenta un precio medio más bajo
    
    ordered_labels = train.groupby([var])[target].mean().sort_values().index

    # Crea el diccionario de categorías ordenadas
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}

    # Usa el diccionario para reemplazar categorías por enteros.
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)

In [21]:
# Se aplica la función:
for var in cat_vars:
    replace_categories(X_train, X_test, var, 'SalePrice')

#### Comprobación:

In [30]:
print (

    # C-1: Las variables procesadas en el set de prueba no son nulas
    'Datos faltantes set de entrenamiento:',
    [var for var in X_train.columns if X_train[var].isnull().sum() > 0],
    
    # C-2: Las variables procesadas en el set de entrenamiento no son nulas
    'Datos faltantes set de prueba:',
    [var for var in X_test.columns if X_test[var].isnull().sum() > 0] 

)

#Se cuela un NA aquí, hay que revisarlo.

Datos faltantes set de entrenamiento: [] Datos faltantes set de prueba: ['Fence']


## 5. Normalización
Decisión: Se reescalan las variables entre los valores mínimo y máxmio.

In [23]:
# lista de variables excepto ID

train_vars = [var for var in X_train.columns if var not in ['Id', 'SalePrice']]

# Número de variables
print ('Número de variables:',
       len(train_vars))

Número de variables: 90


In [24]:
# Escala
scaler = MinMaxScaler()

#  Ajustar la escala con el set de entrenamiento
scaler.fit(X_train[train_vars]) 

# Reescalar las variables en el set de entrenamiento y el set de prueba
X_train[train_vars] = scaler.transform(X_train[train_vars])
X_test[train_vars] = scaler.transform(X_test[train_vars])

In [25]:
#Visualización
X_train.head()

Unnamed: 0,Id,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,BsmtFinSF1_na,BsmtFinSF2_na,BsmtUnfSF_na,TotalBsmtSF_na,BsmtFullBath_na,BsmtHalfBath_na,GarageYrBlt_na,GarageCars_na,GarageArea_na
45,1506,12.167484,0.0,1.0,0.465802,0.537401,1.0,1.0,0.333333,0.333333,0.0,0.25,0.5,0.95,0.0,0.0,0.5,0.2,0.555556,0.625,0.338462,0.725806,1.0,0.0,0.7,0.8,0.75,0.093023,0.666667,0.0,0.75,0.5,0.75,0.25,0.0,0.126185,0.5,0.0,0.618224,0.358979,0.0,0.25,1.0,0.666667,0.594615,0.0,0.0,0.594615,0.333333,0.0,0.5,0.0,0.666667,0.5,0.75,0.416667,0.0,0.0,0.2,0.666667,0.385965,0.666667,0.4,0.359543,0.666667,0.666667,1.0,0.0,0.102426,0.0,0.0,0.0,0.0,1.0,0.666667,0.0,0.0,0.363636,1.0,0.75,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1347,2808,12.13115,0.0,1.0,0.465802,0.559318,1.0,1.0,0.333333,0.333333,0.0,0.25,0.5,0.6,0.0,0.0,0.5,0.2,0.444444,0.625,0.307692,0.66129,1.0,0.0,0.7,0.8,0.75,0.236434,0.666667,0.0,0.75,0.5,0.75,0.0,0.666667,0.194264,0.5,0.0,0.133178,0.208832,0.0,0.5,1.0,0.666667,0.380254,0.0,0.0,0.380254,0.0,0.5,0.25,0.0,0.5,0.5,0.75,0.25,0.0,0.25,0.8,0.666667,0.350877,0.0,0.4,0.354839,0.666667,0.666667,1.0,0.220506,0.06469,0.0,0.0,0.0,0.0,1.0,0.666667,0.0,0.0,0.454545,0.0,0.75,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,1516,12.145904,0.176471,1.0,0.773349,0.676274,1.0,1.0,0.0,0.333333,0.0,0.5,0.5,0.45,0.166667,0.0,0.5,0.4,0.555556,0.625,0.476923,1.0,0.0,0.0,0.2,0.3,0.25,0.263566,0.666667,0.333333,0.75,0.5,0.75,0.25,0.333333,0.074813,0.5,0.0,0.225234,0.153484,0.0,0.5,1.0,0.666667,0.363154,0.2884,0.0,0.53065,0.0,0.0,0.5,0.0,0.5,0.5,0.75,0.25,0.0,0.25,0.6,0.666667,0.54386,0.0,0.4,0.31586,0.666667,0.666667,1.0,0.0,0.0,0.148221,0.0,0.0,0.0,1.0,0.666667,0.0,0.0,0.0,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
381,1842,12.153656,0.176471,1.0,0.465802,0.519093,1.0,1.0,0.0,0.333333,0.0,0.25,0.5,0.4,0.0,0.0,0.5,0.4,0.444444,0.25,0.584615,0.983871,0.0,0.0,0.2,0.3,0.0,0.0,0.666667,0.0,0.5,0.5,0.75,0.25,0.833333,0.0,0.5,0.0,0.273364,0.114818,0.0,0.5,0.0,0.333333,0.207154,0.228249,0.0,0.397714,0.333333,0.0,0.5,0.0,0.666667,0.5,0.75,0.25,0.0,0.0,0.2,0.333333,0.666667,0.0,0.2,0.151882,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.666667,0.0,0.0,0.181818,0.75,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
776,2237,12.122712,0.0,1.0,0.620344,0.562204,1.0,1.0,0.333333,0.333333,0.0,0.25,0.5,0.75,0.0,0.0,0.5,0.2,0.777778,0.5,0.030769,0.064516,0.0,0.0,0.6,0.7,0.75,0.522481,0.0,0.0,0.5,1.0,0.75,0.0,0.0,0.391771,0.5,0.0,0.183645,0.385476,0.0,0.75,1.0,0.666667,0.622794,0.0,0.0,0.622794,0.333333,0.0,0.5,0.5,0.5,0.5,0.5,0.333333,0.0,0.25,0.6,0.666667,0.035088,1.0,0.6,0.599462,0.666667,0.666667,1.0,0.0,0.357143,0.0,0.0,0.0,0.0,1.0,0.666667,0.0,0.0,0.454545,0.5,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6. Backup de los set de entreanimento y prueba.

In [26]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('xtrain.csv', index=False)
X_test.to_csv('xtest.csv', index=False)