# Introduction
This is a series that I am very excited about. We are going to explore some unsupervised models, the main idea is to understand the X-space or the space of the feautures.

All the other things we saw so far explore how feautures helps us make certain kind of predictions. And then we look deeper at the model itself and try to explain how those feauture impact the model (shap and permutation analysis).

But we haven't stopped to analysis with greater care how all the feautures that make those predictions move together. That what we are going to explore here. Graph theory was the only exception, because with graph theory we understand the underlying pattern of connections.

We already saw PCA and SVD. Now we are going to do something more subtle to the categorical variables, hopefully increasing their predicting power.

In this notebook we are going to explore Target Encoding.
you can read more here: 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
# new library, we haven't used this one before in this repo
from sklearn.preprocessing import TargetEncoder

In [2]:
# Read the data
# This data you can find here: https://www.kaggle.com/c/home-data-for-ml-course/data

X_full = pd.read_csv('train.csv', index_col='Id')

# SalePrice is the target, if there is no target eliminate row associated with it
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X = X_full.copy()
X.drop(['SalePrice'], axis=1, inplace=True)

In [3]:
def model_pipeline_score(X,
                         y,
                         n_estimators=369,
                         cv=5,
                         scoring='neg_mean_absolute_error',
                         target_encoding='yes'):

    
    numerical_col = [col for col in X.columns if str(X[col].dtypes)!='object' ]
    numerical_col_imputed = [col for col in numerical_col if X[col].isnull().any()==True]

    categorical_col = [col for col in X.columns if str(X[col].dtypes)=='object' ]
    categorical_col_imputed = [col for col in categorical_col if X[col].isnull().any()==True]

    numerical_transformer = Pipeline(
    steps=[("scaler", StandardScaler()), ("imputer", KNNImputer(n_neighbors=3))
      ]
        )

    if target_encoding =="no":
        categorical_transformer =  Pipeline(steps=[
            ('imputer', SimpleImputer(missing_values=pd.NA, strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ])
    elif target_encoding =="yes":
        categorical_transformer =  Pipeline(steps=[
            ('encoder', TargetEncoder(smooth="auto",target_type='continuous'))
        ])
        

    preprocessor = ColumnTransformer(transformers=
        [("numerical_transformer", numerical_transformer, numerical_col_imputed),
        ("categorical_transformer", categorical_transformer, categorical_col)],remainder='passthrough')

    # Define model
    model = RandomForestRegressor(n_estimators=n_estimators,random_state=0,n_jobs=-1)

    # Bundle preprocessing and modeling code in a pipeline
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])

    
    scores = -1 * cross_val_score(pipe, X, y,cv=cv,scoring=scoring)

    return scores,pipe

In [4]:
scores,pipe = model_pipeline_score(X,y,target_encoding='yes')

In [5]:
scores.mean()

17332.533775104872

In [20]:
#pipe

In [7]:
scores,pipe  = model_pipeline_score(X,y,target_encoding='no')

In [8]:
scores.mean()

17572.00848461224

In [21]:
#pipe

This was interesting! we use target encoding in all columns and it seems it work better than one-hot encoding. We are going to play a little with the max cardinality to lower the mae even more!. So what we are going to do next is to use target enconding in categorical columns who has a high cardinality and not in others.

In [10]:
def model_pipeline_score(X,
                         y,
                         n_estimators=369,
                         cv=5,
                         scoring='neg_mean_absolute_error',
                         cardinaity_target=13):

    
    numerical_col = [col for col in X.columns if str(X[col].dtypes)!='object' ]
    numerical_col_imputed = [col for col in numerical_col if X[col].isnull().any()==True]
    enc_target = [col for col in list(X.columns) if str(X[col].dtypes)=='object' and len(X[col].unique())>cardinaity_target ]
   
    all_categories = [col for col in list(X.columns) if str(X[col].dtypes)=='object']
    col_encoding = [col for col in all_categories if col not in enc_target ]
    
   
    numerical_transformer = Pipeline(
    steps=[("scaler", StandardScaler()), 
           ("imputer", KNNImputer(n_neighbors=3))
          ])



    target_transformer =  Pipeline(steps=[
        ('encoder', TargetEncoder(smooth="auto",target_type='continuous'))
    ])
        
    categorical_transformer =  Pipeline(
        steps=[('imputer', SimpleImputer(missing_values=pd.NA, strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])    

    preprocessor = ColumnTransformer(transformers=
        [("numerical_transformer", numerical_transformer, numerical_col_imputed),
         ("target_transformer", target_transformer, enc_target),
        ("categorical_transformer", categorical_transformer, col_encoding)
        ], remainder='passthrough')

    # Define model
    model = RandomForestRegressor(n_estimators=n_estimators,random_state=0,n_jobs=-1)

    # Bundle preprocessing and modeling code in a pipeline
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])

    
    scores = -1 * cross_val_score(pipe, X, y,cv=cv,scoring=scoring)

    return scores,pipe

In [11]:
for number in range(5,20,3):
    best_number = 100
    min_score = 17572
    scores,pipe = model_pipeline_score(X,y,cardinaity_target=number)
    if scores.mean() < min_score:
        min_score = scores.mean() 
        best_number = number

In [12]:
min_score.mean()

17245.58469948398

In [19]:
best_number

17

We just obtain the best MAE we have got so far! 
Notice the best MAE it just happened to be the one that uses TargetEncoder with columns with high cardinality.

In [23]:
numerical_col = [col for col in X.columns if str(X[col].dtypes)!='object' ]
numerical_col_imputed = [col for col in numerical_col if X[col].isnull().any()==True]
enc_target = [col for col in list(X.columns) if str(X[col].dtypes)=='object' and len(X[col].unique())>17 ]

all_categories = [col for col in list(X.columns) if str(X[col].dtypes)=='object']
col_encoding = [col for col in all_categories if col not in enc_target ]
enc_target

['Neighborhood']

Notice the best transform uses only one column as TargetEncoder who has more than 17 types of neighborhoods.