<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-do-you-when-you-have-both-numeric-and-categorical-features?" data-toc-modified-id="What-do-you-when-you-have-both-numeric-and-categorical-features?-1">What do you when you have both numeric and categorical features?</a></span></li><li><span><a href="#Sources-of-Inspiration" data-toc-modified-id="Sources-of-Inspiration-2">Sources of Inspiration</a></span></li><li><span><a href="#`make_column_selector`-allows-you-specific-columns" data-toc-modified-id="`make_column_selector`-allows-you-specific-columns-3">`make_column_selector` allows you specific columns</a></span></li></ul></div>

<center><h2>What do you when you have both numeric and categorical features?</h2></center>

Put them together with ColumnTransformer

ColumnTransformer does column-by-column preprocessing

Let's read the documentation   
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

In [23]:
reset -fs

In [24]:
import numpy as np
import pandas as pd

In [25]:
data = pd.read_csv("../data/adult.csv", index_col=0)
data.tail() # Data analysis protip - Always look at the last rows. It is sometimes the most recent data. Sometimes the first rows are mock data.

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [26]:
target        = data.income
data_features = data.drop("income", axis=1)

In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_features, target)

In [28]:
# Find the categorical columns
categorical_columns = (data_features.dtypes == object)

# Setup two preprocessing pipelines
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute          import SimpleImputer

con_pipe = Pipeline([('scaler', StandardScaler()),
                      ('imputer', SimpleImputer(strategy='median', add_indicator=True))])

cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore')),
                     ('imputer', SimpleImputer(strategy='most_frequent', add_indicator=True))])

from sklearn.compose         import ColumnTransformer

preprocessing = ColumnTransformer([('categorical', cat_pipe,  categorical_columns),
                                   ('continuous',  con_pipe, ~categorical_columns),
                                   ])

from sklearn.linear_model    import LogisticRegression

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('clf', LogisticRegression(solver='liblinear'))])
pipe.fit(X_train, y_train)
pipe.predict(X_test, y_test)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore')),
                                                                  ('imputer',
                                                                   SimpleImputer(add_indicator=True,
                                                                                 strategy='most_frequent'))]),
                                                  age               False
workclass          True
education          True
education-num     False
marital-status     True
occupation         True
relationship       True
race               True
gender             True
capital-gain      Fals...
                                                  Pipeline(steps=[('scaler',
                                                            

<center><h2>Sources of Inspiration</h2></center>

- https://github.com/amueller/ml-workshop-1-of-4

<center><h2>`make_column_selector` allows you specific columns</h2></center>

Use make_column_selector with make_column_transformer to apply different preprocessing to different columns according to their data types (integers, categories) or column names.

```python
ct = make_column_transformer(
 (StandardScaler(), make_column_selector(dtype_include=np.number)), # ages
 (OneHotEncoder(), make_column_selector(dtype_include=object)), # genders
)
```

<br>
<br> 
<br>

----