# Transform Your Data Conveniently With Sklearn
> "Transform and standardize your dataset by this sklearn trick!"

- toc: false
- categories: [data science]
- branch: master
- badges: false
- sticky_rank: 2
- comments: true
- author: Rafael Macalaba
- image: images/data-transformation.png
- hide: false
- search_exclude: true

Have you tried to standardize and transform your dataset by manually inputting the functions and/or looping into your dataframe columns just to do the job? Ugh, -- it's so frustrating. You might remember doing something like this.

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from tqdm import tqdm
# copied from https://github.com/avsolatorio
def transform_data(train, test):
    train = train.copy()
    test = test.copy()

    cols = set(train.columns)
    cat_cols = []
    
    # Target is of bool type so it will not be transformed.
    
    numeric = train.select_dtypes(include=['int64', 'float64'])
    numeric_fill = numeric.mean()
    
    numeric = numeric.fillna(numeric_fill)
    
    train[numeric.columns] = numeric
    test[numeric.columns] = test[numeric.columns].fillna(numeric_fill)

    sc = StandardScaler()
    mx = MinMaxScaler()

    train = pd.concat(
        [train, pd.DataFrame(
            sc.fit_transform(numeric),
            columns=['sc_{}'.format(i) for i in numeric.columns],
            index=train.index
        )], axis=1)
    
    test = pd.concat(
        [test, pd.DataFrame(
            sc.transform(test[numeric.columns].fillna(numeric_fill)),
            columns=['sc_{}'.format(i) for i in numeric.columns],
            index=test.index
        )], axis=1)
    
    train = pd.concat(
        [train, pd.DataFrame(
            mx.fit_transform(numeric),
            columns=['mx_{}'.format(i) for i in numeric.columns],
            index=train.index
        )], axis=1)
    
    test = pd.concat(
        [test, pd.DataFrame(
            mx.transform(test[numeric.columns].fillna(numeric_fill)),
            columns=['mx_{}'.format(i) for i in numeric.columns],
            index=test.index
        )], axis=1)
    
    
    num_cols = set(numeric.columns)
    
    for col in tqdm(cols):
        if train[col].dtype == 'object':
            train[col] = train[col].fillna('N/A')
            test[col] = test[col].fillna('N/A')

            train[col] = train[col].apply(str)
            test[col] = test[col].apply(str)

            le = LabelEncoder()
            ohe = OneHotEncoder()

            train_vals = list(train[col].unique())
            test_vals = list(test[col].unique())
            le.fit(train_vals + test_vals)
            train[col] = le.transform(train[col])
            test[col] = le.transform(test[col])
            
            cat_cols.append(col)

    train_ohe = pd.get_dummies(train[cat_cols].astype(str))
    test_ohe = pd.get_dummies(test[cat_cols].astype(str))

    ohe_common = train_ohe.columns.intersection(test_ohe.columns)

    train = pd.concat([train, train_ohe], axis=1)
    test = pd.concat([test, test_ohe], axis=1)
    
    return train, test

Familiar isn't it? This is the usual way we deal with this kind of job when tranforming our dataframe. Now let's try it on boston dataset so that we see it in action.

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")

change some of the data to categorical so that we can also see it in action

In [3]:
df['tax'] = df['tax'].apply(lambda x: 'cat_tax_' + str(x))
df['rad'] = df['rad'].apply(lambda x: 'cat_rad_' + str(x))
df['target'] = np.random.randint(0,2,df.shape[0])

we split the data to train and test

In [4]:
from sklearn.model_selection import train_test_split
X = df.copy()
y = X.pop('target')
train, test, y_train, y_test = \
    train_test_split(X, y, stratify=y, train_size=0.75)

check data head

In [5]:
train.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
271,0.16211,20.0,6.96,0,0.464,6.24,16.3,4.429,cat_rad_3,cat_tax_223,18.6,396.9,6.59,25.2
419,11.8123,0.0,18.1,0,0.718,6.824,76.5,1.794,cat_rad_24,cat_tax_666,20.2,48.45,22.74,8.4
69,0.12816,12.5,6.07,0,0.409,5.885,33.0,6.498,cat_rad_4,cat_tax_345,18.9,396.9,8.79,20.9
493,0.17331,0.0,9.69,0,0.585,5.707,54.0,2.3817,cat_rad_6,cat_tax_391,19.2,396.9,12.01,21.8
17,0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,cat_rad_4,cat_tax_307,21.0,386.75,14.67,17.5


In [6]:
train_transformed, test_transformed = transform_data(train, test)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 2834.67it/s]


In [7]:
train_transformed.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,...,tax_9,rad_0,rad_1,rad_2,rad_3,rad_4,rad_5,rad_6,rad_7,rad_8
271,0.16211,20.0,6.96,0,0.464,6.24,16.3,4.429,3,6,...,0,0,0,0,1,0,0,0,0,0
419,11.8123,0.0,18.1,0,0.718,6.824,76.5,1.794,2,64,...,0,0,0,1,0,0,0,0,0,0
69,0.12816,12.5,6.07,0,0.409,5.885,33.0,6.498,4,47,...,0,0,0,0,0,1,0,0,0,0
493,0.17331,0.0,9.69,0,0.585,5.707,54.0,2.3817,6,54,...,0,0,0,0,0,0,0,1,0,0
17,0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,4,38,...,0,0,0,0,0,1,0,0,0,0


In [8]:
test_transformed.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,...,tax_9,rad_0,rad_1,rad_2,rad_3,rad_4,rad_5,rad_6,rad_7,rad_8
390,6.96215,0.0,18.1,0,0.7,5.713,97.0,1.9265,2,64,...,0,0,0,1,0,0,0,0,0,0
396,5.87205,0.0,18.1,0,0.693,6.405,96.0,1.6768,2,64,...,0,0,0,1,0,0,0,0,0,0
352,0.07244,60.0,1.69,0,0.411,5.884,18.5,10.7103,4,58,...,0,0,0,0,0,1,0,0,0,0
45,0.17142,0.0,6.91,0,0.448,5.682,33.8,5.1004,3,9,...,1,0,0,0,1,0,0,0,0,0
342,0.02498,0.0,1.89,0,0.518,6.54,59.7,6.2669,0,59,...,0,1,0,0,0,0,0,0,0,0


Recap, we standardized our numeric data with StandardScaler and MinMaxScaler. Moreover, we also transformed our categorical data using LabelEncoder and OneHotEncoder.

However, this kind of process is really codeful (codely-mouthful hehe) which requires a lot of code in order to transform our data. But say no more, as I will be sharing a trick to you on how you will utilize sklearn function to do this in an easier and convenient way!

#### Introducing make_column_transformer & make_column_selector

[sklearn.compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) provides a higher-level api that provides these two convenient functions to transform your data, eliminating most of the unnecessary code that we are doing on our data such as selecting numerics and categorical.

Now, let's see it in action.

In [9]:
from sklearn.compose import make_column_transformer, make_column_selector

# define our preprocessor that will handle most of the work for us!
# you can add more transformation as you'd like, this time we only used StandardScaler and OneHotEncoder
preprocessor = make_column_transformer(
    (StandardScaler(),
     make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(sparse=False),
     make_column_selector(dtype_include=object)),
)

In [10]:
X_transformed = preprocessor.fit_transform(X)

In [11]:
type(X_transformed)

numpy.ndarray

In [12]:
X_transformed_df = pd.DataFrame(X_transformed)
# now let's check our DataFrame
X_transformed_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,77,78,79,80,81,82,83,84,85,86
0,-0.419782,0.28483,-1.287909,-0.272599,-0.144217,0.413672,-0.120013,0.140214,-1.459,0.441052,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.417339,-0.487722,-0.593381,-0.272599,-0.740262,0.194274,0.367166,0.55716,-0.303094,0.441052,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.417342,-0.487722,-0.593381,-0.272599,-0.740262,1.282714,-0.265812,0.55716,-0.303094,0.396427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.41675,-0.487722,-1.306878,-0.272599,-0.835284,1.016303,-0.809889,1.077737,0.113032,0.416163,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.412482,-0.487722,-1.306878,-0.272599,-0.835284,1.228577,-0.51118,1.077737,0.113032,0.441052,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As we can see, we already have our data standardized and transformed, one thing in here is that our transformed data is in numpy.ndarray type which is machine-learning training ready, so we can just transform it to pandas dataframe by calling the pd.DataFrame function. you can get the column names by calling `preprocessor.get_feature_names` method, support for standardscaler is not already in place but for categorical values, the trick will do.

# There's more!

Imputing null/nans in our dataset is also a preprocessing step that we always do when tackling new dataset. 
constant forward filling for numeric data, ignoring unknown categorical data etc.

If you wanted to do an advanced version of this, the next section of example will do.

We'll be using a couple of new functions here namely, `SimpleImputer` and `make_pipeline`

In [13]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# select numerics and categoricals
features_num = X.select_dtypes(np.number).columns.tolist()
features_cat = X.select_dtypes(np.object).columns.tolist()

# transformer for numeric
transformer_num = make_pipeline(
    SimpleImputer(strategy="constant"), # there are a few missing values
    StandardScaler(),
)

#transformer for categorical
transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore'),
)

# initialize preprocessor
preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
)

In [14]:
processed_train = preprocessor.fit_transform(train)
processed_test = preprocessor.fit_transform(test)

And that's how you can utilize these tricks to make your data transformation and standardization conveniently.

Please feel free to comment or reach out to me if you have any question or suggestions.

Happy Learning!