# Exercise 1

* Load **sample_dataset.csv** and select only the features: mean radius, area error, mean perimeter
* Apply the following transformations using ColumnTransformer and Pipeline:
    * Numerical features:
        * Cleaning using the mean value
        * Transformation using the Yeo-Johnson transformation
    * Categorical features:
        * Cleaning using the most probable value
        * One-hot encoding with dense output

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer, OneHotEncoder
data_directory = os.path.join("../../", "Datasets")

In [3]:
df = pd.read_csv(data_directory+"/"+"sample_dataset.csv")[["mean radius", "area error", "mean perimeter"]]

In [8]:
num_pipe = Pipeline([
    ('cleaning', SimpleImputer(strategy="mean")),
    ('transform', PowerTransformer())
])
cat_pipe = Pipeline([
    ('cleaning', SimpleImputer(strategy="most_frequent")),
    ('encoding', OneHotEncoder())
])

In [9]:
transformer = ColumnTransformer([
    ('numerical', num_pipe, make_column_selector(dtype_exclude = "object")),
    ('categorical', cat_pipe, make_column_selector(dtype_include = "object"))
])

In [11]:
transformer.fit_transform(df)

array([[ 0.14898925,  1.31305907,  1.        ,  0.        ,  0.        ],
       [ 1.77124867,  1.61013246,  1.        ,  0.        ,  0.        ],
       [ 1.59585762,  1.52800056,  1.        ,  0.        ,  0.        ],
       ...,
       [ 0.88394456,  0.82431697,  1.        ,  0.        ,  0.        ],
       [ 1.77704664,  1.80409019,  1.        ,  0.        ,  0.        ],
       [-2.82849658, -2.89045254,  1.        ,  0.        ,  0.        ]])

# Exercise 2

* Modify the transformations of the previous exercise according to these settings and using set_params:
    * Numerical features: change the cleaning value to the median value
    * Categorical features: change the cleaning value to 'N' constant value

In [12]:
transformer.set_params(numerical__cleaning__strategy="median", categorical__cleaning__strategy="constant", categorical__cleaning__fill_value="N")

In [13]:
transformer.fit_transform(df)

array([[-0.05595844,  1.33240916,  1.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.77695743,  1.61842447,  1.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.61194827,  1.53959676,  1.        ,  0.        ,  0.        ,
         0.        ],
       ...,
       [ 0.93017209,  0.85656353,  1.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.78239188,  1.80382232,  1.        ,  0.        ,  0.        ,
         0.        ],
       [-2.91245376, -2.95836499,  1.        ,  0.        ,  0.        ,
         0.        ]])