# Data Transformations

## What will we accomplish

In this notebook we will:
- Introduce some common data transformations
- Introduce the concept of a pipeline
- Review `sklearn`'s `Pipeline`s
- Give some example pipelines.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style
set_style("whitegrid")

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
X

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States


In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


In [4]:
y

Unnamed: 0,income
0,<=50K
1,<=50K
2,<=50K
3,<=50K
4,<=50K
...,...
48837,<=50K.
48838,<=50K.
48839,<=50K.
48840,<=50K.


In [5]:
# Most sklearn estimators expect a pandas series as the target, not an (n,1) pandas dataframe
y = y.income

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

- Preprocess features
  - Impute: [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html), [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html)
    - e.g. `SimpleImputer` replaces missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. Using the most frequent strategy $\begin{bmatrix} 2 \\ 3 \\ \textrm{NaN} \\ 3\end{bmatrix} \mapsto \begin{bmatrix} 2 \\ 3 \\ 3 \\ 3\end{bmatrix}$.

  - Encode: [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for nominal, [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) for ordinal
    - e.g. `OneHotEncoder` would take a column containing the distinct strings `dog`, `cat`, `parrot` and transform them into 3 binary columns (which you could think of as `is_dog`, `is_cat`, and `is_parrot` columns). $\begin{bmatrix} \textrm{dog} \\ \textrm{cat} \\ \textrm{parrot} \\ \textrm{dog}\end{bmatrix} \mapsto \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0& 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}$


  - Scale and transform: [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html), [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html), [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)
    - e.g. `StandardScaler` will recenter so the mean of a column is $0$ and scale so that variance is $1$.
    - Some machine learning algorithms are sensitive to scale (e.g. Ridge Regression) and some are not (e.g. Random Forest Regression).
  - Feature generation: [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html), [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html), [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)
    - e.g. `PolynomialFeatures` will generate new columns which are monomials of specified degrees of the old columns.
    - Could be used, for example, when you have reason to believe the target varies quadratically with one of the input variables.
  - Dimensionality reduction: [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)
    - Useful when you have too many features relative to observations and cannot eliminate features based on domain knowledge alone.
  - Feature selection: [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html), [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html), [VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html)
    - These automated feature selection tools are rarely a good idea, but they can do in a pinch.
  - Text vectorization: [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    - We won't do much with NLP in this course and there are more robust text vectorization tools out there now.  However, this is a good quick baseline.

- Apply different transforms to different columns
  - [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html), [make_column_selector](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html)

- Parallel feature branches and concatenation
  - [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html), [`make_union`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_union.html)

- Model training and post processing
  - Any estimator as the final step: see the [Supervised learning](https://scikit-learn.org/stable/supervised_learning.html) overview

- Target side transforms
  - [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html)
  - e.g. you might want to regress on $\log(y)$ but report predictions in the original units of the target.

- Multioutput and multilabel wrappers
  - [MultiOutputRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html), [MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html), [ClassifierChain](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html), [RegressorChain](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html)

- The pipelines themselves
  - [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and the user guide on [Composing estimators](https://scikit-learn.org/stable/modules/compose.html)

The concept of a <i>pipeline</i> is a nice framework for combining all of those steps and fitting a model all into one container. Here's a simple visualization that helps explain this concept:

<img src="lecture_assets/pipe.png" style="width:85%"></img>

Here is a basic example of a pipeline:

In [None]:
## make the Pipeline object
## Pipeline objects take in a list as an argument
## that list contains tuples of the steps you want in your pipeline
## Each tuple has a name for the step as its first entry,
## then the python object as its second entry

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression

numeric_features = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]

pipe = Pipeline([('poly', PolynomialFeatures(3,
                                                interaction_only=False,
                                                include_bias=False)),
                ('scale', StandardScaler()),
                ('reg', LogisticRegression(max_iter = 1000))])

# Only doing this now to show the pipeline structure.  
# You shouldn't usually fit a model on all of the training data immediately:  you need to cross-validate.
pipe.fit(X_train[numeric_features], y_train)

0,1,2
,steps,"[('poly', ...), ('scale', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,degree,3
,interaction_only,False
,include_bias,False
,order,'C'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [15]:
from sklearn.model_selection import StratifiedKFold, cross_validate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=216)
cross_validate(pipe, X_train[numeric_features], y_train, scoring=('neg_log_loss', 'accuracy'))

{'fit_time': array([1.55043888, 1.41606808, 1.69568014, 1.47184992, 1.56545305]),
 'score_time': array([0.0176599 , 0.01562285, 0.01521206, 0.01566696, 0.0155561 ]),
 'test_neg_log_loss': array([-1.02013469, -1.01744865, -1.02620769, -1.02330894, -1.02376998]),
 'test_accuracy': array([0.5468814 , 0.54750205, 0.54941305, 0.54818455, 0.55105105])}

We can also make more complex pipelines:

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np

numeric_features = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = ["workclass", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]

log_transformer = FunctionTransformer(lambda x: np.log1p(x))

numeric_transformer = Pipeline(steps=[
    ('log', ColumnTransformer(
        transformers=[('log', log_transformer, ["capital-gain", "capital-loss"])],
        remainder='passthrough'
    )),
    ('scaler', StandardScaler())
])

categorical_transformer = OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=1)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

In [8]:
preprocessor.fit_transform(X_train)[:5].toarray()

array([[-0.2998939 , -0.21856599, -0.4112016 ,  1.13584592, -0.03174566,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0. 

In [9]:
# Only doing this now to show the pipeline structure.  
# You shouldn't usually fit a model on all of the training data immediately:  you need to cross-validate.
clf.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,transformers,"[('log', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,func,<function <la...t 0x17749eca0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'infrequent_if_exist'
,min_frequency,1
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [10]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.dummy import DummyClassifier

In [11]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=216)
cross_validate(clf, X_train, y_train, scoring=('neg_log_loss', 'accuracy'))

{'fit_time': array([0.69908118, 0.7102201 , 0.755337  , 0.70361423, 0.74871397]),
 'score_time': array([0.02927303, 0.02985287, 0.02702498, 0.02659488, 0.03054094]),
 'test_neg_log_loss': array([-0.9226455 , -0.92013161, -0.92739418, -0.92017882, -0.9288136 ]),
 'test_accuracy': array([0.5852327 , 0.58927109, 0.58558559, 0.58640459, 0.58149058])}

In [12]:
cross_validate(DummyClassifier(), X_train, y_train, scoring=('neg_log_loss', 'accuracy'))

{'fit_time': array([0.01566792, 0.00952816, 0.00782609, 0.00777102, 0.00783515]),
 'score_time': array([0.01779819, 0.01042819, 0.01015186, 0.01045394, 0.01126194]),
 'test_neg_log_loss': array([-1.18832389, -1.18823302, -1.18823302, -1.18823302, -1.18839278]),
 'test_accuracy': array([0.50484509, 0.5047775 , 0.5047775 , 0.5047775 , 0.5047775 ])}

So the model outperforms the simple strategy of predicting the most frequent class.