# Data Transformation with Scikit-Learn

## Loading some data to work with

In [2]:
from sklearn.datasets import load_iris
X=load_iris()['data'] #vectors of data
y=load_iris()['target'] #label vector

In [3]:
print("First 10 lines of X:\n", X[0:10])
print("\nLabels\n",y)

First 10 lines of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Labels
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [4]:
# Randomly split into train and test data (corresponds to "Partitioning" node in KNIME)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Some general notes on data transformation in sklearn

* All transformers have implemented the method ``fit``. This method is used to learn model parameters from a training set. (**Make sure you don't fit your transformation data on the test dataset!**)
* Furthermore, all transformers have  a method ``transform``. This method applies the transformation model to unseen data.
* In addition, the method ``fit_transform`` exists, which can be called for modelling and transforming the training data simultaneously. 



## Data Scaling and Normalization

***sklearn*** provides a wide range of pre-processing methods on ***NumPy*** arrays and other input.

In [5]:
#example scaling data
from sklearn import preprocessing
X_scaled = preprocessing.scale(X_train)

X_scaled [0:10,:]                                    

array([[-0.4134164 , -1.46200287, -0.09951105, -0.32339776],
       [ 0.55122187, -0.50256349,  0.71770262,  0.35303182],
       [ 0.67180165,  0.21701605,  0.95119225,  0.75888956],
       [ 0.91296121, -0.02284379,  0.30909579,  0.2177459 ],
       [ 1.63643991,  1.41631528,  1.30142668,  1.70589097],
       [-0.17225683, -0.26270364,  0.19235097,  0.08245999],
       [ 2.11875905, -0.02284379,  1.59328871,  1.16474731],
       [-0.29283662, -0.02284379,  0.36746819,  0.35303182],
       [-0.89573553,  1.17645543, -1.44207638, -1.40568508],
       [ 2.23933883, -0.50256349,  1.65166111,  1.0294614 ]])

### Scaling
One problem with scaling - as with all other pre-processing methods - is, that we need to find the "right" processing steps based on the **train data** and the also apply it to the **test data**. **Sklearn*** provides ***Scaler*** models to do this:


#### Step 1: Instantiate the class
Check out the API documentation for parameters 

In [6]:
scaler = preprocessing.StandardScaler()

#### Step 2: Fit to training data

In [7]:
scaler.fit(X_train)

# Now the mean and the standard deviation have been determined for all features and 
# are stored in the scaler instance. We can retrieve them if we want. 
print("Mean = ", scaler.mean_)
print("Standard dev. = ", scaler.scale_)

Mean =  [5.84285714 3.00952381 3.87047619 1.23904762]
Standard dev. =  [0.82932642 0.41691013 1.71313824 0.73917525]


#### Step 3: Apply transformation to data                                   

In [8]:
transformed = scaler.transform(X_train)                           

# The data the transformation is applied to does not have to be the same than the data that
# the model has been fitted with. But if the same data that is used for fitting has to be 
# transformed, we also could use method *.fit_transform() to perform both steps at once. 

# transformed = scaler.fit_transform(X_train)  #in this case the fit method wouldn't have to be called separately

print(transformed)

[[-0.4134164  -1.46200287 -0.09951105 -0.32339776]
 [ 0.55122187 -0.50256349  0.71770262  0.35303182]
 [ 0.67180165  0.21701605  0.95119225  0.75888956]
 [ 0.91296121 -0.02284379  0.30909579  0.2177459 ]
 [ 1.63643991  1.41631528  1.30142668  1.70589097]
 [-0.17225683 -0.26270364  0.19235097  0.08245999]
 [ 2.11875905 -0.02284379  1.59328871  1.16474731]
 [-0.29283662 -0.02284379  0.36746819  0.35303182]
 [-0.89573553  1.17645543 -1.44207638 -1.40568508]
 [ 2.23933883 -0.50256349  1.65166111  1.0294614 ]
 [-0.05167705 -0.74242333  0.13397857 -0.32339776]
 [-0.77515575  0.93659559 -1.44207638 -1.40568508]
 [-1.01631531  1.17645543 -1.50044878 -1.27039917]
 [-0.89573553  1.89603497 -1.15021435 -1.13511325]
 [-1.01631531 -2.42144225 -0.21625586 -0.32339776]
 [ 0.55122187 -0.74242333  0.60095781  0.75888956]
 [-1.25747488  0.93659559 -1.15021435 -1.40568508]
 [-1.01631531 -0.02284379 -1.32533157 -1.40568508]
 [-0.89573553  0.69673574 -1.26695916 -0.99982734]
 [-0.29283662 -0.74242333  0.19

### Scaler
There are many different ***Scaler*** available. See [Examples here](https://scikit-learn.org/stable/modules/preprocessing.html)

##  Encoding categorical features as numbers

**sklearn** cannot deal with categorical data. Therefore, categories have to be encoded as numbers.

In [9]:
# Example data
import pandas as pd

def initializeData(): 
    df = pd.DataFrame([
                ['green', 'M', 10.1, 'class2'],
                ['red', 'L', 13.5, 'class1'],
                ['blue', 'XL', 15.3, 'class2']])
    df.columns = ['color', 'size', 'price', 'classlabel']
    return df

df = initializeData()
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


We can get a list of all columns with categorical variables with dtype ``object`` by checking the data type of each column. (Note that this way you don't get categorical variables which are already converted into numerical values or which are assigned a different data type!)

In [10]:
# Get list of categorical variables
s = (df.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['color', 'size', 'classlabel']


### **Mapping categorical target variables to numbers**

The LabelEncoder can be used to transform non-numerical labels to numerical labels.  
The labels in the columns are alphabetically sorted to define the mapping. 

In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['classlabel'] = label_encoder.fit_transform(df['classlabel'])

print("Numbers correspond to alphabetic order of labels: ", label_encoder.classes_)
df

Numbers correspond to alphabetic order of labels:  ['class1' 'class2']


Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,1
1,red,L,13.5,0
2,blue,XL,15.3,1


### **Mapping categorical features to numbers**

The OrdinalEncoder is used to transform numerical features to numbers.  
Again, the numbers are assigned in the alphabetic order of the categories of the feature.  
Note that in contrast to the LabelEncoder the OrdinalEncoder expects a two-dimensional input (which means that it also accepts multiple features at once).

In [12]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

ordinal_encoder = OrdinalEncoder()

df['color'] = ordinal_encoder.fit_transform(df[['color']])


print(ordinal_encoder.categories_)
df

[array(['blue', 'green', 'red'], dtype=object)]


Unnamed: 0,color,size,price,classlabel
0,1.0,M,10.1,1
1,2.0,L,13.5,0
2,0.0,XL,15.3,1


### **Mapping ordinal feature to numbers**

##### Step 1: Define the mapping
It cannot be "guessed" by an algorithm what a meaningful order looks like, so have to provide this information!

In [13]:
size_mapping = {'XL': 3, 
               'L':2, 
               'M':1}

##### Step 2: Apply the mapping

In [14]:
df['size'] = df['size'].map(size_mapping)
df


Unnamed: 0,color,size,price,classlabel
0,1.0,1,10.1,1
1,2.0,2,13.5,0
2,0.0,3,15.3,1


In [15]:
# If needed, a reverse mapping can be implemented as well
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df = df['size'].map(inv_size_mapping)
df

0     M
1     L
2    XL
Name: size, dtype: object

### **One-Hot Encoding**
One issue with the transformation of the OrdinalEncoder is that ML algorithms will assume that two nearby values are more similar than two distant values. However, this is not always the case.   
Another possibility to convert categorical features to features is to use a ***one-hot*** or dummy encoding. This transforms each categorical feature with **$n$ categories** possible values into **$n$ categories binary features**, with one of them 1, and all others 0. 

In [16]:
from sklearn.preprocessing import OneHotEncoder
df2 = initializeData()
one_hot_encoder = OneHotEncoder()
transformed = one_hot_encoder.fit_transform(df2['color'][:,np.newaxis])

  transformed = one_hot_encoder.fit_transform(df2['color'][:,np.newaxis])


In [17]:
# Note that the output is a SciPy sparse matrix, not a NumPy array! 
type(transformed)

scipy.sparse.csr.csr_matrix

In [18]:
# Transformed to a dense NumPy array it looks like this
transformed.toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [19]:
# Adding the new columns to the data frame
#df2 = pd.concat([df2,pd.DataFrame(transformed.toarray())], axis = 1)
df2 = df2.join(pd.DataFrame(transformed.toarray()))
df2

Unnamed: 0,color,size,price,classlabel,0,1,2
0,green,M,10.1,class2,0.0,1.0,0.0
1,red,L,13.5,class1,0.0,0.0,1.0
2,blue,XL,15.3,class2,1.0,0.0,0.0


In [20]:
# Delete the original column
df2 = df2.drop("color", axis = 1)
df2

Unnamed: 0,size,price,classlabel,0,1,2
0,M,10.1,class2,0.0,1.0,0.0
1,L,13.5,class1,0.0,0.0,1.0
2,XL,15.3,class2,1.0,0.0,0.0


An alternative way to get a one-hot encoding is to use the method ``get_dummies()`` of pandas. One advantage of this approach is that you can specify a prefix for the new columns.   
More information: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html 


In [21]:
df3 = initializeData()
df3 = pd.get_dummies(df3, columns = ["color"], prefix="color" )
df3

Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red
0,M,10.1,class2,0,1,0
1,L,13.5,class1,0,0,1
2,XL,15.3,class2,1,0,0


## Discretization

Binning continuous data into intervals.

In [22]:
#discretize data by dimension
from sklearn.preprocessing import KBinsDiscretizer

disc = KBinsDiscretizer(n_bins = 5, encode='ordinal').fit(X_train)
disc.transform(X_test)

array([[3., 1., 3., 2.],
       [2., 4., 1., 1.],
       [4., 0., 4., 4.],
       [2., 2., 2., 3.],
       [4., 1., 3., 2.],
       [1., 4., 1., 1.],
       [2., 2., 1., 2.],
       [4., 3., 3., 4.],
       [3., 0., 2., 3.],
       [2., 1., 1., 2.],
       [3., 3., 3., 4.],
       [0., 3., 0., 0.],
       [1., 4., 0., 1.],
       [0., 3., 1., 0.],
       [1., 4., 1., 1.],
       [3., 3., 3., 3.],
       [3., 3., 4., 4.],
       [2., 0., 1., 1.],
       [2., 1., 2., 2.],
       [3., 1., 4., 4.],
       [0., 3., 1., 1.],
       [3., 3., 3., 3.],
       [1., 4., 1., 1.],
       [3., 1., 4., 4.],
       [4., 4., 4., 4.],
       [4., 3., 3., 4.],
       [4., 0., 4., 3.],
       [4., 3., 4., 4.],
       [0., 3., 0., 1.],
       [0., 3., 1., 1.],
       [0., 4., 0., 1.],
       [2., 4., 1., 1.],
       [4., 3., 2., 2.],
       [0., 4., 1., 1.],
       [0., 3., 0., 1.],
       [3., 0., 3., 4.],
       [3., 3., 2., 3.],
       [1., 4., 1., 1.],
       [1., 4., 0., 1.],
       [1., 4., 1., 0.],


## Column Transformers

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. The ColumnTransformer helps performing different transformations for different columns. It can also be included in a Pipeline. 

https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

In [28]:
from sklearn.compose import ColumnTransformer
df4 = initializeData()
num_attributes = ['price']
cat_attributes = ['color', 'size', 'classlabel']

full_pipeline = ColumnTransformer([
    ("num", scaler, num_attributes), 
    ("cat", OrdinalEncoder(), cat_attributes)
])

prepared = full_pipeline.fit_transform(df4)
prepared

color          object
size           object
price         float64
classlabel     object
dtype: object


array([[-1.32954369,  1.        ,  1.        ,  1.        ],
       [ 0.24735697,  2.        ,  0.        ,  0.        ],
       [ 1.08218672,  0.        ,  2.        ,  1.        ]])

## Pipelines
***Pipeline*** can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

* Convenience and encapsulation
    
* The parameters of the transformers can be included in the parameter selection
   
* Avoiding data leakage: Ensures that statistics from the validation data is not incorporated into the preprocessors which would make cross-validation scores unreliable. 
   

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

Docs: https://scikit-learn.org/stable/modules/compose.html#pipeline

#### Instantiating, filling and using a pipeline

In [24]:
# Instantiating and filling a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()
knn = KNeighborsClassifier()

pipeline = Pipeline([
    ('scaler', scaler),
    ('nearest neighbor', knn)
])

If you don't want to give names to the steps of the pipeline, you can use make_pipeline instead

In [25]:
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, knn)

Use the pipeline

In [26]:
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0])