## Preprocessing, Feature Extraction and Pipelines


In [None]:
# Standard Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns; sns.set()

# New submodules
from sklearn import preprocessing
from sklearn import feature_extraction
from sklearn import pipeline

### Scaling

Features of a dataset can be of different scales (e.g. salary vs age). This usually has a detrimental effect on many machine learning algorithms. As such, transforming features to have similar scales is widely used. Some methods for scaling:

* **Z-Normalization:** Transform each dimension of the data to have 0 mean and 1 standard deviation.
* **Min-Max Scaling:** Transform each dimension of the data so that it falls between two values. Usually this range is chosen as [0,1] or [-1,1].
* **Max Absolute Value Scaling:** To transform the absolute value of each dimension of the data to be less than a certain value. Usually this value will be 1.
* **Robust Scaling:**  *Outlier* values in the data can adversely affect scaling operations. A scaling approach using median and quartile ranges may be preferable when outliers are present. .
* **Whitening:** Scaling and decorelating multiple dimensions to have 0-vector mean and unit covariance. We will see this when we go over Principal Component Analysis. ([Wikipedia entry](https://en.wikipedia.org/wiki/Whitening_transformation))

**Note**: Care should be taken if the data containing many 0s (sparse data) is to be preprocessed. It is easy to deteriorate their sparse structure, leading to many issues. The methods that change the *center* of the data is the main culprit (e.g. z-normalization). Methods that keep the sparse nature intact should be chosen (e.g. max absolute value scaling)

In [None]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.],
              [ 1.,  0.,  1.]])
print("Original:")
print(X)

Z-Normalization:

In [None]:
Xzn = preprocessing.scale(X)
print("Z-Normalized:")
print(Xzn)

In [None]:
print("Before: Mean and Standard Deviation")
print(X.mean(axis = 0), X.std(axis = 0))
print()
print("After: Mean and Standard Deviation")
print(Xzn.mean(axis = 0), Xzn.std(axis = 0))

In [None]:
# Keeping the scaling information:
scaler = preprocessing.StandardScaler()

# Calculate the scaling values:
scaler.fit(X) 
print("Scaling Class:")
print(scaler)
print()
print("Scaling Values:")
print(scaler.mean_, scaler.scale_)

In [None]:
print("Z-Normalized:")

# Scale the incoming data
print(scaler.transform(X))

# TRANSFORMER API!!!

# fit(train) + transform(train) + transform(test)
# fit_transform(train) + transform(test)

MinMax

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()

# Calculate the scaling values and scale the original data in a single line
Xmm = min_max_scaler.fit_transform(X)
print("Original")
print(X)
print()
print("Range [0, 1]:")
print(Xmm)
print()

print("Scaling Values:")
print(min_max_scaler.scale_, min_max_scaler.min_)

In [None]:
min_max_scaler.inverse_transform(Xmm)

In [None]:
Y = np.array([[1.5, -1., 2.5]])
print("When new data comes: ") 
print(Y)
print(min_max_scaler.transform(Y))
print()

In [None]:
print("Original")
print(X)
print()
print("Range [-1, 1] ")
min_max_scaler_range = preprocessing.MinMaxScaler(feature_range = [-1, 1])

print(min_max_scaler_range.fit_transform(X)) 

Absolute Value Scaling (Better for Sparse Data)

In [None]:
max_abs_scaler = preprocessing.MaxAbsScaler()
Xma = max_abs_scaler.fit_transform(X)
print("Original")
print(X)
print()
print("Absolute Value to 1:")
print(Xma)
print()
print("Scaling:")
print(max_abs_scaler.scale_)
print()

In [None]:
print("With different data")
Y = np.array([[ -3., -1.,  4.]])
print(Y)
print(max_abs_scaler.transform(Y))

Scaling with Outliers

In [None]:
# Random data generation
Xo = np.random.standard_t(2, (50,2))
plt.boxplot(Xo)
plt.title("Data ")
down, up = plt.ylim()
plt.show()

z_scaler = preprocessing.StandardScaler()
Xz = z_scaler.fit_transform(Xo)
plt.boxplot(Xz)
plt.title("Z-Normalized")
plt.ylim(down, up)
plt.show()

robust_scaler = preprocessing.RobustScaler()
Xrs = robust_scaler.fit_transform(Xo)
plt.boxplot(Xrs)
plt.title("Robust")
plt.ylim(down, up)
plt.show()


In [None]:
from sklearn import datasets
bc_dataset = datasets.load_breast_cancer()

In [None]:
plt.boxplot(bc_dataset["data"][:,-7])

In [None]:
plt.boxplot(preprocessing.MinMaxScaler().fit_transform(bc_dataset["data"][:,-7].reshape(-1,1)))

In [None]:
plt.boxplot(preprocessing.RobustScaler().fit_transform(bc_dataset["data"][:,-7].reshape(-1,1)))

Some data is inherently multi-dimensional (e.g. orientation of a rigid body) and thus it is not appropriate to scale them by-themselves. We need to scale them together. One such method is data whitening. Another method is unit normalization where we normalize the data points (or a feature subset of the data points)  to have unit norm.

In [None]:
print("Original:")
print(X)
print()

Xn = preprocessing.normalize(X, norm = "l2")
print("Normalize to Unit Vectors:")
print(Xn)
print()

# when you need a class (e.g. to put in a pipeline)
normalizer = preprocessing.Normalizer()
print(normalizer)

In [None]:
normalizer.transform(X)

In [None]:
# Doesn't do anything but it is there so that since the Normalizer class 
# fully implements the Transformer API (useful for using it in a pipeline)
normalizer.fit(X)

We will talk about Data Whitening when we cover Principal Component Analysis

### Categorical Data Encoding

Many learning algorithms expect numerical values (vector, matrix, etc.) as input. Therefore, it is necessary to encode categorical data this way. For example, consider customer data: `[gender, city, occupation]`. Let there be `[female, male]` for gender, `[ankara, istanbul, izmir]` for city, and `[private, public, freelance, retired]` for occupation.

How can we convert categories to numbers?

In [None]:
enc = preprocessing.OneHotEncoder()

M = [["male", "ankara", "public"], 
     ["female", "istanbul", "private"],
     ["female", "izmir", "retired"],
     ["male", "istanbul", "freelance"]]

# Getting the categories from the data
enc.fit(M)
print("Encoder:")
print(enc)
print("Encoding:")
# 2 (gender) + 3 (city) + 4 (occupation) = 9 dimensional
print(enc.transform(M).toarray())

# New data
print(enc.transform([["female", "istanbul", "freelance"],
                     ["male", "ankara", "retired"]]).toarray())

print("Categories:")
print(enc.categories_)

**Note:** It is a better idea to map binary categories to a single dimension (0-1) instead of a 2 dimensional 1-hot vector

In [None]:
enc.transform([["female", "izmir", "engineer"]]).toarray()

In [None]:
# Giving categories by hand
gender = ["female", "male"] 
city = ["ankara", "istanbul", "izmir", "kocaeli"]
occupation = ["private", "public", "freelance", "retired"]

enc2 = preprocessing.OneHotEncoder(categories = [gender, city, occupation])
print("Encoder:")
print(enc2)

In [None]:
print(enc2.transform([["kadın", "istanbul", "freelance"],
                      ["erkek", "kocaeli", "retired"]]).toarray())

In [None]:
# We still need to call the fit
enc2.fit(M)
print("Encoding:")
print(enc2.transform([["female", "istanbul", "freelance"],
                      ["male", "kocaeli", "retired"]]).toarray())
print()

In [None]:
# If there is a chance of getting an unexpected category (e.g. missing data)
enc3 = preprocessing.OneHotEncoder(handle_unknown = "ignore")
enc3.fit(M) 
print("Encoder:")
print(enc3)
print("Unknown category is mapped to a 0-vector of appropriate size:")
print(enc3.transform([["female", "bursa", "public"]]).toarray())

Sometimes data comes in a dictionary format (e.g. JSON). There is the `DictVectorizer` class for these cases

In [None]:
from sklearn.feature_extraction import DictVectorizer
dictData = [
    {"price": 1200000, "room": 4, "neighborhood": "Maslak", "purpose": "business"},
    {"price": 1400000, "room": 3, "neighborhood": "Etiler", "purpose": "house"},
    {"price":  500000, "room": 3, "neighborhood": "Tuzla",  "purpose": "house"},
    {"price":  900000, "room": 2, "purpose": "business", "neighborhood": "Etiler"}]
vec = DictVectorizer(sparse = False, dtype = int)
print(vec.fit_transform(dictData))

In [None]:
vec.feature_names_

**Note:** It only converts strings to one-hot vectors. If the categorical data is given with integers, we need to use the `OneHotEncoder` class.

**Other Category Encoders**

When there are a lot of categories and/or imbalanced category distribution, one-hot encoding does not work very well. Some domains also like to encode continuous variables. For a wide variety of categorical encoders (including the famous weight of evidence in banking): https://contrib.scikit-learn.org/category_encoders/.


In [None]:
import category_encoders as ce

encoder = ce.WOEEncoder(cols = [...])

### Filling Missing Data: Imputation

Some portions of real-world data is often missing. There are different ways to deal with this. The simplest and most commonly used methods include using the mean, median, mode, or a constant value to fill the missing data. 

Ignoring data points with missing features or using machine learning to fill them are also common.

In [None]:
from sklearn.impute import SimpleImputer
X = np.array([[1, 2], 
              [np.nan, 3], 
              [7, 6], 
              [5, 3], 
              [4, 4]])
print("Initial Data")
print(X) 

In [None]:
# Mean
imp = SimpleImputer(missing_values = np.nan, strategy = "mean")
imp.fit_transform(X)

In [None]:
Xe = np.array([[np.nan, 2], 
               [6, np.nan], 
               [7, 6]])
print("New Data")
print(Xe)

print("Mean:")
print(imp.transform(Xe))  

In [None]:
imp.statistics_

In [None]:
# Median
imp2 = SimpleImputer(missing_values = -999, strategy = "median")
X = np.array([[1,    2], 
              [-999, 3], 
              [7,    6]])

print("Initial Data")
print(X)
print()

Xe = np.array([[-999,   2], 
               [6,   -999], 
               [7,      6]])
print("New Data")
print(Xe)
print()

imp2.fit(X) 
print("Median:")
print(imp2.transform(Xe))

In [None]:
Xt = imp2.fit_transform(X)
print(Xt)

In [None]:
df = pd.DataFrame([["a", "y"],
                   [np.nan, "x"],
                   ["a", np.nan],
                   ["b", "y"]], dtype = "category")

display(df)

In [None]:
print("Most common based on available categories in the Dataframe:")
imp3 = SimpleImputer(strategy = "most_frequent")
tmp = imp3.fit_transform(df)
print(tmp)
print(type(tmp))

In [None]:
print("Pandas Dataframe")
print(df)
print()

print("Fill with Constant:")
imp4 = SimpleImputer(strategy = "constant", fill_value = "c")
print(imp4.fit_transform(df))

In [None]:
df.fillna(method = "bfill")

In [None]:
# Depending on your scikit-learn version you may or may not need the below line
from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

# Chose the predictor: 
ite_imp = IterativeImputer(estimator = LinearRegression())
X = np.array([[7, 2, 3], 
              [4, np.nan, 6], 
              [10, 5, 9]])
print(X)
print()
ite_imp.fit_transform(X)

In [None]:
Xe = np.array([[np.nan, 2, 3], 
               [4, np.nan, 6], 
               [10, np.nan, 9]])
print(Xe)
print()
ite_imp.transform(Xe)

In [None]:
Xe2 =np.array([[np.nan, 4, 5], 
               [2, np.nan, 8], 
               [12, np.nan, 6]])

ite_imp.transform(Xe2)

In [None]:
# ite_imp.

### Polynomial Features

Some problems benefit from non-linearity. One way to get this is to create non-linear features from data. Linear estimators can be turned into non-linear ones this way. One of the most common ways of doing this is to add polynomials of the inputs. We can do this with the `PolynomialFeatures` class. 

Another common method is to use RBF functions but we will cover it in the regression part of the lecture. There are also others. Nowadays however, we rely more on the non-linearity of the methods instead of extracting non-linear features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

Xp = np.arange(6).reshape(3, 2)
print("Original:")
print(Xp)

In [None]:
poly = PolynomialFeatures(2) #exhaustive to 2nd degree
print("With polynomial features added: 1 x1 x2 (x1)^2 x1*x2 (x2)^2")
print(poly.fit_transform(Xp))

In [None]:
poly2 = PolynomialFeatures(degree = 2, 
                           interaction_only = True, 
                           include_bias = False)
print("Sadece etkileşim Polinom Öznitelikleri Eklenmiş: x1 x2 x1x2")
print(poly2.fit_transform(Xp))

### Custom Functions for Preprocessing

Sometimes we want to use custom functions for preprocessing (e.g. taking the logarithm to make a feature more symmetric). There are multiple ways of doing this. However, the below one is recommended so that we can include them easily in Pipelines.

In [None]:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate = True)
Xt = np.array([[0, 1], [2, 3]])
print(Xt)
print()
print(np.log1p(Xt))
print()
print(transformer.transform(Xt))

### Text Features

We need to convert text data into numerical values so that we can use it in ML algorithms. There are many approaches towards this end. `scikit-learn` provides implementations for two older but still useful methods. The first one is the `bag-of-words` model which uses word counts directly and the other one is the `tf-idf` model which trade-ofss the counts of the words with their commonality (if a word is common everywhere, it may not be important).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
example = ["It is hot today", 
           "The hot hot and sour soup", 
           "Air, water, road and electricity", 
           "On the road today"]

print("Count Based")
vec = CountVectorizer()
Xbow = vec.fit_transform(example)
display(pd.DataFrame(Xbow.toarray(), columns = vec.get_feature_names()))


In [None]:
test_example = ["Hot air soup"]
display(pd.DataFrame(vec.transform(test_example).toarray(), columns = vec.get_feature_names()))


In [None]:
test_example2 = ["Hot air balloon"]
display(pd.DataFrame(vec.transform(test_example2).toarray(), columns = vec.get_feature_names()))

In [None]:
vec1sw = CountVectorizer(stop_words=['it','and'])
Xbow1sw = vec1sw.fit_transform(example)
display(pd.DataFrame(Xbow1sw.toarray(), columns = vec1sw.get_feature_names()))

In [None]:
print("TF-IDF")
vec2 = TfidfVectorizer()
Xtfidf = vec2.fit_transform(example)
display(pd.DataFrame(Xtfidf.toarray(), columns = vec.get_feature_names()))

In [None]:
[np.linalg.norm(Xtfidf[i,:].toarray()) for i in range(Xtfidf.shape[0])]

We will use these features in a little bit.

### Pipelining Pre-Processing, Feature Extraction and Prediction Steps

In most machine learning applications, we perform several "operations" to the data before inputting them to a learning algorithm.

For example:  
1. Fill the missing values with means 
2. Scale the data to the [0,1] range
3. Add second degree polynomial features
4. Fit a linear model to the data

The data enters the learning process after undergoing some transformation. We have seen the Transformer API before which are for the transformation algorithms. This API defines the `fit()` and `transform()` functions along with the `fit_transform()` function, that does both in a single step.

The `scikit-learn` module has `pipeline` functionality to abstract and simplify this process. This abstraction represents the input of data passing through multiple transformers into an estimator in its final form. Let's see how it is used:



In [None]:
from sklearn.preprocessing import MinMaxScaler
# Input
Xpp = np.array([[ np.nan, 0,   3  ],
                [ 3,      7,   9  ],
                [ 3,      5,   2  ],
                [ 4, np.nan,   6  ],
                [ 8,      8,   1  ]])

# Target
ypp = np.array([14.2, 15.9, -1.01,  7.93, -5.2])

#Fill the missing values with means
simple_imp = SimpleImputer(strategy="mean")
Ximp = simple_imp.fit_transform(Xpp)

#Scale the data to the [0,1] range
mm_scaler = MinMaxScaler()
Xsca = mm_scaler.fit_transform(Ximp)

#Add second degree polynomial features
pf = PolynomialFeatures(degree=2)
Xfeats = pf.fit_transform(Xsca)

#Fit a linear model to the data
model = LinearRegression()

model.fit(Xfeats, ypp)

In [None]:
print("Training")
print(model.predict(Xfeats))

In [None]:
print(Xfeats.shape)

In [None]:
XppTest = np.array([[     2,      6, np.nan],
                    [np.nan, np.nan, np.nan]])

XTimp = simple_imp.transform(XppTest)
XTsca = mm_scaler.transform(XTimp)
XTfeats = pf.transform(XTsca)

print("Testing")
print(model.predict(XTfeats))

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

                     # 1. Fill the missing values with means 
pipe = make_pipeline(SimpleImputer(missing_values = np.nan, strategy = "mean"),
                     # 2. Scale the data to the [0,1] range                     
                     preprocessing.MinMaxScaler(),
                     # 3. Add second degree polynomial features
                     PolynomialFeatures(degree = 2),
                     # 4. Fit a linear model to the data
                     LinearRegression())

print(type(pipe))
print(pipe)

In [None]:
# We can also label the steps (useful for later)
from sklearn.pipeline import Pipeline
steps = [("impute", SimpleImputer(missing_values = np.nan, strategy = "mean")), 
         ("scale", preprocessing.MinMaxScaler()),
         ("poly", PolynomialFeatures(degree = 2)),
         ("learn", LinearRegression())]

pipe2 = Pipeline(steps)
print(pipe2)

The resulting Pipeline object implements `fit` and `predict` functions.

In [None]:
# Fitting
pipe.fit(Xpp, ypp)

In [None]:
print("Training Targets:")
print(ypp)
print()
print("Training Predictions:")
print(pipe.predict(Xpp))

In [None]:
print("Test Prediction:")
print(pipe.predict(XppTest))

In [None]:
stepsNoLearn = [("impute", SimpleImputer(missing_values = np.nan, strategy = "mean")), 
                ("scale", preprocessing.MinMaxScaler()),
                ("poly", PolynomialFeatures(degree = 2))]

pipeNL = Pipeline(stepsNoLearn)
Xfeats2 = pipeNL.fit_transform(Xpp)
Xfeats2

In [None]:
Xfeats

In [None]:
# Look at the individual steps:
tmp = pipe.named_steps['polynomialfeatures']
tmp.powers_

We can create pipelines with any number of transformers and optionally a single predictor at the end.

In effect, pipelines perform back to back `fit_transform` operations and feed the outputs of prior transformers as the inputs of the next ones. They are mostly multi-input and multi-output and apply the same step to all the measurements. (some only accept 1D data and output 1D data, you can custom write transformers that apply different operations to different dimensions) 

**Is this good enough?**

In [None]:
# Input
Xpp2 = np.array([[ np.nan, 'a',   3  ],
                 [ 3,      'b',   9  ],
                 [ 3,      'a',   2  ],
                 [ 4, np.nan,   6  ],
                 [ 8,      'c',   1  ]])

# Target
ypp2 = np.array([14.2, 15.9, -1.01,  7.93, -5.2])

In [None]:
pipe2.fit(Xpp2,ypp2)

**Not Enough:**  
* What if we have different data types? The transformers in question are designed mostly for a single type of data (e.g. scalers vs categoric variables). 
* What if we want to apply different pre-processing steps to same type of data? (e.g. log transform to count variables and keeping the others as is before scaling)
* What if we want to extract multiple types of features in parallel? (e.g. polynomial and radial basis for regression)?

For these, we are going to use the `ColumnTransformer` and `FeatureUnion` classes. Before those, let's do an exercise