# DSA5101 - Introduction to Big Data for Industry


**Prepared by *Dr Li Xiaoli*** 

# Preprocessing

## Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of machine learning models) assume that all features are centered around zero and have variance in the same order. 

## If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

## The function **scale** provides a quick and easy way to perform this operation on a single array-like dataset:

# 1. Standardization

> Sklearn provides data preprocessing capabilities, we will learn a few important ones

> We will also use numpy to generate array data

## 1.1 Scaled  features - zero mean and unit variance

In [1]:
# Input packages preprocessing
from sklearn import preprocessing
import numpy as np

In [2]:
#Generate array data
X = np.array([[ 1., -1.,  2.],
               [ 2.,  0.,  0.],
               [ 0.,  1., -1.]])

X

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

### Perform scaling on data using preprocessing.scale (data)

### 1.1.1 Scaling on Columns/Features

In [3]:
X_scaled_features = preprocessing.scale(X) # default scaling on columns/features, i.e. axis=0
X_scaled_features   

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [4]:
# Below we get the same results, i.e. with or without axis=0 is the same
X_scaled_features = preprocessing.scale(X, axis=0) # default scaling on columns/features, i.e. axis=0
X_scaled_features   

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

#### Scaled data has zero mean and unit variance, in terms of 3 columns or features

### 1.1.1.1 Let us check/verify X_scaled_features' mean and variance

In [5]:
X_scaled_features.mean(axis=0)
# axis : int (0 by default)
# axis used to compute the means and standard deviations along given along axis. 
# If axis=0, independently standardize each feature (vertical), 
# otherwise (if axis=1) standardize each sample (horizontal).

array([0., 0., 0.])

In [6]:
X_scaled_features.std(axis=0)

array([1., 1., 1.])

In [7]:
X_scaled_features   

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [8]:
# Let us take a look at the mean of third column
(1.33630621-0.26726124-1.06904497)/3

0.0

#### Compute variance of first column according to the variance formula, given average is 0
#### $\frac { \sum (every value-mean)^2} {n}$= $\frac { \sum (every value-0)^2} {3}$

In [9]:
firstcolumnvariance=(((0-0)**2+(1.22474487-0)**2+(-1.22474487-0)**2)/3)
print ("unit variance %.0f" % firstcolumnvariance)

unit variance 1


#### This means all the columns have zero mean
#### We can also see it directly
#### The key advantage is that we do not want some features dominate the future computation, e.g. similarity, distance, objective function, etc.

#### However, what about means for rows?

### 1.1.1.2 Let us check/verify X_scaled_features' rows' mean and variance

In [10]:
X_scaled_features      

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [11]:
X_scaled_features.mean(axis=1)

array([ 0.03718711,  0.31916121, -0.35634832])

In [12]:
# manully compute the average of the first row [ 0. , -1.22474487,  1.33630621]
firstrowave=(0-1.22474487+1.33630621)/3
print("The average of first row is %.8f" % firstrowave)

The average of first row is 0.03718711


So it is not zero mean

In [13]:
# help(X_scaled.mean)

In [14]:
X_scaled_features.std(axis=1)
# Standard Deviation of features

array([1.04587533, 0.64957343, 1.11980724])

It is also not a unit variance

#### Now, let us scale on rows

### 1.1.2 Scaling on Rows

In [8]:
# Using the same data
X = np.array([[ 1., -1.,  2.],
               [ 2.,  0.,  0.],
               [ 0.,  1., -1.]])

X

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [9]:
X_scaled_rows = preprocessing.scale(X, axis=1)  #axis=1, indicating we are handling/scaling rows
X_scaled_rows   

array([[ 0.26726124, -1.33630621,  1.06904497],
       [ 1.41421356, -0.70710678, -0.70710678],
       [ 0.        ,  1.22474487, -1.22474487]])

In [10]:
X_scaled_rows.mean(axis=1)

array([1.48029737e-16, 7.40148683e-17, 0.00000000e+00])

In [11]:
X_scaled_rows.std(axis=1)

array([1., 1., 1.])

It has zero mean and unit variance for rows

In [19]:
# You can get more information about scale from 
# ? preprocessing.scale

## 1.2 Scaling features to a range

### An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 

### This can be achieved using preprocessing.MinMaxScaler or MaxAbsScaler, respectively.

In [13]:
# The following three features have range [0, 2], [-1, 1] and [-1,2] respectively
X_train = np.array([[ 1., -1.,  2.],
                     [ 2.,  0.,  0.],
                     [ 0.,  1., -1.]])

# perform MinMax scaler to make them into the same range [0,1]
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train) # fit training data and transform training data
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

#### The same instance of the transformer can then be applied to some new _unseen test  data_  during the fit call: the same scaling operations will be applied to be consistent with the transformation performed on the train data!
#### We learn from training data and then apply to test data

In [15]:
#Suppose we have a test example [ -3., -1.,  4.], which is different from training examples
X_test = np.array([[ -3., -1.,  4.]])

# min_max_scaler is obtained from training data
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

#### Note that the range of test feature values may not be [0,1]  
#### as we are obtaining min-max scaling (parameters, i.e. original minimal and maximal values for each feature) based on training data, 
#### which can only make each feature  in training data within range [0, 1] instead of test data

#### Test data features could have bigger or smaller values than the parameters obtained from training data. For example, for feature 1, the two parameters are 0 (minimal) and 2 (maximal) [0, 2] in training data, but now in test data, the first feature value is -3, much smaller than minimal value 0 in training data. So in the end, it has been scaled to -1.5.

# 2. Normalization

### Normalization is the process of _scaling individual samples to have unit norm_. 
### This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

### The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the L1 or L2 norms:

https://rorasa.wordpress.com/2012/05/13/l0-norm-l1-norm-l2-norm-l-infinity-norm/

In [16]:
X = [[ 1., -1.,  2.],
      [ 2.,  0.,  0.],
      [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2') 

X_normalized 
# Here the sum of square of all the elements in normalized vector is 1

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In general, given [$x_1, x_2, ..., x_n$], its L2 normalization makes sure
it normalized vector has $\sqrt{x_1^2+ x_2^2+...+ x_n^2}=1$

#### Let us verify it

In [23]:
from math import *   # want to use sqrt function
normalized_length = sqrt(0.40824829**2 + (-0.40824829)**2 + 0.81649658**2)
print ("unit variance %.0f" % normalized_length)

unit variance 1


#### What is the difference between scaling and normalization here?
#### Scaling can perform on columns (features) or rows (records), e.g. scale to zero mean and unit variance
#### Normalization is performed for rows only, to make sure each row (sample/vector) has unit length (say for L-2 norm)

# 3. Imputation of missing values
> Many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders.

> A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). 

> A better strategy is to impute the missing values, i.e., to infer them from the known parts of the data.

> The **Imputer class** provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. 

> This class also allows for *different* missing values encodings.

> The following snippet demonstrates how to replace missing values, encoded as np.nan, using the **mean value** of the columns (axis 0) that contain the missing values:

### 3.1 Univariate feature imputation

#### 3.1.1. Using mean value

In [17]:
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')

# axis=0, meaning that we want to do it for features

In [18]:
# A simple Data: 3 records with 2 dimension
imp.fit([[1, 2], [np.nan, 3], [7, 6]])  #Fit training data

SimpleImputer()

#### The $\underline{first}$ feature, using (1+7)/2=4
#### The $\underline{second}$ feature, using (2+3+6)/3=3.66666667
#### We need to perform fit to compute mean score for each feature with 'NaN' values
#### From training data, we will compute the mean value for each feature. 

In [19]:
# Now, given a new test example, how to handle/transform its missing value?
# We use the mean value computed from training data to replace missing values in test/future data
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))                       

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


#### 3.1.2 using most frequent value

In [20]:
import pandas as pd
df = pd.DataFrame([["a", "x"],
                   [np.nan, "y"],
                   ["a", np.nan],
                   ["b", "y"]], dtype="category")
# Data Frame is more like database table

In [22]:
df

Unnamed: 0,0,1
0,a,x
1,,y
2,a,
3,b,y


In [24]:
imp = SimpleImputer(strategy="most_frequent")


#### $\underline{First}$ column, NaN has been changed to the most frequent value $a$
#### $\underline{Second}$ column, NaN has been changed to the most frequent value $y$

In [30]:
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]


### 3.2 Multivariate feature imputation

> IterativeImputer models each feature with missing values as a function of other features, and uses that estimate for imputation. 

> It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. 

> A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. 

> This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

In [25]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]]) # the second feature is double of the first
IterativeImputer(random_state=0)

IterativeImputer(random_state=0)

In [26]:
X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
# the model learns that the second feature is double the first
print(np.round(imp.transform(X_test)))

[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]


### 3.3 Nearest neighbors imputation¶

> The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach. 

> By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors. 

> Each missing feature is imputed using values from n_neighbors nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. 

In [27]:
import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
X

[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]

In [34]:
imputer = KNNImputer(n_neighbors=2, weights="uniform") 
#using the mean feature value of the two nearest neighbors of samples with missing values:
imputer.fit_transform(X)
# Given [1, 2, nan] (third column missing), we find 2 nearest neighbors: [3, 4, 3], [nan, 6, 5]  
#(second and third example), (3+5)/2=4, i.e.average the thrid column
# Given [nan, 6, 5] (first column missing), we find 2 nearest neighbors: [3, 4, 3],[8, 8, 7]]
# (second and fourth example),  (3+8)/2=5.5, i.e.average the first column

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

# 4. Generating polynomial features

### Often it might be useful to add complexity to the model by considering _nonlinear features_ of the input data. 

### A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms. 

### It is implemented in PolynomialFeatures

In [28]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [29]:
#generate 3*2 matrix
X = np.arange(6).reshape(3, 2)
X                                                         

array([[0, 1],
       [2, 3],
       [4, 5]])

In [37]:
poly = PolynomialFeatures(2)
# degree : integer
# The degree of the polynomial features. Default = 2.

poly.fit_transform(X)    
# The features of X have been transformed from  (X1, X2) to (1, X1, X2, X1*X1, X1*X2, X2*X2.
# First 1 is constant

array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

### In some cases, only interaction terms among features are required, and it can be gotten with the setting interaction_only=True:

In [30]:
X = np.arange(9).reshape(3, 3)
X                                                  

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [31]:
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly.fit_transform(X)  
#The features of X have been transformed from  (X1, X2, X3) 
#to (1, X1, X2, X3, X1*X2, X1*X3, X2*X3, X1*X2*X3.

array([[  1.,   0.,   1.,   2.,   0.,   0.,   2.,   0.],
       [  1.,   3.,   4.,   5.,  12.,  15.,  20.,  60.],
       [  1.,   6.,   7.,   8.,  42.,  48.,  56., 336.]])

# 5. Custom transformers

### Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing. 
### You can implement a transformer from an arbitrary function with FunctionTransformer
### For example, to build a transformer that applies a log transformation
### It is easy for us to change different transformer

In [32]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
# log1p computes log(1 + x)     # log with base e
# Return the natural logarithm of one plus the input array, element-wise.

In [33]:
X = np.array([[0, 1], [2, 3]])
X

array([[0, 1],
       [2, 3]])

In [34]:
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

In [43]:
import math
print (math.log(1+1), math.log(2+1), math.log(3+1))

0.6931471805599453 1.0986122886681098 1.3862943611198906


# 6 Encoding categorical features

## 6.1 OrdinalEncoder

* Often features are not given as continuous values but categorical.

* For example a person could have features: 

> feature 1: ["female", "male"], 

> Feature 2: ["from Asia", "from Europe", "from US"], 

> feature 3: [ "uses Chrome", "uses Firefox", "uses Internet Explorer", uses Safari"].

* Such features can be efficiently coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed as [1, 2, 2], while ["female", "from Asia", "uses Chrome"] would be [0, 0, 0].
* Indices start from 0

In [35]:
from sklearn.preprocessing import OrdinalEncoder
enc = preprocessing.OrdinalEncoder() #enc denote encoder variable
X = [['male', 'from US', 'uses Safari'],
     ['female', 'from Europe', 'uses Firefox'], 
     ['female', 'from Asia', 'uses Chrome'],
     ['male', 'from US', 'uses Internet Explorer']]
X
# X has 4 examples/records

[['male', 'from US', 'uses Safari'],
 ['female', 'from Europe', 'uses Firefox'],
 ['female', 'from Asia', 'uses Chrome'],
 ['male', 'from US', 'uses Internet Explorer']]

### Let us sort each feature alphabetically

### ["female", "male"], ["from Asia", "from Europe", "from US"], 
### [ "uses Chrome", "uses Firefox", "uses Internet Explorer", uses Safari"].

In [45]:
enc.fit(X)
OrdinalEncoder()
enc.transform([['female', 'from US', 'uses Safari']])

array([[0., 2., 3.]])

In [46]:
enc.transform([['female', 'from Europe', 'uses Internet Explorer']])

array([[0., 1., 2.]])

## 6.2 OneHotEncoder

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

In [36]:
from sklearn.preprocessing import OneHotEncoder
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
OneHotEncoder()
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Firefox']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 1., 0.]])

We create a vocabulary _alphabetically_ for each feature, leading to a 6 dimension vector

['female', 'male', 'from Europe', 'from US', 'uses Firefox', 'uses Safari']

It is possible to specify this explicitly using the parameter categories. There are two genders, four possible continents and four Web browsers in our dataset:

In [37]:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])

In [38]:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
OneHotEncoder(categories=[['female', 'male'],
                          ['from Africa', 'from Asia', 'from Europe','from US'],
                          ['uses Chrome', 'uses Firefox', 'uses IE','uses Safari']])
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

|F1|F2|F3|F4|F5|F6|F7|F8|F9|F10|
|------|------|------|------|------|------|------|------|------|------|
|'female'| 'male' | 'from Africa'| 'from Asia'| 'from Europe'| 'from US'|'uses Chrome'| 'uses Firefox'| 'uses IE'|'uses Safari'|
|1|0|0|1|0|0|1|0|0|0|


Each feature value is a feature with binary value: 1 occur; 0 otherwise. 
There are three 1 (F1, F4, F7) in the onehot coded vector because of 'female', 'from Asia', 'uses Chrome' respectively

# 7 Discretization

* Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. 

* Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

* One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

In [39]:
import numpy as np
from sklearn import preprocessing
X = np.array([[ -3., 5., 15 ],
               [  0., 6., 14 ],
               [  6., 3., 11 ]])
X

array([[-3.,  5., 15.],
       [ 0.,  6., 14.],
       [ 6.,  3., 11.]])

### We have three features
* Feature 1 has value -3, 0, 6
* Feature 2 has value 3, 5, 6
* Feature 3 has value 11, 14, 15

We will parttion each feature into multiple bins, e.g. 
* Feature 1 into _3_ bins
* Feature 2 into _2_ bins
* Feature 3 into _2_ bins

In [40]:
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

In [41]:
est

KBinsDiscretizer(encode='ordinal', n_bins=[3, 2, 2])

#### By default the output is one-hot encoded into a sparse matrix and this can be configured with the encode parameter. For each feature, the bin edges are computed during fit and together with the number of bins, they will define the intervals. Therefore, for the current example, these intervals are defined as:

feature 1: [-$\infty$, -1), [-1, 2), [2,$\infty$)

feature 2: [-$\infty$, 5), [5,$\infty$)


feature 3:  [-$\infty$, 14), [14,$\infty$)

In [42]:
est.transform(X)  

array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

In [43]:
X

array([[-3.,  5., 15.],
       [ 0.,  6., 14.],
       [ 6.,  3., 11.]])

First feature: -3=>0, 0=>1, 6=>2
Second feature: 5=>1, 6=>1, 3=>0
Third feature: 15=>1, 14=>1, 11=>0

### Discretization is similar to constructing histograms for continuous data. However, histograms focus on counting features which fall into particular bins, whereas discretization focuses on assigning feature values to these bins.

* KBinsDiscretizer implements different binning strategies, which can be selected with the strategy parameter. 

>  The ‘uniform’ strategy uses constant-width bins. 

> The ‘quantile’ strategy uses the quantiles values to have equally populated bins in each feature.

> The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently.

### Strategy : {'uniform', 'quantile', 'kmeans'}, (default='quantile')
 Strategy used to define the widths of the bins.

* uniform: All bins in each feature have identical widths.
* quantile： All bins in each feature have the same number of points.
* kmeans：Values in each bin have the same nearest center of a 1D k-means cluster.

In [55]:
 help(preprocessing.KBinsDiscretizer)

Help on class KBinsDiscretizer in module sklearn.preprocessing._discretization:

class KBinsDiscretizer(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)
 |  
 |  Bin continuous data into intervals.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_discretization>`.
 |  
 |  .. versionadded:: 0.20
 |  
 |  Parameters
 |  ----------
 |  n_bins : int or array-like of shape (n_features,), default=5
 |      The number of bins to produce. Raises ValueError if ``n_bins < 2``.
 |  
 |  encode : {'onehot', 'onehot-dense', 'ordinal'}, default='onehot'
 |      Method used to encode the transformed result.
 |  
 |      onehot
 |          Encode the transformed result with one-hot encoding
 |          and return a sparse matrix. Ignored features are always
 |          stacked to the right.
 |      onehot-dense
 |          Encode the transformed result with one-hot encoding
 |          and return a 