<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/demos/02_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Feature Engineering
Conceptually, feature engineering is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.

<img src = '../images/fe_pipeline.png'>

As shown in the figure above, it is an important step in data mining. In fact, it is also the most time-consuming <a href = https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/> Data Scientists Spend Most of Their Time Cleaning Data </a>.

<img src = '../images/pie.png'>

In this demo we show different transformations and how you can use them in the datafactory.

# How To use in the Datafactory

## Import Package

In [1]:
import sys
if 'google.colab' in sys.modules:
    ! git clone https://github.com/sdsc-bw/DataFactory.git # clone repository for colab
    ! ls

In [2]:
import warnings # igorne irrelevant warnings
warnings.filterwarnings('ignore')

In [3]:
import matplotlib.pyplot as plt # library used for visualization
import numpy as np # library for efficient list calculations
import pandas as pd # library for creating tables
import seaborn as sns # library for plotting statistical data visualization
from abc import ABCMeta, abstractmethod # library to create abstract methods
from sklearn.preprocessing import LabelEncoder # method to encode features
from sklearn.model_selection import train_test_split # method to split data into training and test data and seperate data and targets

## add path to import datafactory 
if 'google.colab' in sys.modules: 
    root = 'DataFactory/'
else:
    root = '../'
sys.path.append(root)

from datafactory.ts.feature_engineering.transforms_binary import * # binary transformations
from datafactory.ts.feature_engineering.transforming import * # transforming methods
from datafactory.ts.plotting.dataset_plotting import plot_density_for_each_column_in_df # density visualization
from datafactory.ts.preprocessing.loading import evaluate # method to evaluate the dataset
from datafactory.ts.preprocessing.cleaning import clean_data # method to clean data

ModuleNotFoundError: No module named 'sktime.utils.data_io'

## Load test dataset: diabetes

<a href = 'https://www.openml.org/d/37'> Diabets</a> is an open source data set on OpenCV. It collected the physical conditions of a total of 768 residents living in Phoenix, Arizona, USA. These information include:
- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Has diabet


In [None]:
df = pd.read_csv(root + 'data/dataset_37_diabetes.csv')

In [None]:
df.head()

We can see that the prediction target is of the character type, which cannot be processed by most models, so we first convert it to a numeric type with labelencoder

In [None]:
le = LabelEncoder()
df['class'] = le.fit_transform(df['class'])
df.head()

# Feature engineering
Depending on whether the transformation requires the use of the target value ('class' in the diabetes example), transformations for feature engineering can be distinguished into supervised and unsupervised transformations. And depending on the number of input attributes per conversion unsupervised conversions can again be classified as unary, binary and multivariate conversions.
<img src = '../images/transform.png'>

## Unsupervised Transformation
### Unary Feature transformation

<img width="350" height="360" src = '../images/unary.png'>

In [None]:
class UnaryOpt(metaclass=ABCMeta):
    @abstractmethod
    def fit(self, value: pd.Series) -> pd.Series:
        pass

All unary transform are wrapped as a subclass of 'UnaryOpt', which has one fix function 'fit'. Following is a brief introduction and demonstration of the unary conversion
* abs: |x|
* add: x + e
* negative: -x
* log: log(x)
* exp: e^x
* reciprocol: 1/x
* square: x*x
* squrt: 
* cos: cos(x)
* sin: sin(x)
* degree: Convert angles from radians to degrees
* radians: Convert angles from degrees to radians
* sigmoid: 1 / (1 + exp(-x))
* tanh: sinh(x)/cosh(x)
* relu: x * (x > 0)
* binning: clustering for a single feature
* ktermfreq: value counts

To simplify the programming, sdsc researchers developed the DataFactory class (the specifics of this class will be described in other notebooks), which can finish all the unary transformation in one function.
- apply_unary_transformations_to_series(self, value: pd.Series) -> pd.DataFrame
    - It takes a series (feature) as input and output a dataframe after the transformation

In [None]:
tmp_df = apply_unary_transforms_to_series(df['preg'])
tmp_df

One noteworthy thing is that feature transformation can lead to the generation of implausible values, such as the fifth value of log(preg) in the table above. Since the original value is 0, the corresponding log value is negative infinity.

The next few plots show the density distribution of the attributes before and after the conversion, and we can see that their density distribution functions show a big difference, and this is the main reason for our conversion

In [None]:
plt.figure(figsize=(20,6))
sns.kdeplot(df['preg'])

In [None]:
plot_density_for_each_column_in_df(tmp_df)

### Binary feature transformation

<img width="350" height="360" src = '../images/binary.png'>

In [None]:
class BinaryOpt(metaclass=ABCMeta):
    @abstractmethod
    def fit(self, value1: pd.Series, value2: pd.Series) -> pd.Series:
        pass

All unary transform are wrapped as a subclass of 'BinaryOpt', which has one fix function 'fit'. It takes two series as input and output the series after transformation. There are only four binary transformation defiend in DataFactory class.

In [None]:
operators = {'div': Div(), 'minus': Minus(), 'add': Add(), 'product': Product()}

In [None]:
tmp_df = apply_binary_transforms_to_series(df['preg'], df['plas'])
tmp_df

### Multiple feature transformation

<img width="650" height="660" src = '../images/multiple.png'>

All unary transform are wrapped as a subclass of 'MultiOpt', which also has one fix function 'fit'. The multivariate transformation takes several different features as input (generally the whole dataset) and its output generally contains one or more different features in form of DataFrame. The transformation functions can be broadly classified into clustering, regularization, downscaling and time series attribute extraction. At present, SDSC researchers do not implement all possible transformation methods, for example, for the clustering method category only Kmeans is implemented, while other common clusters are not implemented.

- Clustering: 
    - clustering the data set with Kmeans and use the result as new feature.
    - Because most, datasets contain dozens of features, and kmeans does not perform well on high-dimensional data, SDSC staff incorporates a sliding window mechanism in clustering, where values are clustered only for features in the same window.
- Normalization:
    - minmax 
    - zscore
- Dimension_reduction:
    - isomap
- Time_series feature extraction: each item in the dataset is treated as a time series
    - Diff: diff between the columns
    - WinAgg
        - apply sliding window to it and aggregate
        - agg func include: max, .25, .50, .75, max, std
- Other:
    - LeakyInfo
    - KernelApproxRBF: use rbf to approximate kernel


In [None]:
class MultiOpt(metaclass=ABCMeta):
    @abstractmethod
    def fit(self, df: pd.DataFrame):
        pass

In [None]:
tmp_df = apply_multiple_transforms_to_dataframe(df.iloc[:, :-1])
tmp_df

## Supervised Transformation

This category is more complex and it mainly uses some existing models, such as decision trees, k-nearest neighbors, etc., to assist in generating new features. Depending on the type of dataset, it can be broadly classified into two main types: classification and regression.

Another special feature of this classification is that since the target value of the test set is unknown, there is no way to generate new features purely by relying on the test set. The composition of the features in the test set depends on the model built in the training set.

Taking DecisionTreeClassifier for example, following three information can be extracted:
- regard each node in the last layer of the tree as a cluster, extract the clustering information
- use the prediction of the model
- compute the distance between the predictt and ground_truth

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1])
tmp_df, tmp_df2 = apply_supervised_transforms_to_dataframe(X_train, X_test, y_train, y_test, 'C')
tmp_df

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1])
tmp_df, tmp_df2 = apply_supervised_transforms_to_dataframe(X_train, X_test, y_train, y_test, 'R')
tmp_df

# Only Apply Certain Transformations

To only apply certain transformations, insert the dataframe and the transformations as list together with the column names where to apply the transformation:

In [None]:
trfms = [('ln', 'age'), ('cos', 'age'), ('exp', 'skin'), ('add', 'pres', 'age'), ('minmaxnorm', 'insu', 'mass', 'age'), 'dfCla']
tmp_df = apply_transforms(df, trfms)
tmp_df

# Evaluate
In this section we will briefly test the effect of the feature engineering by comparing the accuracy before and after using the transformation.

Meanwhile, SDSC engineers have integrated the evaluate method in the data factory. This method uses the data and the predicted target as input and outputs the corresponding cv results: weighted f1 score for classification task and 1-rae for regression task.

It is worth noting that we only use multivariate transformation once in this test. In addition, there may generate na or inf values after transformation, which we will handle using the clean_data method in the data factory.

In [None]:
evaluate(df.iloc[:, :-1], df.iloc[:, -1])

In [None]:
tmp_multi = apply_multiple_transforms_to_dataframe((df.iloc[:, :-1]))

In [None]:
df = pd.concat([tmp_multi, df], axis = 1)

In [None]:
df = clean_data(df)

In [None]:
evaluate(df.iloc[:, :-1], df.iloc[:, -1])

By comparing the results we find that there is a significant improvement