# Feature Engineering

## Definition
Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. 

## Why ?
Engineering features has two main objectives:
* Preparing the proper input dataset, compatible with the machine learning algorithm requirements
* Improving the machine learning model's performance

According to a survey in Forbes, data scientists spend 80% of their time on data preparation:
![Forbes Data](forbes.jpg)
Source: [https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1594bda36f63](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1594bda36f63)

## Techniques
There are many feature engineering techniques available. This notebook covers few of the key techniques. Some techniques above might work better with some algorithms or datasets, while some of them might be beneficial in all cases. The best way to achieve expertise in feature engineering is practicing different techniques on various datasets and observing their effect on model performances.
1. Imputation
2. Handling Outliers
3. Binning
4. Log Transform
5. One-Hot Encoding
6. Grouping Operations
7. Feature Split
8. Scaling
9. Extracting Date

In [99]:
# Import Numpy and Pandas
import numpy as np 
import pandas as pd 

In [100]:
train = pd.read_csv('train.csv')

In [101]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [102]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [103]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


## Imputation

Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models. The most simple solution to the missing values is to drop the rows or the entire column. There is not an optimum threshold for dropping but you can use 70% as an example value and try to drop the rows and columns which have missing values with higher than this threshold.

Imputation is a more preferable option rather than dropping because it preserves the data size. However, there is an important selection of what you impute to the missing values. Except for the case of having a default value for missing values, the best imputation way is to use the medians of the columns. As the averages of the columns are sensitive to the outlier values, while medians are more solid in this respect.

In the above dataset we can clearly see missing values for Age, Cabin and Embarked features. Let's perform imputation on the Age feature.

In [104]:
train[train.Age.isnull()].Age

5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
       ..
859   NaN
863   NaN
868   NaN
878   NaN
888   NaN
Name: Age, Length: 177, dtype: float64

In [105]:
train.Age.dropna().median()

28.0

In [106]:
imputed_train = train.copy()
imputed_train.Age = train.Age.fillna(train.Age.dropna().median())

In [107]:
imputed_train[imputed_train.Age.isnull()].Age

Series([], Name: Age, dtype: float64)

## Categorical Imputation
Replacing the missing values with the maximum occurred value in a column is a good option for handling categorical columns. But if the values in the column are distributed uniformly and there is not a dominant value, imputing a category like “Other” might be more sensible, because in such a case, imputation is likely to converge a random selection.

In [108]:
train[train.Embarked.isnull()].Embarked

61     NaN
829    NaN
Name: Embarked, dtype: object

In [109]:
train.Embarked.dropna().mode()

0    S
dtype: object

In [110]:
imputed_train['Embarked'] = train['Embarked'].fillna(train.Embarked.dropna().mode()[0])

In [111]:
imputed_train[imputed_train.Embarked.isnull()].Embarked

Series([], Name: Embarked, dtype: object)

## Handling Outliers

The best way to detect the outliers is to explore the data visually. All other statistical methodologies are open to making mistakes, whereas visualizing the outliers gives a chance to take a decision with high precision. There are different statistical methods available for outlier detection:
* If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. Then what x should be? There is no trivial solution for x, but usually, a value between 2 and 4 seems practical.
* Another mathematical method to detect outliers is to use percentiles.

So how to handle outliers. 
1. Drop rows that has outliers
2. Cap Outliers - This can affect the distribution, hence limit adopting this approach

## Binning

The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance.
Binning can be performed both on numerical and categorical data

In [112]:
imputed_train.Age.value_counts()

28.00    202
24.00     30
22.00     27
18.00     26
19.00     25
        ... 
55.50      1
74.00      1
0.92       1
70.50      1
12.00      1
Name: Age, Length: 88, dtype: int64

In [113]:
imputed_train['AgeGroup'] = pd.cut(imputed_train['Age'], bins=[0,17,59,100], labels=['Kids','Adults', 'Seniors'])
imputed_train[['Age','AgeGroup']].sample(10)

Unnamed: 0,Age,AgeGroup
605,36.0,Adults
786,18.0,Adults
253,30.0,Adults
248,37.0,Adults
181,28.0,Adults
273,37.0,Adults
232,59.0,Adults
125,12.0,Kids
23,28.0,Adults
434,50.0,Adults


In [114]:
imputed_train[imputed_train['Age']==60]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C,Seniors
587,588,1,1,"Frolicher-Stehli, Mr. Maxmillian",male,60.0,1,1,13567,79.2,B41,C,Seniors
684,685,0,2,"Brown, Mr. Thomas William Solomon",male,60.0,1,1,29750,39.0,,S,Seniors
694,695,0,1,"Weir, Col. John",male,60.0,0,0,113800,26.55,,S,Seniors


## Log Transform
Logarithm transformation (or log transform) is one of the most commonly used mathematical transformations in feature engineering. 
### Benefits
* It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal.
* It decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust

*Note: As log function can't work on negative numbers, the feature values should be positive to use this transformation.*

In [115]:
data = pd.DataFrame([100, 1, 9, 50, 23, 25])
data

Unnamed: 0,0
0,100
1,1
2,9
3,50
4,23
5,25


In [116]:
data.transform(np.log)

Unnamed: 0,0
0,4.60517
1,0.0
2,2.197225
3,3.912023
4,3.135494
5,3.218876


## One-hot Encoding
One-hot encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column

If you have N distinct values in the column, it is enough to map them to N-1 binary columns, because the missing value can be deducted from other columns.

In [117]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [118]:
encoded_columns = pd.get_dummies(train['Sex'])

In [119]:
encoded_columns

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1


## Grouping
In most machine learning algorithms, every instance is represented by a row in the training dataset, where every column show a different feature of the instance. There could be scenarios where multiple row combinations defines a training sample. In these scenarios we need to group the rows to form a single row. 

The key point of group by operations is to decide the aggregation functions of the features. For numerical features, average and sum functions are usually convenient options, whereas for categorical features it more complicated.

### Categorical Grouping
* Aggregate by frequency, select the highest frequency as label (data.groupby('id').agg(lambda x: x.value_counts().index[0]))
* Pivot Table (data.pivot_table(index='column_to_group', columns='column_to_encode', values='aggregation_column', aggfunc=np.sum, fill_value = 0))
* Apply a group by function after applying one-hot encoding

## Numerical Grouping
Numerical columns are grouped using sum and mean functions in most of the cases. 

In [120]:
#sum_cols: List of columns to sum
#mean_cols: List of columns to average
grouped = imputed_train.groupby('Embarked')

sums = grouped['Fare'].sum().add_suffix('_sum')
avgs = grouped['Fare'].mean().add_suffix('_avg')

new_df = pd.concat([sums, avgs], axis=1)
new_df

Unnamed: 0_level_0,Fare,Fare
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C_sum,10072.2962,
Q_sum,1022.2543,
S_sum,17599.3988,
C_avg,,59.954144
Q_avg,,13.27603
S_avg,,27.243651


# Feature Split
Splitting features is a good way to make them useful in terms of machine learning. Most of the time the dataset contains string columns that may require further processing. By extracting the utilizable parts of a column into new features:
* We enable machine learning algorithms to comprehend them.
* Make possible to bin and group them.
* Improve model performance by uncovering potential information.

In [121]:
imputed_train['Title'] = train.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
imputed_train['Title'].value_counts

<bound method IndexOpsMixin.value_counts of 0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Title, Length: 891, dtype: object>

## Scaling

In most cases, the numerical features of the dataset do not have a certain range and they differ from each other
Scaling solves this problem. The continuous features become identical in terms of the range, after a scaling process. This process is not mandatory for many algorithms, but it might be still nice to apply. However, the algorithms based on distance calculations such as k-NN or k-Means need to have scaled continuous features as model input.

* Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers
* Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

In [122]:
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})

data['normalized'] = (data['value'] - data['value'].min()) / (data['value'].max() - data['value'].min())

data

Unnamed: 0,value,normalized
0,2,0.231481
1,45,0.62963
2,-23,0.0
3,85,1.0
4,28,0.472222
5,2,0.231481
6,35,0.537037
7,-12,0.101852


In [123]:
data['standardized'] = (data['value'] - data['value'].mean()) / data['value'].std()
data

Unnamed: 0,value,normalized,standardized
0,2,0.231481,-0.518878
1,45,0.62963,0.703684
2,-23,0.0,-1.22967
3,85,1.0,1.840952
4,28,0.472222,0.220346
5,2,0.231481,-0.518878
6,35,0.537037,0.419367
7,-12,0.101852,-0.916922


## Extracting Date

- Extracting the parts of the date into different columns: Year, month, day, etc.
- Extracting the time period between the current date and columns in terms of years, months, days, etc.
- Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc.

In [124]:
from datetime import date

data = pd.DataFrame({'date':
['01-01-2017',
'04-12-2008',
'23-06-1988',
'25-08-1999',
'20-02-1993',
]})

#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year
data['year'] = data['date'].dt.year

#Extracting Month
data['month'] = data['date'].dt.month

#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()

data

Unnamed: 0,date,year,month,passed_years,passed_months,day_name
0,2017-01-01,2017,1,4,56,Sunday
1,2008-12-04,2008,12,13,153,Thursday
2,1988-06-23,1988,6,33,399,Thursday
3,1999-08-25,1999,8,22,265,Wednesday
4,1993-02-20,1993,2,28,343,Saturday


## Upsampling & Downsampling

We can handle the imbalanced dataset cases to minimize the Type II errors by balancing the class representations. 
To balance the classes we can:
* Decrease the frequency of the majority class (Downsampling)
* Increase the frequency of the minority class (Upsampling)

The sampling process is applied only to the training set and no changes are made to the validation and testing data. Imblearn library in python comes in handy to achieve the data resampling.

### Upsampling
Upsampling is a procedure where synthetically generated data points (corresponding to minority class) are injected into the dataset. There are two techniques:
* SMOTE(SyntheticMinorityOversamplingTechnique) - It works based on the KNearestNeighbours algorithm, synthetically generating data points that fall in the proximity of the already existing outnumbered group. The input records should not contain any null values when applying this approach. (from imblearn.over_sampling import SMOTENC)
* DataDuplication - In this approach, the existing data points corresponding to the outvoted labels are randomly selected and duplicated. (from sklearn.utils import resample)

### Downsampling
Downsampling is a mechanism that reduces the count of training samples falling under the majority class. As it helps to even up the counts of target categories. By removing the collected data, we tend to lose so much valuable information.
There are two techniques:
* Tomek(T-Links) - is basically a pair of data points from different classes(nearest-neighbors). The objective is to drop the sample that corresponds to the majority and thereby minimalizing the count of the dominating label. This also increases the borderspace between the two labels and thus improving the performance accuracy. (from imblearn.under_sampling import TomekLinks)
* Centroid - The algorithm tries to find the homogenous clusters in the majority class and retains only the centroid. This would reduce the lion’s share of the majority label. It leverages the logic used in the KMeans clustering. But a lot of useful information is wasted.

In [126]:
# Source: https://www.kaggle.com/uciml/pima-indians-diabetes-database?select=diabetes.csv
data = pd.read_csv("diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [129]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [137]:
from imblearn.over_sampling import SMOTENC

In [128]:
data.groupby(["Outcome"]).count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,500,500,500,500,500,500,500,500
1,268,268,268,268,268,268,268,268


In [130]:
X = data.drop('Outcome', axis=1)
Y = data['Outcome']

In [131]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 48.1 KB


In [133]:
Y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [134]:
print("Before UpSampling, counts of label '1': {}".format(sum(Y==1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(Y==0)))

Before UpSampling, counts of label '1': 268
Before UpSampling, counts of label '0': 500 



In [135]:
Y.ravel()

array([1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,

In [138]:
sm = SMOTENC(categorical_features=[0,1], random_state = 100)
X_smote, Y_smote = sm.fit_resample(X, Y)

In [143]:
print("After UpSampling, counts of label '1': {}".format(sum(Y_smote==1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(Y_smote==0)))

After UpSampling, counts of label '1': 500
After UpSampling, counts of label '0': 500 



In [142]:
X_smote.sample(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
480,3,158,70,30,328,35.5,0.344,35
283,7,161,86,0,0,30.4,0.165,47
951,3,149,67,27,115,29.430468,0.209973,36
316,3,99,80,11,64,19.3,0.284,30
752,3,108,62,24,0,26.0,0.223,25


In [144]:
# Down Sampling
from imblearn.under_sampling import TomekLinks
undersample = TomekLinks()
X_tklinks, Y_tklinks  = undersample.fit_resample(X, Y)

In [145]:
print("After DownSampling, counts of label '1': {}".format(sum(Y_tklinks==1)))
print("After DownSampling, counts of label '0': {} \n".format(sum(Y_tklinks==0)))

After DownSampling, counts of label '1': 268
After DownSampling, counts of label '0': 445 



## References
1. [https://en.wikipedia.org/wiki/Feature_engineering](https://en.wikipedia.org/wiki/Feature_engineering)
2. [https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)
3. [https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/](https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/)