# Feature Engineering

Feature engineering is one of the most important aspects of the machine learning pipeline. It is the practice of creating and modifying features, or variables, for the purposes of improving model performance.

## What are Features?

Features are measurable characteristics of any phenomenon that we are observing. They are the granular elements that make up the data with which models operate upon to make predictions. Examples of features can include things like age, income, a timestamp, longitude, value, and almost anything else one can think of that can be measured or represented in some form.

There are different feature types, the main ones being:

1. **Numerical Features**: Continuous or discrete numeric types (e.g. age, salary)
2. **Categorical Features**: Qualitative values representing categories (e.g. gender, shoe size type)
3. **Text Features**: Words or strings of words (e.g. "this" or "that" or "even this")
4. **Time Series Features**: Data that is ordered by time (e.g. stock prices)

Features are crucial in machine learning because they directly influence a model's ability to make predictions. Well-constructed features improve model performance, while bad features make it harder for a model to produce strong predictions. Feature selection and feature engineering are preprocessing steps in the machine learning process that are used to prepare the data for use by learning algorithms.

A distinction is made between feature selection and feature engineering, though both are crucial in their own right:

1. **Feature Selection**: The culling of important features from the entire set of all available features, thus reducing dimensionality and promoting model performance
2. **Feature Engineering**: The creation of new features and subsequent changing of existing ones, all in the aid of making a model perform better
By selecting only the most important features, feature selection helps to only leave behind the signal in the data, while feature engineering creates new features that help to model the outcome better.

## Handling Missing Values
It is common for datasets to contain missing information. This can be detrimental to a model's performance, which is why it is important to implement strategies for dealing with missing data. There are a handful of common methods for rectifying this issue:

1. **Mean/Median Imputation**: Filling missing areas in a dataset with the mean or median of the column
2. **Mode Imputation**: Filling missing spots in a dataset with the most common entry in the same column
3. **Interpolation**: Filling in missing data with values of data points around it
These fill-in methods should be applied based on the nature of the data and the potential effect that the method might have on the end model.

Dealing with missing information is crucial in keeping the integrity of the dataset in tact. Here is an example Python code snippet that demonstrates various data filling methods using the pandas library.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample DataFrame
data = {'age': [25, 30, np.nan, 35, 40], 'salary': [50000, 60000, 55000, np.nan, 65000]}
df = pd.DataFrame(data)

print(df)

    age   salary
0  25.0  50000.0
1  30.0  60000.0
2   NaN  55000.0
3  35.0      NaN
4  40.0  65000.0


In [4]:
df.isna().sum()

age       1
salary    1
dtype: int64

In [76]:
# Fill in missing ages using the mean
mean_imputer = SimpleImputer(strategy='mean')
df['age'] = mean_imputer.fit_transform(df[['age']])

# Fill in the missing salaries using the median
median_imputer = SimpleImputer(strategy='median')
df['salary'] = median_imputer.fit_transform(df[['salary']])

print(df)

    age   salary
0  25.0  50000.0
1  30.0  60000.0
2  32.5  55000.0
3  35.0  57500.0
4  40.0  65000.0


## Encoding of Categorical Variables
Recalling that most machine learning algorithms are best (or only) equipped to deal with numeric data, categorical variables must often be mapped to numerical values in order for said algorithms to better interpret them. The most common encoding schemes are the following:

1. **One-Hot Encoding**: Producing separate columns for each category
2. **Label Encoding**: Assigning an integer to each category
3. **Target Encoding**: Encoding categories by their individual outcome variable averages
The encoding of categorical data is necessary for planting the seeds of understanding in many machine learning models. The right encoding method is something you will select based on the specific situation, including both the algorithm at use and the dataset.

Below is an example Python script for the encoding of categorical features using pandas and elements of scikit-learn.

In [5]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Sample DataFrame
data = {'color': ['red', 'blue', 'green', 'blue', 'red']}
df = pd.DataFrame(data)
print(df)

   color
0    red
1   blue
2  green
3   blue
4    red


In [6]:
# Implementing one-hot encoding
one_hot_encoder = OneHotEncoder()
one_hot_encoding = one_hot_encoder.fit_transform(df[['color']]).toarray()
df_one_hot = pd.DataFrame(one_hot_encoding, columns=one_hot_encoder.get_feature_names_out(['color']))
print(df_one_hot)

   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         1.0          0.0        0.0
4         0.0          0.0        1.0


In [7]:
# Implementing label encoding
label_encoder = LabelEncoder()
df['color_label'] = label_encoder.fit_transform(df['color'])
print(df)

   color  color_label
0    red            2
1   blue            0
2  green            1
3   blue            0
4    red            2


In [8]:
# Sample DataFrame
data = {'brand': ['iphone', 'samsung', 'vivo', 'xiaomi'], 'damage': ['cracked screen', 'dent', 'water', 'dent']}
df = pd.DataFrame(data)
print(df)

     brand          damage
0   iphone  cracked screen
1  samsung            dent
2     vivo           water
3   xiaomi            dent


In [9]:
df['damage'] = df['damage'].map({'cracked screen': 0,'dent': 1, 'water': 2})
print(df)

     brand  damage
0   iphone       0
1  samsung       1
2     vivo       2
3   xiaomi       1


## Scaling and Normalizing Data
For good performance of many machine learning methods, scaling and normalization needs to be performed on your data. There are several methods for scaling and normalizing data, such as:

1. Standardization: Transforming data so that it has a mean of 0 and a standard deviation of 1
2. Min-Max Scaling: Scaling data to a fixed range, such as [0, 1]
3. Robust Scaling: Scaling high and low values iteratively by the median and interquartile range, respectively
The scaling and normalization of data is crucial for ensuring that feature contributions are equitable. These methods allow the varying feature values to contribute to a model commensurately.

Below is an implementation, using scikit-learn, that shows how to complete data that has been scaled and normalized.

In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Sample DataFrame
data = {'age': [25, 30, 35, 40, 45], 'salary': [50000, 60000, 55000, 65000, 70000]}
df = pd.DataFrame(data)

print(df)

   age  salary
0   25   50000
1   30   60000
2   35   55000
3   40   65000
4   45   70000


In [13]:
# Standardize data
scaler_standard = StandardScaler()
df['age_standard'] = scaler_standard.fit_transform(df[['age']])

# Min-Max Scaling
scaler_minmax = MinMaxScaler()
df['salary_minmax'] = scaler_minmax.fit_transform(df[['salary']])

# Robust Scaling
scaler_robust = RobustScaler()
df['salary_robust'] = scaler_robust.fit_transform(df[['salary']])

print(df)

   age  salary  age_standard  salary_minmax  salary_robust
0   25   50000     -1.414214           0.00           -1.0
1   30   60000     -0.707107           0.50            0.0
2   35   55000      0.000000           0.25           -0.5
3   40   65000      0.707107           0.75            0.5
4   45   70000      1.414214           1.00            1.0


## Advanced Techniques in Feature Engineering


We now turn our attention to to more advanced featured engineering techniques, and include some sample Python code for implementing these concepts.


### Feature Creation
With feature creation, new features are generated or modified to fashion a model with better performance. Some techniques for creating new features include:

1. **Polynomial Features**: Creation of higher-order features with existing features to capture more complex relationships
2. **Interaction Terms**: Features generated by combining several features to derive interactions between them
3. **Domain-Specific Feature Generation**: Features designed based on the intricacies of subjects within the given problem realm


In [14]:
# Sample DataFrame
data = {'x1': [1, 2, 3, 4, 5], 'x2': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
print(df)

   x1  x2
0   1  10
1   2  20
2   3  30
3   4  40
4   5  50


In [15]:
# Polynomial Features
df['x1_squared'] = df['x1'] ** 2
df['x1_x2_interaction'] = df['x1'] * df['x2']
print(df)

   x1  x2  x1_squared  x1_x2_interaction
0   1  10           1                 10
1   2  20           4                 40
2   3  30           9                 90
3   4  40          16                160
4   5  50          25                250


Another example is extracting a feature or transforming a feature to a brand new feature. In this example, we can see that the model_nhame contains the RAM of the phone

In [16]:
# Sample DataFrame
data = {'brand': ['Apple', 'Samsung'], 'model_name': ['Iphone Pro Max 16GB Titanium', 'Samsung Galaxy S24 8GB'], 'year_released':[2023, 2024]}
df = pd.DataFrame(data)
print(df)

     brand                    model_name  year_released
0    Apple  Iphone Pro Max 16GB Titanium           2023
1  Samsung        Samsung Galaxy S24 8GB           2024


We can create a new memory column which contains the amount of RAM of the phone. To extract it, we can use regex

In [17]:
df['memory'] = df['model_name'].str.extract(r'(\d+)(?=GB)').astype(float)
print(df)

     brand                    model_name  year_released  memory
0    Apple  Iphone Pro Max 16GB Titanium           2023    16.0
1  Samsung        Samsung Galaxy S24 8GB           2024     8.0


We can also create a new 'phone_age' column by subtracting the current date and the release date of the phone

In [18]:
from datetime import datetime

current_year = datetime.now().year
df['phone_age'] = current_year - df['year_released']
print(df)

     brand                    model_name  year_released  memory  phone_age
0    Apple  Iphone Pro Max 16GB Titanium           2023    16.0          2
1  Samsung        Samsung Galaxy S24 8GB           2024     8.0          1


## References

https://www.kdnuggets.com/feature-engineering-for-beginners