In [1]:
import pandas as pd

# Feature Engineering

**Feature engineering** is the process of transforming raw data into useful inputs for machine learning models.
In this notebook, we’ll cover techniques like encoding categorical features, normalizing and standardizing data, handling missing values, and creating polynomial features.

Table of contents:

- [Encoding categorical features](#1.-Enconding-categorical-features)
- [Normalization and Standardization](#2.-Normalization-and-Standardization)
- [Data imputation](#3.-Data-imputation)
- [Polynomial features](#4.-Polynomial-features)

Let’s load the Titanic dataset, which contains information about passengers aboard the Titanic.

In [2]:
# load titanic dataset
path = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/titanic.csv'
df = pd.read_csv(path ,index_col = 'PassengerId')
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


| Column Name    | Description |
|----------------|-------------|
| **PassengerId** | Unique ID for each passenger |
| **Survived**    | Survival (0 = No, 1 = Yes) |
| **Pclass**      | Passenger class (1st, 2nd, 3rd) |
| **Name**        | Name of the passenger |
| **Sex**         | Gender of the passenger |
| **Age**         | Age of the passenger |
| **SibSp**       | Number of siblings/spouses aboard |
| **Parch**       | Number of parents/children aboard |
| **Ticket**      | Ticket number |
| **Fare**        | Ticket fare |
| **Cabin**       | Cabin number |
| **Embarked**    | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |

Features can be classified as numerical or categorical:

- **Numerical features** represent quantities or measurements. Examples include `Age`, `Fare`, `SibSp`, and `Parch`.
- **Categorical features** represent categories or groups. Examples include `Sex`, `Pclass`, and `Embarked`.

Numerical features are used directly in most models, while categorical features often need to be encoded into numerical values for the model to process them effectively.

## 1. Enconding categorical features

- [Ordinal encoding](#1.1.-Ordinal-encoding)
- [One hot encoding](#1.2.-One-hot-encoding)

Often, features are not continuous values but categorical.  
For example, a person might have features like `["male", "female"]`, `["from Europe", "from US", "from Asia"]`, `["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]`.
These categories can be turned into numbers.

### 1.1. Ordinal encoding

An **ordinal feature** is a categorical feature where the values have a natural order or ranking. 
For example, in the Titanic dataset, `Pclass` (passenger class) is an ordinal feature because the classes—1st, 2nd, and 3rd—have a meaningful order (1st class is higher than 3rd class).

**Ordinal encoding** is the process of converting these ordered categories into numbers that reflect their ranking.
For example, 1st class might be encoded as 1, 2nd class as 2, and 3rd class as 3. 
This allows the model to understand the relative importance of the categories.

In [4]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()

We will first ordinal encode the `Sex` feature. 
Although there is no natural order between categories like "male" and "female," ordinal encoding can still be applied to convert these categories into numerical values. 
Keep in mind, for unordered features like this, ordinal encoding may not always be the best choice, and other encoding techniques like one-hot encoding could be more appropriate.

In [14]:
df.Sex.value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [11]:
# Instantiate the OrdinalEncoder and specify the category order
encoder = OrdinalEncoder(categories=[['male','female']])  # Explicitly set the order: male < female
# Fit and transform the Sex column
encoder.fit_transform(df[['Sex']])

array([[0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

Notice that "male" has been encoded as 0 and "female" as 1.
If we set the `categories` parameter the other way around using `categories=[['female', 'male']]`, the encoded values would be reversed, with "female" encoded as 0 and "male" as 1.

Let’s look at another example of ordinal encoding. This time, we’ll encode the Embarked feature. Again, keep in mind that the categories ("C" for Cherbourg, "Q" for Queenstown, and "S" for Southampton) don’t have a natural order

In [23]:
df.Embarked.value_counts(dropna=False)

Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64

Notice that there are a few NaN values in the Embarked feature. 
Before we apply encoding, we need to decide how to handle these missing values. 
In this case, we will simply drop the rows with missing values.

In [28]:
df.dropna(axis=0,subset='Embarked', inplace=True)

In [31]:
df.Embarked.value_counts(dropna=False)

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

We will encode 'S' as 0, 'C' as 1, and 'Q' as 2.

In [33]:
encoder = OrdinalEncoder(categories=[['S','C','Q']])
encoder.fit_transform(df[['Embarked']])

array([[0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [2.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [2.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [2.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [2.],
       [0.],
       [1.],
       [1.],
       [2.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [0.],
       [2.],
       [2.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],

We can encode multiple features at the same time. For example, if we want to encode both the `Sex` and `Embarked` features together, we can pass both columns to the encoder and define the categories for each feature.

In [34]:
categories = [
    ['male','female'],
    ['S','C','Q']
]
encoder = OrdinalEncoder(categories=categories)
encoder.fit_transform(df[['Sex','Embarked']])

array([[0., 0.],
       [1., 1.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 2.]])

We get a matrix with two columns. 
The first column contains the encoding for `Sex`.
The second column contains the encoding for `Embarked`.

### 1.2. One hot encoding

**One-hot encoding** is a technique used to convert categorical features into a binary format
. Instead of assigning a single number to each category (as in ordinal encoding), one-hot encoding creates a new binary column for each category.
Each row contains a 1 in the column corresponding to its category and 0s in the other columns. 
This method is particularly useful when there’s no natural order between categories, ensuring the model doesn’t assume any ranking among them.

In the code below, we’re using `OneHotEncoder` to convert the `Sex` feature into a one-hot encoded format. 
By default, one-hot encoding generates a sparse matrix, which is memory-efficient because it only stores non-zero values.
However, sparse matrices can be difficult to visualize directly.
By setting `sparse_output=False`, we ensure the encoder outputs a dense matrix, which is easier to view and interpret.
While we don't generally need `sparse_output=False` in most cases, here we use it to clearly see the one-hot encoded values.

In [37]:
# One-hot encoding of the "Sex" feature
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoder.fit_transform(df[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

The result is a matrix with two columns, where each column represents one of the categories in the `Sex` feature.
One column corresponds to "male" and the other to "female". 
For each row, a 1 indicates the presence of that category, and a 0 indicates its absence.

If you're unsure of the order of the categories after encoding, you can use `encoder.categories_` to check.
This attribute shows the categories that were detected and used for the encoding.

In [38]:
encoder.categories_

[array(['female', 'male'], dtype=object)]

In [39]:
# One-hot encoding of the "Embarked" feature
encoder.fit_transform(df[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [19]:
encoder.categories_

[array(['C', 'Q', 'S'], dtype=object)]

This time, we get a matrix with three columns, each representing a category in the Embarked feature: "C" (Cherbourg), "Q" (Queenstown), and "S" (Southampton). 
For each row, a 1 is placed in the column for the passenger’s embarkation point, with 0s in the other columns.

We can encode multiple columns at once.

In [40]:
# One-hot encoding of the "Sex" and "Embarked" features
encoder.fit_transform(df[['Sex','Embarked']])

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0.]])

In [41]:
encoder.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

This time, we get a matrix with five columns.
The first two columns represent the one-hot encoding of the `Sex` feature, while the remaining three columns represent the `Embarked` feature.

## 2. Normalization and Standardization 

- [Normalization](#2.1.-Normalization)
- [Standardization](#2.2.-Standardization)

In [22]:
X = titanic[['Pclass','Age','Fare']] # feature matrix
X

Unnamed: 0_level_0,Pclass,Age,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,22.0,7.2500
2,1,38.0,71.2833
3,3,26.0,7.9250
4,1,35.0,53.1000
5,3,35.0,8.0500
...,...,...,...
887,2,27.0,13.0000
888,1,19.0,30.0000
889,3,,23.4500
890,1,26.0,30.0000


### 2.1. Normalization

**Normalization** is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1,1] or [0,1].
This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

Normalizing the data is not a strict requirement. 
However, in practice, it can lead to an increased speed of training. 

In [23]:
# min-max scaler scales data to the [0, 1] range
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
scaler.transform(X)

array([[1.        , 0.27117366, 0.01415106],
       [0.        , 0.4722292 , 0.13913574],
       [1.        , 0.32143755, 0.01546857],
       ...,
       [1.        ,        nan, 0.04577135],
       [0.        , 0.32143755, 0.0585561 ],
       [1.        , 0.39683338, 0.01512699]])

In [24]:
# max-abs scaler scales data to the [-1, 1] range 
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(X)
scaler.transform(X)

array([[1.        , 0.275     , 0.01415106],
       [0.33333333, 0.475     , 0.13913574],
       [1.        , 0.325     , 0.01546857],
       ...,
       [1.        ,        nan, 0.04577135],
       [0.33333333, 0.325     , 0.0585561 ],
       [1.        , 0.4       , 0.01512699]])

### 2.2. Standardization

**Standardization (or mean removal and variance scaling)** is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with mean 0 and standard deviation 1.

In [25]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X)

array([[ 0.82520863, -0.52766856, -0.50023975],
       [-1.57221121,  0.57709388,  0.78894661],
       [ 0.82520863, -0.25147795, -0.48664993],
       ...,
       [ 0.82520863,         nan, -0.17408416],
       [-1.57221121, -0.25147795, -0.0422126 ],
       [ 0.82520863,  0.16280796, -0.49017322]])

## 3. Data Imputation

In [26]:
X.isnull().sum()

Pclass      0
Age       177
Fare        0
dtype: int64

The typical approaches of dealing with missing values for a feature include:
- remove rows with missing features from the dataset (this can be done if your dataset is big enough)
- using a **data imputation** technique

The SimpleImputer class provides basic strategies for imputing missing values. 
Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. 

In [27]:
from sklearn.impute import SimpleImputer

In [28]:
titanic.Age.mean()

29.64209269662921

In [29]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
imputed_X = imputer.transform(X)
imputed_X

array([[ 3.       , 22.       ,  7.25     ],
       [ 1.       , 38.       , 71.2833   ],
       [ 3.       , 26.       ,  7.925    ],
       ...,
       [ 3.       , 29.6420927, 23.45     ],
       [ 1.       , 26.       , 30.       ],
       [ 3.       , 32.       ,  7.75     ]])

## 4. Polynomial features

Often it's useful to add complexity to the model by considering nonlinear features of the input data. 
A simple and common method to use is **polynomial features**, which can get features' high-order and interaction terms. 
It is implemented in PolynomialFeatures

In [30]:
from sklearn.preprocessing import PolynomialFeatures

In [31]:
poly = PolynomialFeatures(degree=2)
poly.fit(imputed_X)
poly.transform(imputed_X)

array([[1.00000000e+00, 3.00000000e+00, 2.20000000e+01, ...,
        4.84000000e+02, 1.59500000e+02, 5.25625000e+01],
       [1.00000000e+00, 1.00000000e+00, 3.80000000e+01, ...,
        1.44400000e+03, 2.70876540e+03, 5.08130886e+03],
       [1.00000000e+00, 3.00000000e+00, 2.60000000e+01, ...,
        6.76000000e+02, 2.06050000e+02, 6.28056250e+01],
       ...,
       [1.00000000e+00, 3.00000000e+00, 2.96420927e+01, ...,
        8.78653659e+02, 6.95107074e+02, 5.49902500e+02],
       [1.00000000e+00, 1.00000000e+00, 2.60000000e+01, ...,
        6.76000000e+02, 7.80000000e+02, 9.00000000e+02],
       [1.00000000e+00, 3.00000000e+00, 3.20000000e+01, ...,
        1.02400000e+03, 2.48000000e+02, 6.00625000e+01]])

In [None]:
poly.get_feature_names(X.columns)