# Feature Engineering Exercise (Solution)

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Feature engineering is a crucial step in developing effective Machine Learning systems, blending domain expertise with mathematical transformations. It focuses on processing diverse data types and variables, with each Machine Learning problem demanding tailored feature engineering strategies. This module explores techniques for engineering both **numeric** and **categorical** features.

## Learning Objectives

- Transform and engineer numeric features
  - Apply raw measures and counts
  - Implement binarization techniques
  - Perform rounding operations
  - Create feature interactions
- Transform and engineer categorical features
  - Convert nominal features to numeric representations
  - Transform ordinal features with preserved ordering
  - Apply encoding schemes for categorical data
    - One Hot Encoding
    - Dummy Coding

### Tasks to complete

- Implement numeric feature engineering techniques
- Transform categorical variables
- Apply various encoding schemes
- Analyze transformed features

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

- Please select kernel "conda_python3" from SageMaker notebook instance.

### Import necessary libraries


In [86]:
# Import necessary dependencies

# Matplotlib for plotting and visualization
import matplotlib as mpl
import matplotlib.pyplot as plt

# NumPy for numerical operations and array manipulations
import numpy as np

# Pandas for data manipulation and analysis
import pandas as pd

# SciPy statistical functions for advanced statistical analysis
import scipy.stats as spstats

# Scikit-learn preprocessing tools for data transformation
from sklearn.preprocessing import (
    Binarizer,           # Converts numerical values into binary (0 or 1) based on a threshold
    LabelEncoder,        # Encodes categorical labels as integers (useful for classification tasks)
    OneHotEncoder,       # Encodes categorical variables as one-hot (dummy) variables
    PolynomialFeatures,  # Generates polynomial features for regression models
)


# Enable inline plotting in Jupyter Notebook
%matplotlib inline

# Reload Matplotlib's style library to ensure the latest settings are applied
mpl.style.reload_library()

# Set the Matplotlib style to "classic" for a traditional look
mpl.style.use("classic")

# Set the background color of figures to transparent (white with 0 alpha)
mpl.rcParams["figure.facecolor"] = (1, 1, 1, 0)

# Define the default figure size as 6 inches by 4 inches
mpl.rcParams["figure.figsize"] = [6.0, 4.0]

# Set the figure resolution to 100 dots per inch (DPI) for better clarity
mpl.rcParams["figure.dpi"] = 100

## Feature Engineering on Numeric Data


Although numeric data can be directly used as input for Machine Learning models, it is often necessary to engineer features that are relevant to the specific scenario, problem, and domain before building a model. This underscores the importance of feature engineering. Key considerations for numeric features include their **scale** and **distribution**. In some cases, transformations are required to adjust the scale of numeric values, while in others, the overall distribution may need to be modified—for example, converting a skewed distribution into a normal distribution.

### Raw Measures

Raw measures refer to the direct use of numeric variables as features without any transformation or engineering. These features typically represent values or counts in their original form.

#### Values

Scalar values in their raw form typically represent a specific measurement, metric, or observation associated with a particular variable or field. The meaning of the field is usually derived from its name or, if available, a data dictionary.

### Ecoli Dataset

Ecoli dataset is for predicting Protein Localization Sites in Ecoli.

```
Number of Instances:  336
Number of Attributes: 8 ( 7 predictive, 1 name )
Attribute Information.
  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
  5. chg: Presence of charge on N-terminus of predicted lipoproteins (Binary attribute).
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Missing Attribute Values: None.
Class Distribution. The class is the localization site.
  cp  (cytoplasm)                                    143
  im  (inner membrane without signal sequence)        77
  pp  (perisplasm)                                    52
  imU (inner membrane, uncleavable signal sequence)   35
  om  (outer membrane)                                20
  omL (outer membrane lipoprotein)                     5
  imL (inner membrane lipoprotein)                     2
  imS (inner membrane, cleavable signal sequence)      2
```

You can learn more about the dataset here:

- Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
- Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))


In [87]:
# Define the file path to the Ecoli dataset (relative path)
ecoli_data = "../../Data/ecoli.csv"

# Load the dataset into a Pandas DataFrame
ecoli_df = pd.read_csv(ecoli_data)

# Display the first 10 rows of the dataset to inspect its structure
ecoli_df.head(10)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


In [88]:
# Display the first few rows of the selected feature columns ("mcg", "gvh", "chg") 
# from the ecoli dataset
ecoli_df[["mcg", "gvh", "chg"]].head()

Unnamed: 0,mcg,gvh,chg
0,0.06,0.61,0.5
1,0.49,0.29,0.5
2,0.85,0.53,0.5
3,0.07,0.4,0.5
4,0.78,0.68,0.5


In [89]:
# Compute basic statistical measures (count, mean, std, min, max, and quartiles)
# for the numerical columns 'mcg', 'gvh', and 'chg' in the DataFrame 'ecoli_df'
ecoli_df[["mcg", "gvh", "chg"]].describe()

Unnamed: 0,mcg,gvh,chg
count,336.0,336.0,336.0
mean,0.50006,0.5,0.501488
std,0.194634,0.148157,0.027277
min,0.0,0.16,0.5
25%,0.34,0.4,0.5
50%,0.5,0.47,0.5
75%,0.6625,0.57,0.5
max,0.89,1.0,1.0


### Counts

Raw numeric measures can also represent counts, frequencies, or occurrences of specific attributes.


### Diabetes Dataset

The dataset classifies patient data as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))


In [90]:
# Load Diabetes dataset from a CSV file
diabetes_data = "../../Data/pima-indians-diabetes.csv"

# Read the CSV file into a pandas DataFrame, specifying no header row (header=None)
diabetes_df = pd.read_csv(diabetes_data, header=None)

# Assign column names to the dataset for better readability
diabetes_df.columns = [
    "pregnancy",  # Number of times pregnant
    "glucose",    # Plasma glucose concentration
    "bp",         # Diastolic blood pressure (mm Hg)
    "triceps",    # Triceps skinfold thickness (mm)
    "insulin",    # 2-Hour serum insulin (mu U/ml)
    "bmi",        # Body Mass Index (weight in kg/(height in m)^2)
    "pedigree",   # Diabetes pedigree function (genetic risk factor)
    "age",        # Age in years
    "diabetes",   # Diabetes diagnosis (1 = positive, 0 = negative)
]

# Display the first 10 rows of the dataset
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [91]:
diabetes_df.describe()

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Binarization

If the focus is on identifying whether specific songs have been listened to (rather than the number of times they were played), a binary feature is more suitable than a count-based feature.


In [92]:
# Convert the 'age' column to a NumPy array for easier manipulation
age = np.array(diabetes_df["age"])

# Create a copy of the 'age' array to store the binarized values
old = np.array(diabetes_df["age"])

# Assign 1 to individuals older than 50
old[age > 50] = 1

# Assign 0 to individuals aged 50 or younger
old[age <= 50] = 0

# Add the binarized 'old' column back to the DataFrame
diabetes_df["old"] = old

# Display the first 10 rows of the updated DataFrame
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0
3,1,89,66,23,94,28.1,0.167,21,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0
5,5,116,74,0,0,25.6,0.201,30,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0
7,10,115,0,0,0,35.3,0.134,29,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1


In [93]:
# Binarize 'age' field using Binarizer
# This transformation converts numerical values into binary (0 or 1) based on a given threshold.

# Initialize the Binarizer with a threshold of 50
# Any age value greater than 50 will be mapped to 1, while 50 and below will be mapped to 0.
bn = Binarizer(threshold=50)

# Apply the transformation on the 'age' column of the diabetes dataset
# Note: `Binarizer.transform()` expects a 2D array, so we wrap the column inside a list.
bn_old = bn.transform([diabetes_df["age"]])[0]  

# Store the binarized values in a new column 'bn_old' in the DataFrame
diabetes_df["bn_old"] = bn_old

# Display the first 10 rows of the updated DataFrame to verify the transformation
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old
0,6,148,72,35,0,33.6,0.627,50,1,0,0
1,1,85,66,29,0,26.6,0.351,31,0,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0,0
3,1,89,66,23,94,28.1,0.167,21,0,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0,0
5,5,116,74,0,0,25.6,0.201,30,0,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0,0
7,10,115,0,0,0,35.3,0.134,29,0,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1,1


### Rounding

When working with numeric attributes such as proportions or percentages, high precision is often unnecessary. In such cases, it is practical to round these values to whole numbers. These rounded integers can then be used directly as raw numeric values or even as categorical (discrete class-based) features.


In [94]:
# Create a new column 'pedigree_scale_10' by multiplying the 'pedigree' column by 10 
# and rounding the values to the nearest integer, then converting them to integers
diabetes_df["pedigree_scale_10"] = np.array(
    np.round((diabetes_df["pedigree"] * 10)), dtype="int"
)

# Create a new column 'pedigree_scale_100' by multiplying the 'pedigree' column by 100 
# and rounding the values to the nearest integer, then converting them to integers
diabetes_df["pedigree_scale_100"] = np.array(
    np.round((diabetes_df["pedigree"] * 100)), dtype="int"
)

# Display the updated DataFrame
diabetes_df

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old,pedigree_scale_10,pedigree_scale_100
0,6,148,72,35,0,33.6,0.627,50,1,0,0,6,63
1,1,85,66,29,0,26.6,0.351,31,0,0,0,4,35
2,8,183,64,0,0,23.3,0.672,32,1,0,0,7,67
3,1,89,66,23,94,28.1,0.167,21,0,0,0,2,17
4,0,137,40,35,168,43.1,2.288,33,1,0,0,23,229
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0,1,1,2,17
764,2,122,70,27,0,36.8,0.340,27,0,0,0,3,34
765,5,121,72,23,112,26.2,0.245,30,0,0,0,2,24
766,1,126,60,0,0,30.1,0.349,47,1,0,0,3,35


### Interactions

In real-world datasets and scenarios, it is often beneficial to capture interactions between feature variables and include them as part of the input feature set.


In [95]:
# Select the "gvh" (global protein localization) and "lip" (lipoprotein signal) columns from the ecoli_df DataFrame
gvh_lip = ecoli_df[["gvh", "lip"]]

# Display the first 5 rows of the selected subset to inspect the data
gvh_lip.head()

Unnamed: 0,gvh,lip
0,0.61,0.48
1,0.29,0.48
2,0.53,0.48
3,0.4,0.48
4,0.68,0.48


In [96]:
# Create an instance of PolynomialFeatures with the following parameters:
# degree=2: Generate features up to the second degree (squared features and interactions).
# interaction_only=False: Allow both interaction terms (e.g., feature1*feature2) and polynomial terms (e.g., feature1^2).
# include_bias=False: Exclude the bias column (column of ones) that represents the intercept term.
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit the PolynomialFeatures transformer on the dataset (gvh_lip) and then transform it.
# This will generate new features that are combinations of the original features up to the specified degree.
res = pf.fit_transform(gvh_lip)

# Output the transformed feature set (the original features and their polynomial interactions).
res

array([[0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.29  , 0.48  , 0.0841, 0.1392, 0.2304],
       [0.53  , 0.48  , 0.2809, 0.2544, 0.2304],
       ...,
       [0.6   , 0.48  , 0.36  , 0.288 , 0.2304],
       [0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.74  , 0.48  , 0.5476, 0.3552, 0.2304]])

We have a total of five features including the new interaction
features.


We can see the degree of each feature in the matrix.


In [97]:
# Convert the pf.powers_ attribute to a Pandas DataFrame
# 'pf.powers_' is assumed to be a NumPy array or similar structure containing power values.
# The columns are labeled as 'gvh_degree' and 'lip_degree' for clarity.
pd.DataFrame(pf.powers_, columns=["gvh_degree", "lip_degree"])

Unnamed: 0,gvh_degree,lip_degree
0,1,0
1,0,1
2,2,0
3,1,1
4,0,2


Now that we know what each feature actually represented from the degrees depicted, we can assign a
name to each feature as follows to get the updated feature set.


In [98]:
# Create a DataFrame from the list 'res' with specific column names
intr_features = pd.DataFrame(res, columns=["gvh", "lip", "gvh^2", "gvh x lip", "lip^2"])

# Display the first 5 rows of the DataFrame to preview the data
intr_features.head(5)

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.61,0.48,0.3721,0.2928,0.2304
1,0.29,0.48,0.0841,0.1392,0.2304
2,0.53,0.48,0.2809,0.2544,0.2304
3,0.4,0.48,0.16,0.192,0.2304
4,0.68,0.48,0.4624,0.3264,0.2304


Transforming new data in the future (during predictions)


In [99]:
# Create a new DataFrame containing sample observations for 'gvh' (attack) and 'lip' (defense)
# Each row represents a new Pokémon's features (attack and defense)
new_df = pd.DataFrame(
    [[0.35, 0.49], [0.46, 0.38], [0.25, 0.48]],  # Sample feature values for 3 new Pokémon
    columns=["gvh", "lip"]  # Define column names as 'gvh' (attack) and 'lip' (defense)
)

# Display the newly created DataFrame
new_df

Unnamed: 0,gvh,lip
0,0.35,0.49
1,0.46,0.38
2,0.25,0.48


In [100]:
# Use the pf object that we created earlier to transform the input features 
# and generate interaction features from the new data (new_df).
new_res = pf.transform(new_df)

# Convert the resulting interaction features (new_res) into a DataFrame 
# with column names representing the specific interaction terms and transformations.
new_intr_features = pd.DataFrame(
    new_res,  # The transformed features
    columns=["gvh", "lip", "gvh^2", "gvh x lip", "lip^2"]  # Assigning column names to the features
)

# Output the DataFrame containing the new interaction features
new_intr_features

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.35,0.49,0.1225,0.1715,0.2401
1,0.46,0.38,0.2116,0.1748,0.1444
2,0.25,0.48,0.0625,0.12,0.2304


## Feature Engineering on Categorical Data

Any categorical attribute or feature represents discrete values that fall within a specific, finite set of categories or classes. These category or class labels can be either text or numeric. Categorical variables are typically divided into two types: **nominal** and **ordinal**.

### Transforming Nominal Features

Nominal features or attributes are categorical variables with a finite set of distinct discrete values. These values are often represented as strings or text, which Machine Learning algorithms cannot process directly. As a result, it is usually necessary to transform these features into a numeric format that algorithms can interpret.

In [101]:
# Display the first 11 rows of the dataframe 'ecoli_df'
ecoli_df.head(11)  # This will show the first 11 entries in the dataframe, including column names

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


The dataset depicted in this dataframe shows us various attributes pertaining to video games. Features
like Platform, Genre, and Publisher are nominal categorical variables.


In [102]:
# Extract unique values from the "site" column in the ecoli_df DataFrame
sites = np.unique(ecoli_df["site"])

# Display the unique site values
sites

array(['cp', 'im', 'imL', 'imS', 'imU', 'om', 'omL', 'pp'], dtype=object)

This output tells us we have 8 distinct sites in Ecoli dataset.


In [103]:
# Initialize the LabelEncoder
sle = LabelEncoder()

# Encode the 'site' column in ecoli_df, converting categorical values into numeric labels
site_labels = sle.fit_transform(ecoli_df["site"])

# Create a dictionary mapping each numeric label back to its original category
site_mappings = {index: label for index, label in enumerate(sle.classes_)}

# Display the mapping of encoded values to original categories
site_mappings

{0: 'cp', 1: 'im', 2: 'imL', 3: 'imS', 4: 'imU', 5: 'om', 6: 'omL', 7: 'pp'}

A mapping scheme has been generated where each site value is
mapped to a number with the help of the LabelEncoder object sle. The transformed labels are stored in the
site_labels value.


In [104]:
# Assign the list/array 'site_labels' to a new column 'siteLabel' in the dataframe 'ecoli_df'
ecoli_df["siteLabel"] = site_labels  

# Display the first 11 rows of the dataframe to check the assigned labels
ecoli_df.head(11)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site,siteLabel
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im,1
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp,0
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS,3
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp,0
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om,5
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL,2
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL,6
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp,0
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU,4
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp,7


The SiteLabel field shows the mapped numeric labels for each of the site labels and we can clearly
see that this adheres to the mappings that we generated earlier.


### Transforming Ordinal Features

Ordinal features are similar to nominal features, but with one key difference: the order of values matters and is an inherent property that provides meaning to these features. Like nominal features, ordinal features may also be represented as text, requiring transformation into a numeric format for Machine Learning algorithms to interpret them effectively.

### Create Generation based on 'age'


In [105]:
# Convert the "age" column of the diabetes dataframe to a NumPy array
age = np.array(diabetes_df["age"])

# Create a new column "Generation" based on age groups using a lambda function
diabetes_df["Generation"] = diabetes_df["age"].apply(
    lambda value: (
        "Gen Z" if value <= 25  # Ages 25 and below belong to Generation Z
        else (
            "Millennials" if value <= 41  # Ages 26-41 belong to Millennials
            else (
                "Gen X" if value <= 57  # Ages 42-57 belong to Generation X
                else (
                    "Boomers II" if value <= 67  # Ages 58-67 belong to Boomers II
                    else (
                        "Boomers I" if value <= 76  # Ages 68-76 belong to Boomers I
                        else (
                            "Post WWII" if value <= 94  # Ages 77-94 belong to Post-WWII generation
                            else "WWII"  # Ages 95+ belong to WWII generation
                        )
                    )
                )
            )
        )
    )
)

# Display the first 10 rows of the "age" and "Generation" columns
diabetes_df[["age", "Generation"]].head(10)

Unnamed: 0,age,Generation
0,50,Gen X
1,31,Millennials
2,32,Millennials
3,21,Gen Z
4,33,Millennials
5,30,Millennials
6,26,Millennials
7,29,Millennials
8,53,Gen X
9,54,Gen X


In [106]:
# Get unique values from the "Generation" column of the diabetes_df DataFrame
unique_generations = np.unique(diabetes_df["Generation"])

# Display the unique generation values
print(unique_generations)

['Boomers I' 'Boomers II' 'Gen X' 'Gen Z' 'Millennials' 'Post WWII']


From this output, we can observe that there are six distinct generations of people. This attribute is clearly ordinal, as the generations have a natural sense of order.

However, there is no built-in module or function to automatically map and transform these features into numeric representations. As a result, we need to manually implement this transformation using custom logic, as shown in the following code snippet.

In [107]:
# Define a dictionary to map generation labels to ordinal values
gen_ord_map = {
    "Gen Z": 1,         # Youngest generation in the dataset
    "Millennials": 2,   # Followed by Millennials
    "Gen X": 3,         # Middle-aged generation
    "Boomers II": 4,    # Late Boomers
    "Boomers I": 5,     # Early Boomers
    "Post WWII": 6,     # Oldest generation in the dataset
}

# Map the 'Generation' column in the diabetes dataset to corresponding ordinal values
diabetes_df["GenerationLabel"] = diabetes_df["Generation"].map(gen_ord_map)

# Display selected columns (age, original generation label, and mapped generation label)
# for rows 4 to 9 (since slicing is exclusive of the end index)
diabetes_df[["age", "Generation", "GenerationLabel"]].iloc[4:10]

Unnamed: 0,age,Generation,GenerationLabel
4,33,Millennials,2
5,30,Millennials,2
6,26,Millennials,2
7,29,Millennials,2
8,53,Gen X,3
9,54,Gen X,3


### Create BMI Class based on 'bmi'


In [108]:
# Extract the 'bmi' column from the diabetes DataFrame and store it as a numpy array
bmi = np.array(diabetes_df["bmi"])

# Create a new 'BMI' column in the dataframe by applying a function to the 'bmi' values
diabetes_df["BMI"] = diabetes_df["bmi"].apply(
    lambda value: (
        "Underweight"  # If the BMI is less than or equal to 18.5, classify as Underweight
        if value <= 18.5
        else (
            "Normal"  # If BMI is between 18.6 and 22.9, classify as Normal
            if value <= 22.9
            else (
                "Pre-obese"  # If BMI is between 23 and 24.9, classify as Pre-obese
                if value <= 24.9
                else (
                    "Class I obesity"  # If BMI is between 25 and 29.9, classify as Class I obesity
                    if value <= 29.9
                    else "Class II obesity"  # If BMI is between 30 and 34.9, classify as Class II obesity
                    if value <= 34.9 
                    else "Class II obesity"  # If BMI is greater than 35, classify as Class II obesity
                )
            )
        )
    )
)

# Display the first 10 rows of 'bmi' and the newly created 'BMI' column for review
diabetes_df[["bmi", "BMI"]].head(10)

Unnamed: 0,bmi,BMI
0,33.6,Class II obesity
1,26.6,Class I obesity
2,23.3,Pre-obese
3,28.1,Class I obesity
4,43.1,Class II obesity
5,25.6,Class I obesity
6,31.0,Class II obesity
7,35.3,Class II obesity
8,30.5,Class II obesity
9,0.0,Underweight


In [109]:
# Get the unique values in the 'BMI' column of the diabetes_df DataFrame
# 'diabetes_df["BMI"]' selects the BMI column from the dataframe
# np.unique() returns the sorted unique values in the specified array
unique_bmi_values = np.unique(diabetes_df["BMI"])

# Output the unique BMI values
print(unique_bmi_values)

['Class I obesity' 'Class II obesity' 'Normal' 'Pre-obese' 'Underweight']


From this output, we can observe that there are five distinct BMI classes. This attribute is clearly ordinal, as the classes follow a natural order.

However, there is no built-in module or function to automatically map and transform these features into numeric representations. As a result, we need to manually implement this transformation using custom logic, as shown in the following code snippet.

In [110]:
# Dictionary mapping BMI categories to ordinal values
bmi_ord_map = {
    "Underweight": 1,       # "Underweight" corresponds to 1
    "Normal": 2,            # "Normal" corresponds to 2
    "Pre-obese": 3,         # "Pre-obese" corresponds to 3
    "Class I obesity": 4,   # "Class I obesity" corresponds to 4
    "Class II obesity": 5,  # "Class II obesity" corresponds to 5
}

# Map the 'BMI' column in diabetes_df to its corresponding ordinal value using the bmi_ord_map dictionary
diabetes_df["BMILabel"] = diabetes_df["BMI"].map(bmi_ord_map)

# Display a subset of the dataframe (rows 4 to 9) showing 'bmi', 'BMI', and 'BMILabel' columns
diabetes_df[["bmi", "BMI", "BMILabel"]].iloc[4:10]

Unnamed: 0,bmi,BMI,BMILabel
4,43.1,Class II obesity,5
5,25.6,Class I obesity,4
6,31.0,Class II obesity,5
7,35.3,Class II obesity,5
8,30.5,Class II obesity,5
9,0.0,Underweight,1


From this output, we can observe that there are six distinct generations of Pokémon. This attribute is clearly ordinal, as Pokémon from Generation 1 were introduced earlier in the video games and television shows compared to Generation 2, and so on. This establishes a natural order among the generations.

However, there is no built-in module or function to automatically map and transform these features into numeric representations. As a result, we need to manually implement this transformation using custom logic, as shown in the following code snippet.

## Encoding Categorical Features

If we directly feed these transformed numeric representations of categorical features into an algorithm, the model will interpret them as raw numeric features. This introduces an incorrect notion of magnitude, as the numeric values do not inherently carry meaningful order or scale.

As a result, models built using these features directly would be suboptimal and inaccurate. To address this, several strategies exist for creating dummy features, where each unique value or label from the distinct categories is represented separately. In the following sections, we will explore some of these strategies, including **one-hot encoding**, **dummy coding**, **effect coding**, and **feature hashing schemes**.


### One Hot Encoding Scheme

For a categorical feature with **m** unique labels, the one-hot encoding scheme transforms the feature into **m** binary features, each of which can only take a value of **1** or **0**. Each observation in the categorical feature is converted into a vector of size **m**, where only one element is **1** (indicating the active category) and the rest are **0**.


In [111]:
# Select a subset of the DataFrame 'diabetes_df' with specific columns
# - 'diabetes': The target variable indicating if the person has diabetes (e.g., 1 or 0)
# - 'Generation': A categorical variable representing the generation group (e.g., Gen X, Millennial)
# - 'BMI': A numerical variable for Body Mass Index (BMI)

# Use 'iloc' to filter rows between index 4 and 9 (remember, Python is 0-indexed, so row 4 is included, row 10 is excluded)
diabetes_df[["diabetes", "Generation", "BMI"]].iloc[4:10]

Unnamed: 0,diabetes,Generation,BMI
4,1,Millennials,Class II obesity
5,0,Millennials,Class I obesity
6,1,Millennials,Class II obesity
7,0,Millennials,Class II obesity
8,1,Gen X,Class II obesity
9,1,Gen X,Underweight


In [112]:
# Initialize the LabelEncoder for "Generation"
gen_le = LabelEncoder()

# Fit the LabelEncoder on the "Generation" column and transform it into numerical labels
gen_labels = gen_le.fit_transform(diabetes_df["Generation"])

# Add the transformed "Generation" labels as a new column in the dataframe
diabetes_df["Gen_Label"] = gen_labels

# Initialize the LabelEncoder for "BMI"
bmi_le = LabelEncoder()

# Fit the LabelEncoder on the "BMI" column and transform it into numerical labels
bmi_labels = bmi_le.fit_transform(diabetes_df["BMI"])

# Add the transformed "BMI" labels as a new column in the dataframe
diabetes_df["BMI_Label"] = bmi_labels

# Create a new dataframe subset with only relevant columns: 
# "diabetes" (target variable), "Generation", "Gen_Label", "BMI", and "BMI_Label"
diabetes_df_sub = diabetes_df[
    ["diabetes", "Generation", "Gen_Label", "BMI", "BMI_Label"]
]

# Display rows 4 to 9 (5th to 10th) from the new dataframe subset
diabetes_df_sub.iloc[4:10]

Unnamed: 0,diabetes,Generation,Gen_Label,BMI,BMI_Label
4,1,Millennials,4,Class II obesity,1
5,0,Millennials,4,Class I obesity,0
6,1,Millennials,4,Class II obesity,1
7,0,Millennials,4,Class II obesity,1
8,1,Gen X,2,Class II obesity,1
9,1,Gen X,2,Underweight,4


In [113]:
# Encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()  # Initialize the OneHotEncoder for 'Gen_Label'
gen_feature_arr = gen_ohe.fit_transform(diabetes_df[["Gen_Label"]]).toarray()  
# Apply the encoder to the "Gen_Label" column and convert the result into an array
gen_feature_labels = list(gen_ohe.categories_[0])  
# Extract the unique categories from the 'Gen_Label' encoding and convert to a list
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)  
# Create a DataFrame with the encoded features, with the appropriate column labels

# Encode BMI labels using one-hot encoding scheme
bmi_ohe = OneHotEncoder()  # Initialize the OneHotEncoder for 'BMI_Label'
bmi_feature_arr = bmi_ohe.fit_transform(diabetes_df[["BMI_Label"]]).toarray()  
# Apply the encoder to the "BMI_Label" column and convert the result into an array
bmi_feature_labels = ["BMI_" + str(cls_label) for cls_label in bmi_ohe.categories_[0]]  
# Create BMI feature labels by prepending "BMI_" to the class labels of the BMI categories
bmi_features = pd.DataFrame(bmi_feature_arr, columns=bmi_feature_labels)  
# Create a DataFrame with the encoded BMI features, with the appropriate column labels

In [114]:
# Concatenate the dataframes: diabetes_df_sub, gen_features, and bmi_features along columns (axis=1)
# This will combine the features from these different sources into a single dataframe
diabetes_df_ohe = pd.concat([diabetes_df_sub, gen_features, bmi_features], axis=1)

# Create the column names list by combining predefined column labels and the feature labels
# "diabetes", "Generation", and "Gen_Label" are predefined columns
# gen_feature_labels and bmi_feature_labels are dynamically created lists based on features
columns = sum(
    [
        ["diabetes", "Generation", "Gen_Label"],  # Predefined column names
        gen_feature_labels,  # Feature labels for genetic data
        ["BMI", "BMI_Label"],  # Predefined BMI-related column names
        bmi_feature_labels,  # Feature labels for BMI-related data
    ],
    [],  # Flatten the list of lists into a single list
)

# Display a slice (rows 4 to 9) of the concatenated dataframe with the newly created columns
diabetes_df_ohe[columns].iloc[4:10]

Unnamed: 0,diabetes,Generation,Gen_Label,0,1,2,3,4,5,BMI,BMI_Label,BMI_0,BMI_1,BMI_2,BMI_3,BMI_4
4,1,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
5,0,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class I obesity,0,1.0,0.0,0.0,0.0,0.0
6,1,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
7,0,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
8,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
9,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Underweight,4,0.0,0.0,0.0,0.0,1.0


We can clearly observe the new one-hot encoded features for **Gen_Label** and **BMI_Label**. Each of these one-hot encoded features is binary, meaning they can only take a value of **1** or **0**. A value of **1** indicates that the feature is active for the corresponding observation.

In [115]:
# The following code creates a dummy DataFrame with two data points representing new diabetes cases.
new_diabetes_df = pd.DataFrame(
    # Data: A list of lists, where each list represents a data point
    [["1", "Gen X", "Pre-obese"],  # First data point: diabetes (1), Generation (Gen X), BMI (Pre-obese)
     ["0", "Boomers II", "Class I obesity"]],  # Second data point: diabetes (0), Generation (Boomers II), BMI (Class I obesity)
    
    # Columns: Names of the columns for the DataFrame
    columns=["diabetes", "Generation", "BMI"],  # Define the names for each column (diabetes, Generation, BMI)
)

# Display the DataFrame to show the created data points
new_diabetes_df

Unnamed: 0,diabetes,Generation,BMI
0,1,Gen X,Pre-obese
1,0,Boomers II,Class I obesity


In [116]:
# Converting the text categories into numeric representations using our previously built LabelEncoder objects

# Transforming the 'Generation' column values to numeric using the previously fitted LabelEncoder (gen_le)
new_gen_labels = gen_le.transform(new_diabetes_df["Generation"])

# Adding the transformed numeric labels as a new column called "Gen_Label" in the DataFrame
new_diabetes_df["Gen_Label"] = new_gen_labels

# Transforming the 'BMI' column values to numeric using the previously fitted LabelEncoder (bmi_le)
new_bmi_labels = bmi_le.transform(new_diabetes_df["BMI"])

# Adding the transformed numeric labels as a new column called "BMI_Label" in the DataFrame
new_diabetes_df["BMI_Label"] = new_bmi_labels

# Displaying the relevant columns to inspect the new encoded labels
new_diabetes_df[["diabetes", "Generation", "Gen_Label", "BMI", "BMI_Label"]]

Unnamed: 0,diabetes,Generation,Gen_Label,BMI,BMI_Label
0,1,Gen X,2,Pre-obese,3
1,0,Boomers II,1,Class I obesity,0


In [117]:
# Transform 'Gen_Label' column using previously built LabelEncoder to one-hot encoded features
# 'gen_ohe' is assumed to be a previously fitted OneHotEncoder for the 'Gen_Label' column.
new_gen_feature_arr = gen_ohe.transform(new_diabetes_df[["Gen_Label"]]).toarray()  
# Convert the resulting array into a DataFrame with appropriate column names from 'gen_feature_labels'
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)  

# Transform 'BMI_Label' column using previously built LabelEncoder to one-hot encoded features
# 'bmi_ohe' is assumed to be a previously fitted OneHotEncoder for the 'BMI_Label' column.
new_bmi_feature_arr = bmi_ohe.transform(new_diabetes_df[["BMI_Label"]]).toarray()  
# Convert the resulting array into a DataFrame with appropriate column names from 'bmi_feature_labels'
new_bmi_features = pd.DataFrame(new_bmi_feature_arr, columns=bmi_feature_labels)  

# Concatenate the original dataframe 'new_diabetes_df' with the newly generated one-hot encoded features
# This will add the new columns from 'new_gen_features' and 'new_bmi_features' to the original data
new_diabetes_ohe = pd.concat(
    [new_diabetes_df, new_gen_features, new_bmi_features], axis=1
)

# Define the desired column order, starting with diabetes-related columns, then adding one-hot encoded features
columns = sum(
    [
        ["diabetes", "Generation", "Gen_Label"],  # The original columns from the dataset
        gen_feature_labels,  # Columns generated from one-hot encoding of 'Gen_Label'
        ["BMI", "BMI_Label"],  # The BMI-related columns
        bmi_feature_labels,  # Columns generated from one-hot encoding of 'BMI_Label'
    ],
    [],
)

# Display the dataframe with the new column order
new_diabetes_ohe[columns]

Unnamed: 0,diabetes,Generation,Gen_Label,0,1,2,3,4,5,BMI,BMI_Label,BMI_0,BMI_1,BMI_2,BMI_3,BMI_4
0,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Pre-obese,3,0.0,0.0,0.0,1.0,0.0
1,0,Boomers II,1,0.0,1.0,0.0,0.0,0.0,0.0,Class I obesity,0,1.0,0.0,0.0,0.0,0.0


In [118]:
# Pandas provides the 'get_dummies()' function that can help us easily perform one-hot encoding.
# It converts a categorical column into multiple binary columns, each representing a category in the original column.
gen_onehot_features = pd.get_dummies(diabetes_df["Generation"])

# Concatenate the original dataframe with the one-hot encoded columns, while keeping the "diabetes" and "Generation" columns.
# We use `axis=1` to concatenate along columns (horizontally).
# The 'iloc[4:10]' selects rows 4 through 9 (i.e., 6 rows) from the resulting dataframe.
# This gives a glimpse of the encoded features for a specific subset of rows.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_onehot_features], axis=1).iloc[
    4:10
]

Unnamed: 0,diabetes,Generation,Boomers I,Boomers II,Gen X,Gen Z,Millennials,Post WWII
4,1,Millennials,False,False,False,False,True,False
5,0,Millennials,False,False,False,False,True,False
6,1,Millennials,False,False,False,False,True,False
7,0,Millennials,False,False,False,False,True,False
8,1,Gen X,False,False,True,False,False,False
9,1,Gen X,False,False,True,False,False,False


### Dummy Coding Scheme

The dummy coding scheme is similar to one-hot encoding, with one key difference: when applied to a categorical feature with **m** distinct labels, it generates **m-1** binary features. As a result, each value of the categorical variable is converted into a vector of size **m-1**. The remaining feature is entirely omitted, and if the category values range from {**0, 1, ..., m-1**}, the 0th or (m-1)th feature is typically represented by a vector of all zeros (**0**).

In [119]:
# Create dummy (one-hot encoded) variables for the "Generation" column in the diabetes dataset.
# The first category (Boomers I) is dropped to avoid the dummy variable trap (multicollinearity).
gen_dummy_features = pd.get_dummies(diabetes_df["Generation"], drop_first=True)

# Concatenate the original "diabetes" and "Generation" columns with the newly created dummy variables.
# Select and display rows 4 to 9 for inspection.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,diabetes,Generation,Boomers II,Gen X,Gen Z,Millennials,Post WWII
4,1,Millennials,False,False,False,True,False
5,0,Millennials,False,False,False,True,False
6,1,Millennials,False,False,False,True,False
7,0,Millennials,False,False,False,True,False
8,1,Gen X,False,True,False,False,False
9,1,Gen X,False,True,False,False,False


In [120]:
# Perform one-hot encoding on the "Generation" column, creating binary features for each unique category.
gen_onehot_features = pd.get_dummies(diabetes_df["Generation"])

# Drop the last column to avoid the dummy variable trap (multicollinearity issue).
# This ensures that information is preserved without redundancy.
gen_dummy_features = gen_onehot_features.iloc[:, :-1]

# Concatenate the original "diabetes" and "Generation" columns with the encoded dummy variables.
# Then, display rows 4 to 9 of the resulting DataFrame.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,diabetes,Generation,Boomers I,Boomers II,Gen X,Gen Z,Millennials
4,1,Millennials,False,False,False,False,True
5,0,Millennials,False,False,False,False,True
6,1,Millennials,False,False,False,False,True
7,0,Millennials,False,False,False,False,True
8,1,Gen X,False,False,True,False,False
9,1,Gen X,False,False,True,False,False


## Conclusion

Through this module, we learned essential feature engineering techniques for both numeric and categorical data, including:

- Converting raw data into machine learning-ready features
- Applying appropriate transformations based on data type
- Understanding and implementing different encoding schemes
- Creating meaningful feature interactions
- Handling both nominal and ordinal categorical variables

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
