# Feature Engineering Exercise (Solution)

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

Feature engineering is a crucial step in developing effective Machine Learning systems, blending domain expertise with mathematical transformations. It focuses on processing diverse data types and variables, with each Machine Learning problem demanding tailored feature engineering strategies. This module explores techniques for engineering both **numeric** and **categorical** features.

## Learning Objectives

- Transform and engineer numeric features
  - Apply raw measures and counts
  - Implement binarization techniques
  - Perform rounding operations
  - Create feature interactions
- Transform and engineer categorical features
  - Convert nominal features to numeric representations
  - Transform ordinal features with preserved ordering
  - Apply encoding schemes for categorical data
    - One Hot Encoding
    - Dummy Coding

### Tasks to complete

- Implement numeric feature engineering techniques
- Transform categorical variables
- Apply various encoding schemes
- Analyze transformed features

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

- Please select kernel "conda_python3" from SageMaker notebook instance.

### Import necessary libraries


In [None]:
# Import necessary dependencies

# Matplotlib for plotting and visualization
import matplotlib as mpl
import matplotlib.pyplot as plt

# NumPy for numerical operations and array manipulations
import numpy as np

# Pandas for data manipulation and analysis
import pandas as pd

# SciPy statistical functions for advanced statistical analysis
import scipy.stats as spstats

# Scikit-learn preprocessing tools for data transformation
from sklearn.preprocessing import (
    Binarizer,           # Converts numerical values into binary (0 or 1) based on a threshold
    LabelEncoder,        # Encodes categorical labels as integers (useful for classification tasks)
    OneHotEncoder,       # Encodes categorical variables as one-hot (dummy) variables
    PolynomialFeatures,  # Generates polynomial features for regression models
)


# Enable inline plotting in Jupyter Notebook
%matplotlib inline

# Reload Matplotlib's style library to ensure the latest settings are applied
mpl.style.reload_library()

# Set the Matplotlib style to "classic" for a traditional look
mpl.style.use("classic")

# Set the background color of figures to transparent (white with 0 alpha)
mpl.rcParams["figure.facecolor"] = (1, 1, 1, 0)

# Define the default figure size as 6 inches by 4 inches
mpl.rcParams["figure.figsize"] = [6.0, 4.0]

# Set the figure resolution to 100 dots per inch (DPI) for better clarity
mpl.rcParams["figure.dpi"] = 100

## Feature Engineering on Numeric Data


While machine learning algorithms can process raw numerical data directly, effective modeling typically requires deliberate feature engineering to create meaningful representations aligned with the problem domain. For numerical features, two critical properties demand attention: scale (relative magnitude of values) and distribution (underlying statistical shape). Proper scaling ensures features contribute equally to distance-based calculations (e.g., k-NN, SVM), while distribution adjustments—such as normalizing skewed variables via log/power transforms—can improve performance for algorithms assuming Gaussian-like inputs (e.g., linear regression). These transformations are not merely algorithmic prerequisites but domain-specific design choices; for instance, financial models may intentionally preserve scale for interpretability, whereas image processing pipelines might aggressively normalize pixel intensities. The art of feature engineering lies in balancing mathematical soundness with problem context to extract maximal signal from numerical data.

### Raw Measures

Raw measures constitute the most fundamental form of feature representation, where numeric variables are incorporated into machine learning models in their original, untransformed state. These features preserve the exact values as recorded in the source data - whether as continuous measurements, discrete counts, or absolute quantities - without undergoing normalization, scaling, or other engineering processes. While this approach maintains maximum fidelity to the initial data collection, it may introduce challenges when variables operate on vastly different scales or units, potentially biasing algorithms sensitive to feature magnitudes. The use of raw measures is particularly common in domains requiring strict interpretability of input variables, or when the native scales themselves carry meaningful information for the predictive task.

#### Values

Scalar values in their raw form represent individual measurements, metrics, or observations tied to specific variables, where the meaning and context of each value are typically inferred from the field name or, when available, through reference to a comprehensive data dictionary that provides formal definitions, units of measurement, and other relevant metadata.

### Ecoli Dataset

Ecoli dataset is for predicting Protein Localization Sites in Ecoli.

```
Number of Instances:  336
Number of Attributes: 8 ( 7 predictive, 1 name )
Attribute Information.
  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
  5. chg: Presence of charge on N-terminus of predicted lipoproteins (Binary attribute).
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Missing Attribute Values: None.
Class Distribution. The class is the localization site.
  cp  (cytoplasm)                                    143
  im  (inner membrane without signal sequence)        77
  pp  (perisplasm)                                    52
  imU (inner membrane, uncleavable signal sequence)   35
  om  (outer membrane)                                20
  omL (outer membrane lipoprotein)                     5
  imL (inner membrane lipoprotein)                     2
  imS (inner membrane, cleavable signal sequence)      2
```

You can learn more about the dataset here:

- Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
- Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))


In [None]:
# Define the file path to the Ecoli dataset (relative path)
ecoli_data = "../../Data/ecoli.csv"

# Load the dataset into a Pandas DataFrame
ecoli_df = pd.read_csv(ecoli_data)

# Display the first 10 rows of the dataset to inspect its structure
ecoli_df.head(10)

In [None]:
# Display the first few rows of the selected feature columns ("mcg", "gvh", "chg") 
# from the ecoli dataset
ecoli_df[["mcg", "gvh", "chg"]].head()

In [None]:
# Compute basic statistical measures (count, mean, std, min, max, and quartiles)
# for the numerical columns 'mcg', 'gvh', and 'chg' in the DataFrame 'ecoli_df'
ecoli_df[["mcg", "gvh", "chg"]].describe()

### Counts

Numeric variables often directly encode quantitative measurements of events or characteristics, serving as fundamental representations of counts (e.g., customer transactions), frequencies (e.g., word occurrences in documents), or binary occurrences (e.g., presence/absence of symptoms). These raw numerical measures provide objective, machine-readable data that capture discrete phenomena without requiring transformation. However, their interpretation often requires contextual understanding - a count of '5' might represent 5 products purchased (a meaningful magnitude) or simply a binary 'yes' coded as 1 (where only presence matters). Proper documentation of what these numbers represent is essential for accurate analysis, as the same numerical format can convey fundamentally different types of information depending on the underlying measurement paradigm.


### Diabetes Dataset

The dataset classifies patient data as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))


In [None]:
# Load Diabetes dataset from a CSV file
diabetes_data = "../../Data/pima-indians-diabetes.csv"

# Read the CSV file into a pandas DataFrame, specifying no header row (header=None)
diabetes_df = pd.read_csv(diabetes_data, header=None)

# Assign column names to the dataset for better readability
diabetes_df.columns = [
    "pregnancy",  # Number of times pregnant
    "glucose",    # Plasma glucose concentration
    "bp",         # Diastolic blood pressure (mm Hg)
    "triceps",    # Triceps skinfold thickness (mm)
    "insulin",    # 2-Hour serum insulin (mu U/ml)
    "bmi",        # Body Mass Index (weight in kg/(height in m)^2)
    "pedigree",   # Diabetes pedigree function (genetic risk factor)
    "age",        # Age in years
    "diabetes",   # Diabetes diagnosis (1 = positive, 0 = negative)
]

# Display the first 10 rows of the dataset
diabetes_df.head(10)

In [None]:
diabetes_df.describe()

### Binarization

Binarization is a preprocessing technique that converts continuous numerical values into binary outputs (0 or 1) based on a specified threshold. In code implementation below:
* The Binarizer from scikit-learn transforms the 'age' feature using 50 as the decision boundary
* Ages ≤ 50 become 0 (representing "not old")
* Ages > 50 become 1 (representing "old")

The operation preserves the original data structure while creating a new binary column ('bn_old')

In [None]:
# Convert the 'age' column to a NumPy array for easier manipulation
age = np.array(diabetes_df["age"])

# Create a copy of the 'age' array to store the binarized values
old = np.array(diabetes_df["age"])

# Assign 1 to individuals older than 50
old[age > 50] = 1

# Assign 0 to individuals aged 50 or younger
old[age <= 50] = 0

# Add the binarized 'old' column back to the DataFrame
diabetes_df["old"] = old

# Display the first 10 rows of the updated DataFrame
diabetes_df.head(10)

In [None]:
# Binarize 'age' field using Binarizer
# This transformation converts numerical values into binary (0 or 1) based on a given threshold.

# Initialize the Binarizer with a threshold of 50
# Any age value greater than 50 will be mapped to 1, while 50 and below will be mapped to 0.
bn = Binarizer(threshold=50)

# Apply the transformation on the 'age' column of the diabetes dataset
# Note: `Binarizer.transform()` expects a 2D array, so we wrap the column inside a list.
bn_old = bn.transform([diabetes_df["age"]])[0]  

# Store the binarized values in a new column 'bn_old' in the DataFrame
diabetes_df["bn_old"] = bn_old

# Display the first 10 rows of the updated DataFrame to verify the transformation
diabetes_df.head(10)

### Rounding

For numeric attributes representing proportions or percentages, excessive precision frequently offers diminishing returns. A pragmatic approach involves rounding these values to whole numbers, which serves dual purposes: the simplified integers can function either as (1) streamlined continuous variables that reduce computational noise or as (2) discrete categorical features that capture meaningful value bands. This transformation not only improves data manageability but may also enhance model interpretability without significant loss of predictive power, particularly when the original decimal precision exceeds measurement accuracy or business requirements.

In [None]:
# Create a new column 'pedigree_scale_10' by multiplying the 'pedigree' column by 10 
# and rounding the values to the nearest integer, then converting them to integers
diabetes_df["pedigree_scale_10"] = np.array(
    np.round((diabetes_df["pedigree"] * 10)), dtype="int"
)

# Create a new column 'pedigree_scale_100' by multiplying the 'pedigree' column by 100 
# and rounding the values to the nearest integer, then converting them to integers
diabetes_df["pedigree_scale_100"] = np.array(
    np.round((diabetes_df["pedigree"] * 100)), dtype="int"
)

# Display the updated DataFrame
diabetes_df

### Interactions

In practical machine learning applications, explicitly creating interaction terms between features can significantly enhance model performance by capturing synergistic relationships that individual variables alone cannot express. These engineered features often reveal hidden patterns in real-world data where predictors combine non-additively, particularly in domains like healthcare (drug interactions), finance (portfolio effects), or engineering (system synergies). While some algorithms like deep neural networks can implicitly learn interactions, creating explicit cross-product features or using techniques like polynomial expansion often improves interpretability and boosts simpler models' predictive power.

In [None]:
# Select the "gvh" (global protein localization) and "lip" (lipoprotein signal) columns from the ecoli_df DataFrame
gvh_lip = ecoli_df[["gvh", "lip"]]

# Display the first 5 rows of the selected subset to inspect the data
gvh_lip.head()

In [None]:
# Create an instance of PolynomialFeatures with the following parameters:
# degree=2: Generate features up to the second degree (squared features and interactions).
# interaction_only=False: Allow both interaction terms (e.g., feature1*feature2) and polynomial terms (e.g., feature1^2).
# include_bias=False: Exclude the bias column (column of ones) that represents the intercept term.
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit the PolynomialFeatures transformer on the dataset (gvh_lip) and then transform it.
# This will generate new features that are combinations of the original features up to the specified degree.
res = pf.fit_transform(gvh_lip)

# Output the transformed feature set (the original features and their polynomial interactions).
res

After feature engineering, our final dataset comprises five predictive variables, including newly created interaction terms. These engineered features combine original variables to capture multiplicative relationships, expanding the model's capacity to identify non-linear patterns while maintaining a parsimonious feature set that balances predictive power and computational efficiency

The feature degree matrix reveals the connectivity pattern of each variable, quantifying how strongly features interact within the modeled relationships. Higher-degree features indicate more numerous or influential connections in the underlying data structure, while lower-degree features may represent more isolated or niche predictors. This metric helps identify hub features that dominate relationships versus peripheral ones that contribute minimally to the system's topology.

In [None]:
# Convert the pf.powers_ attribute to a Pandas DataFrame
# 'pf.powers_' is assumed to be a NumPy array or similar structure containing power values.
# The columns are labeled as 'gvh_degree' and 'lip_degree' for clarity.
pd.DataFrame(pf.powers_, columns=["gvh_degree", "lip_degree"])

With the feature degrees now clearly mapped to their actual representations, we can assign meaningful names to each feature, transforming the generic dataset into an interpretable feature set. This naming convention enables clearer analysis and domain-specific insights while preserving all computational properties of the data.

In [None]:
# Create a DataFrame from the list 'res' with specific column names
intr_features = pd.DataFrame(res, columns=["gvh", "lip", "gvh^2", "gvh x lip", "lip^2"])

# Display the first 5 rows of the DataFrame to preview the data
intr_features.head(5)

In [None]:
# Create a new DataFrame containing sample observations for 'gvh' and 'lip' 
new_df = pd.DataFrame(
    [[0.35, 0.49], [0.46, 0.38], [0.25, 0.48]], 
    columns=["gvh", "lip"]  # Define column names as 'gvh' and 'lip' 
)

# Display the newly created DataFrame
new_df

In [None]:
# Use the pf object that we created earlier to transform the input features 
# and generate interaction features from the new data (new_df).
new_res = pf.transform(new_df)

# Convert the resulting interaction features (new_res) into a DataFrame 
# with column names representing the specific interaction terms and transformations.
new_intr_features = pd.DataFrame(
    new_res,  # The transformed features
    columns=["gvh", "lip", "gvh^2", "gvh x lip", "lip^2"]  # Assigning column names to the features
)

# Output the DataFrame containing the new interaction features
new_intr_features

## Feature Engineering on Categorical Data

Categorical features represent discrete values from a finite set of possible categories, which may be expressed as either text labels or numeric codes. These variables are fundamentally classified into two types: **nominal** (unordered categories like colors or brands) and **ordinal** (ordered categories with inherent ranking like education levels or survey scales). The discrete nature of categorical data distinguishes it from continuous numerical values and requires specialized handling in statistical analysis and machine learning.

### Transforming Nominal Features

Nominal features represent categorical variables with distinct, non-ordinal values (e.g., colors or cities). Since machine learning algorithms require numerical inputs, these string-based categories must be encoded into numeric representations. Common transformation techniques include one-hot encoding for low-cardinality features and target encoding for high-cardinality variables, each preserving different aspects of the categorical information while making it algorithm-compatible.

In [None]:
# Display the first 11 rows of the dataframe 'ecoli_df'
ecoli_df.head(11)  # This will show the first 11 entries in the dataframe, including column names

The output displays the first 11 rows of the DataFrame ecoli_df, where each row represents a protein from E. coli, identified by its accession name (e.g., EMRA_ECOLI, AAT_ECOLI). The DataFrame includes 8 columns in total: 7 numerical features—mcg, gvh, lip, chg, aac, alm1, and alm2—which are used for classification, and 1 categorical target column, site, which indicates the predicted protein localization site (e.g., cp, im, om, etc.).

In [None]:
# Extract unique values from the "site" column in the ecoli_df DataFrame
sites = np.unique(ecoli_df["site"])

# Display the unique site values
sites

The analysis reveals that the E.coli dataset contains samples from 8 unique site locations, as confirmed by distinct value counts in the site identifier feature. This categorical distribution is significant for ensuring representative sampling across different environmental contexts in subsequent modeling.

In [None]:
# Initialize the LabelEncoder
sle = LabelEncoder()

# Encode the 'site' column in ecoli_df, converting categorical values into numeric labels
site_labels = sle.fit_transform(ecoli_df["site"])

# Create a dictionary mapping each numeric label back to its original category
site_mappings = {index: label for index, label in enumerate(sle.classes_)}

# Display the mapping of encoded values to original categories
site_mappings

Using scikit-learn's LabelEncoder (*sle*), we've encoded categorical site values into numerical representations, with the transformed integer labels stored in the site_labels array. This mapping scheme preserves the original categorical relationships while converting them to a format compatible with machine learning algorithms.

In [None]:
# Assign the list/array 'site_labels' to a new column 'siteLabel' in the dataframe 'ecoli_df'
ecoli_df["siteLabel"] = site_labels  

# Display the first 11 rows of the dataframe to check the assigned labels
ecoli_df.head(11)

The SiteLabel field contains the numerically encoded values corresponding to each site location, and our validation confirms these mappings precisely match the predefined label-to-integer assignments created during preprocessing. This consistency ensures accurate representation of categorical site data for downstream modeling tasks.

### Transforming Ordinal Features

Ordinal features share similarities with nominal features in their categorical nature, but differ critically through their meaningful value ordering. While both types often appear as text data requiring numerical encoding, ordinal features uniquely preserve this inherent order during transformation - a crucial distinction since the sequence itself carries predictive information that machine learning algorithms can leverage.

### Create Generation based on 'age'


In [None]:
# Convert the "age" column of the diabetes dataframe to a NumPy array
age = np.array(diabetes_df["age"])

# Create a new column "Generation" based on age groups using a lambda function
diabetes_df["Generation"] = diabetes_df["age"].apply(
    lambda value: (
        "Gen Z" if value <= 25  # Ages 25 and below belong to Generation Z
        else (
            "Millennials" if value <= 41  # Ages 26-41 belong to Millennials
            else (
                "Gen X" if value <= 57  # Ages 42-57 belong to Generation X
                else (
                    "Boomers II" if value <= 67  # Ages 58-67 belong to Boomers II
                    else (
                        "Boomers I" if value <= 76  # Ages 68-76 belong to Boomers I
                        else (
                            "Post WWII" if value <= 94  # Ages 77-94 belong to Post-WWII generation
                            else "WWII"  # Ages 95+ belong to WWII generation
                        )
                    )
                )
            )
        )
    )
)

# Display the first 10 rows of the "age" and "Generation" columns
diabetes_df[["age", "Generation"]].head(10)

In [None]:
# Get unique values from the "Generation" column of the diabetes_df DataFrame
unique_generations = np.unique(diabetes_df["Generation"])

# Display the unique generation values
print(unique_generations)

The data contains six distinct generations, representing an ordinal attribute with inherent sequential ordering. 

Since Python lacks native functionality for automated ordinal encoding of such features, we must manually implement the transformation. The following code demonstrates how to map these generational categories to their appropriate numerical representations while preserving the logical progression between values.

In [None]:
# Define a dictionary to map generation labels to ordinal values
gen_ord_map = {
    "Gen Z": 1,         # Youngest generation in the dataset
    "Millennials": 2,   # Followed by Millennials
    "Gen X": 3,         # Middle-aged generation
    "Boomers II": 4,    # Late Boomers
    "Boomers I": 5,     # Early Boomers
    "Post WWII": 6,     # Oldest generation in the dataset
}

# Map the 'Generation' column in the diabetes dataset to corresponding ordinal values
diabetes_df["GenerationLabel"] = diabetes_df["Generation"].map(gen_ord_map)

# Display selected columns (age, original generation label, and mapped generation label)
# for rows 4 to 9 (since slicing is exclusive of the end index)
diabetes_df[["age", "Generation", "GenerationLabel"]].iloc[4:10]

### Create BMI Class based on 'bmi'


In [None]:
# Extract the 'bmi' column from the diabetes DataFrame and store it as a numpy array
bmi = np.array(diabetes_df["bmi"])

# Create a new 'BMI' column in the dataframe by applying a function to the 'bmi' values
diabetes_df["BMI"] = diabetes_df["bmi"].apply(
    lambda value: (
        "Underweight"  # If the BMI is less than or equal to 18.5, classify as Underweight
        if value <= 18.5
        else (
            "Normal"  # If BMI is between 18.6 and 22.9, classify as Normal
            if value <= 22.9
            else (
                "Pre-obese"  # If BMI is between 23 and 24.9, classify as Pre-obese
                if value <= 24.9
                else (
                    "Class I obesity"  # If BMI is between 25 and 29.9, classify as Class I obesity
                    if value <= 29.9
                    else "Class II obesity"  # If BMI is between 30 and 34.9, classify as Class II obesity
                    if value <= 34.9 
                    else "Class II obesity"  # If BMI is greater than 35, classify as Class II obesity
                )
            )
        )
    )
)

# Display the first 10 rows of 'bmi' and the newly created 'BMI' column for review
diabetes_df[["bmi", "BMI"]].head(10)

In [None]:
# Get the unique values in the 'BMI' column of the diabetes_df DataFrame
# 'diabetes_df["BMI"]' selects the BMI column from the dataframe
# np.unique() returns the sorted unique values in the specified array
unique_bmi_values = np.unique(diabetes_df["BMI"])

# Output the unique BMI values
print(unique_bmi_values)

The output reveals five naturally ordered BMI classes, confirming their ordinal nature. 

Since scikit-learn lacks built-in ordinal encoding functionality for such cases, we must manually implement the numeric transformation. The following code demonstrates how to create this custom mapping while preserving the class hierarchy

In [None]:
# Dictionary mapping BMI categories to ordinal values
bmi_ord_map = {
    "Underweight": 1,       # "Underweight" corresponds to 1
    "Normal": 2,            # "Normal" corresponds to 2
    "Pre-obese": 3,         # "Pre-obese" corresponds to 3
    "Class I obesity": 4,   # "Class I obesity" corresponds to 4
    "Class II obesity": 5,  # "Class II obesity" corresponds to 5
}

# Map the 'BMI' column in diabetes_df to its corresponding ordinal value using the bmi_ord_map dictionary
diabetes_df["BMILabel"] = diabetes_df["BMI"].map(bmi_ord_map)

# Display a subset of the dataframe (rows 4 to 9) showing 'bmi', 'BMI', and 'BMILabel' columns
diabetes_df[["bmi", "BMI", "BMILabel"]].iloc[4:10]

The results show concerning patterns in BMI distribution and classification. Most entries (4/6) fall into Class II obesity (BMI ≥30), indicating a high-risk population sample. However, two critical issues emerge: First, the BMI value of 0.0 (ID 9) is physiologically impossible, suggesting either data entry error or missing values incorrectly coded as zero. Second, ID 5 (BMI=25.6) is classified as Class I obesity, which contradicts standard medical thresholds where obesity typically begins at BMI ≥30. The ordinal labels (BMILabel column) correctly reflect severity progression from Underweight (1) to Class II obesity (5), but the underlying classifications appear inconsistent with clinical standards. This output warrants verification of both the raw data quality (particularly the 0.0 value) and the classification thresholds being applied.

Since no existing Python module automatically handles such ordinal conversions, we implement custom transformation logic to properly encode these generational categories into their corresponding numerical values while preserving this inherent sequence.

## Encoding Categorical Features

If we directly feed these transformed numeric representations of categorical features into an algorithm, the model will interpret them as raw numeric features. This introduces an incorrect notion of magnitude, as the numeric values do not inherently carry meaningful order or scale.

As a result, models built using these features directly would be suboptimal and inaccurate. To address this, several strategies exist for creating dummy features, where each unique value or label from the distinct categories is represented separately. In the following sections, we will explore some of these strategies, including **one-hot encoding**, **dummy coding**, **effect coding**, and **feature hashing schemes**.


### One Hot Encoding Scheme

For a categorical feature with **m** unique labels, the one-hot encoding scheme transforms the feature into **m** binary features, each of which can only take a value of **1** or **0**. Each observation in the categorical feature is converted into a vector of size **m**, where only one element is **1** (indicating the active category) and the rest are **0**.


In [None]:
# Select a subset of the DataFrame 'diabetes_df' with specific columns
# - 'diabetes': The target variable indicating if the person has diabetes (e.g., 1 or 0)
# - 'Generation': A categorical variable representing the generation group (e.g., Gen X, Millennial)
# - 'BMI': A numerical variable for Body Mass Index (BMI)

# Use 'iloc' to filter rows between index 4 and 9 (remember, Python is 0-indexed, so row 4 is included, row 10 is excluded)
diabetes_df[["diabetes", "Generation", "BMI"]].iloc[4:10]

In [None]:
# Initialize the LabelEncoder for "Generation"
gen_le = LabelEncoder()

# Fit the LabelEncoder on the "Generation" column and transform it into numerical labels
gen_labels = gen_le.fit_transform(diabetes_df["Generation"])

# Add the transformed "Generation" labels as a new column in the dataframe
diabetes_df["Gen_Label"] = gen_labels

# Initialize the LabelEncoder for "BMI"
bmi_le = LabelEncoder()

# Fit the LabelEncoder on the "BMI" column and transform it into numerical labels
bmi_labels = bmi_le.fit_transform(diabetes_df["BMI"])

# Add the transformed "BMI" labels as a new column in the dataframe
diabetes_df["BMI_Label"] = bmi_labels

# Create a new dataframe subset with only relevant columns: 
# "diabetes" (target variable), "Generation", "Gen_Label", "BMI", and "BMI_Label"
diabetes_df_sub = diabetes_df[
    ["diabetes", "Generation", "Gen_Label", "BMI", "BMI_Label"]
]

# Display rows 4 to 9 (5th to 10th) from the new dataframe subset
diabetes_df_sub.iloc[4:10]

In [None]:
# Encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()  # Initialize the OneHotEncoder for 'Gen_Label'
gen_feature_arr = gen_ohe.fit_transform(diabetes_df[["Gen_Label"]]).toarray()  
# Apply the encoder to the "Gen_Label" column and convert the result into an array
gen_feature_labels = list(gen_ohe.categories_[0])  
# Extract the unique categories from the 'Gen_Label' encoding and convert to a list
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)  
# Create a DataFrame with the encoded features, with the appropriate column labels

# Encode BMI labels using one-hot encoding scheme
bmi_ohe = OneHotEncoder()  # Initialize the OneHotEncoder for 'BMI_Label'
bmi_feature_arr = bmi_ohe.fit_transform(diabetes_df[["BMI_Label"]]).toarray()  
# Apply the encoder to the "BMI_Label" column and convert the result into an array
bmi_feature_labels = ["BMI_" + str(cls_label) for cls_label in bmi_ohe.categories_[0]]  
# Create BMI feature labels by prepending "BMI_" to the class labels of the BMI categories
bmi_features = pd.DataFrame(bmi_feature_arr, columns=bmi_feature_labels)  
# Create a DataFrame with the encoded BMI features, with the appropriate column labels

In [None]:
# Concatenate the dataframes: diabetes_df_sub, gen_features, and bmi_features along columns (axis=1)
# This will combine the features from these different sources into a single dataframe
diabetes_df_ohe = pd.concat([diabetes_df_sub, gen_features, bmi_features], axis=1)

# Create the column names list by combining predefined column labels and the feature labels
# "diabetes", "Generation", and "Gen_Label" are predefined columns
# gen_feature_labels and bmi_feature_labels are dynamically created lists based on features
columns = sum(
    [
        ["diabetes", "Generation", "Gen_Label"],  # Predefined column names
        gen_feature_labels,  # Feature labels for genetic data
        ["BMI", "BMI_Label"],  # Predefined BMI-related column names
        bmi_feature_labels,  # Feature labels for BMI-related data
    ],
    [],  # Flatten the list of lists into a single list
)

# Display a slice (rows 4 to 9) of the concatenated dataframe with the newly created columns
diabetes_df_ohe[columns].iloc[4:10]

The output now includes one-hot encoded representations of both **Gen_Label** and **BMI_Label**, where each generated feature acts as a binary indicator. These dummy variables strictly assume values of 1 (when the category is present for a given observation) or 0 (when absent), creating mutually exclusive columns for every original categorical value. This transformation effectively converts the ordinal labels into a format suitable for machine learning algorithms while eliminating any artificial ordinal relationships that might bias model interpretation. For example, a '1' in the 'Class II obesity' column would indicate that particular BMI classification for the record while all other BMI category columns would show '0'.

In [None]:
# The following code creates a dummy DataFrame with two data points representing new diabetes cases.
new_diabetes_df = pd.DataFrame(
    # Data: A list of lists, where each list represents a data point
    [["1", "Gen X", "Pre-obese"],  # First data point: diabetes (1), Generation (Gen X), BMI (Pre-obese)
     ["0", "Boomers II", "Class I obesity"]],  # Second data point: diabetes (0), Generation (Boomers II), BMI (Class I obesity)
    
    # Columns: Names of the columns for the DataFrame
    columns=["diabetes", "Generation", "BMI"],  # Define the names for each column (diabetes, Generation, BMI)
)

# Display the DataFrame to show the created data points
new_diabetes_df

In [None]:
# Converting the text categories into numeric representations using our previously built LabelEncoder objects

# Transforming the 'Generation' column values to numeric using the previously fitted LabelEncoder (gen_le)
new_gen_labels = gen_le.transform(new_diabetes_df["Generation"])

# Adding the transformed numeric labels as a new column called "Gen_Label" in the DataFrame
new_diabetes_df["Gen_Label"] = new_gen_labels

# Transforming the 'BMI' column values to numeric using the previously fitted LabelEncoder (bmi_le)
new_bmi_labels = bmi_le.transform(new_diabetes_df["BMI"])

# Adding the transformed numeric labels as a new column called "BMI_Label" in the DataFrame
new_diabetes_df["BMI_Label"] = new_bmi_labels

# Displaying the relevant columns to inspect the new encoded labels
new_diabetes_df[["diabetes", "Generation", "Gen_Label", "BMI", "BMI_Label"]]

In [None]:
# Transform 'Gen_Label' column using previously built LabelEncoder to one-hot encoded features
# 'gen_ohe' is assumed to be a previously fitted OneHotEncoder for the 'Gen_Label' column.
new_gen_feature_arr = gen_ohe.transform(new_diabetes_df[["Gen_Label"]]).toarray()  
# Convert the resulting array into a DataFrame with appropriate column names from 'gen_feature_labels'
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)  

# Transform 'BMI_Label' column using previously built LabelEncoder to one-hot encoded features
# 'bmi_ohe' is assumed to be a previously fitted OneHotEncoder for the 'BMI_Label' column.
new_bmi_feature_arr = bmi_ohe.transform(new_diabetes_df[["BMI_Label"]]).toarray()  
# Convert the resulting array into a DataFrame with appropriate column names from 'bmi_feature_labels'
new_bmi_features = pd.DataFrame(new_bmi_feature_arr, columns=bmi_feature_labels)  

# Concatenate the original dataframe 'new_diabetes_df' with the newly generated one-hot encoded features
# This will add the new columns from 'new_gen_features' and 'new_bmi_features' to the original data
new_diabetes_ohe = pd.concat(
    [new_diabetes_df, new_gen_features, new_bmi_features], axis=1
)

# Define the desired column order, starting with diabetes-related columns, then adding one-hot encoded features
columns = sum(
    [
        ["diabetes", "Generation", "Gen_Label"],  # The original columns from the dataset
        gen_feature_labels,  # Columns generated from one-hot encoding of 'Gen_Label'
        ["BMI", "BMI_Label"],  # The BMI-related columns
        bmi_feature_labels,  # Columns generated from one-hot encoding of 'BMI_Label'
    ],
    [],
)

# Display the dataframe with the new column order
new_diabetes_ohe[columns]

In [None]:
# Pandas provides the 'get_dummies()' function that can help us easily perform one-hot encoding.
# It converts a categorical column into multiple binary columns, each representing a category in the original column.
gen_onehot_features = pd.get_dummies(diabetes_df["Generation"])

# Concatenate the original dataframe with the one-hot encoded columns, while keeping the "diabetes" and "Generation" columns.
# We use `axis=1` to concatenate along columns (horizontally).
# The 'iloc[4:10]' selects rows 4 through 9 (i.e., 6 rows) from the resulting dataframe.
# This gives a glimpse of the encoded features for a specific subset of rows.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_onehot_features], axis=1).iloc[
    4:10
]

### Dummy Coding Scheme

The dummy coding scheme is similar to one-hot encoding, with one key difference: when applied to a categorical feature with **m** distinct labels, it generates **m-1** binary features. As a result, each value of the categorical variable is converted into a vector of size **m-1**. The remaining feature is entirely omitted, and if the category values range from {**0, 1, ..., m-1**}, the 0th or (m-1)th feature is typically represented by a vector of all zeros (**0**).

In [None]:
# Create dummy (one-hot encoded) variables for the "Generation" column in the diabetes dataset.
# The first category (Boomers I) is dropped to avoid the dummy variable trap (multicollinearity).
gen_dummy_features = pd.get_dummies(diabetes_df["Generation"], drop_first=True)

# Concatenate the original "diabetes" and "Generation" columns with the newly created dummy variables.
# Select and display rows 4 to 9 for inspection.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_dummy_features], axis=1).iloc[4:10]

In [None]:
# Perform one-hot encoding on the "Generation" column, creating binary features for each unique category.
gen_onehot_features = pd.get_dummies(diabetes_df["Generation"])

# Drop the last column to avoid the dummy variable trap (multicollinearity issue).
# This ensures that information is preserved without redundancy.
gen_dummy_features = gen_onehot_features.iloc[:, :-1]

# Concatenate the original "diabetes" and "Generation" columns with the encoded dummy variables.
# Then, display rows 4 to 9 of the resulting DataFrame.
pd.concat([diabetes_df[["diabetes", "Generation"]], gen_dummy_features], axis=1).iloc[4:10]

## Conclusion

Through this module, we learned essential feature engineering techniques for both numeric and categorical data, including:

- Converting raw data into machine learning-ready features
- Applying appropriate transformations based on data type
- Understanding and implementing different encoding schemes
- Creating meaningful feature interactions
- Handling both nominal and ordinal categorical variables

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
