# Feature Engineering and Preprocessing

This notebook prepares the analytics-ready dataset for machine learning modeling. Building on insights from the exploratory data analysis (EDA) phase, this step focuses on transforming raw attributes into meaningful features, handling data quality issues, and ensuring compatibility with machine learning algorithms.

The main objectives of this notebook are to:
- Select relevant features based on EDA findings
- Remove redundant, constant, or non-informative columns
- Handle missing values and skewed distributions
- Encode categorical variables into numerical form
- Prepare a clean, model-ready dataset for training and evaluation


## Feature Engineering Workflow

The preprocessing pipeline in this notebook follows these stages:

1. Reload the joined dataset created from fact and dimension tables
2. Identify and remove non-predictive or redundant columns
3. Define the target variable for modeling
4. Handle missing values and data inconsistencies
5. Transform skewed numerical features
6. Encode categorical variables
7. Split the data into training and testing sets

Each step is informed by findings from the EDA phase and is documented to ensure transparency and reproducibility.


In [1]:
import os
import pandas as pd
import numpy as np

Instead of copying and pasting the same code again and again, a `data_utils.py` was created that contains the code we need. For that reason we have two functions created.
We will use `data_dir()` bellow to check what the relative path to the data is.

In [2]:
import sys

# Ensure workspace root is on sys.path so `src` package is importable
workspace_root = os.path.abspath('..')

if workspace_root not in sys.path:
    sys.path.insert(0, workspace_root)
    
import src.data_utils as du

# Resolve DATA_DIR from the package helper and expose as variable
DATA_DIR = du.data_dir()
DATA_DIR

'../data/raw'

Bellow we will use the other function, `load_csv()`, that was created inside `data_utils.py`.

In [4]:
# Load the joined, analytics-ready dataset
fact_sales = du.load_csv(DATA_DIR, "FactInternetSales.csv", "utf-8-sig")

# Load key dimension tables
dim_product = du.load_csv(DATA_DIR, "DimProduct.csv")
dim_customer = du.load_csv(DATA_DIR, "DimCustomer.csv")
dim_date = du.load_csv(DATA_DIR, "DimDate.csv")


In [5]:
# Join fact with product dimension
sales_df = fact_sales.merge(
    dim_product,
    on="ProductKey",
    how="left"
)

# Join with customer dimension
sales_df = sales_df.merge(
    dim_customer,
    on="CustomerKey",
    how="left"
)

# Join with date dimension
sales_df = sales_df.merge(
    dim_date,
    left_on="OrderDateKey",
    right_on="DateKey",
    how="left"
)

# Validate final dataset
sales_df.shape

(60398, 100)

In [6]:
# Inspect columns and data types
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60398 entries, 0 to 60397
Data columns (total 100 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ProductKey             60398 non-null  int64  
 1   OrderDateKey           60398 non-null  int64  
 2   DueDateKey             60398 non-null  int64  
 3   ShipDateKey            60398 non-null  int64  
 4   CustomerKey            60398 non-null  int64  
 5   PromotionKey           60398 non-null  int64  
 6   CurrencyKey            60398 non-null  int64  
 7   SalesTerritoryKey      60398 non-null  int64  
 8   SalesOrderNumber       60398 non-null  object 
 9   SalesOrderLineNumber   60398 non-null  int64  
 10  RevisionNumber         60398 non-null  int64  
 11  OrderQuantity          60398 non-null  int64  
 12  UnitPrice              60398 non-null  float64
 13  ExtendedAmount         60398 non-null  float64
 14  UnitPriceDiscountPct   60398 non-null  int64  
 15  D

## Dataset Reload Validation

The dataset has been reloaded to ensure this notebook is self-contained and reproducible. Column names, data types, and row counts are inspected before any transformations are applied.

Subsequent steps will explicitly document which features are retained, transformed, or removed, along with the rationale for each decision.


## Feature Categorization — How We Decide


Before performing any transformations, features are grouped into categories based on their role in the business process and their availability at prediction time.

The guiding principle used for this categorization is:

> *A feature may only be used for prediction if it would be known at the time the prediction is made.*

Each category below documents **why** certain columns belong to it.

---

### 1. Identifier and Technical Columns

**Definition:**  
Columns that uniquely identify records or entities but do not carry predictive meaning.

**Examples:**  
- ProductKey  
- CustomerKey  
- SalesOrderNumber  

**Reasoning:**  
These columns act as database identifiers rather than informative attributes. Their numeric values are arbitrary and are unlikely to contribute meaningful predictive signal. Including them would cause the model to learn meaningless patterns.

**Action:**  
Excluded from modeling.

---

### 2. Leakage-Prone (Post-Outcome) Columns

**Definition:**  
Columns that are calculated after the transaction completes or are derived directly from the target variable.

**Examples:**  
- ExtendedAmount  
- TotalProductCost  
- TaxAmt  

**Reasoning:**  
These values are only known *after* the sale has occurred and the final SalesAmount is determined. Including them would leak future information into the model, resulting in unrealistically high performance that would not generalize to real-world predictions.

**Action:**  
Excluded to prevent data leakage.

---

### 3. Numerical Predictive Features

**Definition:**  
Quantitative attributes that are known **before** or **at the time of the sale**.

**Examples:**  
- OrderQuantity  
- UnitPrice  
- UnitPriceDiscountPct  

**Reasoning:**  
These features directly influence the transaction outcome and are available at prediction time. They represent legitimate inputs that the model can reasonably use to estimate SalesAmount.

**Action:**  
Retained for modeling.

---

### 4. Categorical Descriptive Features

**Definition:**  
Non-numeric attributes describing products, customers, or time periods.

**Examples:**  
- ProductCategory  
- EnglishMonthName  
- CustomerGender  

**Reasoning:**  
Although not numerical, these features encode meaningful contextual information. They will be transformed into numerical representations using encoding techniques suitable for machine learning models.

**Action:**  
Retained and encoded in later steps.


## Target Variable Definition

The objective of this modeling task is to predict **SalesAmount** at the transaction (line-item) level based on product, customer, and time-related attributes.

`SalesAmount` is selected as the target variable because:
- It represents the primary business outcome (revenue)
- It is continuous and suitable for regression-based models
- It showed meaningful patterns and skewness during EDA
- It aligns naturally with retail and sales forecasting use cases

All remaining features will be evaluated as potential predictors of `SalesAmount`.


In [7]:
# Define target variable
TARGET_COL = "SalesAmount"

# Separate target variable
y = sales_df[TARGET_COL]

# Create initial feature set by dropping target
X = sales_df.drop(columns=[TARGET_COL])

# Basic validation
X.shape, y.shape

((60398, 99), (60398,))

## Feature Categorization — Applying the Decision Rules

The categories below illustrate how features are grouped based on the decision rules described above. At this stage, we identify representative examples rather than exhaustively classifying all columns. Final feature selection will be performed programmatically in subsequent steps.

To prepare the dataset for machine learning, features are grouped into the following categories:

### 1. Identifier and Technical Columns
These columns uniquely identify records or serve operational purposes and do not carry predictive value. Examples include:
- SalesOrderNumber
- SalesOrderLineNumber
- Surrogate keys (e.g., ProductKey, CustomerKey)

These will be excluded from modeling.

### 2. Leakage-Prone Columns
These columns are derived directly from or mathematically linked to the target variable (`SalesAmount`) and would cause data leakage if included. Examples include:
- ExtendedAmount
- TotalProductCost
- TaxAmt

These are excluded from the feature set prior to modeling.

### 3. Numerical Predictive Features
Continuous or discrete numerical attributes that may explain variations in sales, such as:
- OrderQuantity
- UnitPrice
- Discount-related fields
- Product cost attributes

These features may require transformation due to skewness or scale differences.

### 4. Categorical Features
Non-numerical attributes describing products, customers, and time, such as:
- Product categories
- Customer demographics
- Calendar attributes (Year, Month, Quarter)

These will be encoded into numerical form in later steps.

This structured categorization ensures transparent, defensible feature selection.


## Missing Value Analysis

Before applying any preprocessing steps, we assess the presence and extent of missing values across all features. This analysis helps determine whether missing data is negligible, requires imputation, or warrants column removal.

Understanding missingness patterns ensures that subsequent preprocessing decisions are data-driven and defensible.


In [8]:
# Calculate missing values per column
missing_counts = sales_df.isna().sum()

# Calculate percentage of missing values
missing_percent = (missing_counts / len(sales_df)) * 100

# Combine into a summary DataFrame
missing_summary = (
    pd.DataFrame({
        "missing_count": missing_counts,
        "missing_percent": missing_percent
    })
    .query("missing_count > 0")
    .sort_values(by="missing_percent", ascending=False)
)

missing_summary

Unnamed: 0,missing_count,missing_percent
CarrierTrackingNumber,60398,100.0
CustomerPONumber,60398,100.0
Suffix,60392,99.990066
FullDateAlternateKey,60384,99.97682
DateKey,60384,99.97682
DayNumberOfWeek,60384,99.97682
FiscalQuarter,60384,99.97682
FiscalYear,60384,99.97682
FiscalSemester,60384,99.97682
CalendarSemester,60384,99.97682


## Missing Value Interpretation and Handling Strategy

The missing value analysis reveals distinct patterns driven by the data warehouse schema and business logic rather than random data quality issues.

Key observations:

- Several columns exhibit near-total missingness (≈ 100%), particularly operational fields and descriptive date attributes. These columns provide no usable signal and will be removed.

- Product-related attributes such as size, weight, style, and class show high but systematic missingness, reflecting optional or category-specific characteristics. These features will be evaluated individually and either dropped or encoded with explicit missing indicators.

- A small number of attributes exhibit moderate missingness and may be retained through imputation or categorical encoding.

Based on these findings, features with extreme missingness will be removed prior to modeling, while others will be handled using controlled preprocessing techniques.


In [9]:
# Drop columns with extremely high missingness (>= 99%)
missing_threshold = 0.99

cols_to_drop = (
    missing_summary[missing_summary["missing_percent"] >= missing_threshold * 100]
    .index
    .tolist()
)

# Drop from feature set only (not target)
X = X.drop(columns=cols_to_drop)

X.shape

(60398, 75)

## Handling Remaining Missing Values

After removing columns with extreme missingness, the remaining missing values represent meaningful absence rather than data corruption.

These missing values are handled differently depending on feature type:

- **Numerical features**: Missing values are imputed using the median to preserve distribution robustness in the presence of skewness.
- **Categorical features**: Missing values are treated as an explicit category ("Unknown") to preserve information about absence.

This approach ensures no rows are dropped while maintaining model compatibility and interpretability.


In [10]:
# Identify remaining missing values after column removal
remaining_missing = X.isna().sum()
remaining_missing = remaining_missing[remaining_missing > 0]

remaining_missing.sort_values(ascending=False)


AddressLine2             59293
EndDate                  54970
WeightUnitMeasureCode    45193
Weight                   45193
SizeUnitMeasureCode      45193
Class                    38946
SizeRange                37549
Size                     37549
Style                    36092
Color                    28919
MiddleName               25495
FrenchProductName        13135
SpanishProductName       13135
Status                    5428
dtype: int64

In [11]:
# Separate numerical and categorical columns with missing values
num_missing_cols = X.select_dtypes(include=[np.number]).columns
num_missing_cols = [col for col in num_missing_cols if X[col].isna().any()]

cat_missing_cols = X.select_dtypes(exclude=[np.number]).columns
cat_missing_cols = [col for col in cat_missing_cols if X[col].isna().any()]

num_missing_cols, cat_missing_cols

(['Weight'],
 ['WeightUnitMeasureCode',
  'SizeUnitMeasureCode',
  'SpanishProductName',
  'FrenchProductName',
  'Color',
  'Size',
  'SizeRange',
  'Class',
  'Style',
  'EndDate',
  'Status',
  'MiddleName',
  'AddressLine2'])

### Missing Value Treatment Strategy

Remaining missing values fall into two categories:

**Numerical Features**
- Missingness is handled using median imputation.
- Median is preferred over mean due to skewed distributions observed in EDA.

**Categorical Features**
- Missing values are replaced with the category "Unknown".
- This preserves information that the value was absent rather than discarding records.

This strategy avoids data loss while maintaining statistical stability.


In [12]:
# Impute numerical features with median
for col in num_missing_cols:
    X[col] = X[col].fillna(X[col].median())

# Impute categorical features with explicit label
for col in cat_missing_cols:
    X[col] = X[col].fillna("Unknown")

# Validate no remaining missing values
X.isna().sum().sum()

np.int64(0)

In [13]:
# Drop identifier, surrogate key, and leakage-prone columns
cols_to_drop = [
    # Leakage-prone
    "ExtendedAmount",
    "TotalProductCost",
    "TaxAmt",

    # Identifiers / surrogate keys
    "ProductKey",
    "CustomerKey",
    "PromotionKey",
    "CurrencyKey",
    "SalesTerritoryKey",
    "SalesOrderLineNumber",

    # Date keys (handled via DimDate attributes instead)
    "OrderDateKey",
    "DueDateKey",
    "ShipDateKey"
]

X = X.drop(columns=[col for col in cols_to_drop if col in X.columns])

X.shape


(60398, 63)

## Numerical Feature Transformation

Many numerical features in the dataset exhibit right-skewed distributions, as observed during the exploratory data analysis (EDA) phase.

To improve model performance and numerical stability, skewed numerical features are transformed using a logarithmic transformation.

Log transformation:
- Reduces skewness
- Compresses extreme values
- Improves compatibility with linear and distance-based models

Only strictly positive numerical features are transformed to avoid mathematical issues.

Identifier fields, surrogate keys, and leakage-prone variables are explicitly excluded from numerical transformation, as their numeric values do not represent meaningful quantities.


In [14]:
# Identify numerical features
numerical_cols = X.select_dtypes(include=[np.number]).columns

len(numerical_cols), numerical_cols

(21,
 Index(['RevisionNumber', 'OrderQuantity', 'UnitPrice', 'UnitPriceDiscountPct',
        'DiscountAmount', 'ProductStandardCost', 'Freight',
        'ProductSubcategoryKey', 'StandardCost', 'SafetyStockLevel',
        'ReorderPoint', 'ListPrice', 'Weight', 'DaysToManufacture',
        'DealerPrice', 'GeographyKey', 'YearlyIncome', 'TotalChildren',
        'NumberChildrenAtHome', 'HouseOwnerFlag', 'NumberCarsOwned'],
       dtype='object'))

In [15]:
# Calculate skewness of numerical features
skewness = X[numerical_cols].skew().sort_values(ascending=False)

skewness

ProductStandardCost      1.950547
StandardCost             1.950547
Freight                  1.927515
ListPrice                1.927515
UnitPrice                1.927515
DealerPrice              1.927515
NumberChildrenAtHome     1.281429
DaysToManufacture        1.144009
ReorderPoint             1.126946
SafetyStockLevel         1.126946
YearlyIncome             0.783957
GeographyKey             0.734562
Weight                   0.550952
TotalChildren            0.463057
NumberCarsOwned          0.402790
OrderQuantity            0.000000
RevisionNumber           0.000000
DiscountAmount           0.000000
UnitPriceDiscountPct     0.000000
ProductSubcategoryKey   -0.688218
HouseOwnerFlag          -0.823695
dtype: float64

### Skewness-Based Transformation Criteria

Numerical features with an absolute skewness greater than 1 are considered highly skewed and are selected for transformation.

This threshold balances correction of extreme distributions while preserving natural variability in approximately symmetric features.


In [16]:
# Select highly skewed numerical features
skewed_cols = skewness[abs(skewness) > 1].index.tolist()

len(skewed_cols), skewed_cols

(10,
 ['ProductStandardCost',
  'StandardCost',
  'Freight',
  'ListPrice',
  'UnitPrice',
  'DealerPrice',
  'NumberChildrenAtHome',
  'DaysToManufacture',
  'ReorderPoint',
  'SafetyStockLevel'])

In [17]:
# Apply log1p transformation to skewed numerical features
for col in skewed_cols:
    X[col] = np.log1p(X[col])

In [18]:
# Recalculate skewness after transformation
post_skewness = X[skewed_cols].skew().sort_values(ascending=False)

post_skewness

Freight                 1.169011
DaysToManufacture       1.144009
StandardCost            0.807867
ProductStandardCost     0.807867
NumberChildrenAtHome    0.807117
DealerPrice             0.774139
UnitPrice               0.740682
ListPrice               0.740682
ReorderPoint            0.296094
SafetyStockLevel        0.289817
dtype: float64

## Categorical Feature Encoding

Machine learning models require numerical inputs, so categorical features must be converted into numerical representations.

This step encodes categorical variables while preserving their informational content and avoiding unintended ordinal relationships.

Encoding strategy:
- Categorical features are one-hot encoded using pandas
- High-cardinality identifiers and free-text fields have been removed beforehand
- Missing values have already been handled using explicit categories



In [19]:
# Identify categorical features
categorical_cols = X.select_dtypes(exclude=[np.number]).columns

len(categorical_cols), categorical_cols


(42,
 Index(['SalesOrderNumber', 'OrderDate', 'DueDate', 'ShipDate',
        'ProductAlternateKey', 'WeightUnitMeasureCode', 'SizeUnitMeasureCode',
        'EnglishProductName', 'SpanishProductName', 'FrenchProductName',
        'FinishedGoodsFlag', 'Color', 'Size', 'SizeRange', 'ProductLine',
        'Class', 'Style', 'ModelName', 'EnglishDescription', 'StartDate',
        'EndDate', 'Status', 'CustomerAlternateKey', 'FirstName', 'MiddleName',
        'LastName', 'NameStyle', 'BirthDate', 'MaritalStatus', 'Gender',
        'EmailAddress', 'EnglishEducation', 'SpanishEducation',
        'FrenchEducation', 'EnglishOccupation', 'SpanishOccupation',
        'FrenchOccupation', 'AddressLine1', 'AddressLine2', 'Phone',
        'DateFirstPurchase', 'CommuteDistance'],
       dtype='object'))

In [20]:
cols_to_drop_cat = [
    "SalesOrderNumber",
    "ProductAlternateKey",
    "CustomerAlternateKey",
    "EmailAddress",
    "Phone",
    "OrderDate",
    "DueDate",
    "ShipDate",
    "StartDate",
    "EndDate",
    "BirthDate",
    "DateFirstPurchase",
    "EnglishDescription",
    "ModelName",
    "AddressLine1",
    "AddressLine2",
    "FirstName",
    "MiddleName",
    "LastName"
]

X = X.drop(columns=cols_to_drop_cat)
X.shape

(60398, 44)

In [21]:
# Re-identify categorical columns after cleanup
categorical_cols = X.select_dtypes(exclude=[np.number]).columns

len(categorical_cols), categorical_cols


(23,
 Index(['WeightUnitMeasureCode', 'SizeUnitMeasureCode', 'EnglishProductName',
        'SpanishProductName', 'FrenchProductName', 'FinishedGoodsFlag', 'Color',
        'Size', 'SizeRange', 'ProductLine', 'Class', 'Style', 'Status',
        'NameStyle', 'MaritalStatus', 'Gender', 'EnglishEducation',
        'SpanishEducation', 'FrenchEducation', 'EnglishOccupation',
        'SpanishOccupation', 'FrenchOccupation', 'CommuteDistance'],
       dtype='object'))

In [22]:
# One-hot encode categorical features
X_encoded = pd.get_dummies(
    X,
    columns=categorical_cols,
    drop_first=True
)

X_encoded.shape

(60398, 464)

## Train–Test Split

The final step in preprocessing is to split the dataset into training and testing subsets.

This ensures that model performance is evaluated on unseen data and helps prevent overfitting.

A fixed random state is used to ensure reproducibility.

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((48318, 464), (12080, 464), (48318,), (12080,))

## Persist Preprocessed Data

To ensure reproducibility and separation of concerns, the final training and testing datasets are saved to disk and reloaded in the modeling notebook.


In [24]:
# Create processed data directory if it doesn't exist
processed_dir = os.path.join(DATA_DIR, "..", "processed")
os.makedirs(processed_dir, exist_ok=True)

# Save train-test splits
X_train.to_csv(os.path.join(processed_dir, "X_train.csv"), index=False)
X_test.to_csv(os.path.join(processed_dir, "X_test.csv"), index=False)
y_train.to_csv(os.path.join(processed_dir, "y_train.csv"), index=False)
y_test.to_csv(os.path.join(processed_dir, "y_test.csv"), index=False)

## Preprocessing Summary and Next Steps

This notebook prepared the raw, joined sales dataset into a clean, model-ready feature matrix suitable for supervised learning.

### Key preprocessing outcomes:

- **Initial dataset**:
    
    60,398 transaction-level records with 100 raw features
    
- **Target variable**:
    
    `SalesAmount` (continuous, line-item revenue)
    
- **Feature selection**:
    - Removed identifier and surrogate key columns
    - Excluded leakage-prone, post-outcome variables
    - Dropped columns with extreme missingness (≥ 99%)
    - Removed high-cardinality and non-predictive text fields
- **Missing value handling**:
    - Numerical features imputed using median values
    - Categorical features imputed using an explicit `"Unknown"` category
    - No rows were dropped during preprocessing
- **Numerical transformations**:
    - Skewed numerical features transformed using `log1p`
    - Transformation applied selectively based on skewness diagnostics
- **Categorical encoding**:
    - One-hot encoding applied to categorical features
    - `drop_first=True` used to reduce multicollinearity
    - Final encoded feature space contains **464 features**
- **Train–test split**:
    - 80% training / 20% testing
    - Fixed random state for reproducibility

### Final dataset shapes:

- Training set: `(48,318, 464)`
- Test set: `(12,080, 464)`