# 02 – Feature Engineering

This notebook performs feature engineering for the Retail Sales Dataset project.

We will generate all relevant features needed for our classification and regression models, including:
- Categorical encodings
- Temporal features
- Interaction-ready variables
- Target variable for classification

At the end, we’ll export a clean `processed_data.csv` to be used for modeling.


In [2]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("data/raw/retail_sales_dataset.csv")
df['Date'] = pd.to_datetime(df['Date'])

# Preview
df.head()


Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [3]:
# Age Group Feature
# We segment customer ages into 4 groups: `<25`, `25-40`, `40-60`, and `60+`.

bins = [0, 25, 40, 60, 100]
labels = ['<25', '25-40', '40-60', '60+']
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)


In [4]:
# High Spender Target (for classification)
# We will label the top 25% of transactions by `Total Amount` as "high spenders".

threshold = df['Total Amount'].quantile(0.75)
df['High Spender'] = (df['Total Amount'] >= threshold).astype(int)
print(f"High Spender threshold: ${threshold:.2f}")



High Spender threshold: $900.00


In [6]:
# Temporal Features

df['Month'] = df['Date'].dt.month
df['Day of Week'] = df['Date'].dt.dayofweek  # 0 = Monday, 6 = Sunday

In [7]:
# Average Price per Item

df['Avg Price per Item'] = df['Total Amount'] / df['Quantity']


In [9]:
# One-Hot Encoding
# We create one-hot encoded variables for `Gender`, `Product Category`, and `Age Group`.

df_encoded = pd.get_dummies(df, columns=['Gender', 'Product Category', 'Age Group'], drop_first=True)


In [10]:
## Numeric Encodings (for Regression)

# Gender numeric
df_encoded['Gender_Num'] = df['Gender'].map({'Male': 0, 'Female': 1})

# Age group numeric
age_map = {'<25': 1, '25-40': 2, '40-60': 3, '60+': 4}
df_encoded['AgeGroup_Num'] = df['Age Group'].map(age_map)

# Product category numeric
product_map = {'Clothing': 1, 'Electronics': 2, 'Beauty': 3}
df_encoded['ProductCategory_Num'] = df['Product Category'].map(product_map)



To simplify linear regression modeling, we also create numeric encodings for:
- Gender: Male = 0, Female = 1
- Age Group: `<25` = 1, `25-40` = 2, `40-60` = 3, `60+` = 4
- Product Category: Clothing = 1, Electronics = 2, Beauty = 3


In [11]:
# Preview of the processed dataset
print("Dataset Shape:", df_encoded.shape)
print("\nColumn Names:\n", df_encoded.columns.tolist())

# Show first 5 rows
df_encoded.head()


Dataset Shape: (1000, 20)

Column Names:
 ['Transaction ID', 'Date', 'Customer ID', 'Age', 'Quantity', 'Price per Unit', 'Total Amount', 'High Spender', 'Month', 'Day of Week', 'Avg Price per Item', 'Gender_Male', 'Product Category_Clothing', 'Product Category_Electronics', 'Age Group_25-40', 'Age Group_40-60', 'Age Group_60+', 'Gender_Num', 'AgeGroup_Num', 'ProductCategory_Num']


Unnamed: 0,Transaction ID,Date,Customer ID,Age,Quantity,Price per Unit,Total Amount,High Spender,Month,Day of Week,Avg Price per Item,Gender_Male,Product Category_Clothing,Product Category_Electronics,Age Group_25-40,Age Group_40-60,Age Group_60+,Gender_Num,AgeGroup_Num,ProductCategory_Num
0,1,2023-11-24,CUST001,34,3,50,150,0,11,4,50.0,True,False,False,True,False,False,0,2,3
1,2,2023-02-27,CUST002,26,2,500,1000,1,2,0,500.0,False,True,False,True,False,False,1,2,1
2,3,2023-01-13,CUST003,50,1,30,30,0,1,4,30.0,True,False,True,False,True,False,0,3,2
3,4,2023-05-21,CUST004,37,1,500,500,0,5,6,500.0,True,True,False,True,False,False,0,2,1
4,5,2023-05-06,CUST005,30,2,50,100,0,5,5,50.0,True,False,False,True,False,False,0,2,3


## Feature Engineering Decisions

| Feature | Type | Reason |
|--------|------|--------|
| `Age Group` | Categorical | Needed to evaluate interaction effects between age and price in regression. Binned into `<25`, `25-40`, `40-60`, `60+` for business relevance. |
| `High Spender` | Binary | Target variable for classification. Labeled as 1 if the transaction is in the top 25% of `Total Amount`. |
| `Month` | Numeric | Temporal feature to explore monthly patterns or seasonality. |
| `Day of Week` | Numeric | Helps identify weekday/weekend trends. Can be used to enrich predictions. |
| `Avg Price per Item` | Numeric | Provides insight into pricing behavior per transaction. |
| `Gender_*`, `Product Category_*`, `Age Group_*` | One-hot encoded categorical variables | Useful for classification models and non-linear ML algorithms. `drop_first=True` used to avoid dummy variable trap. |
| `Gender_Num`, `AgeGroup_Num`, `ProductCategory_Num` | Numeric (label encoded) | Added to support regression models (linear models often benefit from single numeric representations of categories). |

---

The processed dataset now includes all the features needed for modeling:

- **Targets:**  
  - `High Spender`: Binary target for classification  
  - `Quantity`: Target for regression  

- **Continuous Features:**  
  `Price per Unit`, `Avg Price per Item`, `Month`, `Day of Week`

- **Categorical Features:**  
  Both one-hot encoded variables and numeric-encoded versions of gender, product category, and age group for flexibility across modeling approaches.


In [12]:
# Export the final processed data
df_encoded.to_csv("data/processed/processed_data.csv", index=False)
print("Processed data saved to ../data/processed/processed_data.csv")


Processed data saved to ../data/processed/processed_data.csv
