# 04 - Feature Engineering Basics

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved accuracy on unseen data.

## Learning Objectives

By the end of this notebook, you will be able to:

- Understand what feature engineering is and why it matters
- Create interaction features using polynomial expansion and manual multiplication
- Apply log transforms to handle skewed distributions
- Bin continuous features using `pd.cut` and `pd.qcut`
- Extract useful features from datetime columns
- Derive basic text features (word count, string length)

## Prerequisites

- Python fundamentals (lists, dictionaries, functions)
- NumPy and Pandas basics
- Basic understanding of train/test splits (Notebooks 01-03)
- Familiarity with Matplotlib for plotting

## Table of Contents

1. [What Is Feature Engineering?](#1-what-is-feature-engineering)
2. [Interaction Features](#2-interaction-features)
3. [Log Transforms for Skewed Data](#3-log-transforms-for-skewed-data)
4. [Binning Continuous Features](#4-binning-continuous-features)
5. [Date-Time Feature Extraction](#5-date-time-feature-extraction)
6. [Text Feature Basics](#6-text-feature-basics)
7. [Putting It All Together](#7-putting-it-all-together)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercise](#9-exercise)

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

---

## 1. What Is Feature Engineering?

**Feature engineering** is the art and science of creating new input variables from existing data to improve model performance.

**Why it matters:**
- Raw data is rarely in the optimal format for a model
- Good features can make simple models outperform complex ones
- Domain knowledge encoded as features gives models a head start
- It is often the single most impactful step in a ML pipeline

**Types of feature engineering:**
- **Interaction features** - combining two or more features
- **Transformations** - log, sqrt, Box-Cox to fix distributions
- **Binning** - converting continuous values to categories
- **Temporal extraction** - pulling year, month, day from dates
- **Text extraction** - word counts, lengths, pattern matching

---

## 2. Interaction Features

Interaction features capture relationships **between** existing features that the model might not discover on its own (especially for linear models).

For two features $x_1$ and $x_2$, interactions include:
- $x_1 \times x_2$ (product)
- $x_1^2$, $x_2^2$ (polynomial terms)
- $x_1^2 \times x_2$, $x_1 \times x_2^2$ (higher-order)

### 2.1 Manual Interaction Features

In [None]:
# Create a small dataset: house features
df_house = pd.DataFrame({
    'length': [30, 40, 35, 50, 45],
    'width':  [20, 25, 22, 30, 28],
    'floors': [1, 2, 1, 3, 2]
})

# Manual interaction: area = length * width
df_house['area'] = df_house['length'] * df_house['width']

# Manual interaction: total_living_space = area * floors
df_house['total_living_space'] = df_house['area'] * df_house['floors']

# Ratio feature: aspect_ratio = length / width
df_house['aspect_ratio'] = df_house['length'] / df_house['width']

print("House dataset with engineered features:")
df_house

### 2.2 Polynomial Features with sklearn

`PolynomialFeatures` automatically generates all polynomial combinations up to a specified degree.

In [None]:
# Original features
X = df_house[['length', 'width']].values
print("Original features shape:", X.shape)
print("Original features (first 3 rows):")
print(X[:3])
print()

# Degree 2 polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("Polynomial features shape:", X_poly.shape)
print("Feature names:", poly.get_feature_names_out())
print()
print("Polynomial features (first 3 rows):")
print(X_poly[:3])

In [None]:
# interaction_only=True: only cross-products, no powers
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X)

print("Interaction-only features:", poly_interact.get_feature_names_out())
print("Shape:", X_interact.shape)
print()
print("First 3 rows:")
print(X_interact[:3])

---

## 3. Log Transforms for Skewed Data

Many real-world features (income, house prices, populations) follow a **right-skewed** distribution. Log transforms can:
- Reduce skewness and make distributions more normal
- Stabilize variance
- Help linear models that assume normally distributed features

Common transforms:
- $\log(x)$ - natural log (requires $x > 0$)
- $\log(x + 1)$ - handles zeros (`np.log1p`)
- $\sqrt{x}$ - milder than log

In [None]:
# Generate right-skewed data (simulating income)
np.random.seed(42)
income = np.random.lognormal(mean=10.5, sigma=0.8, size=2000)

print(f"Original income stats:")
print(f"  Mean:   ${income.mean():,.0f}")
print(f"  Median: ${np.median(income):,.0f}")
print(f"  Skew:   {pd.Series(income).skew():.2f}")
print()

# Apply log transform
income_log = np.log1p(income)

print(f"Log-transformed stats:")
print(f"  Mean:   {income_log.mean():.2f}")
print(f"  Median: {np.median(income_log):.2f}")
print(f"  Skew:   {pd.Series(income_log).skew():.2f}")

In [None]:
# Visualize before and after
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Before: original skewed distribution
axes[0].hist(income, bins=50, color='salmon', edgecolor='black', alpha=0.7)
axes[0].axvline(income.mean(), color='red', linestyle='--', label=f'Mean: ${income.mean():,.0f}')
axes[0].axvline(np.median(income), color='blue', linestyle='--', label=f'Median: ${np.median(income):,.0f}')
axes[0].set_title('Before: Original Income (Right-Skewed)', fontsize=12)
axes[0].set_xlabel('Income ($)')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# After: log-transformed distribution
axes[1].hist(income_log, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[1].axvline(income_log.mean(), color='red', linestyle='--', label=f'Mean: {income_log.mean():.2f}')
axes[1].axvline(np.median(income_log), color='blue', linestyle='--', label=f'Median: {np.median(income_log):.2f}')
axes[1].set_title('After: Log-Transformed Income', fontsize=12)
axes[1].set_xlabel('log(Income + 1)')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

**Key observations:**
- The original distribution has a long right tail (high skew)
- After log transform, the distribution is approximately normal (skew near 0)
- Mean and median converge after the transform, indicating symmetry

---

## 4. Binning Continuous Features

**Binning** (discretization) converts continuous features into categorical ones. This can:
- Capture non-linear relationships for linear models
- Reduce the impact of outliers
- Create interpretable categories (e.g., age groups)

Two main approaches:
- `pd.cut` - equal-width bins (uniform spacing)
- `pd.qcut` - equal-frequency bins (same number of observations per bin)

In [None]:
# Create sample data: ages
np.random.seed(42)
ages = np.random.randint(18, 80, size=100)
df_age = pd.DataFrame({'age': ages})

# pd.cut: equal-width bins
df_age['age_bin_equal_width'] = pd.cut(
    df_age['age'], 
    bins=[0, 25, 35, 50, 65, 100],
    labels=['18-25', '26-35', '36-50', '51-65', '66+']
)

print("pd.cut (equal-width bins):")
print(df_age['age_bin_equal_width'].value_counts().sort_index())
print()

In [None]:
# pd.qcut: equal-frequency bins (quantile-based)
df_age['age_bin_quantile'] = pd.qcut(
    df_age['age'], 
    q=4,  # 4 bins = quartiles
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

print("pd.qcut (equal-frequency bins, quartiles):")
print(df_age['age_bin_quantile'].value_counts().sort_index())
print()
print("Note: each bin has roughly the same number of observations.")

In [None]:
# Visualize both binning strategies
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df_age['age_bin_equal_width'].value_counts().sort_index().plot(
    kind='bar', ax=axes[0], color='coral', edgecolor='black'
)
axes[0].set_title('pd.cut: Equal-Width Bins', fontsize=12)
axes[0].set_xlabel('Age Group')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

df_age['age_bin_quantile'].value_counts().sort_index().plot(
    kind='bar', ax=axes[1], color='steelblue', edgecolor='black'
)
axes[1].set_title('pd.qcut: Equal-Frequency Bins', fontsize=12)
axes[1].set_xlabel('Quantile Group')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

**When to use each:**
- `pd.cut` - when domain-specific boundaries matter (e.g., age groups, tax brackets)
- `pd.qcut` - when you want balanced groups regardless of data distribution

---

## 5. Date-Time Feature Extraction

Raw datetime values are not directly useful for most models. We extract meaningful components:
- **Year, month, day** - capture seasonality and trends
- **Day of week** - distinguish weekdays from weekends
- **Hour** - time-of-day patterns
- **Is weekend** - binary flag for weekend behavior

In [None]:
# Create sample datetime data (simulating order timestamps)
np.random.seed(42)
dates = pd.date_range(start='2023-01-01', periods=365, freq='D')
df_dates = pd.DataFrame({
    'order_date': np.random.choice(dates, size=200),
    'amount': np.random.lognormal(mean=4, sigma=0.5, size=200)
})

# Ensure the column is datetime type
df_dates['order_date'] = pd.to_datetime(df_dates['order_date'])

print("Raw data (first 5 rows):")
df_dates.head()

In [None]:
# Extract datetime features
df_dates['year'] = df_dates['order_date'].dt.year
df_dates['month'] = df_dates['order_date'].dt.month
df_dates['day'] = df_dates['order_date'].dt.day
df_dates['day_of_week'] = df_dates['order_date'].dt.dayofweek  # 0=Monday, 6=Sunday
df_dates['day_name'] = df_dates['order_date'].dt.day_name()
df_dates['is_weekend'] = df_dates['day_of_week'].isin([5, 6]).astype(int)
df_dates['quarter'] = df_dates['order_date'].dt.quarter

print("With extracted features:")
df_dates.head(10)

In [None]:
# Analyze: average order amount by day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_by_day = df_dates.groupby('day_name')['amount'].mean().reindex(day_order)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

colors = ['steelblue'] * 5 + ['coral'] * 2  # weekdays blue, weekends coral
avg_by_day.plot(kind='bar', ax=axes[0], color=colors, edgecolor='black')
axes[0].set_title('Average Order Amount by Day of Week', fontsize=12)
axes[0].set_ylabel('Average Amount ($)')
axes[0].tick_params(axis='x', rotation=45)

monthly = df_dates.groupby('month')['amount'].mean()
monthly.plot(kind='line', ax=axes[1], marker='o', color='steelblue')
axes[1].set_title('Average Order Amount by Month', fontsize=12)
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Average Amount ($)')

plt.tight_layout()
plt.show()

---

## 6. Text Feature Basics

Even simple text features can add predictive power. Here are quick teasers (full NLP is a separate topic):
- **Word count** - number of words in a text field
- **Character length** - total string length
- **Contains keyword** - binary flag for specific words

In [None]:
# Sample product reviews
df_text = pd.DataFrame({
    'review': [
        'Great product, love it!',
        'Terrible quality. Broke after one day. Very disappointed with this purchase.',
        'OK',
        'Amazing value for money. Would definitely recommend to friends and family.',
        'Not worth it.',
        'Exceeded expectations! The build quality is superb and shipping was fast.'
    ],
    'rating': [5, 1, 3, 5, 2, 5]
})

# Extract text features
df_text['word_count'] = df_text['review'].str.split().str.len()
df_text['char_length'] = df_text['review'].str.len()
df_text['has_exclamation'] = df_text['review'].str.contains('!').astype(int)
df_text['avg_word_length'] = df_text['char_length'] / df_text['word_count']

print("Text features:")
df_text

These simple features are surprisingly useful. Longer reviews often correlate with stronger opinions (positive or negative), while very short reviews tend to be neutral.

---

## 7. Putting It All Together

Let us create a synthetic dataset and apply multiple feature engineering techniques.

In [None]:
# Create synthetic e-commerce dataset
np.random.seed(42)
n = 500

dates = pd.date_range(start='2022-01-01', end='2023-12-31', periods=n)

df = pd.DataFrame({
    'order_date': dates,
    'quantity': np.random.randint(1, 20, size=n),
    'unit_price': np.random.lognormal(mean=3, sigma=0.7, size=n),  # skewed!
    'customer_age': np.random.randint(18, 75, size=n),
})

print("Original dataset:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nunit_price skewness: {df['unit_price'].skew():.2f}")

In [None]:
# --- Apply feature engineering ---

# 1. Interaction feature: total_revenue
df['total_revenue'] = df['quantity'] * df['unit_price']

# 2. Log transform of skewed columns
df['log_unit_price'] = np.log1p(df['unit_price'])
df['log_total_revenue'] = np.log1p(df['total_revenue'])

# 3. Binning: age groups
df['age_group'] = pd.cut(
    df['customer_age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['18-25', '26-35', '36-50', '51-65', '66+']
)

# 4. Datetime features
df['month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['order_date'].dt.quarter

print("Engineered dataset:")
print(df.head())
print(f"\nNew shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

In [None]:
# Visualize the impact of feature engineering
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# unit_price before/after log
axes[0, 0].hist(df['unit_price'], bins=40, color='salmon', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('unit_price (Original - Skewed)', fontsize=11)

axes[0, 1].hist(df['log_unit_price'], bins=40, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('log_unit_price (Log-Transformed)', fontsize=11)

# Age group distribution
df['age_group'].value_counts().sort_index().plot(
    kind='bar', ax=axes[1, 0], color='mediumpurple', edgecolor='black'
)
axes[1, 0].set_title('Age Group Distribution', fontsize=11)
axes[1, 0].tick_params(axis='x', rotation=0)

# Weekend vs weekday revenue
df.groupby('is_weekend')['total_revenue'].mean().plot(
    kind='bar', ax=axes[1, 1], color=['steelblue', 'coral'], edgecolor='black'
)
axes[1, 1].set_title('Avg Revenue: Weekday vs Weekend', fontsize=11)
axes[1, 1].set_xticklabels(['Weekday', 'Weekend'], rotation=0)

plt.tight_layout()
plt.show()

---

## 8. Common Mistakes

### Mistake 1: Overfitting with Too Many Engineered Features

Creating too many features (especially high-degree polynomials) can lead to overfitting. The model memorizes noise rather than learning signal.

**Rule of thumb:** Start simple. Only add features that have a plausible domain reason to help.

In [None]:
# Demonstration: feature explosion with high-degree polynomials
X_small = np.random.randn(100, 5)  # 5 original features

for degree in [2, 3, 4, 5]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_expanded = poly.fit_transform(X_small)
    print(f"Degree {degree}: {X_small.shape[1]} features -> {X_expanded.shape[1]} features")

print("\nWith 5 features, degree 5 creates 251 features from 100 samples!")
print("This is a recipe for overfitting.")

### Mistake 2: Not Applying the Same Transforms to Test Data

If you engineer features on the training set, you **must** apply the exact same transformations to the test set. Otherwise, the model sees different feature spaces at train and test time.

In [None]:
from sklearn.model_selection import train_test_split

# Split first, then engineer features identically
np.random.seed(42)
data = pd.DataFrame({
    'price': np.random.lognormal(5, 1, 200),
    'quantity': np.random.randint(1, 50, 200)
})
y = (data['price'] * data['quantity'] > 5000).astype(int)

train_data, test_data, y_train, y_test = train_test_split(
    data, y, test_size=0.2, random_state=42
)

# CORRECT: Apply same transforms to both
def engineer_features(df):
    """Apply identical feature engineering to any dataframe."""
    df = df.copy()
    df['log_price'] = np.log1p(df['price'])
    df['total'] = df['price'] * df['quantity']
    df['log_total'] = np.log1p(df['total'])
    return df

train_fe = engineer_features(train_data)
test_fe = engineer_features(test_data)

print("Train columns:", list(train_fe.columns))
print("Test columns: ", list(test_fe.columns))
print("\nColumns match:", list(train_fe.columns) == list(test_fe.columns))

### Summary of Common Mistakes

| Mistake | Why It Is Bad | Fix |
|---------|---------------|-----|
| Too many polynomial features | Overfitting, curse of dimensionality | Keep degree low (2-3), use regularization |
| Different transforms on train/test | Model sees inconsistent features | Use a function or Pipeline for transforms |
| Log transform without handling zeros | `log(0)` is undefined | Use `np.log1p(x)` instead of `np.log(x)` |
| Binning with test-set-derived boundaries | Data leakage | Compute bin edges from train set only |

---

## 9. Exercise

**Task:** Given the dataset below, engineer at least 5 new features. Then split into train/test and verify that both sets have the same columns.

Suggested features to create:
1. Log-transform the `salary` column
2. Bin `years_experience` into groups (junior, mid, senior, lead)
3. Extract `month` and `is_weekend` from `hire_date`
4. Create an interaction feature: `salary_per_year_exp` = salary / (years_experience + 1)
5. Compute `name_length` from the `name` column

In [None]:
# Exercise starter code
np.random.seed(42)
n = 300

exercise_df = pd.DataFrame({
    'name': [f'Employee_{i}' for i in range(n)],
    'salary': np.random.lognormal(mean=11, sigma=0.5, size=n),
    'years_experience': np.random.randint(0, 30, size=n),
    'hire_date': pd.date_range('2015-01-01', periods=n, freq='5D'),
})

print("Exercise dataset:")
print(exercise_df.head())
print(f"\nShape: {exercise_df.shape}")
print(f"salary skewness: {exercise_df['salary'].skew():.2f}")

# YOUR CODE HERE
# 1. Log-transform salary
# exercise_df['log_salary'] = ...

# 2. Bin years_experience
# exercise_df['experience_level'] = ...

# 3. Extract datetime features
# exercise_df['hire_month'] = ...
# exercise_df['hire_is_weekend'] = ...

# 4. Interaction feature
# exercise_df['salary_per_year_exp'] = ...

# 5. Text feature
# exercise_df['name_length'] = ...

# 6. Split and verify columns match
# train_df, test_df = train_test_split(exercise_df, test_size=0.2, random_state=42)
# assert list(train_df.columns) == list(test_df.columns), "Column mismatch!"
# print("Columns match between train and test.")