# Feature Scaling and Engineering

This notebook explores the importance of feature scaling and feature engineering in machine learning. We'll cover:

1. Different scaling techniques (Min-Max, Standardization)
2. The effect of feature scaling on gradient descent convergence
3. Feature engineering techniques
4. How these techniques affect model performance

Let's begin!

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split as sk_train_test_split
from sklearn.preprocessing import StandardScaler as SklearnStandardScaler
from sklearn.preprocessing import MinMaxScaler as SklearnMinMaxScaler

# Add the parent directory to sys.path to import our custom modules
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our implementations
from models.linear_regression import LinearRegression
from utils.preprocessing import StandardScaler, MinMaxScaler
from utils.plotting import plot_learning_curve
from datasets.data_utils import train_test_split

# Set random seed for reproducibility
np.random.seed(42)

## 1. Creating Data with Different Scales

Let's first create some data with features on very different scales to demonstrate the importance of feature scaling.

In [None]:
# Generate data with features on different scales
n_samples = 1000

# Feature 1: Small range (0-1)
X1 = np.random.rand(n_samples)

# Feature 2: Medium range (0-100)
X2 = np.random.rand(n_samples) * 100

# Feature 3: Large range (0-10000)
X3 = np.random.rand(n_samples) * 10000

# Combine features
X = np.column_stack((X1, X2, X3))

# Generate target (with more influence from feature 1)
y = 5 * X1 + 0.05 * X2 + 0.0005 * X3 + np.random.randn(n_samples) * 0.5

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Let's look at the ranges of each feature
print("Feature ranges:")
for i in range(X.shape[1]):
    print(f"Feature {i+1}: Min = {X[:, i].min():.4f}, Max = {X[:, i].max():.4f}, Range = {X[:, i].max() - X[:, i].min():.4f}")

# Visualize feature distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(X[:, 0], bins=30, alpha=0.7)
plt.title('Feature 1 Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.hist(X[:, 1], bins=30, alpha=0.7)
plt.title('Feature 2 Distribution')
plt.xlabel('Value')

plt.subplot(1, 3, 3)
plt.hist(X[:, 2], bins=30, alpha=0.7)
plt.title('Feature 3 Distribution')
plt.xlabel('Value')

plt.tight_layout()
plt.show()

## 2. Implementing Feature Scaling

Let's use our custom scalers to normalize the features and compare with scikit-learn's implementations.

In [None]:
# Our min-max scaler
our_mm_scaler = MinMaxScaler()
X_train_mm_our = our_mm_scaler.fit_transform(X_train)
X_test_mm_our = our_mm_scaler.transform(X_test)

# Our standard scaler
our_std_scaler = StandardScaler()
X_train_std_our = our_std_scaler.fit_transform(X_train)
X_test_std_our = our_std_scaler.transform(X_test)

# scikit-learn's min-max scaler
sk_mm_scaler = SklearnMinMaxScaler()
X_train_mm_sk = sk_mm_scaler.fit_transform(X_train)
X_test_mm_sk = sk_mm_scaler.transform(X_test)

# scikit-learn's standard scaler
sk_std_scaler = SklearnStandardScaler()
X_train_std_sk = sk_std_scaler.fit_transform(X_train)
X_test_std_sk = sk_std_scaler.transform(X_test)

# Verify that our implementation matches sklearn's
mm_diff = np.abs(X_train_mm_our - X_train_mm_sk).mean()
std_diff = np.abs(X_train_std_our - X_train_std_sk).mean()

print(f"Average difference for MinMaxScaler: {mm_diff:.10f}")
print(f"Average difference for StandardScaler: {std_diff:.10f}")

# Let's visualize the scaled features
plt.figure(figsize=(15, 10))

# Original data
plt.subplot(3, 3, 1)
plt.boxplot(X_train)
plt.title('Original Features')
plt.xticks([1, 2, 3], ['Feature 1', 'Feature 2', 'Feature 3'])
plt.ylabel('Value')

# Min-Max scaled (our implementation)
plt.subplot(3, 3, 2)
plt.boxplot(X_train_mm_our)
plt.title('Min-Max Scaled (Our Implementation)')
plt.xticks([1, 2, 3], ['Feature 1', 'Feature 2', 'Feature 3'])

# Standard scaled (our implementation)
plt.subplot(3, 3, 3)
plt.boxplot(X_train_std_our)
plt.title('Standard Scaled (Our Implementation)')
plt.xticks([1, 2, 3], ['Feature 1', 'Feature 2', 'Feature 3'])

# Min-Max scaled (scikit-learn)
plt.subplot(3, 3, 5)
plt.boxplot(X_train_mm_sk)
plt.title('Min-Max Scaled (scikit-learn)')
plt.xticks([1, 2, 3], ['Feature 1', 'Feature 2', 'Feature 3'])
plt.ylabel('Value')

# Standard scaled (scikit-learn)
plt.subplot(3, 3, 6)
plt.boxplot(X_train_std_sk)
plt.title('Standard Scaled (scikit-learn)')
plt.xticks([1, 2, 3], ['Feature 1', 'Feature 2', 'Feature 3'])

plt.tight_layout()
plt.show()

## 3. Effect of Scaling on Gradient Descent

Let's compare how gradient descent behaves with and without feature scaling.

In [None]:
# Function to train model and track performance
def train_and_evaluate(X_train, y_train, X_test, y_test, learning_rate=0.01, max_iterations=500, description="Model"):
    model = LinearRegression(learning_rate=learning_rate, max_iterations=max_iterations, store_history=True)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_mse = np.mean((model.predict(X_train) - y_train) ** 2)
    test_mse = np.mean((model.predict(X_test) - y_test) ** 2)
    
    print(f"{description}:")
    print(f"  Weights: {model.weights}")
    print(f"  Bias: {model.bias:.4f}")
    print(f"  Training MSE: {train_mse:.4f}")
    print(f"  Test MSE: {test_mse:.4f}")
    print(f"  Iterations: {len(model.cost_history)}")
    
    return model, train_mse, test_mse

# Train models with different scaling
print("Training models with different scaling approaches...")

# No scaling (may diverge or take a very long time)
try:
    model_no_scale, train_mse_no_scale, test_mse_no_scale = train_and_evaluate(
        X_train, y_train, X_test, y_test, learning_rate=0.0000001, max_iterations=1000, 
        description="No scaling (very small learning rate)")
except Exception as e:
    print(f"No scaling model failed: {e}")
    model_no_scale = None

# Min-Max scaling
model_mm, train_mse_mm, test_mse_mm = train_and_evaluate(
    X_train_mm_our, y_train, X_test_mm_our, y_test, learning_rate=0.1, max_iterations=1000, 
    description="Min-Max scaling")

# Standard scaling
model_std, train_mse_std, test_mse_std = train_and_evaluate(
    X_train_std_our, y_train, X_test_std_our, y_test, learning_rate=0.1, max_iterations=1000, 
    description="Standard scaling")

# Plot learning curves
plt.figure(figsize=(12, 6))

if model_no_scale is not None:
    plt.plot(model_no_scale.cost_history, label='No scaling')
    
plt.plot(model_mm.cost_history, label='Min-Max scaling')
plt.plot(model_std.cost_history, label='Standard scaling')

plt.title('Learning Curves with Different Scaling Methods')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.legend()
plt.grid(True)
plt.show()

## 4. Feature Engineering: Creating New Features

Feature engineering involves creating new features from existing ones to improve model performance.
Let's explore some common feature engineering techniques.

In [None]:
# Load California Housing dataset
california = fetch_california_housing()
X_cal = california.data
y_cal = california.target
feature_names = california.feature_names

print("California Housing Dataset:")
print(f"Number of samples: {X_cal.shape[0]}")
print(f"Number of features: {X_cal.shape[1]}")
print(f"Feature names: {feature_names}")

# Split data
X_train_cal, X_test_cal, y_train_cal, y_test_cal = train_test_split(X_cal, y_cal, test_size=0.2, random_state=42)

# Scale data
scaler = StandardScaler()
X_train_cal_scaled = scaler.fit_transform(X_train_cal)
X_test_cal_scaled = scaler.transform(X_test_cal)

# Train baseline model
model_baseline, train_mse_baseline, test_mse_baseline = train_and_evaluate(
    X_train_cal_scaled, y_train_cal, X_test_cal_scaled, y_test_cal, 
    learning_rate=0.1, max_iterations=1000, description="Baseline model")

# Feature engineering techniques
def add_polynomial_features(X, degree=2):
    """Add polynomial features up to the specified degree."""
    X_poly = X.copy()
    n_samples, n_features = X.shape
    
    # Add squared terms (x^2)
    if degree >= 2:
        for i in range(n_features):
            X_poly = np.column_stack((X_poly, X[:, i]**2))
    
    # Add cubic terms (x^3)
    if degree >= 3:
        for i in range(n_features):
            X_poly = np.column_stack((X_poly, X[:, i]**3))
    
    # Add interaction terms (x_i * x_j)
    for i in range(n_features):
        for j in range(i+1, n_features):
            X_poly = np.column_stack((X_poly, X[:, i] * X[:, j]))
    
    return X_poly

def add_log_features(X):
    """Add logarithmic transformations of positive features."""
    X_log = X.copy()
    n_samples, n_features = X.shape
    
    for i in range(n_features):
        # Check if feature is positive
        if np.all(X[:, i] > 0):
            X_log = np.column_stack((X_log, np.log(X[:, i])))
    
    return X_log

# Create new feature sets
X_train_poly = add_polynomial_features(X_train_cal, degree=2)
X_test_poly = add_polynomial_features(X_test_cal, degree=2)

X_train_log = add_log_features(X_train_cal)
X_test_log = add_log_features(X_test_cal)

# Scale new feature sets
scaler_poly = StandardScaler()
X_train_poly_scaled = scaler_poly.fit_transform(X_train_poly)
X_test_poly_scaled = scaler_poly.transform(X_test_poly)

scaler_log = StandardScaler()
X_train_log_scaled = scaler_log.fit_transform(X_train_log)
X_test_log_scaled = scaler_log.transform(X_test_log)

# Train models with engineered features
model_poly, train_mse_poly, test_mse_poly = train_and_evaluate(
    X_train_poly_scaled, y_train_cal, X_test_poly_scaled, y_test_cal, 
    learning_rate=0.1, max_iterations=1000, description="Polynomial features model")

model_log, train_mse_log, test_mse_log = train_and_evaluate(
    X_train_log_scaled, y_train_cal, X_test_log_scaled, y_test_cal, 
    learning_rate=0.1, max_iterations=1000, description="Log features model")

## 5. Comparing Model Performance with Different Feature Sets

In [None]:
# Compare MSE values
models = ['Baseline', 'Polynomial Features', 'Log Features']
train_mse_values = [train_mse_baseline, train_mse_poly, train_mse_log]
test_mse_values = [test_mse_baseline, test_mse_poly, test_mse_log]

# Plot MSE comparison
plt.figure(figsize=(12, 6))

x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, train_mse_values, width, label='Training MSE')
plt.bar(x + width/2, test_mse_values, width, label='Test MSE')

plt.xlabel('Model Type')
plt.ylabel('Mean Squared Error')
plt.title('MSE Comparison with Different Feature Engineering Techniques')
plt.xticks(x, models)
plt.legend()
plt.grid(True, axis='y')
plt.show()

# Display feature counts
print("Feature count comparison:")
print(f"Original features: {X_train_cal.shape[1]}")
print(f"With polynomial features: {X_train_poly.shape[1]}")
print(f"With log features: {X_train_log.shape[1]}")

## 6. Feature Importance and Selection

Let's analyze which features are most important for our model.

In [None]:
# Display baseline model weights
baseline_weights = np.abs(model_baseline.weights)
feature_importance = baseline_weights / np.sum(baseline_weights)

# Create a dataframe with feature names and importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance,
    'Weight': model_baseline.weights
})

# Sort by importance
importance_df = importance_df.sort_values('Importance', ascending=False)

# Display feature importance
print("Feature importance for baseline model:")
display(importance_df)

# Plot feature importance
plt.figure(figsize=(12, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Relative Importance')
plt.ylabel('Feature')
plt.title('Feature Importance for California Housing Price Prediction')
plt.grid(True, axis='x')
plt.tight_layout()
plt.show()

## 7. Visualizing Feature Relationships

Let's create scatter plots to visualize the relationship between important features and the target variable.

In [None]:
# Get the top 4 important features
top_features = importance_df['Feature'].values[:4]
# top_indices = [feature_names.tolist().index(feature) for feature in top_features]
top_indices = [feature_names.index(feature) for feature in top_features]

# Create scatter plots
plt.figure(figsize=(15, 10))

for i, (feature, idx) in enumerate(zip(top_features, top_indices)):
    plt.subplot(2, 2, i+1)
    plt.scatter(X_cal[:, idx], y_cal, alpha=0.5, s=10)
    
    # Add linear regression line for visual reference
    z = np.polyfit(X_cal[:, idx], y_cal, 1)
    p = np.poly1d(z)
    x_range = np.linspace(X_cal[:, idx].min(), X_cal[:, idx].max(), 100)
    plt.plot(x_range, p(x_range), 'r--')
    
    plt.title(f'Relationship of {feature} with House Prices')
    plt.xlabel(feature)
    plt.ylabel('House Price')
    plt.grid(True)

plt.tight_layout()
plt.show()

## 8. Combining Feature Scaling and Engineering

Let's build a model that uses both feature scaling and feature engineering to get the best performance.

In [None]:
# Create a function for our best feature engineering approach
def engineer_features(X):
    """Apply our best feature engineering approach based on previous results."""
    # Start with original features
    X_engineered = X.copy()
    
    # Add polynomial features for the most important features
    for idx in top_indices[:2]:  # Only use top 2 features for simplicity
        X_engineered = np.column_stack((X_engineered, X[:, idx]**2))
    
    # Add interaction between the top 2 features
    X_engineered = np.column_stack((X_engineered, X[:, top_indices[0]] * X[:, top_indices[1]]))
    
    # Add log transformations for positive features
    for i in range(X.shape[1]):
        if np.all(X[:, i] > 0):
            X_engineered = np.column_stack((X_engineered, np.log(X[:, i])))
    
    return X_engineered

# Apply our engineering function
X_train_best = engineer_features(X_train_cal)
X_test_best = engineer_features(X_test_cal)

# Scale the features
scaler_best = StandardScaler()
X_train_best_scaled = scaler_best.fit_transform(X_train_best)
X_test_best_scaled = scaler_best.transform(X_test_best)

# Train the model
model_best, train_mse_best, test_mse_best = train_and_evaluate(
    X_train_best_scaled, y_train_cal, X_test_best_scaled, y_test_cal, 
    learning_rate=0.1, max_iterations=1000, description="Best engineered model")

# Compare all models
models = ['Baseline', 'Polynomial Features', 'Log Features', 'Best Engineered']
train_mse_values = [train_mse_baseline, train_mse_poly, train_mse_log, train_mse_best]
test_mse_values = [test_mse_baseline, test_mse_poly, test_mse_log, test_mse_best]

# Plot final comparison
plt.figure(figsize=(14, 6))

x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, train_mse_values, width, label='Training MSE')
plt.bar(x + width/2, test_mse_values, width, label='Test MSE')

plt.xlabel('Model Type')
plt.ylabel('Mean Squared Error')
plt.title('Final MSE Comparison with Different Feature Engineering Techniques')
plt.xticks(x, models)
plt.legend()
plt.grid(True, axis='y')
plt.show()

## 9. Summary of Key Findings

In this notebook, we've explored the importance of feature scaling and engineering:

1. **Feature Scaling Findings**:
   - Without scaling, gradient descent struggles to converge when features have very different scales
   - Both MinMax and Standardization scaling methods greatly improve convergence
   - Proper scaling allows us to use a larger learning rate, making training faster

2. **Feature Engineering Findings**:
   - Adding polynomial terms can capture non-linear relationships
   - Log transformations can be useful for features with skewed distributions
   - Feature interaction terms can capture relationships between features
   - Engineered features can significantly improve model performance

3. **Important Principles**:
   - Always scale features before applying gradient descent
   - Understand your data and feature relationships before engineering new features
   - Not all engineered features will be useful; evaluate their impact on model performance
   - Combining the right scaling and feature engineering techniques leads to the best results