# Feature Engineering + Multicollinearity Check


This notebook prepares the dataset for modeling by:

- **Feature engineering**  
  Creating new derived features (ratios, squared terms, interactions, and normalized measures) to capture nonlinear relationships and improve predictive power.

- **Scaling features for consistency**  
  Ensures all predictors are on the same scale, which is important for models sensitive to feature magnitude.

- **Checking multicollinearity using Variance Inflation Factor (VIF)**  
  Identifies redundant predictors that are highly correlated with each other, which can destabilize linear models.

- **Pruning redundant features**  
  Removes or consolidates features with excessive collinearity while preserving clinically meaningful predictors, improving model stability and interpretability.



##Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Import libaries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

###Load Dataset

In [None]:
data_path = "/content/drive/My Drive/Portfolio/DataSciencePortfolio/Projects/Breast-Cancer/data/raw/Breast_cancer_dataset.csv"
df = pd.read_csv(data_path)


#Encode diagnosis
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0})


# Drop irrelevant columns
df.drop(columns=["id", "Unnamed: 32"], errors="ignore", inplace=True)

# Check result
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Variance Inflation Factor (VIF)
Compute VIF to detect multicollinearity among features.  
  
Very high VIF (>10) indicates redundancy. We will prune features iteratively while protecting clinically meaningful predictors.

In [None]:
def calculate_vif(df_input):
    features_df = df_input.copy()

    if "diagnosis" in features_df.columns:
        features_df = features_df.drop(columns=["diagnosis"])

    # Handle case where no features are left after dropping 'diagnosis'
    if features_df.empty:
        print("No features to calculate VIF for after dropping 'diagnosis'.")
        return pd.DataFrame(columns=["feature", "VIF"])

    vif_data = pd.DataFrame()
    vif_data["feature"] = features_df.columns # Assign feature names
    vif_data["VIF"] = [
        variance_inflation_factor(features_df.values, i) # Calculate VIF only for features
        for i in range(features_df.shape[1])
    ]
    return vif_data.sort_values(by="VIF", ascending=False)

vif_results = calculate_vif(df)
print("Initial VIF calculated.")
print(vif_results)

Initial VIF calculated.
                    feature           VIF
0               radius_mean  63306.172036
2            perimeter_mean  58123.586079
20             radius_worst   9674.742602
22          perimeter_worst   4487.781270
3                 area_mean   1287.262339
23               area_worst   1138.759252
9    fractal_dimension_mean    629.679874
29  fractal_dimension_worst    423.396723
4           smoothness_mean    393.398166
24         smoothness_worst    375.597155
21            texture_worst    343.004387
1              texture_mean    251.047108
10                radius_se    236.665738
28           symmetry_worst    218.919805
12             perimeter_se    211.396334
5          compactness_mean    200.980354
8             symmetry_mean    184.426558
6            concavity_mean    157.855046
7       concave points_mean    154.241268
27     concave points_worst    148.673180
25        compactness_worst    132.884276
26          concavity_worst     86.310362
13        

## Feature Engineering


Feature engineering creates new predictors from existing ones to capture more complex relationships and reduce redundancy.

**Purpose:**  
Feature engineering consolidates redundant raw predictors into more stable, interpretable features.  
This step ensures that when we later prune with VIF, we retain predictive signal in engineered form while reducing collinearity.

In this notebook, we apply several transformations:

- **Ratios**  
  Combine related features to highlight proportional relationships (e.g., `perimeter_radius_ratio`, `concave_perimeter_ratio`).  
  These consolidate overlapping predictors into interpretable measures of tumor irregularity relative to size.

- **Squared terms**  
  Add polynomial features (e.g., `area_worst_squared`, `concavity_mean_squared`) to capture non-linear effects.  
  These help the model detect curved or extreme patterns in tumor size and concavity.

- **Normalized features**  
  Scale features relative to others (e.g., `normalized_texture`, `normalized_symmetry`).  
  These emphasize variation across tumors rather than absolute values.

- **Interaction terms**  
  Combine features to reflect clinically meaningful interactions (e.g., `concavity_compactness_ratio`, `symmetry_fractal_ratio`).  
These highlight how shape descriptors interact to influence malignancy risk.





###Seperate Features and Target

In [None]:
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

###Ratios / Proportions

In [None]:
X['perimeter_radius_ratio'] = X['perimeter_mean'] / X['radius_mean']
X['concavity_compactness_ratio'] = X['concavity_mean'] / X['compactness_mean']
X['concave_perimeter_ratio'] = X['concave points_worst'] / X['perimeter_worst']
X['symmetry_fractal_ratio'] = X['symmetry_mean'] / X['fractal_dimension_mean']

###Polynomial Terms

In [None]:
#Size Feature
X['radius_mean_squared'] = X['radius_mean'] ** 2
X['area_worst_squared'] = X['area_worst'] ** 2

#Concavity / Compactness
X['concavity_mean_squared'] = X['concavity_mean'] ** 2

###Interaction Terms

In [None]:
#Concavity x Compactness
X['concavity_compactness_interaction'] = X['concavity_mean'] * X['compactness_mean']

#Smoothness x Symmetry
X['smoothness_symmetry_interaction'] = X['smoothness_mean'] * X['symmetry_mean']

###Domain‑Informed Transformations


In [None]:
#Log Transforms
X['log_area_worst'] = np.log1p(X['area_worst'])
X['log_perimeter_mean'] = np.log1p(X['perimeter_mean'])

#Normalization by Tumor Size
X['normalized_texture'] = X['texture_mean'] / X['radius_mean']
X['normalized_symmetry'] = X['symmetry_mean'] / X['radius_mean']

###Verify All Engineered Features
Preview new columns and check for NaN, Inf and Correlations.

In [None]:
engineered_cols = [
    # Ratios
    'perimeter_radius_ratio','concavity_compactness_ratio',
    'concave_perimeter_ratio','symmetry_fractal_ratio',
    # Polynomial
    'area_worst_squared','concavity_mean_squared',
    # Interaction
    'concavity_compactness_interaction','smoothness_symmetry_interaction',
    # Log/Normalization
    'log_area_worst','log_perimeter_mean','normalized_texture','normalized_symmetry'
]

# Preview first 5 rows
print("Preview of engineered features:")
print(X[engineered_cols].head(), "\n")

# Check for NaN values
print("NaN counts:")
print(X[engineered_cols].isna().sum(), "\n")

# Check for Inf values
print("Inf counts:")
print(np.isinf(X[engineered_cols]).sum(), "\n")

# Correlation with originals
print("Correlation with base features:")
print(X[['radius_mean','area_worst','concavity_mean','compactness_mean',
         'texture_mean','smoothness_mean','symmetry_mean','fractal_dimension_mean'] + engineered_cols].corr())

Preview of engineered features:
   perimeter_radius_ratio  concavity_compactness_ratio  \
0                6.826014                     1.081052   
1                6.460865                     1.105036   
2                6.602336                     1.234522   
3                6.793345                     0.850299   
4                6.658452                     1.490964   

   concave_perimeter_ratio  symmetry_fractal_ratio  area_worst_squared  \
0                 0.001438                3.073307          4076361.00   
1                 0.001171                3.197459          3825936.00   
2                 0.001593                3.448908          2920681.00   
3                 0.002604                2.665230           322283.29   
4                 0.001068                3.074962          2480625.00   

   concavity_mean_squared  concavity_compactness_interaction  \
0                0.090060                           0.083308   
1                0.007552                     

## VIF Pruning
We iteratively drop features with VIF above 10, except for protected predictors (`concave points_worst`, `area_worst`, `concavity_mean`, etc.) to get a reduced feature set with manageable collinearity.  
This balances predictive strength with statistical stability.

In [None]:
def vif_prune(df, threshold=10.0, protected_features=None):
    """
    Iteratively drop features with VIF above threshold until all non-protected are stable.
    Protected features are kept regardless of VIF.
    """
    if protected_features is None:
        protected_features = []

    while True:
        vif_data = calculate_vif(df)

        # Filter out protected features
        non_protected = vif_data[~vif_data["feature"].isin(protected_features)]
        max_vif = non_protected["VIF"].max()

        if max_vif > threshold:
            drop_feature = non_protected.loc[non_protected["VIF"].idxmax(), "feature"]
            print(f"Dropping '{drop_feature}' with VIF={max_vif:.2f}")
            df = df.drop(columns=[drop_feature])
        else:
            print("VIF pruning complete. All non-protected features below threshold.")
            return df, vif_data

# Instantiate StandardScaler
scaler = StandardScaler()

# Scale the features
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Define the highly predictive features you want to protect
protected = [
    "concave points_worst",
    "area_worst",
    "concavity_mean",
    "radius_worst",
    "perimeter_worst"
]

# Run pruning but keep protected features
X_pruned_df, vif_final = vif_prune(X_scaled_df, threshold=10.0, protected_features=protected)

print("Final VIF results:")
print(vif_final)

Dropping 'radius_mean' with VIF=35605.16
Dropping 'area_mean' with VIF=6885.08
Dropping 'perimeter_mean' with VIF=5220.10
Dropping 'log_perimeter_mean' with VIF=505.37
Dropping 'log_area_worst' with VIF=357.04
Dropping 'symmetry_mean' with VIF=257.44
Dropping 'concavity_compactness_interaction' with VIF=100.74
Dropping 'smoothness_symmetry_interaction' with VIF=83.54
Dropping 'area_worst_squared' with VIF=79.28
Dropping 'radius_se' with VIF=74.67
Dropping 'compactness_mean' with VIF=71.39
Dropping 'concave points_mean' with VIF=63.13
Dropping 'concave_perimeter_ratio' with VIF=38.85
Dropping 'normalized_symmetry' with VIF=36.97
Dropping 'concavity_mean_squared' with VIF=32.96
Dropping 'perimeter_se' with VIF=32.04
Dropping 'concavity_worst' with VIF=30.61
Dropping 'texture_mean' with VIF=26.06
Dropping 'compactness_worst' with VIF=25.40
Dropping 'radius_mean_squared' with VIF=22.76
Dropping 'perimeter_radius_ratio' with VIF=15.41
Dropping 'fractal_dimension_mean' with VIF=11.01
VIF pru

##Save Pruned Data

In [None]:
import os

# Define the full path for the file
file_path = "/content/drive/My Drive/Portfolio/DataSciencePortfolio/Projects/Breast-Cancer/data/preprocessed/breast_cancer_pruned.csv"

# Extract the directory path from the file path
directory_path = os.path.dirname(file_path)

# Create the directory if it does not exist
os.makedirs(directory_path, exist_ok=True)

X_pruned_df["diagnosis"] = y
X_pruned_df.to_csv(file_path, index=False)