# Step 7: Feature Engineering Based on SHAP
Engineer new high-impact features and drop weak ones identified from SHAP analysis.

## Objectives
- Use insights from our SHAP analysis (step6) to eliminate low-impact features
- Create new engineered features focused on high-impact feature interactions
- Generate an improved feature set (v2) for model retraining
- Focus on creating features that leverage nonlinear relationships identified by SHAP

## Background
SHAP analysis from the previous notebook identified several key opportunities:
- `Duration` and `Heart_Rate` have the strongest impact on predictions
- Some features like `BMI` and `Height` contribute minimal predictive power
- Interaction effects between variables (e.g., temperature and heart rate) have high potential
- Age and intensity show moderately strong impacts worth exploring further


In [None]:
import pandas as pd

# Load the feature-engineered training and test datasets from step2b
train = pd.read_csv("datasets/train_fe.csv")
test = pd.read_csv("datasets/test_fe.csv")

# Create fallback IDs in case 'id' column is not present
# This is a safeguard to ensure we can track samples through our pipeline
train_id = pd.Series(range(len(train)), name="id")
test_id = pd.Series(range(len(test)), name="id")

# Split features and target variable for processing
y_train = train['Calories']  # Target variable (what we're predicting)
X_train = train.drop(columns=['Calories'])  # Feature matrix for training
X_test = test.copy()  # Test features (no target available)

In [None]:
# Drop features identified as low-impact by SHAP analysis (step6)
# These features showed minimal contribution to the model predictions:
# - 'BMI': Despite being intuitively useful, SHAP values were nearly zero
# - 'Height': Low impact unless paired with other features
features_to_drop = ['BMI', 'Height']  

# Remove these features from both training and test datasets
X_train.drop(columns=features_to_drop, inplace=True)
X_test.drop(columns=features_to_drop, inplace=True)


In [None]:
def create_new_features(df):
    """
    Create new engineered features based on SHAP insights to capture important relationships.
    
    Key feature engineering strategies applied:
    1. Ratio features - capture relationships between important variables
    2. Interaction features - multiply high-impact features together
    3. Composite features - combine multiple variables in meaningful ways
    
    Args:
        df: DataFrame containing the original features
        
    Returns:
        DataFrame with additional engineered features
    """
    # Duration per Age: Captures exercise intensity relative to age
    # Older people burning same calories in same duration = higher intensity
    df['Duration_per_Age'] = df['Duration'] / df['Age']
    
    # Heart Rate × Duration: Captures total cardiac output over workout period
    # Combines two highest-impact features identified by SHAP
    df['HRxDuration'] = df['Heart_Rate'] * df['Duration']
    
    # Temperature-Heart interaction normalized by Intensity
    # Captures efficiency of the body's heat response relative to effort
    df['TempHeart_per_Intensity'] = df['TempHeart'] / (df['Intensity'] + 1e-5)  # Add small constant to avoid division by zero
    
    # Heart Rate relative to Body Temperature
    # Captures cardiovascular efficiency relative to thermal response
    df['HR_per_BodyTemp'] = df['Heart_Rate'] / (df['Body_Temp'] + 1e-5)
    
    # Complex interaction: (Temperature × Heart Rate × Age) / Weight
    # Captures age-adjusted thermal-cardiac response relative to body mass
    df['TempHeart_Age_per_Weight'] = (df['TempHeart'] * df['Age']) / (df['Weight'] + 1e-5)
    
    return df

# Apply feature engineering to both training and test sets
X_train = create_new_features(X_train)
X_test = create_new_features(X_test)


In [None]:
# Restore the target variable to the training dataset before saving
X_train['Calories'] = y_train

# Insert ID columns at the beginning of both dataframes
X_train.insert(0, 'id', train_id)
X_test.insert(0, 'id', test_id)

# Save the enhanced feature sets as new files with '_v2' suffix
# These will be used in subsequent modeling steps
X_train.to_csv("datasets/train_fe_v2.csv", index=False)
X_test.to_csv("datasets/test_fe_v2.csv", index=False)
print("Saved: train_fe_v2.csv and test_fe_v2.csv")


Saved: train_fe_v2.csv and test_fe_v2.csv


In [None]:
# Reload the newly created files to verify their contents
# This also guards against any kernel resets during the process
train_v2 = pd.read_csv("datasets/train_fe_v2.csv")
test_v2 = pd.read_csv("datasets/test_fe_v2.csv")

# Perform validation checks on the new datasets:
# 1. Check for any missing values that might have been introduced
missing_train = train_v2.isnull().sum()
missing_test = test_v2.isnull().sum()

# 2. Generate summary statistics to verify the new features look reasonable
# This helps catch any anomalies like extreme outliers or unexpected distributions
summary_train = train_v2.describe()

# Display the validation results
missing_train, missing_test, summary_train


(id                          0
 Age                         0
 Weight                      0
 Duration                    0
 Heart_Rate                  0
 Body_Temp                   0
 Sex_male                    0
 Intensity                   0
 TempHeart                   0
 Duration2                   0
 Duration_per_Age            0
 HRxDuration                 0
 TempHeart_per_Intensity     0
 HR_per_BodyTemp             0
 TempHeart_Age_per_Weight    0
 Calories                    0
 dtype: int64,
 id                          0
 Age                         0
 Weight                      0
 Duration                    0
 Heart_Rate                  0
 Body_Temp                   0
 Sex_male                    0
 Intensity                   0
 TempHeart                   0
 Duration2                   0
 Duration_per_Age            0
 HRxDuration                 0
 TempHeart_per_Intensity     0
 HR_per_BodyTemp             0
 TempHeart_Age_per_Weight    0
 dtype: int64,
         

### Top Engineered Features (stats look solid):

Based on the summary statistics, our engineered features show promising distributions:

| Feature                    | Mean | Notes                            |
| -------------------------- | ---- | -------------------------------- |
| `Duration_per_Age`         | 0.43 | Good spread (0.01 – 1.5)         |
| `HRxDuration`              | 1541 | Multiplied range makes sense     |
| `TempHeart_per_Intensity`  | 623  | Strong nonlinear range           |
| `HR_per_BodyTemp`          | 2.38 | Tight spread – no outliers       |
| `TempHeart_Age_per_Weight` | 2177 | Wide spread – likely high signal |

All features show reasonable distributions with no extreme outliers or anomalies.
The next step will be to retrain our XGBoost model on this enhanced feature set
and evaluate whether these changes improve our prediction performance.


## Summary & Next Steps

### What We've Accomplished:
- Removed low-impact features identified by SHAP analysis
- Created 5 new engineered features focusing on high-impact interactions
- Generated and validated enhanced feature sets (v2)
- Preserved data structure with appropriate IDs for consistency

### Next Steps:
1. Retrain XGBoost model using these enhanced features (step7b)
2. Evaluate if the feature engineering improved model performance
3. Consider additional SHAP analysis on the new model for further insights
4. Prepare final submission with best-performing model