# Step 2B: Advanced Feature Engineering
Create new informative features that may reveal hidden patterns.

## Overview:
This notebook focuses on creating domain-specific derived features to enhance the predictive power of our models. By leveraging domain knowledge in exercise physiology and calorie expenditure factors, we engineer features that capture important relationships between existing variables.



In [11]:
# Import essential libraries for data manipulation and numerical operations
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations

# Load preprocessed datasets from Step 2 that contain:
# - Encoded categorical features (Sex_male)
# - Log-transformed target variable (Calories)
# - Removed ID columns
train = pd.read_csv('datasets/train_preprocessed.csv')  # Training dataset with target variable
test = pd.read_csv('datasets/test_preprocessed.csv')    # Test dataset for predictions

## 1. Add BMI (Body Mass Index)

BMI is a well-established indicator of body composition calculated from height and weight.
Higher BMI may affect calorie expenditure during exercise through:
- Metabolic rate differences
- Effort required to move a larger body mass
- Potential differences in exercise efficiency

**Formula**: BMI = Weight (kg) / (Height (m))²

In [12]:
# Calculate Body Mass Index (BMI) for both training and test datasets
# Formula: BMI = Weight (kg) / (Height (m))²
# Height is given in cm, so we divide by 100 to convert to meters before squaring
train['BMI'] = train['Weight'] / ((train['Height'] / 100) ** 2)  # Calculate BMI for training data
test['BMI'] = test['Weight'] / ((test['Height'] / 100) ** 2)     # Calculate BMI for test data

## 2. Add Intensity (Heart Rate / Duration)

Exercise intensity is a critical factor in determining calorie expenditure. By dividing heart rate by duration, we create a feature that captures:
- Exercise effort level per unit time
- Efficiency of energy expenditure
- Potential for higher caloric burn rates during shorter, more intense workouts

**Formula**: Intensity = Heart Rate (bpm) / Duration (minutes)

In [None]:
# Create Intensity feature by dividing Heart Rate by Duration
# This captures the average heart rate per minute of exercise
# Higher values suggest more intense exercise that may burn calories at a faster rate
train['Intensity'] = train['Heart_Rate'] / train['Duration']  # Calculate for training data
test['Intensity'] = test['Heart_Rate'] / test['Duration']     # Calculate for test data

## 3. Add Temp_Heart Interaction (Body Temp × Heart Rate)

Body temperature and heart rate both independently correlate with calorie expenditure, but their interaction may provide additional insights:
- Higher heart rates combined with higher body temperatures may indicate more strenuous exercise


In [None]:
# Create interaction feature between Body Temperature and Heart Rate
# This captures the relationship between two physiological parameters that both 
# indicate exercise intensity and metabolic rate
# Higher values may indicate more intense physiological stress and higher calorie burn
train['TempHeart'] = train['Body_Temp'] * train['Heart_Rate']  # Calculate for training data
test['TempHeart'] = test['Body_Temp'] * test['Heart_Rate']    # Calculate for test data

## 4. Add Squared Duration (nonlinear feature)

Duration shows a strong correlation with calorie expenditure, but this relationship may not be perfectly linear:
- Energy systems shift from aerobic to anaerobic during longer exercise periods
- Fatigue factors may affect efficiency in longer workouts
- Including a quadratic term allows models to capture potential non-linear relationships


In [None]:
# Create squared duration feature to capture potential nonlinear relationships
# Square transformation can help models fit curved relationships between duration and calories
# This is a common polynomial feature engineering technique to represent nonlinear patterns
train['Duration2'] = train['Duration'] ** 2  # Square duration for training data
test['Duration2'] = test['Duration'] ** 2   # Square duration for test data

In [None]:
# Save the feature-engineered datasets to CSV files
# These will be used in subsequent modeling notebooks
# The '_fe' suffix indicates these datasets contain the additional engineered features
train.to_csv('datasets/train_fe.csv', index=False)  # Save training data with new features
test.to_csv('datasets/test_fe.csv', index=False)    # Save test data with new features
print("Feature-engineered datasets saved.")

Feature-engineered datasets saved.


In [None]:
# Load the saved feature-engineered datasets to verify their contents
# This step confirms that the data was saved correctly and allows us to perform quality checks
train_fe = pd.read_csv("datasets/train_fe.csv")  # Load training data with engineered features
test_fe = pd.read_csv("datasets/test_fe.csv")    # Load test data with engineered features

# Display basic information about the datasets
# This shows the data types and non-null counts for each column
train_info = train_fe.info()  # Get info on training data structure
test_info = test_fe.info()    # Get info on test data structure

# Check for missing values in both datasets
# Missing values could cause problems in modeling and should be addressed if present
print("missing values in train dataset:")
missing_train = train_fe.isnull().sum()  # Count missing values per column in training data
print(missing_train)

print("missing values in test dataset:")
missing_test = test_fe.isnull().sum()  # Count missing values per column in test data
print(missing_test)

# Check for potential outliers by examining statistical summaries
# Extreme values might indicate data issues or special cases to handle
print("outliers in train dataset:")
stats_train = train_fe.describe()  # Get statistical summary for all numeric columns

# Return the results for display in the notebook
missing_train, missing_test, stats_train


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Age         750000 non-null  int64  
 1   Height      750000 non-null  float64
 2   Weight      750000 non-null  float64
 3   Duration    750000 non-null  float64
 4   Heart_Rate  750000 non-null  float64
 5   Body_Temp   750000 non-null  float64
 6   Calories    750000 non-null  float64
 7   Sex_male    750000 non-null  bool   
 8   BMI         750000 non-null  float64
 9   Intensity   750000 non-null  float64
 10  TempHeart   750000 non-null  float64
 11  Duration2   750000 non-null  float64
dtypes: bool(1), float64(10), int64(1)
memory usage: 63.7 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Age         250000 non-null  int64  
 1 

(Age           0
 Height        0
 Weight        0
 Duration      0
 Heart_Rate    0
 Body_Temp     0
 Calories      0
 Sex_male      0
 BMI           0
 Intensity     0
 TempHeart     0
 Duration2     0
 dtype: int64,
 Age           0
 Height        0
 Weight        0
 Duration      0
 Heart_Rate    0
 Body_Temp     0
 Sex_male      0
 BMI           0
 Intensity     0
 TempHeart     0
 Duration2     0
 dtype: int64,
                  Age         Height         Weight       Duration  \
 count  750000.000000  750000.000000  750000.000000  750000.000000   
 mean       41.420404     174.697685      75.145668      15.421015   
 std        15.175049      12.824496      13.982704       8.354095   
 min        20.000000     126.000000      36.000000       1.000000   
 25%        28.000000     164.000000      63.000000       8.000000   
 50%        40.000000     174.000000      74.000000      15.000000   
 75%        52.000000     185.000000      87.000000      23.000000   
 max        79.0000

### ⚠️ Spotlight on Intensity (Potential Outliers)
- Mean: 10.55
- Max: 108.0
- 75th percentile: 10.75

This means the top 25% ends at 10.75, but the max jumps to 108 — this is a potential outlier range.

**Insight:** The large gap between the 75th percentile and the maximum value suggests extreme outliers in the Intensity feature. These outliers could be:
- Legitimate high-intensity workout data points
- Measurement errors
- Data entry issues

**Action needed:** Consider handling these outliers in modeling steps through techniques such as:
- Robust scaling
- Winsorization
- Using algorithms less sensitive to outliers (e.g., tree-based models)

### ✅ Clean Features Summary

- **All columns are complete**:
  - No missing values detected
  - No need for imputation techniques
  
- **Feature quality**:
  - Reasonable ranges for most features
  - Intensity feature has some extreme values (addressed above)
  
- **Target variable**:
  - Calories is log-transformed and well-scaled
  - This should help machine learning algorithms converge more effectively
  
**Next Steps:**
- Proceed to modeling with these engineered features
- Consider creating separate versions with and without the squared duration feature to evaluate its impact
- Monitor the influence of Intensity feature in models to determine if outlier handling is necessary