# Step 2: Preprocessing
Handle categorical variables, transform the target, and drop non-informative features.

## Preprocessing Steps:
1. Load the raw datasets
2. Drop redundant ID columns
3. Encode categorical features (Sex)
4. Transform the target variable (Calories) using logarithmic transformation
5. Save the preprocessed datasets

## References:
- Log transformation for right-skewed data: Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning"
- One-hot encoding: Pedregosa et al. (2011). "Scikit-learn: Machine Learning in Python"

In [None]:
# Import essential libraries for data manipulation
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations and mathematical functions

# Load the raw training and test datasets from CSV files
train = pd.read_csv('datasets/train.csv')  # Training data includes the target variable (Calories)
test = pd.read_csv('datasets/test.csv')    # Test data is used for making predictions

# Keep track of original test IDs for submission file creation later
# We'll need these IDs to match our predictions with the correct test instances
test_ids = test['id']

In [None]:
# Drop 'id' column from both datasets as it's not useful for modeling
# The 'id' is just a row identifier and has no predictive power
train.drop(columns=['id'], inplace=True)  # Remove from training data
test.drop(columns=['id'], inplace=True)   # Remove from test data

In [None]:
# One-hot encode the categorical 'Sex' variable
# This converts the categorical variable into binary columns that machine learning models can use
# We use drop_first=True to avoid multicollinearity (the "dummy variable trap")
# This encoding will create a column 'Sex_male' where:
# - 1 indicates male
# - 0 indicates female (the dropped first category)
train = pd.get_dummies(train, columns=['Sex'], drop_first=True)  # Apply to training data
test = pd.get_dummies(test, columns=['Sex'], drop_first=True)    # Apply to test data

In [None]:
# Log-transform the target variable (Calories) to address right-skewed distribution
# We use log1p (log(x + 1)) instead of log(x) to handle possible zero values
# Log transformation helps normalize the distribution and can improve model performance
# This is especially beneficial for RMSLE (Root Mean Squared Logarithmic Error) evaluation
# Reference: https://www.kaggle.com/code/carlolepelaars/understanding-the-metric-rmsle
train['Calories'] = np.log1p(train['Calories'])  # log1p = log(x + 1)

In [None]:
# Display the first few rows of the preprocessed training data
# This allows us to verify that:
# 1. The 'id' column has been removed
# 2. 'Sex' has been one-hot encoded (now appears as 'Sex_male')
# 3. 'Calories' has been log-transformed (values should be smaller and less skewed)
train.head()


Unnamed: 0,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories,Sex_male
0,36,189.0,82.0,26.0,101.0,41.0,5.01728,True
1,64,163.0,60.0,8.0,85.0,39.7,3.555348,False
2,51,161.0,64.0,7.0,84.0,39.8,3.401197,False
3,20,192.0,90.0,25.0,105.0,40.7,4.94876,True
4,38,166.0,61.0,25.0,102.0,40.6,4.990433,False


In [None]:
# Save preprocessed datasets for later use in modeling steps
# This allows us to quickly load these preprocessed datasets in future notebooks
# without repeating the preprocessing steps
train.to_csv('datasets/train_preprocessed.csv', index=False)  # Save preprocessed training data
test.to_csv('datasets/test_preprocessed.csv', index=False)    # Save preprocessed test data
pd.DataFrame({'id': test_ids}).to_csv('datasets/test_ids.csv', index=False)  # Save test IDs separately for submission file creation
