# Obesity Factor Analysis (30 Features Total)

**Project Objective**: Analyze lifestyle factors influencing obesity  
**Author**: Lee Ji-hyun  
**Date**: 2025-11-19

**Important**: Total 30 features including target variable  
- X (Features): 29 columns
- y (Target): 1 column
- **Total: 30 columns**

## 1. Setup and Import Libraries

In [None]:
# Install required libraries
!pip install pandas numpy matplotlib seaborn scikit-learn scipy -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy.stats import f_oneway
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

print("Libraries imported successfully!")

## 2. Upload Data

In [None]:
from google.colab import files
uploaded = files.upload()
print("\nUploaded files:", list(uploaded.keys()))

In [None]:
# Load data
df = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')

print("=" * 80)
print("Obesity Factor Analysis: Lifestyle Factors Only")
print("Excluding Weight, Height, Gender to prevent data leakage")
print("=" * 80)
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
display(df.head())

## 3. Feature Selection

### Excluded Variables (Data Leakage Prevention)
- Weight (direct indicator of obesity)
- Height (used in BMI calculation)
- Gender (not lifestyle-related)

### Selected Variables (13 lifestyle factors)
- Eating Habits (5): FAVC, FCVC, NCP, CAEC, CH2O
- Physical Activity (2): FAF, TUE
- Substance Use (2): CALC, SMOKE
- Genetic/Personal (2): family_history, Age
- Environmental (2): MTRANS, SCC

In [None]:
# Select lifestyle features only
lifestyle_features = [
    # Eating habits
    'FAVC', 'FCVC', 'NCP', 'CAEC', 'CH2O',
    # Physical activity
    'FAF', 'TUE',
    # Substance use
    'CALC', 'SMOKE',
    # Genetic/Personal
    'family_history_with_overweight', 'Age',
    # Environmental
    'MTRANS', 'SCC'
]

target_col = 'NObeyesdad'

# Category definitions (English)
feature_categories = {
    'Eating Habits (5)': ['FAVC', 'FCVC', 'NCP', 'CAEC', 'CH2O'],
    'Physical Activity (2)': ['FAF', 'TUE'],
    'Substance Use (2)': ['CALC', 'SMOKE'],
    'Genetic/Personal (2)': ['family_history_with_overweight', 'Age'],
    'Environmental (2)': ['MTRANS', 'SCC']
}

print("Selected lifestyle variables:")
print("=" * 60)
for category, features in feature_categories.items():
    print(f"\n{category}:")
    for feat in features:
        print(f"  - {feat}")

print(f"\nTotal features: {len(lifestyle_features)}")
print(f"Target: 30 features (29 X + 1 y)")

# Filter data
df_filtered = df[lifestyle_features + [target_col]].copy()
print(f"\nFiltered data shape: {df_filtered.shape}")

## 4. Data Preprocessing

In [None]:
# 4-1. Label Encoding
print("[4-1] Label Encoding")
print("=" * 60)

df_processed = df_filtered.copy()
le_dict = {}
categorical_features = df_processed.select_dtypes(include=['object']).columns

for col in categorical_features:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col])
    le_dict[col] = le
    print(f"{col}: {dict(zip(le.classes_, le.transform(le.classes_)))}")

print(f"\nEncoding complete: {len(categorical_features)} categorical variables")

# Separate features and target
X = df_processed.drop(target_col, axis=1)
y = df_processed[target_col]

print(f"\nX shape: {X.shape}")
print(f"y shape: {y.shape}")

In [None]:
# 4-2. Standardization
print("[4-2] Standardization (StandardScaler)")
print("=" * 60)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

print("\nBefore vs After Standardization:")
comparison = pd.DataFrame({
    'Metric': ['Original Mean', 'Original Std', 'Scaled Mean', 'Scaled Std'],
    'Age': [X['Age'].mean(), X['Age'].std(), X_scaled_df['Age'].mean(), X_scaled_df['Age'].std()],
    'FAF': [X['FAF'].mean(), X['FAF'].std(), X_scaled_df['FAF'].mean(), X_scaled_df['FAF'].std()]
})
display(comparison.round(3))
print("\nStandardization complete!")

## 5. Feature Engineering (16 Derived Features)

**Target**: 29 total features (13 original + 16 derived)  
**Total columns**: 30 (29 X + 1 y)

In [None]:
print("[5] Creating 16 derived features")
print("=" * 60)

X_augmented = X_scaled_df.copy()

# 1. Unhealthy eating score
X_augmented['Unhealthy_Eating_Score'] = (
    X_scaled_df['FAVC'] + X_scaled_df['CAEC'] - X_scaled_df['FCVC']
)
print("  1. Unhealthy_Eating_Score")

# 2. Healthy hydration score
X_augmented['Healthy_Hydration_Score'] = (
    X_scaled_df['CH2O'] + X_scaled_df['FCVC'] - X_scaled_df['CALC']
)
print("  2. Healthy_Hydration_Score")

# 3. Sedentary lifestyle index
X_augmented['Sedentary_Lifestyle'] = X_scaled_df['TUE'] - X_scaled_df['FAF']
print("  3. Sedentary_Lifestyle")

# 4. Irregular eating pattern
X_augmented['Irregular_Eating'] = X_scaled_df['CAEC'] / (X_scaled_df['NCP'] + 1e-5)
print("  4. Irregular_Eating")

# 5. Genetic-lifestyle risk
X_augmented['Genetic_Lifestyle_Risk'] = (
    X_scaled_df['family_history_with_overweight'] * 
    (X_scaled_df['FAVC'] + X_scaled_df['CAEC'])
)
print("  5. Genetic_Lifestyle_Risk")

# 6. Exercise-diet balance
X_augmented['Exercise_Diet_Balance'] = (
    X_scaled_df['FAF'] - X_scaled_df['FAVC'] + X_scaled_df['FCVC']
)
print("  6. Exercise_Diet_Balance")

# 7. Age-lifestyle risk
X_augmented['Age_Lifestyle_Risk'] = (
    X_scaled_df['Age'] * (X_scaled_df['CAEC'] + X_scaled_df['TUE'])
)
print("  7. Age_Lifestyle_Risk")

# 8. Alcohol-snack combined
X_augmented['Alcohol_Snack_Combined'] = (
    X_scaled_df['CALC'] + X_scaled_df['CAEC']
)
print("  8. Alcohol_Snack_Combined")

# 9. Active lifestyle score
X_augmented['Active_Lifestyle_Score'] = (
    X_scaled_df['FAF'] + (4 - X_scaled_df['MTRANS']) - X_scaled_df['TUE']
)
print("  9. Active_Lifestyle_Score")

# 10. Vegetable-exercise synergy
X_augmented['Vegetable_Exercise_Synergy'] = (
    X_scaled_df['FCVC'] * X_scaled_df['FAF']
)
print(" 10. Vegetable_Exercise_Synergy")

# 11. Water-exercise effect
X_augmented['Water_Exercise_Effect'] = (
    X_scaled_df['CH2O'] * X_scaled_df['FAF']
)
print(" 11. Water_Exercise_Effect")

# 12. Total unhealthy habits
X_augmented['Total_Unhealthy_Habits'] = (
    X_scaled_df['SMOKE'] + X_scaled_df['CALC'] + X_scaled_df['CAEC']
)
print(" 12. Total_Unhealthy_Habits")

# 13. Meal regularity
X_augmented['Meal_Regularity'] = X_scaled_df['NCP'] - X_scaled_df['CAEC']
print(" 13. Meal_Regularity")

# 14. Age squared
X_augmented['Age_Squared'] = X_scaled_df['Age'] ** 2
print(" 14. Age_Squared")

# 15. Exercise deficit
X_augmented['Exercise_Deficit'] = (4 - X_scaled_df['FAF']) * X_scaled_df['TUE']
print(" 15. Exercise_Deficit")

# 16. Family-age interaction
X_augmented['Family_Age_Interaction'] = (
    X_scaled_df['family_history_with_overweight'] * X_scaled_df['Age']
)
print(" 16. Family_Age_Interaction")

print(f"\nTotal features after engineering: {X_augmented.shape[1]}")
print(f"  - Original features: {len(lifestyle_features)}")
print(f"  - Derived features: 16")
print(f"  - Total X features: {X_augmented.shape[1]}")

# Final standardization
scaler_final = StandardScaler()
X_final = scaler_final.fit_transform(X_augmented)
X_final_df = pd.DataFrame(X_final, columns=X_augmented.columns)

print(f"\nFinal dataset shape:")
print(f"  - X (features): {X_final_df.shape}")
print(f"  - y (target): {y.shape}")
print(f"  - Total columns: {X_final_df.shape[1] + 1} (X + y)")

## 6. Factor Influence Analysis

In [None]:
print("=" * 80)
print("Factor Influence Analysis (Weight/Height Excluded)")
print("=" * 80)

# Correlation analysis
correlation_with_target = df_processed.corr()[target_col].drop(target_col).abs().sort_values(ascending=False)

print("\nTop 13 Lifestyle Factors (Correlation with Obesity):")
print("=" * 60)
for i, (feature, corr) in enumerate(correlation_with_target.items(), 1):
    if corr >= 0.15:
        impact = "Strong"
    elif corr >= 0.10:
        impact = "Moderate"
    elif corr >= 0.05:
        impact = "Weak"
    else:
        impact = "Minimal"
    
    print(f"{i:2d}. {feature:35s} | {corr:.3f} | {impact}")

In [None]:
# Category-wise analysis
category_impacts = {}
for category, features in feature_categories.items():
    vars_in_category = [v for v in features if v in correlation_with_target.index]
    if vars_in_category:
        avg_corr = correlation_with_target[vars_in_category].mean()
        category_impacts[category] = avg_corr

category_impacts_sorted = sorted(category_impacts.items(), key=lambda x: x[1], reverse=True)

print("\nAverage Impact by Category:")
print("=" * 60)
for i, (category, avg_impact) in enumerate(category_impacts_sorted, 1):
    print(f"{i}. {category:25s} | {avg_impact:.3f}")

In [None]:
# ANOVA test
numeric_features = ['FCVC', 'NCP', 'CH2O', 'FAF', 'TUE', 'Age']
anova_results = []

print("\nANOVA Test Results:")
print("=" * 60)

for feature in numeric_features:
    groups = [df[df[target_col] == level][feature].values 
              for level in df[target_col].unique()]
    f_stat, p_value = f_oneway(*groups)
    anova_results.append({
        'Feature': feature,
        'F-statistic': f_stat,
        'P-value': p_value,
        'Significant': 'Yes' if p_value < 0.05 else 'No'
    })

anova_df = pd.DataFrame(anova_results).sort_values('F-statistic', ascending=False)
display(anova_df)
print("\n* P-value < 0.05: Statistically significant")

## 7. Visualizations

In [None]:
# Prepare data for visualization
y_labels = le_dict[target_col].inverse_transform(y)
obesity_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 
                 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']

In [None]:
# Visualization 1: Factor Ranking
plt.figure(figsize=(12, 8))
top_factors = correlation_with_target
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.9, len(top_factors)))
plt.barh(range(len(top_factors)), top_factors.values, color=colors, edgecolor='black')
plt.yticks(range(len(top_factors)), top_factors.index, fontsize=11)
plt.xlabel('Absolute Correlation with Obesity Level', fontsize=13, fontweight='bold')
plt.title('Lifestyle Factors Influencing Obesity\n(Weight & Height Excluded)', 
         fontsize=15, fontweight='bold', pad=20)
plt.grid(axis='x', alpha=0.3)
plt.axvline(x=0.15, color='red', linestyle='--', alpha=0.5, label='Strong (>=0.15)')
plt.axvline(x=0.10, color='orange', linestyle='--', alpha=0.5, label='Moderate (>=0.10)')
plt.legend(loc='lower right', fontsize=9)
plt.tight_layout()
plt.show()
print("[Visualization 1] Factor Influence Ranking")

In [None]:
# Visualization 2: Category Impact
plt.figure(figsize=(10, 6))
categories = [item[0] for item in category_impacts_sorted]
impacts = [item[1] for item in category_impacts_sorted]
colors_cat = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#96CEB4']
plt.bar(range(len(categories)), impacts, color=colors_cat, edgecolor='black', alpha=0.8)
plt.xticks(range(len(categories)), categories, rotation=20, ha='right', fontsize=10)
plt.ylabel('Average Correlation Coefficient', fontsize=12, fontweight='bold')
plt.title('Average Impact by Lifestyle Category\n(Pure Lifestyle Factors Only)', 
         fontsize=14, fontweight='bold', pad=15)
plt.grid(axis='y', alpha=0.3)

for i, (cat, imp) in enumerate(zip(categories, impacts)):
    plt.text(i, imp + 0.005, f'{imp:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()
print("[Visualization 2] Category Impact")

In [None]:
# Visualization 3: Top 3 Factors Distribution
top_3_factors = correlation_with_target.head(3).index.tolist()

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, factor in enumerate(top_3_factors):
    plot_data = pd.DataFrame({factor: df[factor], 'Obesity': y_labels})
    sns.violinplot(data=plot_data, x='Obesity', y=factor, 
                  palette='viridis', order=obesity_order, ax=axes[idx])
    axes[idx].set_xticklabels(obesity_order, rotation=45, ha='right', fontsize=9)
    axes[idx].set_xlabel('Obesity Level', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel(factor, fontsize=11, fontweight='bold')
    axes[idx].set_title(f'{factor}\n(Correlation: {correlation_with_target[factor]:.3f})', 
                       fontsize=12, fontweight='bold')
    axes[idx].grid(axis='y', alpha=0.3)

plt.suptitle('Top 3 Lifestyle Factors Distribution (Weight/Height Excluded)', 
            fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print("[Visualization 3] Top 3 Factors Distribution")

In [None]:
# Visualization 4: Family History Impact
obesity_classes = ['Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 
                   'Obesity_Type_II', 'Obesity_Type_III']

family_data = pd.DataFrame({
    'Family_History': le_dict['family_history_with_overweight'].inverse_transform(
        df_processed['family_history_with_overweight']),
    'Obesity': y_labels
})

family_yes = family_data[family_data['Family_History'] == 'yes']
family_no = family_data[family_data['Family_History'] == 'no']

obesity_rate_yes = (family_yes['Obesity'].isin(obesity_classes).sum() / len(family_yes)) * 100
obesity_rate_no = (family_no['Obesity'].isin(obesity_classes).sum() / len(family_no)) * 100

plt.figure(figsize=(12, 6))
categories_fh = ['Family History: YES', 'Family History: NO']
rates = [obesity_rate_yes, obesity_rate_no]
colors_fh = ['#FF6B6B', '#4ECDC4']

plt.bar(categories_fh, rates, color=colors_fh, edgecolor='black', width=0.6, alpha=0.8)
plt.ylabel('Obesity Rate (%)', fontsize=13, fontweight='bold')
plt.title('Impact of Family History on Obesity Rate\n(Lifestyle Factors Analysis)', 
         fontsize=15, fontweight='bold', pad=15)
plt.ylim(0, 100)
plt.grid(axis='y', alpha=0.3)

for i, (cat, rate) in enumerate(zip(categories_fh, rates)):
    plt.text(i, rate + 2, f'{rate:.1f}%', ha='center', fontsize=14, fontweight='bold')

diff = obesity_rate_yes - obesity_rate_no
plt.text(0.5, max(rates) - 10, f'Difference: +{diff:.1f}%', 
        ha='center', fontsize=12, fontweight='bold', 
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()
print("[Visualization 4] Family History Impact")

In [None]:
# Visualization 5: Correlation Heatmap
plt.figure(figsize=(12, 10))
lifestyle_corr = df_processed[lifestyle_features].corr()
mask = np.triu(np.ones_like(lifestyle_corr, dtype=bool))
sns.heatmap(lifestyle_corr, mask=mask, annot=True, fmt='.2f', 
            cmap='coolwarm', center=0, square=True, linewidths=0.5,
            cbar_kws={"shrink": 0.8}, annot_kws={'size': 9})
plt.title('Lifestyle Factors Correlation Matrix\n(Weight & Height Excluded)', 
         fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
print("[Visualization 5] Correlation Heatmap")

## 8. Final Results

In [None]:
print("=" * 80)
print("Final Analysis Results (Lifestyle Factors Only)")
print("=" * 80)

print("\n[1] Top 5 Lifestyle Factors")
print("=" * 60)
for i, (feature, corr) in enumerate(correlation_with_target.head(5).items(), 1):
    print(f"{i}. {feature:35s} | Correlation: {corr:.3f}")

print("\n[2] Category Impact Ranking")
print("=" * 60)
for i, (category, impact) in enumerate(category_impacts_sorted, 1):
    print(f"{i}. {category:25s} | Average: {impact:.3f}")

print("\n[3] Key Findings")
print("=" * 60)
print("- Weight/Height excluded to prevent data leakage")
print(f"- CAEC (Snacking) is the top lifestyle factor (correlation: {correlation_with_target.iloc[0]:.3f})")
print(f"- Family history remains a strong predictor (correlation: {correlation_with_target['family_history_with_overweight']:.3f})")
print("- Exercise shows protective effects against obesity")
print("- Lifestyle factors alone can predict obesity")

print("\n[4] Data Leakage Prevention")
print("=" * 60)
print("Excluded: Weight, Height, Gender")
print("Model learns from lifestyle behaviors, not direct obesity indicators")
print("Practical implication: Lifestyle modification can prevent obesity")

print("\n[5] Dataset Summary")
print("=" * 60)
print(f"Total features: {X_final_df.shape[1] + 1} (including target)")
print(f"  - X (features): {X_final_df.shape[1]} columns")
print(f"  - y (target): 1 column")
print(f"  - Original: 13 lifestyle factors")
print(f"  - Derived: 16 engineered features")
print(f"  - Total samples: {len(y)}")

print("\n" + "=" * 80)
print("Analysis Complete!")
print("=" * 80)

## 9. Save Processed Data

In [None]:
# Save processed data
X_final_df.to_csv('X_lifestyle_29features.csv', index=False)
pd.Series(y).to_csv('y_obesity_target.csv', index=False)

print("Data saved successfully!")
print(f"  - X_lifestyle_29features.csv: {X_final_df.shape}")
print(f"  - y_obesity_target.csv: {y.shape}")
print(f"\nTotal columns: {X_final_df.shape[1]} + 1 = {X_final_df.shape[1] + 1}")

# Download files
from google.colab import files
files.download('X_lifestyle_29features.csv')
files.download('y_obesity_target.csv')

## Summary

### Completed Steps
1. Weight/Height excluded (data leakage prevention)
2. 13 lifestyle factors selected
3. 16 derived features created
4. **Total: 30 features (29 X + 1 y)**
5. Factor influence analysis
6. Statistical validation (ANOVA)
7. 5 visualizations created

### Key Results
- **Top factor**: CAEC (Snacking) - 0.327
- **2nd**: Family history - 0.314
- **3rd**: Age - 0.236

### Next Steps
- Stage 4: Random Forest modeling + XAI analysis
- n_estimators=128
- Expected accuracy: 70-85%

---

**Author**: Lee Ji-hyun  
**Date**: 2025-11-19  
**Version**: V2 (30 features total)