# Gestational Diabetes Meal Risk Predictor
## Notebook 1: Data Cleaning and Exploratory Data Analysis

**Author:** Sanjay Kumar Chhetri  
**Date:** December 25, 2025  
**Project:** Springboard Capstone - Predicting Post-Meal Glucose Risk

---

### Objectives:
1. Load nutritional data from USDA FoodData Central
2. Load glycemic index reference tables
3. Merge and clean the datasets
4. Perform exploratory data analysis
5. Understand feature distributions and relationships
6. Prepare data for feature engineering

### Data Sources:
- **USDA FoodData Central**: Nutritional composition of foods
- **Glycemic Index Tables**: GI/GL values from published research

## 1. Import Required Libraries

Import all necessary libraries for data processing, analysis, and visualization.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Utilities
import os
import warnings
from pathlib import Path

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("‚úì Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Set Up Data Paths

Define paths to data directories and check if data files exist.

In [None]:
# Define project root and data directories
PROJECT_ROOT = Path(os.getcwd()).parent
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
REPORTS_FIGURES = PROJECT_ROOT / 'reports' / 'figures'

# Create directories if they don't exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
REPORTS_FIGURES.mkdir(parents=True, exist_ok=True)

print(f"Project Root: {PROJECT_ROOT}")
print(f"Raw Data Path: {DATA_RAW}")
print(f"Processed Data Path: {DATA_PROCESSED}")
print(f"\nDirectory structure created/verified ‚úì")

## 3. Data Collection Instructions

### Next Steps: Download Data Files

Before proceeding with the analysis, you need to download the following datasets:

#### A. USDA FoodData Central
1. Visit: https://fdc.nal.usda.gov/download-datasets.html
2. Download: **"FoodData Central CSV"** (Foundation Foods or SR Legacy)
3. Extract and save `food.csv` and `food_nutrient.csv` to `data/raw/`
4. Alternative: Use the FoodData Central API for programmatic access

#### B. Glycemic Index Reference Table
1. Option 1: Download from University of Sydney GI Database
   - Visit: https://glycemicindex.com/
2. Option 2: Use compiled GI tables from published research
   - Example: Atkinson et al. (2008) International Tables of GI and GL
3. Save as `gi_table.csv` in `data/raw/`

Expected columns in GI table:
- `food_name`: Name of the food item
- `glycemic_index`: GI value (0-100+)
- `glycemic_load`: GL value (optional, can be calculated)
- `category`: Food category (e.g., grains, fruits, dairy)

#### File Structure Expected:
```
data/raw/
‚îú‚îÄ‚îÄ food.csv              # USDA food items
‚îú‚îÄ‚îÄ food_nutrient.csv     # USDA nutritional values
‚îî‚îÄ‚îÄ gi_table.csv          # Glycemic index reference
```

**Note:** Once you've downloaded these files, uncomment and run the data loading cells below.

## 4. Load USDA Nutritional Data

Load and explore the USDA FoodData Central dataset containing nutritional information.

In [None]:
# Load processed USDA food data
# The data has been pre-processed from the raw USDA files
# Run scripts/process_usda_data.py to regenerate if needed

usda_foods = pd.read_csv(DATA_PROCESSED / 'usda_foods_with_nutrition.csv')

print("‚úì Loaded USDA FoodData Central (processed)")
print(f"\nDataset Shape: {usda_foods.shape}")
print(f"\nColumns available:")
print(usda_foods.columns.tolist())
print(f"\nData types:")
print(usda_foods.dtypes)
print(f"\nFirst 5 foods:")
print(usda_foods.head())
print(f"\nBasic statistics:")
print(usda_foods.describe())

## 5. Load Glycemic Index Reference Data

Load the glycemic index table with GI/GL values for various foods.

In [None]:
# The glycemic index is already included in the processed dataset
# Let's examine the GI distribution

print("Glycemic Index Statistics:")
print(usda_foods['glycemic_index'].describe())

# Categorize GI values
def categorize_gi(gi_value):
    if gi_value < 55:
        return 'Low'
    elif gi_value < 70:
        return 'Medium'
    else:
        return 'High'

usda_foods['gi_category'] = usda_foods['glycemic_index'].apply(categorize_gi)

print(f"\nGI Category Distribution:")
print(usda_foods['gi_category'].value_counts())

print(f"\nSample foods by GI category:")
for category in ['Low', 'Medium', 'High']:
    print(f"\n{category} GI foods (sample):")
    sample = usda_foods[usda_foods['gi_category'] == category][['food_name', 'glycemic_index', 'total_carbs_g', 'fiber_g']].head(3)
    print(sample.to_string(index=False))

## 6. Data Inspection and Quality Check

Examine data structure, missing values, and data types.

In [None]:
def inspect_dataframe(df, name):
    """Comprehensive inspection of a dataframe"""
    print(f"\n{'='*80}")
    print(f"Dataset: {name}")
    print(f"{'='*80}")
    print(f"\nShape: {df.shape[0]} rows √ó {df.shape[1]} columns")
    print(f"\nData Types:\n{df.dtypes}")
    print(f"\nMissing Values:")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing,
        'Percentage': missing_pct
    }).sort_values('Missing Count', ascending=False)
    print(missing_df[missing_df['Missing Count'] > 0])
    print(f"\nFirst Few Rows:")
    print(df.head(3))
    
    # Only describe numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print(f"\nBasic Statistics (Numeric Columns):")
        print(df[numeric_cols].describe())
    
# Inspect the real USDA data
inspect_dataframe(usda_foods, "USDA FoodData Central (Real Data)")

## 7. Create Comprehensive Nutritional Dataset

Build a complete nutritional profile for each food item including all key features needed for modeling.

In [None]:
# We now have real USDA data with comprehensive nutritional profiles
# Let's verify the data quality and completeness

print("=" * 80)
print("USDA FOODS DATASET SUMMARY")
print("=" * 80)

print(f"\nTotal foods: {len(usda_foods):,}")
print(f"Features: {len(usda_foods.columns)}")

print(f"\nAvailable nutritional features:")
for col in usda_foods.columns:
    if col not in ['fdc_id', 'food_name', 'data_type', 'gi_category']:
        missing = usda_foods[col].isna().sum()
        missing_pct = (missing / len(usda_foods)) * 100
        print(f"  ‚Ä¢ {col:20s}: {missing:5d} missing ({missing_pct:5.2f}%)")

print(f"\nData type distribution:")
print(usda_foods['data_type'].value_counts())

print(f"\nSample of available foods:")
print(usda_foods[['food_name', 'total_carbs_g', 'protein_g', 'fat_g', 'energy_kcal', 'glycemic_index']].head(10))

## 8. Exploratory Data Analysis: Feature Distributions

Visualize the distributions of key nutritional features.

In [None]:
# Plot distributions of key nutritional features from real USDA data
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Distribution of Nutritional Features (Real USDA Data)', fontsize=16, fontweight='bold')

features = ['total_carbs_g', 'fiber_g', 'sugar_g', 'protein_g', 
            'fat_g', 'saturated_fat_g', 'energy_kcal', 'glycemic_index']

for idx, feature in enumerate(features):
    row = idx // 3
    col = idx % 3
    ax = axes[row, col]
    
    # Filter outliers for better visualization
    data = usda_foods[feature].dropna()
    q99 = data.quantile(0.99)
    data_filtered = data[data <= q99]
    
    ax.hist(data_filtered, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    ax.set_title(f'{feature.replace("_", " ").title()}', fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.grid(True, alpha=0.3)
    
    # Add statistics
    mean_val = data.mean()
    median_val = data.median()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=1, label=f'Mean: {mean_val:.1f}')
    ax.axvline(median_val, color='orange', linestyle='--', linewidth=1, label=f'Median: {median_val:.1f}')
    ax.legend(fontsize=8)

# Remove empty subplot
axes[2, 2].axis('off')

plt.tight_layout()
plt.savefig(REPORTS_FIGURES / 'nutritional_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì Distribution plots created and saved")

## 9. Correlation Analysis

Examine correlations between nutritional features to identify multicollinearity and relationships.

In [None]:
# Select numerical features for correlation analysis
numeric_features = ['total_carbs_g', 'fiber_g', 'sugar_g', 'protein_g', 
                    'fat_g', 'saturated_fat_g', 'energy_kcal', 'glycemic_index']

# Calculate correlation matrix
correlation_matrix = usda_foods[numeric_features].corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, 
            annot=True, 
            fmt='.2f', 
            cmap='coolwarm', 
            center=0,
            square=True,
            linewidths=1,
            cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix: Nutritional Features (Real USDA Data)', 
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig(REPORTS_FIGURES / 'correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Identify high correlations
print("\nHighly Correlated Feature Pairs (|r| > 0.7):")
correlation_found = False
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.7:
            print(f"  {correlation_matrix.columns[i]} <-> {correlation_matrix.columns[j]}: {corr_val:.3f}")
            correlation_found = True

if not correlation_found:
    print("  No feature pairs with |r| > 0.7 found")

print("\n‚úì Correlation analysis complete")

## 10. Glycemic Index Analysis

Analyze the distribution of glycemic index values and their relationship with other features.

In [None]:
# Categorize GI values
def categorize_gi(gi_value):
    if gi_value < 55:
        return 'Low'
    elif gi_value < 70:
        return 'Medium'
    else:
        return 'High'

foods_df['gi_category'] = foods_df['glycemic_index'].apply(categorize_gi)

# Plot GI distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# GI Distribution
axes[0].hist(foods_df['glycemic_index'], bins=20, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].axvline(x=55, color='orange', linestyle='--', linewidth=2, label='Low/Medium threshold')
axes[0].axvline(x=70, color='red', linestyle='--', linewidth=2, label='Medium/High threshold')
axes[0].set_title('Glycemic Index Distribution', fontweight='bold', fontsize=12)
axes[0].set_xlabel('Glycemic Index')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# GI Categories
gi_counts = foods_df['gi_category'].value_counts()
axes[1].bar(gi_counts.index, gi_counts.values, edgecolor='black', alpha=0.7)
axes[1].set_title('GI Category Distribution', fontweight='bold', fontsize=12)
axes[1].set_xlabel('GI Category')
axes[1].set_ylabel('Count')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nGI Category Distribution:")
print(foods_df['gi_category'].value_counts())
print(f"\nGI Statistics:")
print(foods_df['glycemic_index'].describe())

## 11. Key Insights and Data Quality Summary

Summarize findings from the exploratory analysis and note any data quality issues.

In [None]:
print("=" * 80)
print("KEY FINDINGS FROM EXPLORATORY DATA ANALYSIS")
print("=" * 80)

print("\nüìä Data Quality:")
print(f"   ‚Ä¢ Total food items: {len(usda_foods):,}")
print(f"   ‚Ä¢ Features available: {len(usda_foods.columns)}")
print(f"   ‚Ä¢ Missing values: {usda_foods.isnull().sum().sum()}")
print(f"   ‚Ä¢ Duplicate rows: {usda_foods.duplicated().sum()}")
print(f"   ‚Ä¢ Data source: USDA FoodData Central (Foundation + SR Legacy)")

print("\nü•ó Nutritional Feature Ranges:")
for feature in ['total_carbs_g', 'fiber_g', 'protein_g', 'fat_g', 'energy_kcal']:
    print(f"   ‚Ä¢ {feature}: {usda_foods[feature].min():.1f} - {usda_foods[feature].max():.1f}")

print(f"\nüìà Glycemic Index:")
print(f"   ‚Ä¢ Range: {usda_foods['glycemic_index'].min():.1f} - {usda_foods['glycemic_index'].max():.1f}")
print(f"   ‚Ä¢ Mean: {usda_foods['glycemic_index'].mean():.1f}")
print(f"   ‚Ä¢ Low GI foods (<55): {(usda_foods['gi_category'] == 'Low').sum():,}")
print(f"   ‚Ä¢ Medium GI foods (55-70): {(usda_foods['gi_category'] == 'Medium').sum():,}")
print(f"   ‚Ä¢ High GI foods (>70): {(usda_foods['gi_category'] == 'High').sum():,}")

print("\nüîç Data Insights:")
print(f"   ‚Ä¢ Most foods have GI estimates (not all measured)")
print(f"   ‚Ä¢ Fiber values range widely, important for GI impact")
print(f"   ‚Ä¢ Energy correlates with fat and carbs as expected")
print(f"   ‚Ä¢ Ready for feature engineering phase")

print("\n‚úÖ Next Steps:")
print("   1. ‚úì Real USDA data successfully loaded and analyzed")
print("   2. Proceed to Notebook 02: Feature Engineering")
print("   3. Create derived features (glycemic load, ratios)")
print("   4. Generate risk labels for supervised learning")
print("   5. Train and evaluate ML models")

print("\n" + "=" * 80)

## 12. Save Cleaned Data (Placeholder)

Once you have real data loaded and cleaned, save it for the next notebook.

In [None]:
# The data is already cleaned and saved in the processed folder
# This notebook now uses the pre-processed USDA data

print("=" * 80)
print("DATA PREPARATION COMPLETE")
print("=" * 80)

print(f"\n‚úì Working dataset: data/processed/usda_foods_with_nutrition.csv")
print(f"‚úì Total foods: {len(usda_foods):,}")
print(f"‚úì Nutritional features: {len([col for col in usda_foods.columns if col.endswith('_g') or col.endswith('_kcal')])}")
print(f"‚úì Glycemic index included: Yes")
print(f"‚úì Visualizations saved to: reports/figures/")

print(f"\nüìä Dataset ready for:")
print(f"   ‚Ä¢ Feature engineering (Notebook 02)")
print(f"   ‚Ä¢ Model training (Notebook 03)")
print(f"   ‚Ä¢ Application deployment")

print(f"\nüìù To regenerate from raw USDA files:")
print(f"   python scripts/process_usda_data.py")

print("\n" + "=" * 80)
print("‚úì Ready for Notebook 02: Feature Engineering!")
print("=" * 80)