# Fitbit Health Tracker Data Analysis - Exploratory Data Analysis

This notebook provides a comprehensive exploratory analysis of Fitbit activity, sleep, and heart rate data from 33 users across 30 days.

## Objectives
- Explore data distributions and patterns
- Identify health and behavioral patterns
- Analyze user clusters based on lifestyle metrics
- Generate insights and visualizations


In [None]:
# Import libraries
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append(str(Path().resolve().parent))

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")


## 1. Data Loading and Preprocessing


In [None]:
# Import preprocessing module
from src.preprocess import preprocess_pipeline

# Load and preprocess data
print("Loading and preprocessing data...")
df = preprocess_pipeline(data_dir='../data')

print(f"\nData shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


## 2. Feature Engineering


In [None]:
# Import feature engineering module
from src.feature_engineering import feature_engineering_pipeline

# Create features
print("Creating features...")
daily_features, user_features = feature_engineering_pipeline(df)

print(f"\nDaily features shape: {daily_features.shape}")
print(f"\nUser features shape: {user_features.shape}")
print(f"\nUser features sample:")
user_features.head()


## 3. Exploratory Data Analysis

### 3.1 Data Overview and Summary Statistics


In [None]:
# Summary statistics
print("Daily Features Summary:")
daily_features.describe()


In [None]:
# User Features Summary
print("User Features Summary:")
user_features.describe()


### 3.2 Distribution Analysis


In [None]:
# Distribution plots for key metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Steps distribution
if 'steps' in daily_features.columns:
    axes[0, 0].hist(daily_features['steps'], bins=30, edgecolor='black')
    axes[0, 0].set_title('Distribution of Daily Steps', fontweight='bold')
    axes[0, 0].set_xlabel('Steps')
    axes[0, 0].set_ylabel('Frequency')

# Sleep efficiency distribution
if 'sleep_efficiency' in daily_features.columns:
    axes[0, 1].hist(daily_features['sleep_efficiency'], bins=30, edgecolor='black')
    axes[0, 1].set_title('Distribution of Sleep Efficiency', fontweight='bold')
    axes[0, 1].set_xlabel('Sleep Efficiency')
    axes[0, 1].set_ylabel('Frequency')

# Resting heart rate distribution
if 'avg_resting_hr' in daily_features.columns:
    axes[1, 0].hist(daily_features['avg_resting_hr'], bins=30, edgecolor='black')
    axes[1, 0].set_title('Distribution of Resting Heart Rate', fontweight='bold')
    axes[1, 0].set_xlabel('Resting HR (bpm)')
    axes[1, 0].set_ylabel('Frequency')

# Lifestyle score distribution
if 'lifestyle_score' in daily_features.columns:
    axes[1, 1].hist(daily_features['lifestyle_score'], bins=30, edgecolor='black')
    axes[1, 1].set_title('Distribution of Lifestyle Score', fontweight='bold')
    axes[1, 1].set_xlabel('Lifestyle Score')
    axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()


### 3.3 Correlation Analysis


In [None]:
# Correlation heatmap
from src.visualization import plot_correlation_heatmap

# Select key columns for correlation
corr_columns = ['steps', 'calories_burned', 'sleep_efficiency', 
                'avg_resting_hr', 'high_intensity_minutes', 'lifestyle_score']
available_corr_cols = [col for col in corr_columns if col in daily_features.columns]

if available_corr_cols:
    fig = plot_correlation_heatmap(daily_features, available_corr_cols, 
                                   title='Feature Correlation Matrix')
    plt.show()


## 4. User Clustering Analysis


In [None]:
# Import clustering module
from src.clustering import clustering_pipeline

# Perform clustering
print("Performing KMeans clustering...")
user_features_clustered, cluster_profiles, kmeans, scaler, X_scaled, feature_names = clustering_pipeline(
    user_features, 
    n_clusters=3,
    random_state=42
)

print(f"\nClustering features used: {feature_names}")
print(f"\nCluster distribution:")
print(user_features_clustered['cluster'].value_counts().sort_index())


### 4.1 Cluster Visualization


In [None]:
# PCA visualization of clusters
from src.visualization import plot_cluster_pca

fig, pca = plot_cluster_pca(
    user_features_clustered,
    user_features_clustered['cluster'].values,
    feature_names
)
plt.show()

print(f"\nPCA Explained Variance Ratio: {pca.explained_variance_ratio_}")
print(f"Total Variance Explained: {pca.explained_variance_ratio_.sum():.2%}")


### 4.2 Cluster Profile Analysis


In [None]:
# Display cluster profiles
print("Cluster Profiles:")
print(cluster_profiles.round(2))


In [None]:
# Cluster profile comparison plot
from src.visualization import plot_cluster_profiles

fig = plot_cluster_profiles(cluster_profiles, feature_names)
plt.show()


## 5. Insights and Interpretation

### 5.1 Cluster Characteristics


Based on the clustering analysis, we can identify distinct user groups:

1. **High Activity Users**: Users with high step counts, calories burned, and active minutes
2. **Moderate Activity Users**: Users with balanced activity and rest
3. **Low Activity Users**: Users with lower activity levels and potentially more sedentary behavior

Key observations:
- Sleep patterns vary across clusters
- Heart rate metrics correlate with activity levels
- Lifestyle scores reflect overall health behavior patterns


In [None]:
# Detailed cluster analysis
print("Detailed Cluster Analysis:\n")
for cluster_id in sorted(user_features_clustered['cluster'].unique()):
    cluster_data = user_features_clustered[user_features_clustered['cluster'] == cluster_id]
    print(f"\n{'='*50}")
    print(f"CLUSTER {cluster_id} - {len(cluster_data)} users")
    print(f"{'='*50}")
    print(f"Average Steps: {cluster_data['avg_steps'].mean():.0f}")
    print(f"Average Sleep Efficiency: {cluster_data['avg_sleep_efficiency'].mean():.3f}")
    print(f"Average Resting HR: {cluster_data['avg_resting_hr'].mean():.1f} bpm")
    print(f"Average Lifestyle Score: {cluster_data['avg_lifestyle_score'].mean():.2f}")


## 6. Conclusions

This analysis has revealed:

1. **Activity Patterns**: Clear distinctions between user activity levels
2. **Sleep Quality**: Correlation between sleep efficiency and overall health metrics
3. **Lifestyle Clusters**: Users can be grouped into distinct lifestyle categories
4. **Health Metrics**: Heart rate and activity levels show meaningful relationships

These insights can be used for personalized health recommendations and tracking.
