# 📊 Exploratory Data Analysis – HealthTrack Insights

This notebook explores and visualizes the behavioral patterns and demographic composition of HealthTrack users. The data has been cleaned and structured into multiple related tables.

## 1. Load Cleaned Data

In [None]:
import pandas as pd

# Load data from clean CSVs
users = pd.read_csv("clean_data/clean_users.csv", sep=";")
activity = pd.read_csv("clean_data/clean_daily_activity.csv")
sleep = pd.read_csv("clean_data/clean_sleep_logs.csv")
meals = pd.read_csv("clean_data/clean_meal_tracking.csv")
events = pd.read_csv("clean_data/clean_engagement_events.csv")

## 2. Initial Exploration

In [None]:
# Shape and columns
print("Users table shape:", users.shape)
print("Activity logs shape:", activity.shape)
print("Sleep logs shape:", sleep.shape)
print("Meals shape:", meals.shape)
print("Engagement events shape:", events.shape)

In [None]:
# Data types and first rows
users.info()
users.head()

## 3. Missing Values Analysis

In [None]:
users.isna().sum().sort_values(ascending=False)

## 4. Key Visualizations (Monochrome Style)

### 4.1 Gender Distribution

In [None]:
from IPython.display import Image
Image('eda_images/gender_distribution.png')

**Observations:**
- Gender distribution is mostly binary (M/F).
- A notable portion is unspecified or 'O'.

### 4.2 Region Distribution

In [None]:
Image('eda_images/region_distribution.png')

### 4.3 Sleep Score Distribution

**Observations:**
- Most users are concentrated in North America and Europe.
- LATAM and APAC are underrepresented.

In [None]:
Image('eda_images/sleep_score_distribution.png')

### 4.4 Sleep Hours vs Activity Duration

In [None]:
Image('eda_images/sleep_vs_activity.png')

**Observations:**
- Sleep scores are mostly between 65 and 90.
- Distribution is slightly left-skewed, indicating many users score above average.

### 4.5 User Engagement Events

In [None]:
Image('eda_images/user_engagement_events.png')

**Observations:**
- Most common events are 'notification_clicked' and 'challenge_joined'.
- Engagement strategy can leverage these behaviors.

**Observations:**
- There's a mild positive trend between hours slept and activity duration.
- Users who sleep more tend to stay more active.

In [None]:
# Missing values per table
print("Users missing values:")
print(users.isna().sum())
print("\nSleep logs missing values:")
print(sleep.isna().sum())
print("\nMeals missing values:")
print(meals.isna().sum())
print("\nActivity missing values:")
print(activity.isna().sum())

## 5. Summary of Key Metrics

This section calculates high-level metrics to summarize user behavior and wellness trends.

In [None]:
# KPI Summary
print("Average daily sessions:", activity['sessions'].mean())
print("Average session duration (min):", activity['duration_min'].mean())
print("Average sleep score:", sleep['sleep_score'].mean())
print("Avg. hours slept:", sleep['hours_slept'].mean())
print("Meal logs per user:", meals.groupby('user_id').size().mean())
print("Engagement events per user:", events.groupby('user_id').size().mean())