#Predictive model - CardioSense
###A Resting Heart Rate Prediction Model
---
---

# Executive Summary
**Goal:** This project explores the relationship between daily physical activity and cardiovascular health, specifically Resting Heart Rate (RHR).
**Method:** Using personal Apple Watch data and public Fitbit data, I analyzed the impact of Steps, Active Calories, and Sleep on RHR using Linear Regression and Random Forest models.
**Key Findings:**
1.  **Long-Term vs. Short-Term:** Consistent activity over months lowers RHR, but high-intensity days can temporarily raise RHR due to recovery demands.
2.  **The Role of Sleep:** For the general population (Fitbit), sleep duration was the strongest predictor of improved heart health.
3.  **Model Performance:** While daily physiological noise limits prediction accuracy ($R^2 \approx 0.25$), the models successfully identified the key drivers of heart health.

#Project 1

##Problem Definition
---
I aim to build a predictive model that estimates resting heart rate based on workout intensity and intensity measures. The purpose of the model is to determine weather increased physical activity leads to improvement in cardiovascular health, as reflected in resting heart rate. This model can help identify weather higher intensity or frequency of workouts is correlated with lower resting heart rate over time, which is often considered an indicator of improved fitness and heart health.

It is intresting to me because I'm really into fitness and I have personally noticed how higher intensity workouts seem to affect my resting heart rate over the course of a week. By exploring this  through data, I would like to move beyond casual observations and see the relationship implemented in a measurable and quantifiable way, rather than just looking at raw numbers.

##Describing the population
---
The population of interest is fitness enthusiasts who track daily workouts and health metrics using wearable devices such as apple watch, fitbit or similar tools. For this project I'll analyze a hybrid sample composed of
```
(1) My personal apple watch exports (daily heart rate, workouts, calories, sleep)
(2) A public multiuser wearable dataset from kaggle.
```
The combined sample lets me present a personal case study while also evaluating wheather the personal patterns generalize across a larger group of wearable users with varied fitness levels and demographics.

##Variables
---
####Independant variable (IV) - predictors related to workout intensity and volume
    Workout duration(Minute per day)
    Active calories burned (Kcal per day)
    Average heart rate during workouts(bpm)
    Number of workouts per week(count)
    Steps per day(Optional activity proxy)
    Sleep duration(hours per night) - treated as a predictor and as a confounder control.
####Dependant Variable (DV) - The target variable to predict/measure
    Resting heart rate(RHR), measured as daily(or weekly) average resting HR in beats per minute(bpm)
####Confounding Variables
    A confounder is an external factor that influences both the IVs and DVs and thus can create a biased association if not accounted for.
Potential confounders in this project
- Sleep quantity/duration (poor sleep raises RHR and may reduce workout intensity)
- Stress or psychological load(Raises RHR and may affect workout patterns)
- Illness or medication(temporarily raises RHR and alters activity)
- Caffeine/alcohol intake or dehydration(affects HR measurements)
- Age, Sex, and Baseline fitness levels( Since it is a multiuser data)
- Measurement differences across devices(Apple watch vs Fitbit sensor characteristics)
####Dealing with confounders
- Measure and include where possible: include sleep duration from apple health/kaggle sleep fields as a covariate in alanyses.
- Log/flag anomalies: create binary flags for days with illness, travel or medication if recorded and remove or mark extreme outliers in public dataset.
- Control between-subject factors: While using the kaggle multiuser data, include user-level covariates(age,sex) and either run per-user analyses or by fitness level or use mixed effects models to account for baseline differences.
- Device harmonisation: align units and aggregate to daily summaries and note device source as a covariate.
- Smoothing: Apply rolling averages to resting HR tp reduce day-to-day noise and focus on underlying trends.
- Sensitivity analyses: run models with and without certain covariates to show robustness of results and explicitly discuss remaining limitations.

##Hypothesis
---
- Changes in workout intensity and volume are not assiciated with changes in resting heart rate overtime (Null Hypothesis)
- Increased workout intensity or volume(higher duration, higher active calories, higher average workout HR) is associated with a decrease in resting heart rate over time (Alternate hypothesis)

###Lecture style hypothesis
- If workout intensity increases measured by increased duration, increased active calories and higher average workout HR then resting heart rate will decrease over time.

##Data Collection
---
- Collected Data
    1. Personal Apple watch data
        - Exported from Apple health app: Health auto export to create CSVs
        - Extract and standardise fields: Date, resting heart rate, workout start/end, duration, average workout HR, active calories, steps, sleep duration.
    2. Public kaggle wearable dataset
        - Download a kaggle fitbit-style dataset that includes heart rate, activity and sleep fields
        - Inspect the schema and aggregate to a daily summary per user(resting HR, total active calories, total workout minutes, steps, sleep hours)
- Ensure Representativeness when using the sample
    1. The personal dataset represents a single-case study and is explicitly presented as such, valuable for authencity but not generalizable alone.
    2. The kaggle dataset provides multi-user coverage to improve generalizability. I'll describe it's sample size, age/sex distribution(if available) and device types.
    3. To avoid generalizing, I'll
        - Run seperate analyses on the personal dataset and the public dataset, then comapre results.
        - Use per-user aggregation and mixed-effects modeling on the public dataset to caputure within and between person effects.
        - Transparently report limitations: the hybrid sample may still be baised toward wearable users, who are not a random sample of the general population.
- Methods I'll use to collect data
    1. Personal: direct extract from Apple health, converted to CSV. Logging any revelant notes(illness, travel, unsual sleep, caffine/alcohol) in a simple dairy CSV to merge with physiological records.
    2. Public: Download official CSVs from kaggle, perform schema inspection. If minute level, aggregate to daily or per-workout summaries(average workout HR, calories, duration)

##Choosing a Dataset
---
1. For this project, I'll be using a hybrid dataset
    - My personal Apple watch health export: Which has daily resting heart rate, workout minutes, active calories, steps and sleep duration.
    - Fitbit fitness tracker data from kaggle: Which contains multiple CSV files with daily activity, heart rate, calories and sleep information from multiple users.
2. Why this datset is intrestign to me:
    - I'm personally very passionate about fitness and have seen how my workout intensity affects my resting heart rate over time. By combining my own wearable health data with a larger, publicly avaible dataset of fitbit users, I can explore both my individual patterns and comapre them to a broader population. This makes the project authentic, personally meainingful and at the same time generalizable to a larger group. It also allows me to answer the research question in a way that goes beyond numbers and provides quantifiable insights into the relationships between exercise intensity and cardiovascular health.
3. What is in the dataset?
    - Apple Watch Data(Personal Export)
        - Resting heart rate(daily average bpm)
        - Workout duration(minutes)
        - Active calories burned(kcal)
        - Average heart rate during workouts(bpm)
        - Steps per day
        - Sleep duration
    - Fitbit Dataset(Möbius / Kaggle):
        - Daily activity: Total steps, calories burned, active minutes
        - Heart rate: minute-level and daily avarage heart rate per user.
        - Sleep data: total sleep time, sleep records, time in bed.
        - Demographics: via user ID, with multiple participants tracked
This Dataset has multiple variable and can be aggregated into daily records suitable for current projects's analysis.

4. Where is the data set from?
    - My Apple watch data was exported directly from the Apple Health App
    - The fitbit dataset was obtained from kaggle: https://www.kaggle.com/datasets/arashnic/fitbit
    
The kaggle data is public facing and free to use the fitbit dataset is avaiable under Kaggle's public dataset license and my personal health export is my own data which i have full permission to use.

5. When is the dataset from?
    - My Apple Health dataset covers my recent daily healthg metrics(collected over several years)
    - The Fitbit dataset was originally collected in 2016 as part of a personal fitness tracker study and made publicly available on kaggle for educational and analytical purposes.








##Importing Data into Colab
- CSV files from both the sources are copied into google drive.
- Will import them from google drive to Colab and print the first 5 data entries from the file

Mounting the google drive below and then providing the path to apple health data which is currently in CSV format and then asking the pandas library to read the CSV and the storing it in a new variable. Lastly reading the first 5 records of the data.

In [None]:
import pandas as pd
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:
# AppleHealthData = '/content/drive/MyDrive/Datasets/AppleHealthExport.csv'


AppleHealthData = '../data/AppleHealthExport.csv'

if os.path.exists(AppleHealthData):
    df_appleHealthData = pd.read_csv(AppleHealthData)
    print("Apple Health Data Loaded Successfully")
else:
    print(f"Warning: {AppleHealthData} not found. Please ensure data is in the 'data' directory.")


df_appleHealthData.head()

Importing the kaggle data into colab

Providing the path for activity and sleep CSVs, then reading both the CSVs and printing the first 5 rows.

In [None]:
# Define file paths
# activity_path = '/content/drive/MyDrive/Datasets/FitbitData/dailyActivity_merged.csv'
# sleep_path = '/content/drive/MyDrive/Datasets/FitbitData/minuteSleep_merged.csv'
# heart_rate = '/content/drive/MyDrive/Datasets/FitbitData/heartrate_seconds_merged.csv'

activity_path = '../data/dailyActivity_merged.csv'
sleep_path = '../data/minuteSleep_merged.csv'
heart_path = '../data/heartrate_seconds_merged.csv'


# Load the datasets into pandas DataFrames
if os.path.exists(activity_path):
    df_activity = pd.read_csv(activity_path)
    print("Apple Health Data Loaded Successfully")
else:
    print(f"Warning: {activity_path} not found. Please ensure data is in the 'data' directory.")


if os.path.exists(sleep_path):
    df_sleep = pd.read_csv(sleep_path)
    print("Apple Health Data Loaded Successfully")
else:
    print(f"Warning: {sleep_path} not found. Please ensure data is in the 'data' directory.")


if os.path.exists(heart_rate):
    df_heart_rate = pd.read_csv(heart_rate)
    print("Apple Health Data Loaded Successfully")
else:
    print(f"Warning: {heart_rate} not found. Please ensure data is in the 'data' directory.")


In [None]:
# Preview the first few rows of Activity Dataset
print("Activity Data:")
df_activity.head()

In [None]:
# Preview the first few rows of Sleep Dataset in minute format (will clean and pre process the data in project 2)
print("\nSleep Data:")
df_sleep.head()

In [None]:
print("\nHeart Rate Data:")
df_heart_rate.head()

#Project 2

####**Data Cleaning Strategy**

> In this section, I'll prepare the data for analysis. The raw data from Apple Health and Fitbit contains formatting inconsistencies and missing values that must be addressed.

**Key Actions**
* **Merging:** I'll combine separate CSV files (steps, calories, heart rate) into a single master dataframe based on the 'Date' index.

* **Filtering:** I will remove days with biologically impossible values (e.g., resting heart rate of 0) to prevent skewing the model.

* **Formatting:** I will convert all date strings into datetime objects to allow for time-series analysis.

### 2. Getting the **preliminary information** about the dataset

#### i) Getting the shape using .shape fucntion for the below dataframes
  - Personal data from Apple
  - Fitbit activity data
  - Fitbit sleep data
  - Fitbit heart rate

In [None]:
#Shape of the datasets

print('The shape of personal dataset is', df_appleHealthData.shape)
print('The shape of public activity dataset is', df_activity.shape)
print('The shape of public sleep dataset is', df_sleep.shape)
print('The shape of public heart rate dataset is', df_heart_rate.shape)

#### ii) Checking the data types of each cloumn using .dtypes fucntion for below dataframes
  - Personal data from Apple
  - Fitbit activity data
  - Fitbit sleep data
  - Fitbit heart rate

In [None]:
#DataTypes in the data
print('The datatypes of personal dataset is\n', df_appleHealthData.dtypes)
print('\nThe datatypes of public activity dataset is\n', df_activity.dtypes)
print('\nThe datatypes of public sleep dataset is\n', df_sleep.dtypes)
print('\nThe datatypes of public heart rate dataset is\n', df_heart_rate.dtypes)

#### iii) Listing all columns from both my personal and public dataframes using .columns function. Below are the dataframes,
  - Personal data from Apple
  - Fitbit activity data
  - Fitbit sleep data
  - Fitbit heart rate

In [None]:
#columns/variables in the data
print('The columns of personal dataset is\n', df_appleHealthData.columns)
print('\nThe columns of public activity dataset is\n', df_activity.columns)
print('\nThe columns of public sleep dataset is\n', df_sleep.columns)
print('\nThe columns of public heart rate dataset is\n', df_heart_rate.columns)

####iv) Listing the number of unique elements from each column using .nunique function.
  - Personal data from Apple
  - Fitbit activity data
  - Fitbit sleep data
  - Fitbit heart rate

In [None]:
#unique elements in the data
print('\nThe unique elements of personal dataset is\n', df_appleHealthData.nunique())
print('\nThe unique elemsnts in public activity dataset is\n', df_activity.nunique())
print('\nThe unique elemsnts in public sleep dataset is\n', df_sleep.nunique())
print('\nThe unique elemsnts in public heart rate dataset is\n', df_heart_rate.nunique())

####v) Describing the dataframes using .describe function for the following dataframes
  - Personal data from Apple
  - Fitbit activity data
  - Fitbit sleep data
  - Fitbit heart rate  

In [None]:
#describing the datasets
print('\nThe description of personal dataset is\n', df_appleHealthData.describe())
print('\nThe description of public activity dataset is\n', df_activity.describe())
print('\nThe description of public sleep dataset is\n', df_sleep.describe())
print('\nThe description of public heart rate dataset\n', df_heart_rate.describe())

###3. Specific data needs for this project
  - I have multiple data sets that needs to be merged
  - One is a personal dataset from my fitness app and the other is a public dataset from kaggle.
  - I have a export of my personal dataset and the dataframe is called df_appleHealthData and multiple csv files from kaggle datasets from which i'm taking in just the activity, sleep and heart rate data which are named as df_activity, df_sleep and df_heart_rate respectively.


###Preprocessing the data and standardising it

####1. Standardizing Date/Time Columns across the board using pandas's to_datetime funtion.

In [None]:
df_appleHealthData['date'] = pd.to_datetime(df_appleHealthData['date']).dt.date
df_activity['ActivityDate'] = pd.to_datetime(df_activity['ActivityDate']).dt.date
df_sleep['date'] = pd.to_datetime(df_sleep['date']).dt.date
df_heart_rate['Time'] = pd.to_datetime(df_heart_rate['Time']).dt.date

In [None]:
df_activity.head()

In [None]:
df_appleHealthData.head()

In [None]:
df_heart_rate.head()

In [None]:
df_sleep.head()

####2. Aggregate Sleep and Heart Rate Data
Aggregate sleep data to get total minutes asleep per day for each user


In [None]:
# The 'value' of 1 in the sleep data represents one minute of sleep
df_sleep_daily = df_sleep.groupby(['Id', 'date']).agg(
    TotalMinutesAsleep=('value', 'sum')
).reset_index()

Aggregate heart rate data to get a proxy for resting heart rate (RHR)

I choose to take the minimum heart rate value recorded for that day.

In [None]:
df_rhr_daily = df_heart_rate.groupby(['Id', 'Time']).agg(
    RestingHeartRate=('Value', 'min')
).reset_index()

#### 3. Merge the Fitbit Datasets

Merge activity with daily sleep

In [None]:
# Using a left merge to keep all activity records
df_fitbit_merged = pd.merge(df_activity, df_sleep_daily,
                            left_on=['Id', 'ActivityDate'],
                            right_on=['Id', 'date'],
                            how='left')

# Merging the result with daily resting heart rate
df_fitbit_merged = pd.merge(df_fitbit_merged, df_rhr_daily,
                            left_on=['Id', 'ActivityDate'],
                            right_on=['Id', 'Time'],
                            how='left')

# Drop redundant date columns from the merges
df_fitbit_merged = df_fitbit_merged.drop(columns=['date', 'Time'])

print("Merged Fitbit data. Here is the preview:")
df_fitbit_merged.head()

In [None]:
df_heart_rate.head()

In [None]:
df_rhr_daily.head()

In [None]:
df_fitbit_merged.head()

In [None]:
df_fitbit_merged['RestingHeartRate'].value_counts()

In [None]:
df_fitbit_merged['ActivityDate'].value_counts()

In [None]:
df_heart_rate.info()

In [None]:
df_heart_rate['Id'].value_counts()

In [None]:
df_fitbit_merged['Id'].value_counts()

#### Importing seaborn and matplotlib to help visualize the data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

### 4. Potential issues in the data.
#### Investigating the personal data and checking for any potential issues  

In [None]:
print("Investigating Apple Health Data")
# Checking for missing or null values
print('=' * 60)
print("\nMissing Values (Apple): ")
print(df_appleHealthData.isnull().sum())

# Checking for Duplicates in the personal dataframe
print('=' * 60)
print(f"\nNumber of duplicate rows (Apple): {df_appleHealthData.duplicated().sum()}")

# Checking for outliers
print("\nChecking for Outliers (Apple)...")
# The 'resting' column has a '0' value which is impossible.
print("Description of 'resting' column shows a min of 0, which is an outlier:")
print(df_appleHealthData['resting'].describe())


In [None]:
# Visualizing the outliers with a boxplot
plt.figure(figsize=(10, 4))
sns.boxplot(x=df_appleHealthData['resting'])
plt.title('Boxplot of Resting Heart Rate (Apple Data)')
plt.show()

We can see couple of outliers but there is one potential issue with 0 resting heart rate which is not possible. I will remove the value from dataframe while cleaning the data.

#### Investigating the public dataset from kaggle, where i have merged all the required data and created a single dataframe called df_fitbit_merged and checking for any potential issues.

In [None]:
# Investigating Merged Fitbit Data
print("\n--- Investigating Merged Fitbit Data ---")
# Checking for missing or null values
print("\nMissing Values (Fitbit):")
print(df_fitbit_merged.isnull().sum())

# Checking duplicates
print(f"\nNumber of duplicate rows (Fitbit): {df_fitbit_merged.duplicated().sum()}")

# Checking for outliers
print("\nChecking for Outliers (Fitbit)...")
# Checking for days with 0 calories burned but with steps recorded, which is illogical data.
illogical_data = df_fitbit_merged[(df_fitbit_merged['Calories'] == 0) & (df_fitbit_merged['TotalSteps'] > 0)]
print(f"Found {len(illogical_data)} rows with 0 calories but > 0 steps.")

The left joins likely introduced NaNs where sleep or heart rate data was missing for a given activity day.

###5. Reorganisation
My workflow will be to clean the data first, addressing issues like unwanted columns. Then, I'll perform reorganization tasks like renaming the remaining necessary columns for better clarity and consistency. This seems more efficient than renaming potentially temporary columns.

###6. Developing a detailed plan for cleaning the data
  - Apple Health Dataset(df_appleHealthData)
    - Issue 1: Several Coulmns(systolic, diastolic, bpreadtime, glucose, glucose_read_time, glucose_meal) are entirely null.
      - Action: These columns will be dropped
      - Technique: will use df.drop(columns=[...])
      - Why: Contains no information and are irrelevant to the analysis.
    - Issue 2: The blood oxygen column is not needed and has 1 null value.
      - Action: This column will be dropped
      - Technique: df.drop(columns=[...])
      - Why: It is not part of core variables(Unneeded variable for our current analysis)
    - Issue 3: Columns like minimum, maximum, average, activity, variability, and weight provide detailed heart rate stats or other metrics not central to the primary hypothesis
      - Action: These columns will be dropped to simplify the dataset.
      - Technique: df.drop(columns=[...]).
      - Why: To focus on the key predictors (steps, calories, sleep) and the target (resting heart rate).
    - Issue 4: The date column is of object type
      - Action: Convert to datetime objects
      - Technique: pd.to_datetime(df['date']).
      - Why: Essential for time-series analysis and proper data handling.
    - Issue 5: The resting heart rate column contains outlier values of 0.
      - Action: Remove rows where resting is 0 (or less than a realistic minimum, e.g., 30 bpm).
      - Technique: Boolean indexing: df_apple_cleaned = df_apple_cleaned[df_apple_cleaned['resting'] > 0].
      - Why: These values are physiologically impossible and represent data errors.
    - Reorganization (Post-Cleaning):
      - Action: Rename columns for clarity and consistency with the Fitbit data.
      - Technique: df.rename(columns={'date': 'Date', 'resting': 'RestingHeartRate', 'calories': 'ActiveCalories', 'sleep': 'TotalMinutesAsleep', 'steps': 'TotalSteps'})- Why: Improves readability and prepares for potential comparison or merging.

  - Merged Fitbit Dataset (df_fitbit_merged)

    - Issue 1: TotalMinutesAsleep and RestingHeartRate columns contain missing values (NaN), likely due to the merging process and users lacking data for certain days.
      - Action: Fill these missing values
      - Technique: Use a two-step imputation:
        - Fill NaNs using the median value for that specific user (Id). Use df.groupby('Id')[column].transform(lambda x: x.fillna(x.median())).
        - Fill any remaining NaNs (for users with no data at all) using the overall median of the entire column. Use df[column].fillna(df[column].median(), inplace=True).
      - Why: This preserves user-specific patterns where available while ensuring no data rows are lost solely due to missing sleep/RHR. The median is robust to outliers.

    - Issue 2: Some rows might have 0 Calories despite having TotalSteps > 0.
      - Action: Remove rows where Calories is 0.
      - Technique: Boolean indexing: df_fitbit_cleaned = df_fitbit_cleaned[df_fitbit_cleaned['Calories'] > 0].
      - Why: These represent impossible scenarios, likely data logging errors.

    - Issue 3: The dataset contains many activity-related columns (TotalDistance, TrackerDistance, various active distance/minute columns) that are redundant or not needed for the core analysis.
      - Action: Keeping only the essential columns to match the simplified Apple data: Id, Date, TotalSteps, Calories, TotalMinutesAsleep, RestingHeartRate.
      - Technique: df_fitbit_cleaned = df_fitbit_cleaned[['Id', 'Date', ...]].
      - Why: Creates a focused, consistent dataset structure across both sources.

    - Reorganization (Post-Cleaning):
      - Action: Rename columns for consistency.
      - Technique: df.rename(columns={'ActivityDate': 'Date', 'Calories': 'ActiveCalories'}).
      - Why: Ensures column names match the Apple dataset exactly.

###7. Cleaning the data

####Cleaning the personal dataset as per the created plan

In [None]:
# Cleaning my personal health dataset
print("Cleaning Apple Health Data...")
df_apple_cleaned = df_appleHealthData.copy()
#Solving issue 1, issue 2, issue 3
cols_to_drop_apple = [
    'systolic', 'diastolic', 'bpreadtime', 'glucose', 'glucose_read_time',
    'glucose_meal', 'bloodoxygen', 'minimum', 'maximum', 'average',
    'activity', 'variability', 'weight'
]
df_apple_cleaned = df_apple_cleaned.drop(columns=cols_to_drop_apple)
#Solving issue 4
df_apple_cleaned['date'] = pd.to_datetime(df_apple_cleaned['date'])
#solving issue 5
df_apple_cleaned = df_apple_cleaned[df_apple_cleaned['resting'] > 0]
df_apple_cleaned.rename(columns={
    'date': 'Date', 'resting': 'RestingHeartRate', 'calories': 'ActiveCalories',
    'sleep': 'TotalMinutesAsleep', 'steps': 'TotalSteps'
}, inplace=True)

print(f"Apple data cleaned.\nNew shape is now: {df_apple_cleaned.shape}")


####Cleaning the fitbit dataset as per the plan

In [None]:
# Cleaning the Merged Fitbit Dataset
print("\nCleaning Merged Fitbit Data...")
df_fitbit_cleaned = df_fitbit_merged.copy()

#Solving issue 2
df_fitbit_cleaned = df_fitbit_cleaned[df_fitbit_cleaned['Calories'] > 0].copy()

# solving issue 1, Robustly fill missing values
df_fitbit_cleaned['TotalMinutesAsleep'] = df_fitbit_cleaned.groupby('Id')['TotalMinutesAsleep'].transform(lambda x: x.fillna(x.median()))
df_fitbit_cleaned['RestingHeartRate'] = df_fitbit_cleaned.groupby('Id')['RestingHeartRate'].transform(lambda x: x.fillna(x.median()))
df_fitbit_cleaned['TotalMinutesAsleep'] = df_fitbit_cleaned['TotalMinutesAsleep'].fillna(
    df_fitbit_cleaned['TotalMinutesAsleep'].median()
)

df_fitbit_cleaned['RestingHeartRate'] = df_fitbit_cleaned['RestingHeartRate'].fillna(
    df_fitbit_cleaned['RestingHeartRate'].median()
)

#Renaming the columns to match with personal data
df_fitbit_cleaned = df_fitbit_cleaned.rename(columns={'ActivityDate': 'Date'})

# Selecting and reordeing columns to match the personal dataset
essential_cols = [
    'Id',
    'Date',
    'TotalSteps',
    'Calories',
    'TotalMinutesAsleep',
    'RestingHeartRate'
]
df_fitbit_cleaned = df_fitbit_cleaned[essential_cols]

print(f"Fitbit data cleaned and simplified. Shape is now: {df_fitbit_cleaned.shape}")

# Verifing if all the things have been handeled.
print("\nMissing values after FINAL cleaning of Fitbit data:")
print(df_fitbit_cleaned.isnull().sum())

###8. Saving the clean data

In [None]:
# Defining file paths for the cleaned data
apple_cleaned_path = '../data/AppleHealthData_cleaned.csv'
fitbit_cleaned_path = '../data/FitbitData_cleaned.csv'

# Saving the dataframes to CSV
df_apple_cleaned.to_csv(apple_cleaned_path, index=False)
df_fitbit_cleaned.to_csv(fitbit_cleaned_path, index=False)

print(f"Cleaned Apple data saved to: {apple_cleaned_path}")
print(f"Cleaned Fitbit data saved to: {fitbit_cleaned_path}")

###9. Importing the cleaned files

In [None]:
# Importing the cleaned files
df_apple_final = pd.read_csv(apple_cleaned_path)
df_fitbit_final = pd.read_csv(fitbit_cleaned_path)

print("Successfully re-imported the cleaned data files.")

###10. Printing the first and last 5 rows of both datasets

In [None]:
#Printing personal data
print("=" * 60)
print("Final Cleaned Apple Data")
print("First 5 entries:")
print(df_apple_final.head())
print("=" * 60)
print("\nLast 5 entries:")
print(df_apple_final.tail())


#Printing Fitbit data
print("=" * 60)
print("--- Final Cleaned Fitbit Data ---")
print("First 5 entries:")
print(df_fitbit_final.head())
print("=" * 60)
print("\nLast 5 entries:")
print(df_fitbit_final.tail())

In [None]:
df_apple_final.info()

In [None]:
df_fitbit_final.info()

---
---

###Project 3 : Data Exploration and Visual

####2. Capture Initial Thoughts

> - Do you think you have the right data? Yes, I believe I have the right data. My goal is to see if workout intensity (my independent variable) affects resting heart rate (RHR), my dependent variable. Both my personal Apple Health data and the public Fitbit data contain a daily timestamp, measures of intensity (Active Calories, Total Steps), and the target variable (Resting Heart Rate).

> - What are your initial questions?
  >   1.   Is there a clear negative correlation between daily ActiveCalories and RestingHeartRate?
  >   2.   Is this correlation stronger in my personal data compared to the aggregated public data?
  >   3. Does TotalMinutesAsleep (my main confounder) have a strong correlation with RestingHeartRate?
  >   4. Can I see a long-term trend of my RHR decreasing over time in my personal data?





####3. Explore Characteristics of the Data

Most of this preliminary check was completed in Project 2, but I'll re-verify the final cleaned files.

> What does each record/row in the dataset represent?
> - For df_apple_final: Each row represents a single day of my personal health data, including resting heart rate, steps, active calories, and sleep.
> - For df_fitbit_final: Each row represents a single day of health data for one of the 35 unique users in the public dataset.

> What variables/columns do you have?
> - The code below will use .info() to list all columns, their data types, and their non-null counts. This confirms the columns I selected in Project 2.

> Are there any duplicates? How do you know?
> - I will use the .duplicated().sum() method on each DataFrame. This command checks every row against all other rows and provides a count of any exact duplicates.

> Handling Duplicates:
> - This step was performed during the pre-processing phase in Project 2. The code below will just confirm that the final, cleaned dataframes have zero duplicate rows. No further handling should be necessary.

In [None]:
print("--- Apple Data: Characteristics ---")
print("\nShape (Rows, Columns):")
print(df_apple_final.shape)

print("\nColumns and Data Types:")
df_apple_final.info()

print("\nDuplicate Rows:")
print(f"Found {df_apple_final.duplicated().sum()} duplicate rows.")


print("\n\n--- Fitbit Data: Characteristics ---")
print("\nShape (Rows, Columns):")
print(df_fitbit_final.shape)

print("\nColumns and Data Types:")
df_fitbit_final.info()

print("\nDuplicate Rows:")
print(f"Found {df_fitbit_final.duplicated().sum()} duplicate rows.")

Analysis of Characteristics
After running the code, I can confirm:

> Shape:
> - My personal Apple dataset (df_apple_final) has 728 rows and 5 columns.
> - The public Fitbit dataset (df_fitbit_final) has 452 rows and 6 columns (the extra column is the Id for each user).

>Variables:
> - The df_apple_final.info() output confirms the 5 columns are: Date, RestingHeartRate, TotalSteps, ActiveCalories, and TotalMinutesAsleep. All are numeric except for Date, which is an object (I'll re-verify it's a datetime object in Step 5).
> - The df_fitbit_final.info() output confirms the 6 columns are: Id, Date, TotalSteps, Calories, TotalMinutesAsleep, and RestingHeartRate.

> Duplicates:
> - The output confirms that there are 0 duplicate rows in both df_apple_final and df_fitbit_final, so no further action is needed.

####4. Additional Transformations/Manipulations

>During Project 2, I already:
>1. Converted all date columns to datetime objects.
>2. Filled missing RHR and Sleep values in the Fitbit data using a user-by-user median, followed by a global median.
>3. Removed the impossible 0 RHR entry from my Apple data.

One key difference I noted is that df_apple_final has ActiveCalories, while df_fitbit_final has Calories (which is likely total calories, including BMR). This is not an apples-to-apples comparison.

For this exploration, I will proceed using Calories as a proxy for the Fitbit users' activity, but I will rely more on TotalSteps (which is present in both) for a more direct comparison.

####5. Explore Every Variable in the Dataset

Descriptive Statistics

>Here are the summary statistics for all numeric variables in both final dataframes, as required

In [None]:
# Descriptive Statistics for Apple Data
print("--- Apple Data: Descriptive Statistics ---")
print(df_apple_final.describe())

# Descriptive Statistics for Fitbit Data
print("\n--- Fitbit Data: Descriptive Statistics ---")
print(df_fitbit_final.describe())

> Here, I'll analyze the distributions of my key independent variables (intensity measures) and the dependent variable (RHR).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Visualize personal Data Distributions
print("Distributions for Personal Health Data")
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Distributions of Key Metrics (Personal Data)')

# Resting Heart Rate
sns.histplot(df_apple_final['RestingHeartRate'], kde=True, ax=axes[0], color='blue')
axes[0].set_title('Resting Heart Rate')

# Active Calories
sns.histplot(df_apple_final['ActiveCalories'], kde=True, ax=axes[1], color='red')
axes[1].set_title('Active Calories Burned')

# Total Steps
sns.histplot(df_apple_final['TotalSteps'], kde=True, ax=axes[2], color='green')
axes[2].set_title('Total Steps per Day')

plt.tight_layout()
plt.show()


# Visualize Fitbit Data Distributions
print("\nDistributions for Public Fitbit Data")
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Distributions of Key Metrics (Fitbit Data)')

# Resting Heart Rate
sns.histplot(df_fitbit_final['RestingHeartRate'], kde=True, ax=axes[0], color='blue')
axes[0].set_title('Resting Heart Rate')

# Active Calories (using 'Calories' column)
sns.histplot(df_fitbit_final['Calories'], kde=True, ax=axes[1], color='red')
axes[1].set_title('Total Calories Burned')

# Total Steps
sns.histplot(df_fitbit_final['TotalSteps'], kde=True, ax=axes[2], color='green')
axes[2].set_title('Total Steps per Day')

plt.tight_layout()
plt.show()

#####Visual Analysis (Histograms):

- Resting Heart Rate (Apple): My personal RHR distribution is roughly normal, centered around 60 bpm. This looks very reasonable and clean.

- Active Calories (Apple): This distribution is right-skewed. Most days have 500-1500 active calories, with a long tail representing days with very intense or long workouts.

- Total Steps (Apple): Also right-skewed, with a primary peak around 10,000-12,000 steps.

- Resting Heart Rate (Fitbit): This distribution is multi-modal, with several distinct peaks (e.g., ~53, ~60, ~68 bpm). This is expected, as it represents an aggregation of 35 different individuals, each with their own baseline RHR.

- Total Calories (Fitbit): This distribution is much more "normal" (less skewed) than my 'Active Calories' plot. This supports the idea that it includes Basal Metabolic Rate (BMR), which makes the daily total more consistent.

- Total Steps (Fitbit): This is also right-skewed, but the main peak is lower than my personal data, centered around 6,000-8,000 steps.

5b. Initial Thoughts vs. Analysis

> My findings from exploring the variables (Step 5) generally confirmed my initial questions (Step 2).
> - I questioned if a negative correlation existed, and my analysis of the variables in the scatter plots and heatmaps confirmed that it does (e.g., -0.22 for ActiveCalories vs. RestingHeartRate in my data).
> - I also questioned the role of sleep, and the correlation matrix confirmed TotalMinutesAsleep has a notable negative correlation with RHR in both datasets.

> Below is a variable-by-variable analysis of the columns central to my hypothesis.

In [None]:
# Ensure Date columns are in datetime format for plotting
df_apple_final['Date'] = pd.to_datetime(df_apple_final['Date'])
df_fitbit_final['Date'] = pd.to_datetime(df_fitbit_final['Date'])

# --- Step 5: Explore Every Variable ---

print("\n--- [Step 5] Descriptive Statistics ---")
print("\nApple Data Statistics:")
print(df_apple_final.describe())
print("\nFitbit Data Statistics:")
print(df_fitbit_final.describe())

# --- Distributions (Histograms) ---
print("\n--- [Step 5] Plotting Variable Distributions ---")

# Set style for all plots
sns.set_style("whitegrid")

# Plot 1: Resting Heart Rate Distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.histplot(df_apple_final['RestingHeartRate'], kde=True, ax=axes[0], color='blue')
axes[0].set_title('Apple: RHR Distribution')
sns.histplot(df_fitbit_final['RestingHeartRate'], kde=True, ax=axes[1], color='cyan')
axes[1].set_title('Fitbit: RHR Distribution (Aggregated)')
plt.savefig('P3_RHR_distributions.png')
plt.show()

# Plot 2: Calories Distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.histplot(df_apple_final['ActiveCalories'], kde=True, ax=axes[0], color='red')
axes[0].set_title('Apple: Active Calories Distribution')
sns.histplot(df_fitbit_final['Calories'], kde=True, ax=axes[1], color='orange')
axes[1].set_title('Fitbit: Total Calories Distribution')
plt.savefig('P3_Calories_distributions.png')
plt.show()

# Plot 3: Total Steps Distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.histplot(df_apple_final['TotalSteps'], kde=True, ax=axes[0], color='green')
axes[0].set_title('Apple: Total Steps Distribution')
sns.histplot(df_fitbit_final['TotalSteps'], kde=True, ax=axes[1], color='lime')
axes[1].set_title('Fitbit: Total Steps Distribution')
plt.savefig('P3_TotalSteps_distributions.png')
plt.show()

# Plot 4: Total Minutes Asleep Distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.histplot(df_apple_final['TotalMinutesAsleep'], kde=True, ax=axes[0], color='purple')
axes[0].set_title('Apple: TotalMinutesAsleep Distribution')
sns.histplot(df_fitbit_final['TotalMinutesAsleep'], kde=True, ax=axes[1], color='magenta')
axes[1].set_title('Fitbit: TotalMinutesAsleep Distribution')
plt.savefig('P3_TotalMinutesAsleep_distributions.png')
plt.show()


Variable 1: RestingHeartRate (Dependent Variable)
> - Datatype: int64 (Apple) and float64 (Fitbit), as seen in the .info() output.
> - Units: Beats per minute (bpm). I know this as it is the standard medical unit for measuring heart rate.
> - Represents: The target variable. It measures cardiovascular health and fitness. My hypothesis is that this value will decrease as intensity increases.
> - Transformations: None needed. The variable is numeric and ready for modeling.
> - Different from initial thoughts?: No, this is exactly what I expected.
> - Missing Data or Outliers:
>   - Apple: In Project 2, I identified and removed one extreme outlier (0 bpm), which is physiologically impossible.
>   - Fitbit: In Project 2, I identified missing (NaN) values, which were the result of the merge operation.

> Handling:
> - The 0 bpm value in the Apple data was removed.
> - The NaN values in the Fitbit data were imputed using the median RHR for that specific user, which is a robust way to handle missing data without skewing the results for that individual.

> Descriptive Statistics:
> - Apple: The mean is 60.96 bpm, with a standard deviation of 7.3. The values range from a min of 41 to a max of 88 (after cleaning), which is a very reasonable and healthy range.
> - Fitbit: The mean is 54.9 bpm with a std of 6.5. This mean is lower than my personal average.

> Distribution & Visuals:
> - The RHR_distributions.png plot shows the two distributions.
> - Apple (blue): The distribution is very clean and roughly normal, centered just over 60 bpm.
> - Fitbit (cyan): The distribution is multi-modal (it has multiple peaks). This is expected, as it is a combination of 35 different people. Each peak (e.g., around 53, 60, 68 bpm) likely represents the average RHR for a different cluster of users.

Variable 2: ActiveCalories (Apple) / Calories (Fitbit) (Independent Variable)

> -  Datatype: int64 for both.
> - Units: Kilocalories (kcal). This is the standard unit for energy expenditure.
> - Represents: This is my primary proxy for workout intensity.
> - Transformations: None needed for exploration.
> - Different from initial thoughts?: Yes. My Apple data is Active Calories (energy burned from exercise), while the Fitbit data is Total Calories (Active + BMR). This makes a direct comparison difficult, as the Fitbit data is "inflated" by baseline metabolic calories.
> - Missing Data or Outliers: In Project 2, I removed 5 rows from the Fitbit data where Calories was 0, which was illogical. The Apple data had no missing values.

> Descriptive Statistics:
> - Apple: The mean is 1004 kcal (Active), with a max of 3574 kcal.
> - Fitbit: The mean is 2189 kcal (Total), with a max of 4562 kcal.

> Distribution & Visuals:
> - The Calories_distributions.png plot shows the two distributions.
> - Apple (red): The distribution is highly right-skewed. This makes sense; most days have a moderate workout, and a few days have very long/intense workouts (e.g., a long run or hike), creating a long tail.
> - Fitbit (orange): The distribution is much more normal and less skewed. This confirms that it includes BMR, which is fairly constant day-to-day and "centers" the data.

Variable 3: TotalSteps (Independent Variable)
> - Datatype: int64 for both.
> - Units: Steps (count).
> - Represents: A secondary proxy for workout intensity and daily activity level. This is a better variable for comparing the two datasets, as it's measured the same way.
> - Transformations: None needed.
> - Different from initial thoughts?: No, this is as expected.
> - Missing Data or Outliers: No missing values in either cleaned dataset.

> Descriptive Statistics:
> - Apple: The mean is 9090 steps.
> - Fitbit: The mean is 6546 steps. This indicates my personal data comes from a (on average) more active individual than the average of the Fitbit user pool.

> Distribution & Visuals:
> - The TotalSteps_distributions.png plot shows the two distributions.
> - Apple (green): Right-skewed, with a large peak around 10,000-12,000 steps.
> - Fitbit (lime): Also right-skewed, but the main peak is lower, around 6,000-8,000 steps. This confirms the finding from the descriptive statistics.

Variable 4: TotalMinutesAsleep (Confounding Variable)
> - Datatype: int64 (Apple) and float64 (Fitbit).
> - Units: Minutes.
> - Represents: A key confounding variable. Lack of sleep can raise RHR, independently of exercise.
> - Transformations: None needed.
> - Different from initial thoughts?: No.
> - Missing Data or Outliers:
> - Apple: No missing data.
> - Fitbit: Had NaN values from the merge, which were handled in Project 2.

> Handling: The NaN values in the Fitbit data were imputed using the user-specific median.

> Descriptive Statistics:
> - Apple: The mean is 446 minutes (~7.4 hours) with a large range (min 0, max 1404). The 0-minute days are likely days the watch wasn't worn to sleep, but I will leave them as they are not physiologically impossible (unlike a 0 RHR).
> - Fitbit: The mean is 460 minutes (~7.7 hours).

> Distribution & Visuals:
> - The TotalMinutesAsleep_distributions.png plot shows the two distributions.
> - Apple (purple): Roughly normal distribution centered around 450 minutes, but with a tail of low-sleep days.
> - Fitbit (magenta): A very tight, normal distribution, also centered around 450 minutes. This data appears cleaner, likely due to the imputations smoothing it out.

####6. Explore Relationships Between Variables
> This is the core of my hypothesis. I'll check for correlations, pairwise relationships, and time-based (periodicity) patterns.

#####6a. Scatter Plots (Direct Hypothesis Test)
> First, I'll plot my primary independent variable (Calories) against my dependent variable (RestingHeartRate) for both datasets.

In [None]:
# Relationship Scatter Plot for Apple Data
plt.figure(figsize=(10, 6))
sns.regplot(x='ActiveCalories', y='RestingHeartRate', data=df_apple_final, scatter_kws={'alpha':0.5})
plt.title('Workout Intensity vs. Resting Heart Rate (Apple Data)')
plt.xlabel('Active Calories Burned')
plt.ylabel('Resting Heart Rate (bpm)')
plt.show()


# Relationship Scatter Plot for Fitbit Data
plt.figure(figsize=(10, 6))
sns.regplot(x='Calories', y='RestingHeartRate', data=df_fitbit_final, scatter_kws={'alpha':0.5})
plt.title('Workout Intensity vs. Resting Heart Rate (Fitbit Data)')
plt.xlabel('Total Calories Burned')
plt.ylabel('Resting Heart Rate (bpm)')
plt.show()

Visual Analysis (Scatter Plots):

- Apple Data: The regression line shows a clear negative correlation between ActiveCalories and RestingHeartRate. As my daily active calories go up, my RHR tends to go down. This provides the first piece of visual evidence supporting my alternate hypothesis. The data is noisy, but the trend is visible.

- Fitbit Data: The trend is also negative but appears much weaker (flatter). This could be because 'Total Calories' is a poor proxy for intensity, or because the effect is being washed out by aggregating 35 different people (an example of Simpson's Paradox, which I'll look at later).

#####6b. Correlation Matrix
>A scatter plot is good for two variables, but a correlation heatmap gives a quick overview of all linear relationships in the data.

In [None]:
print("--- Apple Data Correlations ---")
plt.figure(figsize=(8, 6))
# We must drop the 'Date' column as corr() only works on numeric data
sns.heatmap(df_apple_final.drop('Date', axis=1).corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix (Apple Data)')
plt.show()

print("\n--- Fitbit Data Correlations ---")
plt.figure(figsize=(8, 6))
# Drop non-numeric 'Date' and user 'Id' for the correlation
df_fitbit_corr = df_fitbit_final.drop(['Date', 'Id'], axis=1).rename(columns={'Calories': 'TotalCalories'})
sns.heatmap(df_fitbit_corr.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix (Fitbit Data)')
plt.show()

Visual Analysis (Heatmaps):

> Apple Data:
>  - RestingHeartRate vs. ActiveCalories: -0.22. This confirms the slight negative correlation from the scatter plot.
> - RestingHeartRate vs. TotalSteps: -0.21. A very similar strength, suggesting both are decent proxies for intensity.
> - RestingHeartRate vs. TotalMinutesAsleep: -0.12. A weaker, but still negative, relationship.

>Fitbit Data:
> - RestingHeartRate vs. TotalCalories: -0.12. A very weak negative correlation, confirming the flat scatter plot.
> - RestingHeartRate vs. TotalSteps: -0.23. This is a much stronger relationship, suggesting TotalSteps is a better predictor of RHR than TotalCalories in this dataset.
> - RestingHeartRate vs. TotalMinutesAsleep: -0.26. This is the strongest correlation with RHR in the Fitbit set, highlighting the importance of sleep as a key variable.

#####6c. Time Series Analysis (Periodicity)
> Finally, I'll look for patterns over time. The Fitbit data is only for one month, but my Apple data spans two years, which is perfect for seeing long-term trends.

In [None]:
# Time Series Plot for Apple Data
# Ensuring the 'Date' column is in datetime format
df_apple_final['Date'] = pd.to_datetime(df_apple_final['Date'])

# Sort by date to ensure the plot is correct
df_apple_final_sorted = df_apple_final.sort_values('Date')

plt.figure(figsize=(15, 6))
# Plot the 30-day rolling average to see the long-term trend
plt.plot(df_apple_final_sorted['Date'], df_apple_final_sorted['RestingHeartRate'].rolling(window=30).mean(), label='30-Day Rolling Average')
plt.title('Long-Term Resting Heart Rate Trend (Apple Data)')
plt.xlabel('Date')
plt.ylabel('Average Resting Heart Rate (bpm)')
plt.legend()
plt.show()

In [None]:
#Time Series Plot for Fitbit Data
# Ensure the 'Date' column is in datetime format
df_fitbit_final['Date'] = pd.to_datetime(df_fitbit_final['Date'])

# Calculate the average resting heart rate across all users for each day
df_fitbit_daily_avg = df_fitbit_final.groupby('Date')['RestingHeartRate'].mean().reset_index()

# Sort by date
df_fitbit_daily_avg_sorted = df_fitbit_daily_avg.sort_values('Date')

plt.figure(figsize=(15, 6))
# Plot the daily average and a 5-day rolling average to see the short-term trend
plt.plot(df_fitbit_daily_avg_sorted['Date'], df_fitbit_daily_avg_sorted['RestingHeartRate'], label='Daily Average RHR', alpha=0.5)
plt.plot(df_fitbit_daily_avg_sorted['Date'], df_fitbit_daily_avg_sorted['RestingHeartRate'].rolling(window=5).mean(), label='5-Day Rolling Average', linewidth=2)
plt.title('Short-Term Resting Heart Rate Trend (Fitbit Data - Averaged Across All Users)')
plt.xlabel('Date')
plt.ylabel('Average Resting Heart Rate (bpm)')
plt.legend()
plt.show()

Visual Analysis (Time Series):

> **Apple Data (30-day Rolling Avg):** This plot is the most compelling evidence so far. My average RHR shows a clear and significant downward trend over the two-year period, starting from an average of ~65 bpm in late 2023 and decreasing to around ~58 bpm by late 2025. This strongly suggests a long-term improvement in cardiovascular fitness, which directly supports my hypothesis that consistent activity (tracked by calories/steps) leads to a lower RHR.

> **Fitbit Data (5-day Rolling Avg):** This plot shows the daily average RHR for all users over one month. There is a lot of daily fluctuation, but the rolling average oscillates between 55-58 bpm without a clear upward or downward trend. This is expected for such a short, aggregated dataset.

#####6d. Simpson's Paradox (Optional Extra Credit)
> In my initial analysis (Section 6a), I noted that the RHR vs. Calories scatter plot for the aggregated Fitbit data showed a very weak trend. I suspected this might be an example of Simpson's Paradox, where the trend for the total group is different from the trends for the subgroups (in this case, the individual users)

> Let's test this by plotting the relationship between TotalSteps and RestingHeartRate but using color to separate each of the 35 users.

In [None]:
import altair as alt

#googled on how to use altair chart and reused the code from stack overflow

# We need to make sure the User ID is treated as a category ('nominal'), not a number.
# Using .copy() to avoid SettingWithCopyWarning
df_fitbit_final_altair = df_fitbit_final.copy()
df_fitbit_final_altair['Id_str'] = df_fitbit_final_altair['Id'].astype(str)

# Create the scatter plot using Altair
chart = alt.Chart(df_fitbit_final_altair).mark_circle(opacity=0.8).encode(
    # Use TotalSteps and RestingHeartRate on the axes
    x=alt.X('TotalSteps', title='Total Steps per Day', scale=alt.Scale(zero=False)),
    y=alt.Y('RestingHeartRate', title='Resting Heart Rate (bpm)', scale=alt.Scale(zero=False)),

    # Color each dot by the unique User ID
    color=alt.Color('Id_str', title="User ID", legend=None),

    # Add tooltips to see the user ID and values on hover
    tooltip=['Id_str', 'TotalSteps', 'RestingHeartRate']
).properties(
    title='RHR vs. Total Steps (Subgroups per User)'
).interactive() # Make the chart zoomable and pannable

chart

Visual Analysis (Simpson's Paradox Plot):

> This plot clearly demonstrates Simpson's Paradox.
> - Aggregated Trend: If you were to draw one single regression line through all the points (like in my previous scatter plot in 6a), the trend would be very flat and weak.
> - Subgroup Trends: By looking at the individual colored clusters (each representing one person), we can see many strong negative trends. For example, the light-blue user (top-left) shows a clear drop in RHR on days they have more steps. The same is true for the dark-red user (bottom-middle) and many others.

> Conclusion: The weak correlation in the aggregated data is misleading. It masks the stronger negative correlation that exists for many of the individuals within the dataset. This confirms my hypothesis is likely true for individuals, but that relationship is lost when you combine them all without accounting for their different baseline fitness levels.

####7. Do You Trust This Data?
Yes, I trust this data for this analysis, with some caveats.

>Apple Data: This is my own personal data, so I trust its source completely. The cleaning I did (removing one impossible 0 RHR value) makes it reliable. The trends (skewed activity, long-term RHR improvement) align perfectly with my personal experience.

>Fitbit Data: This is a reliable public dataset, but its usefulness is limited by two factors I discovered during exploration:
> - The Calories variable is a poor proxy for intensity. TotalSteps is better.
> - The data aggregates 35 users. The multi-modal RHR distribution and weak correlations suggest that per-user trends are being hidden by this aggregation (Simpson's Paradox).

Handling: I will proceed using both datasets, but I will prioritize the findings from my personal data as the primary source for testing my hypothesis. I'll use the Fitbit data as a general, (and weaker) point of comparison.

####8. Wrap Up
>Overview of What I Learned: The exploration process confirmed that my data is clean and usable. The most important insight was discovering the difference between my longitudinal (Apple) data and the cross-sectional, aggregated (Fitbit) data. The distributions and correlations were very different, highlighting how data is aggregated is just as important as what is collected.

>Did this affect your hypothesis? My alternate hypothesis (Increased workout intensity... is associated with a decrease in resting heart rate over time) is strongly supported by this exploration.
> - The Apple data showed a negative correlation (-0.22) between ActiveCalories and RestingHeartRate.
> - The Apple time series plot showed a clear long-term decrease in my RHR over two years.
> - The Fitbit data also showed weak-to-moderate negative correlations between RHR and both TotalSteps (-0.23) and TotalMinutesAsleep (-0.26).

> Summarize Key Findings:
> - Finding 1: The relationship between exercise and RHR is present but noisy in day-to-day data. The long-term trend is much clearer.
> - Finding 2: TotalSteps appears to be a more stable predictor for RHR than TotalCalories when comparing aggregated data.
> - Finding 3: Sleep (TotalMinutesAsleep) has a significant negative correlation with RHR and must be included in my model as a key confounder.

###**Project 4 : Analysis, Hypothesis Testing, and ML**


####**Modeling Strategy**
To test my hypothesis, I'll implement two different Machine Learning Algorithms:
1.  **Linear Regression:** This serves as my baseline. It assumes a straight-line relationship between activity and heart rate. It will help me see the direct positive or negative correlation of each variable.
2.  **Random Forest Regressor:** This is a more complex, non-linear model. I am using it to capture interactions between variables (e.g., how sleep quality might amplify the benefits of exercise) that a simple linear model might miss.

####**2. What kind of ML task is presented by your hypothesis, and what type of learning is it?**

> **ML Task:** The hypothesis presents a Regression task. Learning Type: This is Supervised Learning.

> **Description:** I intend to build a regression model to predict the daily RestingHeartRate (target variable) based on workout intensity and volume metrics. Since the target variable is continuous (bpm), regression is the appropriate task. The model will be trained on labeled data where the input features (activity metrics) and the output (RHR) are known, making it a supervised learning problem.

####**3. What features will you use?**

**Features to Use:**
> Target (Label): RestingHeartRate

>Predictors (Features):
> - ActiveCalories (Apple) / Calories (Fitbit): Primary measure of workout intensity.
> - TotalSteps: Secondary measure of daily activity volume.
> - TotalMinutesAsleep: Key confounding variable to control for recovery and sleep quality.
> - Date (processed): Will be engineered into numeric features (e.g., ordinal or time-since-start) to capture the long-term trend observed in the time-series analysis.


**Feature Engineering:**

> Date Transformation: The Date column is currently a datetime object, which most regression algorithms cannot handle directly. I will convert it to an ordinal number or "Days Since Start" to allow the model to learn the time-dependent downward trend in RHR.

> Scaling/Normalization: Since features like TotalSteps (thousands) and TotalMinutesAsleep (hundreds) have vastly different scales, I will apply standard scaling (z-score normalization) to ensure the algorithm treats them equally.

> Dimensionality Reduction:
> - Reduction: I will likely not apply PCA or heavy dimensionality reduction because I only have 3-4 core features. Reducing dimensions further might lose interpretability, which is key for my hypothesis (I need to know which specific factor drives RHR).
> - Resulting Dimensionality: The final dataset will have 3 predictor columns: ActiveCalories, TotalSteps, TotalMinutesAsleep, plus the transformed Date feature.


**Assumptions:**

> The resulting dataset assumes that the relationship between activity and RHR is roughly linear (or can be approximated as such).

> Independence: We assume daily observations are independent (though time-series data violates this, we will treat it as independent samples for this basic regression task).


**Selected Features Type:**

> - ActiveCalories: Continuous
> - TotalSteps: Continuous (discrete count, treated as continuous)
> - TotalMinutesAsleep: Continuous
> - Date (transformed): Continuous

####**4. Algorithm Selection**

**Algorithm: Linear Regression (or optionally Random Forest Regressor for comparison).**


**Why this algorithm?**

> - Linear Regression is the simplest and most interpretable algorithm for testing the relationship between continuous variables. It directly provides coefficients that tell us how much RHR decreases for every unit increase in calories or sleep, which perfectly answers the hypothesis.
> - Since the scatter plots in Project 3 showed a linear-looking negative trend, a linear model is a strong candidate.


**Assumptions of Linear Regression:**

> - Linearity: The relationship between X and Y is linear.
> - Normality: The residuals (errors) should be normally distributed.
> - Homoscedasticity: The variance of error terms should be constant.
> - No Multicollinearity: Features should not be highly correlated with each other (I will check the correlation between Steps and Calories).


**Known Issues & Mitigations:**

> - Outliers: Linear regression is sensitive to outliers. I have already cleaned 0 RHR values, but I will double-check for extreme calorie counts.
> - Multicollinearity: Steps and Calories are likely correlated. If the correlation is too high (>0.8), I may drop one or use Ridge/Lasso regression (regularization) to mitigate this.

####**5. Hyperparameter Selection**

**For Linear Regression:**

> - There are usually no hyperparameters to tune for standard Least Squares Linear Regression.
> - If using Ridge/Lasso (to handle multicollinearity): The hyperparameter is alpha (regularization strength). I would choose it using cross-validation or a simple grid search (e.g., trying alpha = 0.1, 1.0, 10.0).

**For Random Forest (if used as alternate):**
> - n_estimators (number of trees): I would start with 100.
> - max_depth: To prevent overfitting, I might limit this to 5 or 10.

####**6. Post-processing**

**Techniques:**

> - Residual Analysis: After training, I will plot the residuals (predicted vs. actual values) to check for patterns. If residuals are random, the model is good. If there's a pattern (e.g., a curve), a linear model might be insufficient.
> - De-scaling: If I predict values using scaled data, I will need to inverse-transform the predictions to get the actual heart rate (bpm) for readability.


**Why?**

> - To validate that the model's assumptions were met and to present the results in understandable units (bpm) rather than standardized z-scores.

####**7. ML Code Implementation**
**Steps to Code:**
1. Data Splitting: Split the cleaned Apple dataset (and Fitbit dataset separately) into Training (80%) and Testing (20%) sets using train_test_split.
2. Preprocessing Pipeline: Apply StandardScaler to the features.
3. Train: Fit the LinearRegression model on the training set.
4. Predict: Generate predictions on the test set.
5. Evaluate: Calculate and print the Accuracy (for regression, I'll print $R^2$ score or Mean Squared Error).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import datetime as dt

# ==========================================
# 2. ANALYSIS 1: APPLE HEALTH DATA (Longitudinal)
# ==========================================
print("--- ANALYSIS 1: APPLE HEALTH DATA ---")

# Setup Data
df_ml_apple = df_apple_final.copy()
# Drop rows with missing values
df_ml_apple = df_ml_apple.dropna(subset=['RestingHeartRate', 'ActiveCalories', 'TotalSteps'])

# Feature Engineering: Convert 'Date' (Capitalized) to ordinal
df_ml_apple['Date_Ordinal'] = pd.to_datetime(df_ml_apple['Date']).map(dt.datetime.toordinal)

# Define X and y
# Added TotalSteps to features for better accuracy
features_apple = ['ActiveCalories', 'TotalSteps', 'Date_Ordinal']
target_apple = 'RestingHeartRate'

X_a = df_ml_apple[features_apple]
y_a = df_ml_apple[target_apple]

# Split Data (80% Train, 20% Test)
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_a, y_a, test_size=0.2, random_state=42)

# Scale Data
scaler_a = StandardScaler()
X_train_a_scaled = scaler_a.fit_transform(X_train_a)
X_test_a_scaled = scaler_a.transform(X_test_a)

# --- Model A: Linear Regression ---
lr_model_a = LinearRegression()
lr_model_a.fit(X_train_a_scaled, y_train_a)
y_pred_lr_a = lr_model_a.predict(X_test_a_scaled)
r2_lr_a = r2_score(y_test_a, y_pred_lr_a)

# --- Model B: Random Forest (Comparison) ---
rf_model_a = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model_a.fit(X_train_a_scaled, y_train_a)
y_pred_rf_a = rf_model_a.predict(X_test_a_scaled)
r2_rf_a = r2_score(y_test_a, y_pred_rf_a)

# Output Results
print(f"Linear Regression R²: {r2_lr_a:.4f}")
print(f"Random Forest R²:     {r2_rf_a:.4f}")
print("-" * 30)
print("Coefficients (Linear Model):")
for f, c in zip(features_apple, lr_model_a.coef_):
    print(f"  {f}: {c:.4f}")

# Visualization: Actual vs Predicted (Linear Model)
plt.figure(figsize=(10, 5))
plt.scatter(y_test_a, y_pred_lr_a, alpha=0.6, color='blue', label='Predictions')
plt.plot([y_a.min(), y_a.max()], [y_a.min(), y_a.max()], 'r--', lw=2, label='Perfect Fit')
plt.title("Apple Data: Actual vs Predicted RHR (Linear Regression)")
plt.xlabel("Actual RHR")
plt.ylabel("Predicted RHR")
plt.legend()
plt.grid(True)
plt.show()

print("\n" + "="*50 + "\n")

# ==========================================
# 3. ANALYSIS 2: FITBIT DATA (Activity & Sleep)
# ==========================================
print("--- ANALYSIS 2: FITBIT DATA ---")

# Setup Data
df_ml_fitbit = df_fitbit_final.copy()
# Check for standard Fitbit column names
# If you get a KeyError here, verify your Fitbit column names using df_fitbit_final.columns
df_ml_fitbit = df_ml_fitbit.dropna(subset=['RestingHeartRate', 'TotalSteps', 'TotalMinutesAsleep', 'Calories'])

# Define X and y
features_fitbit = ['TotalSteps', 'TotalMinutesAsleep', 'Calories']
target_fitbit = 'RestingHeartRate'

X_f = df_ml_fitbit[features_fitbit]
y_f = df_ml_fitbit[target_fitbit]

# Split Data
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.2, random_state=42)

# Scale Data
scaler_f = StandardScaler()
X_train_f_scaled = scaler_f.fit_transform(X_train_f)
X_test_f_scaled = scaler_f.transform(X_test_f)

# --- Model A: Linear Regression ---
lr_model_f = LinearRegression()
lr_model_f.fit(X_train_f_scaled, y_train_f)
y_pred_lr_f = lr_model_f.predict(X_test_f_scaled)
r2_lr_f = r2_score(y_test_f, y_pred_lr_f)

# --- Model B: Random Forest (Comparison) ---
rf_model_f = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model_f.fit(X_train_f_scaled, y_train_f)
y_pred_rf_f = rf_model_f.predict(X_test_f_scaled)
r2_rf_f = r2_score(y_test_f, y_pred_rf_f)

# Output Results
print(f"Linear Regression R²: {r2_lr_f:.4f}")
print(f"Random Forest R²:     {r2_rf_f:.4f}")
print("-" * 30)
print("Coefficients (Linear Model):")
for f, c in zip(features_fitbit, lr_model_f.coef_):
    print(f"  {f}: {c:.4f}")

# Visualization: Feature Importance (Random Forest)
plt.figure(figsize=(8, 5))
importances = rf_model_f.feature_importances_
sns.barplot(x=importances, y=features_fitbit, palette='viridis')
# sns.barplot(x=importances, y=features_fitbit, palette='hue')
plt.title("Fitbit Data: Feature Importance (Random Forest)")
plt.xlabel("Importance")
plt.show()

### **8. Model Interpretation & Conclusion**

**Comparison of Models:**

>I ran both Linear Regression and Random Forest on my datasets.
> - **Apple Data:** The Linear Regression model performed **Better** than Random Forest (**R²: 0.20 vs 0.16**). This makes sense because the dominant factor in my Apple data was the simple linear trend of time (my RHR decreasing over 2 years). Random Forest sometimes overfits simple trends.
> - **Fitbit Data:** The Random Forest model performed better here (**R²: 0.17 vs 0.08**), likely because the relationship between daily sleep, steps, and heart rate is non-linear and complex.

**Key Drivers (Coefficients):**

>1. **Time/Date (Apple):** Had a coefficient of **-0.4772**, confirming my main hypothesis that my cardiovascular health has significantly improved over the long term.

>2. **Active Calories:** Results were mixed.
     - In the **Apple** model, this was positive (+1.37), suggesting that on days of high exertion, my RHR might be slightly elevated due to recovery stress.
     - In the **Fitbit** model, Calories had a negative coefficient (**-0.58**), supporting the idea that higher activity generally relates to better heart health.

>3. **Sleep (Fitbit):** The model confirmed that lack of sleep is a major factor. The coefficient was **-1.16**, meaning more minutes of sleep strongly correlates with a lower (better) resting heart rate.

**Final Verdict:**
>While the day-to-day prediction is noisy, the **Linear Regression** model successfully captured the general downward trend in my Resting Heart Rate over time, and the **Fitbit** analysis highlighted the critical importance of Sleep for daily recovery.

---
---
###Project 5 : Model Evaluation, Insights & Policy Decision

####**2. Evaluate Machine Learning Alogorithm**




**What metrics will most effectively measure the performance of your model? Why?**
> Since the target variable, **Resting Heart Rate(RHR)**, is a continious numerical value, this is a Regression problem. The most effective metrics to evaluate performace are:
>- R-squared : Why it's effective: This represents the "Goodness of Fit". It tells us what percentage of the variation in my daily heart rate is explained purely by my activity and sleep data. For health data, which is very noisy, an $R^2$ even around 0.20-0.30 is meaningful. It allows us to see if the model is finding any signal amidst the noise.
>- Mean Absolute Error(MAE): Why it's effective: This is the most interpretable metric for a non-technical audience. If the MAE is 2.5, it means "On any given day, the model's prediction is typically off by about 2.5 beats per minute." This helps contextualize if the model is useful for daily tracking.
>- Root Mean Squared Error (RMSE): Why it's effective: RMSE penalizes larger errors more heavily that MAE. In a health context, predicting a heart rate of 50 when it is actually 90 is a dangerous error. RMSE helps us identify if our model has many of these "large misses".

####**Model Comparison**

**Do you need to compare these metrics across models? Why or why not?**

> Yes, comparison is necessary
> - Comparing Algorithms (Linear vs Random Forest): Comparing metrics helps determine if the relationship is linear(simple) or non-linear(complex). If Random Forest significantly outperforms Linear Regression, it provides the physiological relationships is non-linear.
> - Comparing Data Sources (Apple vs. Fitbit): Comparing the Apple (Personal) metrics against the Fitbit (Population) metrics helps assess generalizability. It tells us if the model works better for a specific individual (customized) or if it can work for the general population.

**How will you do it?**

>I will calculate $R^2$, MAE, and RMSE for all four model combinations (Apple Linear, Apple RF, Fitbit Linear, Fitbit RF) and print them side-by-side.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
import pandas as pd

def get_metrics(y_true, y_pred, model_name):
  r2 = r2_score(y_true, y_pred)
  mae = mean_absolute_error(y_true, y_pred)
  rmse = np.sqrt(mean_squared_error(y_true, y_pred))
  return [model_name, r2, mae, rmse]


# Gather metrics for all existing models
metrics_data = []
metrics_data.append(get_metrics(y_test_a, y_pred_lr_a, "Apple Linear Reg"))
metrics_data.append(get_metrics(y_test_a, y_pred_rf_a, "Apple Random Forest"))
metrics_data.append(get_metrics(y_test_f, y_pred_lr_f, "Fitbit Linear Reg"))
metrics_data.append(get_metrics(y_test_f, y_pred_rf_f, "Fitbit Random Forest"))


# Create a DataFrame for nice display
df_metrics = pd.DataFrame(metrics_data, columns=["Model", "R2 Score", "MAE", "RMSE"])
print(df_metrics)

**Visualizations**

**Show/visualize the performance metric(s).**

>I will visualize the Predicted vs. Actual values for the best performing model. A perfect model would show points lying exactly on the diagonal red line.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualization for Apple Linear Regression
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test_a, y=y_pred_lr_a, alpha=0.6, color='blue', label='Daily Data Points')

# Plotting the "Perfect Fit" line
m, b = np.polyfit(y_test_a, y_pred_lr_a, 1)
plt.plot(y_test_a, m*y_test_a + b, color='red', linestyle='--', linewidth=2)

plt.xlabel("Actual Resting Heart Rate")
plt.ylabel("Predicted Resting Heart Rate")
plt.title("Evaluation of Fit: Actual vs Predicted (Apple Data)")
plt.show()

**Visual Analysis:**
The scatter plot above visualizes the model's accuracy.
* The **Red Line** represents a perfect prediction.
* The **Blue Dots** are the actual days.
* **Observation:** The spread of dots around the line confirms that while the model captures the general range of my heart rate, there is significant variance (noise) that steps and calories alone cannot explain.

**Evaluate Fit**

**Are you overfitting? Underfitting? Fitting well? How do you know?**

> Conclusion: The models are likely **Underfitting**

>How did i come to this conclusion:
> 1. Low $R^2$ Scores: The $R^2$ scores are relatively low (e.g., 0.08 - 0.30). A model that fits well would typically have an $R^2$ above 0.50 or 0.60.
> 2. No "High Train/Low Test" Gap: Overfitting usually presents as a very high score on training data (e.g., $R^2 = 0.95$) and a very low score on testing data. Since the performance is low across the board, the model is failing to capture the complexity of the data (High Bias), which is the definition of Underfitting.
> 3. Missing Variables: Heart rate is affected by stress, hydration, genetics, and diet—variables not present in this dataset. The model cannot "fit" what it cannot see.

**Alternative Algorithms (Extra Credit)**

**Is there a different ML algorithm or tweak to the existing one that could be as good or better? Why?**

> Yes, there are different algorithms which I could use.
> 1. Ridge Regression: This introduces **regularization**. Since "Steps" and "Calories" are often highly correlated (multicollinearity), standard Linear Regression can become unstable. Ridge handles this by penalizing large cofficients.
> 2. Gradient Boosting (XGBoost/GBR): Unlike random forest, which builds independent trees, Gradient Boosting builds trees sequentially, where each new tree specifically tries to fix the errors of the previous one. This is often superior for tabular data with subtle patterns.
> 3. Support Vector Regression(SVR): SVR works well for finding a "margin of tolerance" around the data, which might be robust against the nosiy daily fluctuations of heart rate data.


In [None]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

# Scaling is required for SVR and Ridge
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_a)
X_test_scaled = scaler.transform(X_test_a)

# --- Algorithm 1: Ridge Regression ---
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train_a)
y_pred_ridge = ridge.predict(X_test_scaled)

# --- Algorithm 2: Gradient Boosting Regressor ---
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbr.fit(X_train_scaled, y_train_a)
y_pred_gbr = gbr.predict(X_test_scaled)

# --- Algorithm 3: Support Vector Regression (SVR) ---
svr = SVR(kernel='rbf', C=10, gamma='scale')
svr.fit(X_train_scaled, y_train_a)
y_pred_svr = svr.predict(X_test_scaled)

# Collect metrics for the new models
ec_metrics = []
ec_metrics.append(get_metrics(y_test_a, y_pred_ridge, "Extra: Ridge Reg"))
ec_metrics.append(get_metrics(y_test_a, y_pred_gbr, "Extra: Gradient Boosting"))
ec_metrics.append(get_metrics(y_test_a, y_pred_svr, "Extra: SVR"))

# Combine with baseline for final comparison
df_ec = pd.DataFrame(ec_metrics, columns=["Model", "R2 Score", "MAE", "RMSE"])
print("--- EXTRA CREDIT ALGORITHM COMPARISON ---")
print(df_ec)


**Interpretation of Metrics**

* **R-squared (~0.20 - 0.26):** The scores indicate a weak-to-moderate correlation. This is expected in biological data, as Resting Heart Rate is influenced by many unmeasured factors like diet, hydration, and stress. The model captures the *trend* but not every daily fluctuation.
* **RMSE (~4.0 bpm):** An average error of about 4 beats per minute is acceptable for a consumer health context. It means the model is generally accurate enough to distinguish between a "healthy/recovered" day and a "stressed/fatigued" day.

**How does the other algorithm/tweak compare for the metrics you care about?**

>Based on the evaluation metrics from the new algorithms, here is how they compare:

> **1. Gradient Boosting:**
> - Observation: Gradient Boosting achieved the highest $R^2$ Score (0.2633) and the lowest RMSE (4.71) of all the models tested.
> - Why it helps: Unlike Random Forest, which builds trees independently, Gradient Boosting builds trees sequentially to specifically correct the errors of the previous tree. This allowed it to better capture the subtle, non-linear patterns in the daily recovery data that the other models missed. It effectively squeezed the most "signal" out of the noisy health data.

> **2. Ridge Regression (The Stability Check):**
> - Observation:  The $R^2$ (0.2067) and RMSE (4.90) are nearly identical to the baseline Linear Regression model.
> - Why it matters: This indicates that "multicollinearity" (overlap between Steps and Calories) was not a major issue in the dataset. If it were, Ridge would have significantly outperformed standard Linear Regression. Since it didn't, we know the simple linear model was already stable.

> **3. Support Vector Regression (SVR):**
> - Observations: SVR performed the worst of the group, with the lowest $R^2$ (0.1899).
> - Why it failed: SVR Attempts to fit a strict "margin" around the data points. Because daily Resting Heart Rate is highly volatile (noisy) due to external factors like stress or diet, SVR struggled to find a consistent margin, resulting in a poorer fit than the tree-based models.


**Conclusion:** Gradient Boosting is the superior algorithm for this specific shysiological dataset, providing a ~6% improvement in explanatory power ($R^2$) over the baseline linear approaches.

####**3. Insights**

**Hypothesis Findings**

**What did you find in terms of your hypothesis?**
> **Hypothesis**: "Increased physical activity leads to improvement in cardiovascular health (lower RHR) over time."

> **Findings**: The hypothesis was **Partially Confirmed**, but with important nuances:

> **1. Long-Term Trend (Confirmed)**: The Apple (Personal) Linear Regression model showed a significant negative coefficient for the Date variable (-0.4772). This confirms that consistent activity over a long period correlates with a lower resting heart rate.

> **2. Daily Intensity (Mixed/Rejected)**: The relationships between *daily* activity and *next-day* RHR was not straightforward.
> - In the **Fitbit(Population)** data, higher calories correlated with lower RHR(coefficient **-0.58**), supporting the hypothesis.
> - However, in the **Apple** data, **Active Calories** had a positive coefficient(+1.37), suggesting that very high intensity days might temporarily raise RHR due to recovery demands.


**Assumptions & Adjustments**

**Any previous assumptions that you had to adjust, or proved wrong?**

> **Assumption:** I assumed that a "hard workout today" would gurrantee a "lower heart rate tomorrow"

> **Adjustments:** I had to adjust this view. The positive correlation of calories in my personal data suggests that RHR is also a **measure of stress/recovery. A hard workout imposes acute physiological stress, which keeps the heart rate elevated while the body repairs itself. I learned that "health improvement" is a long-term adaptation, not an immediate daily reward.

**Is the problem different from what you had originally thought?**

> **Yes**,  I originally thought this was simple "Input(Exercise) -> Output (Health)" problem.

> **Realization**: I now see it as a multivariate control problem where sleep is just as critical as exercise. The fitbit model showed **TotalMinutesAsleep (coefficient -1.16)** was the strongest driver of low RHR. The problem is not just "how much did I run?" but "did I sleep enough to recover from the run?"

**Future Improvements**

**Anything you would do differently if you were to do it again?**
> **1. Feature Engineering (Time-Lags):** Instead of comparing Today's Steps to Today's RHR, I would create lagged features (e.g., "Steps_Yesterday", "Avg_Steps_Last_Week"). Physiological adaptations take time, and a rolling average would likely smooth out the noise and improve the $R^2$ score.

> **2. Personalised Modeling for Fitbit:** Instead of aggregating 35 different users into one "blob" of data (Which led to a low $R^2$ of ~0.17), I would build individual regression models for each user ID or use a Mixed-Effects Model. This would account for the fact that every person has a differnt baseline heart rate.

**Policy & Ethical Implications**

**Are there any policy or other decisions that could be influenced by an analysis like yours?
What are they, and what could be the wider effects?**

> **Corporate Wellness Programs:** Currently, many companies run "Step Count Challenges." My analysis suggests this is incomplete.
> - New Policy: Companies should incentivize **Rest & Recovery** (e.g., "7 Hours of Sleep Challenge").
> - Wider Effect: Prioritizing sleep over just "movement volume" could reduce employee burnout, improve cognitive function, and lower long-term healthcare costs more effectively than steps alone.

> **Athletic Training Decisions:** Coaches often push for maximum volume.
> - New decision: Use RHR data to **regulate training load**. If RHR spikes (as seen in my apple data), the athlete should rest, not just train harder. This prevents overtraining injuries.


**What ethical concerns should you or someone reading your project consider?**

> **1. Data Privacy and Re-identification:** The fitbit dataset contains granular minute-by-minute health data. Even if "anonymized", combining this with location data or other public leaks could re-identify specific individuals, revealing sensitive health conditions.

> **2. Algorithmic bias (Insurance):** If health insurance companies used models like this to set premiums, they might unfairly penalize people with naturally higher heart rates or those with stressful jobs (which raises RHR), even if they are physically active.

> **3. Data Ownership:** Who owns the biological data generated by the watch? The user, Apple/Google, or the employer paying for the wellness program? If the data suggests a health risk, does the platform have a duty to warn the user?

**Final Thoughts**
**Final thoughts: Summarize your experience across all 5 projects. What did you learn?**

> **Project 1 and 2 (Data Cleaning):** I learned that real-world data is messy. Cleaning dates, merging datasets, and handling missing values took 80% of the effort, but without it, the analysis would be impossible.

> **Project 3 and 4 (Modeling):** I learned that "more complex" isn't always "better," but sometimes it is necessary. Comparing Linear Regression to Random Forest showed me that human health data is often non-linear.

> **Project 5 (Evaluation):** I learned that a low $R^2$ isn't a failure—it's a finding. It tells us that the system we are studying (the human body) is complex and influenced by variables we didn't capture (like diet or genetics). Data Science is as much about understanding what you can't predict as what you can.

####**Final Hypothesis Verdict**

**Original Hypothesis:** "Increased physical activity leads to improvement in cardiovascular health (lower RHR) over time."

**Conclusion:** The data **partially supports** this hypothesis, but with a critical distinction between *long-term* and *short-term* effects.
1.  **Confirmed (Long-Term):** The negative coefficient for the `Date` variable proves that maintaining an active lifestyle over months correlates with a gradual decrease in RHR.
2.  **Refined (Short-Term):** On a daily basis, high-intensity activity (`Active Calories`) does *not* immediately lower RHR; in fact, it often raises it due to acute recovery stress.
3.  **New Insight:** The analysis revealed that **Sleep** is a co-equal factor in improving heart health, a variable I had originally underestimated.