# **From Wrist to Insights: Apple Watch Data Analysis** 
Using Apple Health Data

Join my journey with wearable technology, specifically the Apple Watch, I've embarked on a quest to uncover valuable insights about my health and fitness. By meticulously analyzing the wealth of data collected by my trusty Apple Watch, I've gained a deeper understanding of my daily activities, exercise routines, heart health, and more. Through data analysis and visualization, I've transformed raw wrist-worn data into meaningful insights that empower me to make informed decisions about my well-being. Join me on this exploration of how wearable technology has unlocked a wealth of knowledge about my personal health and fitness."

In [1]:
# Importing the libraries we would need
import xml.etree.ElementTree as ET
import pandas as pd
from datetime import datetime
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## Importing and Converting data 
Since the health file is saved in XML format, we would use a For-Loop to parse the data and convert it into a dataframe.

In [2]:
file_path= "/kaggle/input/apple-datav2/export.xml"

def parse_apple_health_xml(file_path):
    # parse the xml file
    tree = ET.parse(file_path)
    root = tree.getroot()
    
    # create a list to hold the data
    data = []
    
    # loop through each record in the xml file
    for record in root.findall('Record'):
        # get the data for each record
        values = {}
        values['type'] = record.get('type')
        values['sourceName'] = record.get('sourceName')
        values['unit'] = record.get('unit')
        values['creationDate'] = record.get('creationDate')
        values['startDate'] = record.get('startDate')
        values['endDate'] = record.get('endDate')
        values['value'] = record.get('value')
        
        # add the record data to the list
        data.append(values)
    
    # turn the list into a dataframe
    df = pd.DataFrame(data)
    
    return df
file_path= "/kaggle/input/apple-datav2/export.xml"
df = parse_apple_health_xml(file_path)

In [3]:
df.head(2)

Unnamed: 0,type,sourceName,unit,creationDate,startDate,endDate,value
0,HKQuantityTypeIdentifierHeight,Rooted Android,cm,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,178
1,HKQuantityTypeIdentifierBodyMass,Rooted Android,kg,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,81


In [4]:
df["type"].unique()

array(['HKQuantityTypeIdentifierHeight',
       'HKQuantityTypeIdentifierBodyMass',
       'HKQuantityTypeIdentifierHeartRate',
       'HKQuantityTypeIdentifierOxygenSaturation',
       'HKQuantityTypeIdentifierRespiratoryRate',
       'HKQuantityTypeIdentifierStepCount',
       'HKQuantityTypeIdentifierDistanceWalkingRunning',
       'HKQuantityTypeIdentifierBasalEnergyBurned',
       'HKQuantityTypeIdentifierActiveEnergyBurned',
       'HKQuantityTypeIdentifierFlightsClimbed',
       'HKQuantityTypeIdentifierAppleExerciseTime',
       'HKQuantityTypeIdentifierRestingHeartRate',
       'HKQuantityTypeIdentifierWalkingHeartRateAverage',
       'HKQuantityTypeIdentifierEnvironmentalAudioExposure',
       'HKQuantityTypeIdentifierHeadphoneAudioExposure',
       'HKQuantityTypeIdentifierWalkingDoubleSupportPercentage',
       'HKQuantityTypeIdentifierSixMinuteWalkTestDistance',
       'HKQuantityTypeIdentifierAppleStandTime',
       'HKQuantityTypeIdentifierWalkingSpeed',
       'HKQuanti

In [5]:
#describing data based on unique types
df.groupby('type')['value'].describe()

Unnamed: 0_level_0,count,unique,top,freq
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HKCategoryTypeIdentifierAppleStandHour,1564,2,HKCategoryValueAppleStandHourStood,853
HKCategoryTypeIdentifierSleepAnalysis,6032,6,HKCategoryValueSleepAnalysisInBed,3159
HKDataTypeSleepDurationGoal,2,2,6,1
HKQuantityTypeIdentifierActiveEnergyBurned,67029,3837,0.414,2210
HKQuantityTypeIdentifierAppleExerciseTime,1272,1,1,1272
HKQuantityTypeIdentifierAppleStandTime,3087,5,1,1222
HKQuantityTypeIdentifierAppleWalkingSteadiness,24,24,0.83038,1
HKQuantityTypeIdentifierBasalEnergyBurned,33228,8311,0.074,15033
HKQuantityTypeIdentifierBodyMass,1,1,81,1
HKQuantityTypeIdentifierDistanceWalkingRunning,16697,8920,0.002822,264


The Uniques descibe the possible types of Data that we can use to make our analysis. 
We are essentially looking for a high count so the following can be used to measure:
* Active Energy Burned
* Basal Energy Burned
* Distance Walking Running
* Headphone Audio Exposure
* Heart Rate
* Step Count


---
## Data Sources

In [6]:
#Check for Sources of data
df["sourceName"].unique()

array(['Rooted Android', 'Syed’s Apple\xa0Watch', 'Blood Oxygen',
       'Health'], dtype=object)

Since there are two souces of Data, the phone and the watch, we will need to substitute or add a preference in the days where both are present 

In [7]:
#Replacing phone name and watch name for convinience

df["sourceName"].replace(["Rooted Android","Syed’s Apple\xa0Watch"],["Phone data","Watch Data"],inplace=True)
df["sourceName"].unique()

array(['Phone data', 'Watch Data', 'Blood Oxygen', 'Health'], dtype=object)

---

## Working with Dates

In [8]:
#checking format of time and date
df["startDate"].head(3)

0    2022-11-21 07:07:18 +0530
1    2022-11-21 07:07:18 +0530
2    2022-11-16 01:33:56 +0530
Name: startDate, dtype: object

As the dates are not properly detected, it would become painful to work with once we create graphs. Additionally, we would need to group the data by day instead of time as we would have too many points in the horizontal axis. 

To circumvent these issues, we would create a function to split the dates and append them into a list using a For-Loop. Thereafter, add these new columns

In [9]:
#Use Space as Delimiter to split date to remove time
strt_date = []
end_date = []
def Extract_date():
    for i in df["startDate"]:
        date_parts = i.split(" ")
        if len(date_parts) > 0:
            strt_date.append(date_parts[0])
    return strt_date

# For enddate
def Extract_enddate():
    for i in df["endDate"]:
        date_parts = i.split(" ")
        if len(date_parts) > 0:
            end_date.append(date_parts[0])
    return end_date

#Run functions
Extract_date()
Extract_enddate()
print("Sucess")

Sucess


In [10]:
#assign new columns and convert to date format
df['start_date'] = strt_date
df['end_date'] = end_date
df['start_date']= pd.to_datetime(df["start_date"])
df['end_date']= pd.to_datetime(df["end_date"])
df.head()

Unnamed: 0,type,sourceName,unit,creationDate,startDate,endDate,value,start_date,end_date
0,HKQuantityTypeIdentifierHeight,Phone data,cm,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,178,2022-11-21,2022-11-21
1,HKQuantityTypeIdentifierBodyMass,Phone data,kg,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,2022-11-21 07:07:18 +0530,81,2022-11-21,2022-11-21
2,HKQuantityTypeIdentifierHeartRate,Watch Data,count/min,2022-11-16 01:35:14 +0530,2022-11-16 01:33:56 +0530,2022-11-16 01:33:56 +0530,80,2022-11-16,2022-11-16
3,HKQuantityTypeIdentifierHeartRate,Watch Data,count/min,2022-11-16 01:35:48 +0530,2022-11-16 01:33:17 +0530,2022-11-16 01:33:17 +0530,81,2022-11-16,2022-11-16
4,HKQuantityTypeIdentifierHeartRate,Watch Data,count/min,2022-11-16 01:40:19 +0530,2022-11-16 01:38:58 +0530,2022-11-16 01:38:58 +0530,80,2022-11-16,2022-11-16


# Visualizing the data

## 1. Number of Steps
Lets start with looking at the number of steps taken throughout the duration.

In [11]:
# Filter the data by 'type' and 'sourceName'
filtered_df1 = df[(df['type'] == 'HKQuantityTypeIdentifierStepCount')] #& (df['sourceName'] == 'Watch Data')]

# Convert the 'value' column to float
filtered_df1['value'] = filtered_df1['value'].astype(float)

# Group the data by 'start_date' and calculate the sum of 'value' (assuming 'value' contains the number of steps)
grouped_data1 = filtered_df1.groupby('start_date')['value'].sum().reset_index()

# Convert the 'start_date' to a datetime object
grouped_data1['start_date'] = pd.to_datetime(grouped_data1['start_date'])

# Create an interactive line chart with Plotly
fig = px.bar(grouped_data1, x='start_date', y='value', labels={'value': 'Sum of Steps'}, title='Number of Steps Over Time',
            color='value', color_continuous_scale='blues')
# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='Sum of Steps')

fig.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df1['value'] = filtered_df1['value'].astype(float)


In [12]:
grouped_data1['value'].describe()

count      377.000000
mean      3749.973475
std       3568.605699
min         74.000000
25%       1027.000000
50%       2661.000000
75%       5443.000000
max      20051.000000
Name: value, dtype: float64

In [13]:
# Create a box plot 
fig = px.box(grouped_data1, y='value', labels={'value': 'Value'},
             title='Box Plot of Value Descriptive Statistics')
fig.update_traces(hovertemplate='Value: %{y}')

# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.show()


**Some Insights From Graph:** 
* The dark period (July to August) corresponds to me accidently shutting off apple health tracking.
* There seems to be an increasing trend in the second half of the plot, after November 2022. This can possibly help us understand the outliers in the data. 
> **Refer to the graph below for better understanding**
> * I purchased my watch during the end of November, hence the spike in the data. 
> * There is a high possibility of double counting data from both the watch and the phone,leading to the high step counts. 

---

**Some Insights From Box Plot:** 
* Above are some of the statistical analysis for my step count. There seems to be some outliers as 
* The average step count is moderately high at around 3749 steps, suggesting a decent level of physical activity or movement.
* The large standard deviation (approximately equal to the mean) indicates a wide variation in step counts from day to day.
* The minimum step count of 74 steps indicates that there are days with very low physical activity or perhaps incomplete data.
* The majority of days (50%) have step counts ranging from around 1027 to 5443 steps.
* A significant number of days have higher step counts as indicated by the maximum of 20051 steps.

In [14]:
# Filter the data by 'type' and 'sourceName'
filtered_df2 = df[(df['type'] == 'HKQuantityTypeIdentifierStepCount')]

# Convert the 'value' column to float
filtered_df2['value'] = filtered_df2['value'].astype(float)

# Group the data by 'start_date', 'sourceName', and calculate the sum of 'value'
grouped_data2 = filtered_df2.groupby(['start_date', 'sourceName'])['value'].sum().reset_index()

# Convert the 'start_date' to a datetime object
grouped_data2['start_date'] = pd.to_datetime(grouped_data2['start_date'])

# Create an interactive bar chart with different colors for 'sourceName' using Plotly Express
fig = px.bar(grouped_data2, x='start_date', y='value', color='sourceName', labels={'value': 'Sum of Steps'},
             title='Number of Steps Over Time', barmode='group')  
fig.update_xaxes(title='Date')
fig.update_yaxes(title='Sum of Steps')
# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



---
## 2. Heart Rate 

Since the heart rate is calculated using the sensors on your watch, the data for the heart rate starts after purchasing it, which is why the graphs start mid November. 

In [15]:
# Filter the data by 'type' and 'sourceName'
heart_df = df[(df['type'] == 'HKQuantityTypeIdentifierHeartRate')& (df['sourceName'] == 'Watch Data')]

# Convert the 'value' column to float
heart_df['value'] = heart_df['value'].astype(float)

# Group the data by 'start_date' and calculate the sum of 'value' (assuming 'value' contains the number of steps)
grouped_data3 = heart_df.groupby('start_date')['value'].mean().reset_index()

# Convert the 'start_date' to a datetime object
grouped_data3['start_date'] = pd.to_datetime(grouped_data3['start_date'])

# Create an interactive line chart with Plotly
fig = px.bar(grouped_data3, x='start_date', y='value', labels={'value': 'Avg Heartbeat'}, 
             title='Avg Heart-rate Over Time', color='value', color_continuous_scale='reds')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='Avg Heartbeat per min')
# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [16]:
grouped_data3['value'].describe()

count     88.000000
mean      75.612090
std       19.483285
min       54.732034
25%       64.372151
50%       68.506502
75%       77.156359
max      142.108333
Name: value, dtype: float64

In [17]:
# Create a box plot 
fig = px.box(grouped_data3, y='value', labels={'value': 'Value'},
             title='Box Plot of Value Descriptive Statistics')
fig.update_traces(hovertemplate='Value: %{y}')

# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.update_traces(marker=dict(color='red'))
fig.show()

**Some Insights From the Graph:**
* Average Heart Rate from the graph seems to hover around 80 bpm, with a few days having a higher beats per min. This might be attributed to rigorous exersize during those days. 

---
**Some Insights From the Box plot:**
* Average Heart Rate: The average heart rate is around 75.6 bpm, which falls within the typical range for adults at rest (60-100 bpm). This suggests a moderate resting heart rate overall.
* Variability: The standard deviation of approximately 19.5 bpm indicates a moderate amount of variability in heart rates across the dataset.
* Minimum and Maximum Values: The minimum heart rate of approximately 54.7 bpm and the maximum heart rate of around 142.1 bpm suggest a range of heart rates observed within the dataset.
* Quartiles: Most heart rates (50% to 75%) fall within the range of approximately 64.4 to 77.2 bpm, indicating a central tendency within this range.

---
## 3. Active Energy burned

In [18]:
import plotly.graph_objects as go
# Filter the data by 'type' and 'sourceName'
filtered_df4 = df[(df['type'] == 'HKQuantityTypeIdentifierActiveEnergyBurned')] #& (df['sourceName'] == 'Watch Data')]

filtered_df4['value'] = filtered_df4['value'].astype(float)

# Group the data by 'start_date' and calculate the sum of 'value' (assuming 'value' contains the number of steps)
grouped_data4 = filtered_df4.groupby('start_date')['value'].sum()

# Convert the 'start_date' index to a datetime object
grouped_data4.index = pd.to_datetime(grouped_data4.index)

# Create a line plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=grouped_data4.index, y=grouped_data4.values, mode='lines+markers'))
fig.update_layout(title='Active Energy Burned Over Time', xaxis_title='Date', yaxis_title='Calories Burned')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [19]:
grouped_data4.describe()

count     187.000000
mean      212.190749
std       203.129905
min         0.255000
25%        57.585000
50%       148.194000
75%       346.561000
max      1025.376000
Name: value, dtype: float64

In [20]:
# Create a box plot 
fig = px.box(grouped_data4, y='value', labels={'value': 'Value'},
             title='Box Plot of Value Descriptive Statistics')
fig.update_traces(hovertemplate='Value: %{y}')

# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.update_traces(marker=dict(color='red'))
fig.show()

**Predict which days are more active than others**

In [21]:
# Extract the day of the week (0 = Monday, 6 = Sunday)
filtered_df4['day_of_week'] = filtered_df4['start_date'].dt.dayofweek

# Group the data by day of the week and calculate the average (you can use other aggregation functions as needed)
grouped_data5 = filtered_df4.groupby('day_of_week')['value'].sum()

day_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
grouped_data5.index = day_labels

# Create a bar chart using Plotly Express
fig = px.bar(x=grouped_data5.index, y=grouped_data5.values,
             labels={'x': 'Day of the Week', 'y': 'Average Value'}, 
             title='Average energy by Day of the Week')
fig.update_xaxes(title='Day of the Week')
fig.update_yaxes(title='Average Value')

# Customizing layout for a more appealing chart
fig.update_layout(
    xaxis=dict(tickmode='array', tickvals=list(range(7)), 
               ticktext=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']),
    plot_bgcolor='white',bargap=0.1, showlegend=False )

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



> "Active energy is the energy that the user has burned due to physical activity and exercise." -Apple

Since the phone is not a good predictor of active energy, we see a greater increase after getting data from the apple watch. 

* Average Active Energy Burned: The average active energy burned is around 212.19 calories, indicating the typical amount of energy expended during activity sessions.
* Variability: The standard deviation of approximately 203.13 calories suggests a wide range of variability in the amount of energy burned across the dataset.
* Minimum and Maximum Values: The range between the minimum (0.255 calories) and the maximum (1025 calories) recorded active energy burned shows a considerable spread of values.
* Quartiles: Most instances of active energy burned (50% to 75%) range from approximately 57 to 346 calories, indicating a moderate range of activity levels.

Which days are more active than the others? 
According to the bar chart, Saturdays are the most active while Tuesdays are the least active. 


In [22]:
# Filter the data by 'type' and 'sourceName'
basal=df[(df['type'] == 'HKQuantityTypeIdentifierBasalEnergyBurned')] #& (df['sourceName'] == 'Watch Data')]
basal['value'] = basal['value'].astype(float)

# Group the data by 'start_date' and calculate the sum of 'value' (assuming 'value' contains the number of steps)
bgrouped_data = basal.groupby('start_date')['value'].sum()

# Convert the 'start_date' index to a datetime object
bgrouped_data.index = pd.to_datetime(bgrouped_data.index)

# Create a line plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=bgrouped_data.index, y=grouped_data4.values, mode='lines+markers'))
fig.update_layout(title='Active Energy Burned Over Time', xaxis_title='Date', yaxis_title='Calories Burned')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [23]:
# Create a box plot 
fig = px.box(bgrouped_data, y='value', labels={'value': 'Value'},
             title='Box Plot of Value Descriptive Statistics')
fig.update_traces(hovertemplate='Value: %{y}')

# Set background color to transparent
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
fig.update_traces(marker=dict(color='red'))
fig.show()

In [24]:
from plotly.subplots import make_subplots

# Create subplots with shared x-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add bar chart trace
bar_trace = go.Bar(x=grouped_data1['start_date'], y=grouped_data1['value'], name='Sum of Steps')
fig.add_trace(bar_trace, secondary_y=False)

# Add line chart trace
line_trace = go.Scatter(x=bgrouped_data.index, y=bgrouped_data.values, mode='lines', name='Calories Burned')
fig.add_trace(line_trace, secondary_y=True)

# Update axis labels and titles
fig.update_xaxes(title_text='Date', row=1, col=1)
fig.update_yaxes(title_text='Sum of Steps', secondary_y=False, row=1, col=1)
fig.update_yaxes(title_text='Calories Burned', secondary_y=True, row=1, col=1)

# Update layout
fig.update_layout(title='Steps and Active Energy Burned Over Time')
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)',  
                  paper_bgcolor='rgba(0,0,0,0)')
# Show the combined plot
fig.show()
