Skip to content

This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health

Notifications You must be signed in to change notification settings

vinamrgrover/ETL-Apple-Health

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apple Health ETL Project

Architecture Diagram

Apple_Health_ETL drawio

Description

This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health to iCloud in XML format.

The data is then processed and transformed using AWS services, queried through Amazon Athena, and visualized using a Streamlit dashboard.

Project Overview

  • Export Health Data from iPhone to iCloud in XML format.
  • Load the data into an Amazon S3 bucket.
  • Set up an AWS Lambda function to process the XML data into a CSV file and store it in the S3 bucket.
  • Set up another AWS Lambda function to further transform the data using DuckDB and Pandas.
  • The data is then redirected to a Lambda Function, which saves the data as Parquet files in an S3 Bucket, partitioned by year.
  • Set up AWS Glue Crawlers to crawl the Parquet files stored in the S3 bucket and store the data in the AWS Glue Data Catalog table, partitioned by year.
  • Finally, A Streamlit dashboard is set up on an Amazon EC2 instance to display sleep analytics over the years.

Tech Stack

  • AWS Services : S3, Lambda, Glue, Athena, SNS, EC2
  • Python Libraries : boto3, lxml, s3fs, awswrangler, pandas, duckdb, streamlit
  • Data Processing : DuckDB
  • Analytics and Visualization : Athena, Streamlit

Prerequisites

The above tech stack and an iCloud Account with Apple Health Data synced regularly from Apple Watch are required. If you don't have an account, you can download my Health Dataset

Workflow

  1. The Data is Exported from Apple Health to iCloud (Download here)

  2. The Data is then transfered locally to an S3 Bucket using AWS CLI.

Screenshot 2023-04-30 at 2 11 27 AM

  1. An AWS Lambda Function (Process_XML) is set up to Transform the Raw Data (XML) and save it as CSV Files in an S3 Bucket. If the Function's execution fails, its response is redirected to an AWS SNS Topic.

Screenshot 2023-04-30 at 2 14 19 AM

Processed CSV Files:

Screenshot 2023-04-30 at 2 19 11 AM

  1. Another Lambda Function (Transform Health) is triggered by an Object Put in S3 Bucket, which further transforms the Data using DuckDB and redirects the transformed Data to another Lambda Function (To_Parquet), which saves the Data as Parquet Files, partitioned by year.

Screenshot 2023-04-30 at 2 22 40 AM

Transformed Parquet Files:

Screenshot 2023-04-30 at 2 33 05 AM

The Data is partitioned by year:

Screenshot 2023-04-30 at 2 38 55 AM

  1. The same Lambda Function (To_Parquet), triggers AWS Glue Crawlers to Crawl the Parquet Files stored inside S3 Bucket. The Crawled data is then stored inside AWS Glue Data Catalog's tables.

Screenshot 2023-04-30 at 2 44 53 AM

Crawled Data in AWS Glue Data Catalog Tables:

Screenshot 2023-04-30 at 2 49 02 AM

Querying Data using Athena:

Example Query:

SELECT
    FROM_UNIXTIME(recorded_on / 1000) AS recorded_on,
    avg_heart_rate, 
    year
FROM heart_data_parquet
WHERE year = '2022';

In the above query, I used WHERE Clause to filter the data by year to avoid scanning other Partitions. Also, FROM_UNIXTIME() Function is used to convert the Unix Epoch format to TIMESTAMP

Output:

Screenshot 2023-04-30 at 3 09 23 AM

  1. It also triggers to start an EC2 instance and Launch the Streamlit Dashboard app

Screenshot 2023-04-30 at 3 20 57 AM

You can access the app through the following URL : http://<your_instance's_public_ip>:8501

Replace your_instance's publicc_ip with your EC2 instance's Public IPV4 Address.

Configuring Auto Launcher:

On the EC2 instance, open crontab editor

crontab -e

Add the following line to the editor:

@reboot /home/ec2-user/.local/bin/streamlit run /home/ec2-user/<path_to_streamlit_app> --server.port 8501

Replace the path_to_streamlit_app with path to your Streamlit App

Now whenever the EC2 instance restarts, the Streamlit app will automatically run on port 8501

Streamlit Dashboard

Here's a quick look of the Streamlit Dashboard hosted on EC2:

Screen.Recording.2023-04-30.at.3.40.14.AM.mov

About

This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages