MHC Dataset

Setup

Create a virtual environment and install the dependencies (make sure you work with a more recent Python release > 3.7).

python -m venv venv
source venv/bin/activate

Install the dependencies.

pip install -r requirements.txt

Install the package in editable mode to allow for development and testing without import issues.

pip install -e .

Install the SlurmMultiNodePool package to make distributing the code on a cluster simple.

pip install git+https://github.com/NarayanSchuetz/SlurmMultiNodePool.git

Run

Currently the code is designed to be run on a SLURM cluster (specifically the Stanford Sherlock cluster). In the file run_on_sherlock.ipynb we have an example of how to run the code on a single user_id (one user). As well as how to run it on a cluster. Make sure that the folder structure of the input data is correct. Or adjust the dataloader.

Pipeline

Without going into too much detail, the pre-processing pipeline mostly follows the logic displayed below:

Resulting Dataset

The dataset is stored in the specfified output folder e.g. data/mhc.

For each user_id a separate directory with the user_id is created. Inside this directory there are multiple YYYY-MM-DD.npy files for each day that the user has any HealthKit data.

The numpy file contains a 2 x C x 1440 array where 1440 is the number of minutes in a day and C is the number of data streams. Currently the data streams C (C=24) are:

Data Stream	Date_1	...	Date_1440
HKQuantityTypeIdentifierStepCount		...
HKQuantityTypeIdentifierActiveEnergyBurned		...
HKQuantityTypeIdentifierDistanceWalkingRunning		...
HKQuantityTypeIdentifierDistanceCycling		...
HKQuantityTypeIdentifierAppleStandTime		...
HKQuantityTypeIdentifierHeartRate		...
HKWorkoutActivityTypeWalking		...
HKWorkoutActivityTypeCycling		...
HKWorkoutActivityTypeRunning		...
HKWorkoutActivityTypeOther		...
HKWorkoutActivityTypeMixedMetabolicCardioTraining		...
HKWorkoutActivityTypeTraditionalStrengthTraining		...
HKWorkoutActivityTypeElliptical		...
HKWorkoutActivityTypeHighIntensityIntervalTraining		...
HKWorkoutActivityTypeFunctionalStrengthTraining		...
HKWorkoutActivityTypeYoga		...
HKCategoryValueSleepAnalysisAsleep		...
HKCategoryValueSleepAnalysisInBed		...
stationary		...
walking		...
running		...
automotive		...
cycling		...
not available		...

The first index (0) of the 3D-tensor's first dimension contains an indicator mask (1=observed data; 0=no observed data) of the same size as the second index (the actual per minute data) matrix. The second index (1) of the tensor first tensor dimension contains the actual per minute data.

In each created user_id directory there is also a metadata.parquet file which contains the following information for each data stream and date:

data_coverage: Percentage of minutes that have valid data (neither 0 nor NaN) for each data stream
n: Number of valid elements (from mask) for each data stream
sum: Sum of valid elements for each data stream
sum_of_squares: Sum of squared valid elements for each data stream
date: The date of the measurements
original_time_offset: The timezone offset of the original data

This metadata is used to track data quality and can be used for global data standardization across multiple days and users.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
exploration.ipynb		exploration.ipynb
get_data_stream_names.py		get_data_stream_names.py
metadata_analysis.ipynb		metadata_analysis.ipynb
processing_schematic.png		processing_schematic.png
requirements.txt		requirements.txt
setup.py		setup.py
standardization_params.parquet		standardization_params.parquet
test_dataset.parquet		test_dataset.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MHC Dataset

Setup

Run

Pipeline

Resulting Dataset

About

Releases

Packages

Languages

NarayanSchuetz/MHC_Dataset

Folders and files

Latest commit

History

Repository files navigation

MHC Dataset

Setup

Run

Pipeline

Resulting Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages