# Import Health Data from Phone

Import phone data and reduce it to heart specific data

Change History
9/2/2021 Created
9/8/2021 Extract heart rate data
9/30/2021 Reviewed and tidied up for production use.
10/12/2021 Blood pressure dataset processing added.
10/12//2021 Preprocessing split off
10/24/2021 Drop ISO week date

### Health File Preparation

These instructions are correct at 10/27/2010 with iOS 15.0.2.

#### Phone Export

Open the Health app on the iPhone. Touch the profile picture on the top-right corner. Touch Export All Health Data at the bottom of the next screen.
You will be presented with several options to share the resulting zip file. Save the file to Documents / Coding / Apple Medical / Pandas / Data / Raw

### Prepare for import

On your desktop, unzip the attached archive.
Change zip file name to export_YYYY_MM_DD. Change the export.xml file name to export_YYYY_MM_DD.xml and move it to the Raw directory. Discard the unzipped folder and its contents.

In this notebook change the import_file_date below to YYYY_MM_DD.

In [1]:
# # Change the import_file_date to match date in name of current import file.
import_file_date = '2021_09_30'

In [2]:
# lxml is needed by pandas.read_xml so…
# noinspection PyUnresolvedReferences
import lxml
from pathlib import Path
import pandas

In [3]:
iphone_file = Path.cwd().parent / 'data' / 'raw' / f"export_{import_file_date}.xml"
heart_rate_pickle = Path.cwd().parent / 'data' / 'processed' / 'heart_preprocessed.pickle'

#### TODO

As of 9/8/2021 it's believed that the simple read_xml is not getting all of the relevant data from
the iPhone export file.
Further study of the export file structure is required.

In [4]:
hf = pandas.read_xml(iphone_file)
hf.shape

(1198262, 41)

In [5]:
hf.columns

Index(['value', 'HKCharacteristicTypeIdentifierDateOfBirth',
       'HKCharacteristicTypeIdentifierBiologicalSex',
       'HKCharacteristicTypeIdentifierBloodType',
       'HKCharacteristicTypeIdentifierFitzpatrickSkinType',
       'HKCharacteristicTypeIdentifierCardioFitnessMedicationsUse', 'type',
       'sourceName', 'sourceVersion', 'unit', 'creationDate', 'startDate',
       'endDate', 'device', 'MetadataEntry', 'Record', 'SensitivityPoint',
       'workoutActivityType', 'duration', 'durationUnit', 'totalDistance',
       'totalDistanceUnit', 'totalEnergyBurned', 'totalEnergyBurnedUnit',
       'WorkoutEvent', 'dateComponents', 'activeEnergyBurned',
       'activeEnergyBurnedGoal', 'activeEnergyBurnedUnit', 'appleMoveTime',
       'appleMoveTimeGoal', 'appleExerciseTime', 'appleExerciseTimeGoal',
       'appleStandHours', 'appleStandHoursGoal',
       'HeartRateVariabilityMetadataList', 'identifier', 'sourceURL',
       'fhirVersion', 'receivedDate', 'resourceFilePath'],
      dty

In [6]:
hf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1198262 entries, 0 to 1198261
Data columns (total 41 columns):
 #   Column                                                     Non-Null Count    Dtype  
---  ------                                                     --------------    -----  
 0   value                                                      1195719 non-null  object 
 1   HKCharacteristicTypeIdentifierDateOfBirth                  1 non-null        object 
 2   HKCharacteristicTypeIdentifierBiologicalSex                1 non-null        object 
 3   HKCharacteristicTypeIdentifierBloodType                    1 non-null        object 
 4   HKCharacteristicTypeIdentifierFitzpatrickSkinType          1 non-null        object 
 5   HKCharacteristicTypeIdentifierCardioFitnessMedicationsUse  1 non-null        object 
 6   type                                                       1197094 non-null  object 
 7   sourceName                                                 1197707 non-n

# Refine Dataset

Extract columns and rows with useful information.

#### Refine columns (pass 1)

In [7]:
health_file = hf.loc[:, ['value', 'type', 'sourceName', 'sourceVersion', 'unit',
                         'creationDate', 'startDate', 'endDate', 'device']]
health_file.shape

(1198262, 9)

In [8]:
health_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1198262 entries, 0 to 1198261
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   value          1195719 non-null  object
 1   type           1197094 non-null  object
 2   sourceName     1197707 non-null  object
 3   sourceVersion  1186999 non-null  object
 4   unit           1185798 non-null  object
 5   creationDate   1197098 non-null  object
 6   startDate      1197098 non-null  object
 7   endDate        1197098 non-null  object
 8   device         1157755 non-null  object
dtypes: object(9)
memory usage: 82.3+ MB


#### Refine Rows

In [9]:
health_file.head(7)

Unnamed: 0,value,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,device
0,2021-09-30 07:12:09 -0400,,,,,,,,
1,,,,,,,,,
2,5.75,HKQuantityTypeIdentifierHeight,iPhone,13.4,ft,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,
3,5.83333,HKQuantityTypeIdentifierHeight,Stephen’s iPhone 11,15.0,ft,2021-09-21 13:20:50 -0400,2021-09-21 13:20:50 -0400,2021-09-21 13:20:50 -0400,
4,170,HKQuantityTypeIdentifierBodyMass,iPhone,13.4,lb,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,
5,67,HKQuantityTypeIdentifierHeartRate,Stephen’s Apple Watch,6.1.3,count/min,2020-03-30 16:13:44 -0400,2020-03-30 16:11:49 -0400,2020-03-30 16:11:49 -0400,"<<HKDevice: 0x282f77b10>, name:Apple Watch, ma..."
6,67,HKQuantityTypeIdentifierHeartRate,Stephen’s Apple Watch,6.1.3,count/min,2020-03-30 16:18:49 -0400,2020-03-30 16:16:17 -0400,2020-03-30 16:16:17 -0400,"<<HKDevice: 0x282f77b10>, name:Apple Watch, ma..."


In [10]:
health_file.tail()

Unnamed: 0,value,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,device
1198257,,DiagnosticReport,OhioHealth,,,,,,
1198258,,DiagnosticReport,OhioHealth,,,,,,
1198259,,DiagnosticReport,OhioHealth,,,,,,
1198260,,DiagnosticReport,OhioHealth,,,,,,
1198261,,Patient,OhioHealth,,,,,,


In [11]:
health_file['type'].value_counts()

HKQuantityTypeIdentifierActiveEnergyBurned                566485
HKQuantityTypeIdentifierBasalEnergyBurned                 220225
HKQuantityTypeIdentifierHeartRate                         174296
HKQuantityTypeIdentifierDistanceWalkingRunning             64556
HKQuantityTypeIdentifierStepCount                          56605
HKQuantityTypeIdentifierAppleStandTime                     30697
HKQuantityTypeIdentifierAppleExerciseTime                  26048
HKQuantityTypeIdentifierEnvironmentalAudioExposure         17874
HKQuantityTypeIdentifierFlightsClimbed                     10102
HKCategoryTypeIdentifierAppleStandHour                      9147
HKQuantityTypeIdentifierStairDescentSpeed                   3903
HKQuantityTypeIdentifierWalkingSpeed                        3060
HKQuantityTypeIdentifierWalkingStepLength                   3060
HKQuantityTypeIdentifierHeartRateVariabilitySDNN            2850
HKQuantityTypeIdentifierWalkingDoubleSupportPercentage      2002
HKQuantityTypeIdentifierS

#### Select rows with heart types. Refine columns (pass 2)

In [12]:
heart_rate = health_file['type'] == 'HKQuantityTypeIdentifierHeartRate'
bp_diastolic = health_file['type'] == 'HKQuantityTypeIdentifierBloodPressureDiastolic'
bp_systolic = health_file['type'] == 'HKQuantityTypeIdentifierBloodPressureSystolic'
ds = health_file.loc[heart_rate | bp_diastolic | bp_systolic, ['value', 'type', 'startDate']]
ds.loc[:, 'value'] = ds['value'].astype('float')
ds = ds.rename(columns={'startDate': 'date'})
ds.loc[:, 'date'] = ds['date'].astype('datetime64[ns]')
ds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174654 entries, 5 to 174658
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   value   174654 non-null  float64       
 1   type    174654 non-null  object        
 2   date    174654 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 5.3+ MB


In [13]:
ds.to_pickle(heart_rate_pickle)