Copyright ©2021-2022. Stephen Rigden.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>.

### Health File Preparation

These instructions are correct at 10/27/2010 with iOS 15.0.2.

#### Phone Export

See the README file for instructions on exporting the iPhone health file.

### Prepare for import

On your desktop, unzip the attached archive.
Change zip file name to export_YYYY_MM_DD. Change the export.xml file name to export_YYYY_MM_DD.xml and move it to the Raw directory. Discard the unzipped folder and its contents.

In this notebook change the import_file_date below to YYYY_MM_DD.

In [14]:
# # Change the import_file_date to match date in name of current import file.
import_file_date = '2021_10_31'

In [15]:
# lxml is needed by pandas.read_xml so…
# noinspection PyUnresolvedReferences
import lxml
from pathlib import Path
import pandas

In [16]:
iphone_file = Path.cwd().parent / 'data' / 'raw' / f"export_{import_file_date}.xml"
heart_rate_pickle = Path.cwd().parent / 'data' / 'processed' / 'heart_preprocessed.pickle'

In [17]:
hf = pandas.read_xml(iphone_file)
hf.shape

(1232153, 41)

In [18]:
hf.columns

Index(['value', 'HKCharacteristicTypeIdentifierDateOfBirth',
       'HKCharacteristicTypeIdentifierBiologicalSex',
       'HKCharacteristicTypeIdentifierBloodType',
       'HKCharacteristicTypeIdentifierFitzpatrickSkinType',
       'HKCharacteristicTypeIdentifierCardioFitnessMedicationsUse', 'type',
       'sourceName', 'sourceVersion', 'unit', 'creationDate', 'startDate',
       'endDate', 'device', 'MetadataEntry', 'Record', 'SensitivityPoint',
       'workoutActivityType', 'duration', 'durationUnit', 'totalDistance',
       'totalDistanceUnit', 'totalEnergyBurned', 'totalEnergyBurnedUnit',
       'WorkoutEvent', 'dateComponents', 'activeEnergyBurned',
       'activeEnergyBurnedGoal', 'activeEnergyBurnedUnit', 'appleMoveTime',
       'appleMoveTimeGoal', 'appleExerciseTime', 'appleExerciseTimeGoal',
       'appleStandHours', 'appleStandHoursGoal',
       'HeartRateVariabilityMetadataList', 'identifier', 'sourceURL',
       'fhirVersion', 'receivedDate', 'resourceFilePath'],
      dty

In [19]:
hf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232153 entries, 0 to 1232152
Data columns (total 41 columns):
 #   Column                                                     Non-Null Count    Dtype  
---  ------                                                     --------------    -----  
 0   value                                                      1229430 non-null  object 
 1   HKCharacteristicTypeIdentifierDateOfBirth                  1 non-null        object 
 2   HKCharacteristicTypeIdentifierBiologicalSex                1 non-null        object 
 3   HKCharacteristicTypeIdentifierBloodType                    1 non-null        object 
 4   HKCharacteristicTypeIdentifierFitzpatrickSkinType          1 non-null        object 
 5   HKCharacteristicTypeIdentifierCardioFitnessMedicationsUse  1 non-null        object 
 6   type                                                       1230928 non-null  object 
 7   sourceName                                                 1231567 non-n

# Refine Dataset

Extract columns and rows with useful information.

#### Refine columns (pass 1)

In [20]:
health_file = hf.loc[:, ['value', 'type', 'sourceName', 'sourceVersion', 'unit',
                         'creationDate', 'startDate', 'endDate', 'device']]
health_file.shape

(1232153, 9)

In [21]:
health_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232153 entries, 0 to 1232152
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   value          1229430 non-null  object
 1   type           1230928 non-null  object
 2   sourceName     1231567 non-null  object
 3   sourceVersion  1220858 non-null  object
 4   unit           1218922 non-null  object
 5   creationDate   1230957 non-null  object
 6   startDate      1230957 non-null  object
 7   endDate        1230957 non-null  object
 8   device         1188483 non-null  object
dtypes: object(9)
memory usage: 84.6+ MB


#### Refine Rows

In [22]:
health_file.head(7)

Unnamed: 0,value,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,device
0,2021-10-31 12:17:44 -0400,,,,,,,,
1,,,,,,,,,
2,5.75,HKQuantityTypeIdentifierHeight,iPhone,13.4,ft,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,
3,5.83333,HKQuantityTypeIdentifierHeight,Stephen’s iPhone 11,15.0,ft,2021-09-21 13:20:50 -0400,2021-09-21 13:20:50 -0400,2021-09-21 13:20:50 -0400,
4,170,HKQuantityTypeIdentifierBodyMass,iPhone,13.4,lb,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,2020-03-30 16:01:19 -0400,
5,67,HKQuantityTypeIdentifierHeartRate,Stephen’s Apple Watch,6.1.3,count/min,2020-03-30 16:13:44 -0400,2020-03-30 16:11:49 -0400,2020-03-30 16:11:49 -0400,"<<HKDevice: 0x2839b57c0>, name:Apple Watch, ma..."
6,67,HKQuantityTypeIdentifierHeartRate,Stephen’s Apple Watch,6.1.3,count/min,2020-03-30 16:18:49 -0400,2020-03-30 16:16:17 -0400,2020-03-30 16:16:17 -0400,"<<HKDevice: 0x2839b57c0>, name:Apple Watch, ma..."


In [23]:
health_file.tail()

Unnamed: 0,value,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,device
1232148,,DiagnosticReport,OhioHealth,,,,,,
1232149,,DiagnosticReport,OhioHealth,,,,,,
1232150,,DiagnosticReport,OhioHealth,,,,,,
1232151,,DiagnosticReport,OhioHealth,,,,,,
1232152,,Patient,OhioHealth,,,,,,


In [24]:
health_file['type'].value_counts()

HKQuantityTypeIdentifierActiveEnergyBurned                575015
HKQuantityTypeIdentifierBasalEnergyBurned                 218602
HKQuantityTypeIdentifierHeartRate                         188620
HKQuantityTypeIdentifierDistanceWalkingRunning             67300
HKQuantityTypeIdentifierStepCount                          58980
HKQuantityTypeIdentifierAppleStandTime                     32647
HKQuantityTypeIdentifierAppleExerciseTime                  27353
HKQuantityTypeIdentifierEnvironmentalAudioExposure         18886
HKQuantityTypeIdentifierFlightsClimbed                     10822
HKCategoryTypeIdentifierAppleStandHour                      9669
HKQuantityTypeIdentifierStairDescentSpeed                   4412
HKQuantityTypeIdentifierWalkingSpeed                        3322
HKQuantityTypeIdentifierWalkingStepLength                   3322
HKQuantityTypeIdentifierHeartRateVariabilitySDNN            2979
HKQuantityTypeIdentifierWalkingDoubleSupportPercentage      2130
HKQuantityTypeIdentifierS

#### Select rows with heart types. Refine columns (pass 2)

In [25]:
heart_rate = health_file['type'] == 'HKQuantityTypeIdentifierHeartRate'
bp_diastolic = health_file['type'] == 'HKQuantityTypeIdentifierBloodPressureDiastolic'
bp_systolic = health_file['type'] == 'HKQuantityTypeIdentifierBloodPressureSystolic'
ds = health_file.loc[heart_rate | bp_diastolic | bp_systolic, ['value', 'type', 'startDate']]
ds.loc[:, 'value'] = ds['value'].astype('float')
ds = ds.rename(columns={'startDate': 'date'})
ds.loc[:, 'date'] = ds['date'].astype('datetime64[ns]')
ds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189184 entries, 5 to 189188
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   value   189184 non-null  float64       
 1   type    189184 non-null  object        
 2   date    189184 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 5.8+ MB


In [26]:
ds.to_pickle(heart_rate_pickle)