# Health App_data preprocessing

The Health app contains numerous records. For the project, only the data related to physical activity were taken into account. Step count was considered a direct measure of physical activity, while heart rate was used to measure physical activity not associated with walking, such as exercise in a gym.

In [1]:
import pandas as pd

### 1. Data exploration

An online XML-to-CSV converter was used to export the extracted data from the Health app in CSV format. The data frames were assigned unique names to enable easy identification and retrieval when preprocessing.

In [2]:
sc = pd.read_csv('StepCount.csv',sep=';')
sc.name = 'SC'
sc

Unnamed: 0,version,unit,creationDate,startDate,endDate,value
0,software:12.0.1>,count,2018-10-28 21:05:47 +1000,2018-10-28 20:53:44 +1000,2018-10-28 20:53:47 +1000,16
1,software:5.0.1>,count,2018-10-28 21:33:21 +1000,2018-10-28 21:25:37 +1000,2018-10-28 21:25:40 +1000,8
2,software:5.0.1>,count,2018-10-28 22:25:01 +1000,2018-10-28 22:16:31 +1000,2018-10-28 22:16:34 +1000,2
3,software:12.0.1>,count,2018-10-28 22:28:22 +1000,2018-10-28 22:16:19 +1000,2018-10-28 22:16:24 +1000,11
4,software:12.0.1>,count,2018-10-29 08:00:34 +1000,2018-10-29 07:32:13 +1000,2018-10-29 07:32:31 +1000,28
...,...,...,...,...,...,...
61161,software:16.0.3>,count,2022-10-19 09:47:55 +1000,2022-10-19 09:09:51 +1000,2022-10-19 09:09:58 +1000,16
61162,software:16.0.3>,count,2022-10-19 10:53:21 +1000,2022-10-19 10:42:18 +1000,2022-10-19 10:42:21 +1000,2
61163,software:16.0.3>,count,2022-10-19 11:22:30 +1000,2022-10-19 10:58:40 +1000,2022-10-19 11:08:00 +1000,611
61164,software:16.0.3>,count,2022-10-19 11:26:39 +1000,2022-10-19 11:13:04 +1000,2022-10-19 11:23:04 +1000,348


In [3]:
hr = pd.read_csv('HeartRate.csv',sep=';')
hr.name = 'HR'
hr

Unnamed: 0,version,unit,creationDate,startDate,endDate,value
0,software:5.0.1>,ms,2018-10-29 13:45:43 +1000,2018-10-29 13:44:42 +1000,2018-10-29 13:45:43 +1000,25.5881
1,software:5.0.1>,ms,2018-10-29 14:00:30 +1000,2018-10-29 13:59:25 +1000,2018-10-29 14:00:30 +1000,19.6373
2,software:5.0.1>,ms,2018-10-30 13:31:21 +1000,2018-10-30 13:30:16 +1000,2018-10-30 13:31:21 +1000,29.8444
3,software:5.1>,ms,2018-10-31 10:42:34 +1000,2018-10-31 10:41:29 +1000,2018-10-31 10:42:34 +1000,28.7467
4,software:5.1>,ms,2018-10-31 13:37:13 +1000,2018-10-31 13:36:11 +1000,2018-10-31 13:37:13 +1000,19.4100
...,...,...,...,...,...,...
793,software:7.5>,ms,2021-06-17 11:00:36 +1000,2021-06-17 10:59:30 +1000,2021-06-17 11:00:35 +1000,16.6264
794,software:7.5>,ms,2021-06-17 15:03:25 +1000,2021-06-17 15:02:20 +1000,2021-06-17 15:03:25 +1000,21.1793
795,software:7.5>,ms,2021-06-23 09:54:35 +1000,2021-06-23 09:53:35 +1000,2021-06-23 09:54:35 +1000,16.4564
796,software:7.5>,ms,2021-06-23 10:38:10 +1000,2021-06-23 10:37:05 +1000,2021-06-23 10:38:10 +1000,27.0266


### 2. Data cleaning

For this project, the data frames were narrowed to two columns: the start date and the value. Subsequently, the data frames were modified to include only these two columns. A pipeline was then created to automate the cleaning process, even though it may not have been necessary for the two dataframes. This served to demonstrate the capability in this area.

In [4]:
#removing time zones
def remove_tz(df):
    df['startDate'] = df['startDate'].str.split('+').str[0]
    return df   

In [5]:
#converting to datetime format
def convert_to_dt(df):
    df['startDate'] = pd.to_datetime(df['startDate'])
    return df

In [6]:
#renaming the start date column
def rename_startdate_to_date(df):
    df.rename(columns={'startDate':'Date'},inplace=True)
    return df

In [7]:
#renaming the value column using the unique name assigned earlier
def rename_value(df):
    df.rename(columns={'value':df.name},inplace=True)
    return df

In [8]:
#narrowing to two columns
def get_columns(df):
    df = df.iloc[:,[-3,-1]]
    return df

In [9]:
sc_cleaned = (sc.
              pipe(remove_tz).
              pipe(convert_to_dt).
              pipe(rename_startdate_to_date).
              pipe(rename_value).
              pipe(get_columns))
sc_cleaned

Unnamed: 0,Date,SC
0,2018-10-28 20:53:44,16
1,2018-10-28 21:25:37,8
2,2018-10-28 22:16:31,2
3,2018-10-28 22:16:19,11
4,2018-10-29 07:32:13,28
...,...,...
61161,2022-10-19 09:09:51,16
61162,2022-10-19 10:42:18,2
61163,2022-10-19 10:58:40,611
61164,2022-10-19 11:13:04,348


In [10]:
hr_cleaned = (hr.
              pipe(remove_tz).
              pipe(convert_to_dt).
              pipe(rename_startdate_to_date).
              pipe(rename_value).
              pipe(get_columns))
hr_cleaned

Unnamed: 0,Date,HR
0,2018-10-29 13:44:42,25.5881
1,2018-10-29 13:59:25,19.6373
2,2018-10-30 13:30:16,29.8444
3,2018-10-31 10:41:29,28.7467
4,2018-10-31 13:36:11,19.4100
...,...,...
793,2021-06-17 10:59:30,16.6264
794,2021-06-17 15:02:20,21.1793
795,2021-06-23 09:53:35,16.4564
796,2021-06-23 10:37:05,27.0266


### 3. Data concatenation 

The two cleaned data frames were concatenated, and the resulting data was saved as a new CSV file, ready to be joined with the rest of the data.

In [11]:
hd = pd.concat([sc_cleaned, hr_cleaned])

In [12]:
hd.reset_index(drop=True,inplace=True)

In [13]:
hd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61964 entries, 0 to 61963
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    61964 non-null  datetime64[ns]
 1   SC      61166 non-null  float64       
 2   HR      798 non-null    float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 1.4 MB


In [14]:
hd

Unnamed: 0,Date,SC,HR
0,2018-10-28 20:53:44,16.0,
1,2018-10-28 21:25:37,8.0,
2,2018-10-28 22:16:31,2.0,
3,2018-10-28 22:16:19,11.0,
4,2018-10-29 07:32:13,28.0,
...,...,...,...
61959,2021-06-17 10:59:30,,16.6264
61960,2021-06-17 15:02:20,,21.1793
61961,2021-06-23 09:53:35,,16.4564
61962,2021-06-23 10:37:05,,27.0266


In [15]:
hd.to_csv('HealthData.csv',index=False)