# Final Project

Tony Nguyen

CPSC 222 01

Dr. Gina Sprint

December 13th, 2022

In [1]:
import utils
import importlib

importlib.reload(utils)

<module 'utils' from '/Users/tony/Documents/Python/CPSC222/FinalProject/utils.py'>

## Introduction

For our class final project, I choose to work primarily with my **Apple Health Sleep Data** and **Netflix Watching History**.

Throughout this project, I want to learn more about my sleep history, specifically, how long would I sleep each night on average. I realize that it is instrumental to have a decent amount of sleep since we are all working in a high academic-demadning environment; thus, being able to sleep well is one of the easiest way to prevent ourselves from burning out.

I am a big Netflix user. Watching series on Netflix is one of my ways to relax after school and work. Therefore, I want to know if there would be a correlation between the amount of serie episodes or movies I watched and the total time I got to sleep each night.

I hope the result of this project can provide a glimpse of my sleep routine, something that I usually take for granted without thinking much about how much I had actually slept the night before. At the same time, I want to know if Netflix would play a role in my sleep routine, since I find myself binge watch a lot. Besides, as the Apple Health Data is very extensive that covers other data type, hopefully in the future I can do some further analysis with other health data and find something interesting like this one.

**TODO: STAKEHOLDERS INTERESTED IN THE RESULT AND HYPOTHESES**

### Apple Health Sleep Dataset

The original Apple Health Dataset takeout contains several tables in different formats including electrocardiogram in CSV, workout routes in GPX, health clinical reports in JSON and an XML that has other data points that my phone and my watch collects. The file type I will be working with is XML.

In this original XML file called `export.xml`, there are multiple entries for different type of health information, including the sleep data that I need. And since this XML file is too big, at approximately 390MB, it takes my computer a very long time to work with if I load it directly into a DataFrame. It is essentially unwsie since I will need to re-run this notebook multiple times during my coding process.

Therefore, in the file `healthdata_preprocessing.py`, I load the original XML file into a DataFrame, then export it to a CSV file called `export_converted.csv`. Although the exported CSV size is still relatively big at roughly 260MB, the time it takes to run is significantly faster.

From this point forward, I will use the CSV file to work with the health data. After the format conversion, there is a total of 946,979 instances, varies across different data type. The attributes of this dataset is as follow:
* `type`: Type of data point. I use this one to filter sleep data later.
* `sourceName`: Where does that data come from. The common instances are my iPhone and my Apple Watch. There are also other instances such as the Health app or *MyChart*, the portal that my hospital use.
* `sourceVersion`: The software version of `sourceName`.
* `unit`: The unit of data records. Depending on which type of data it is, there will be a corresponding unit.
* `creationDate`: The time when an instance is logged.
* `startDate`: For those entries that record a period, this is the time when the record starts. 
* `endDate`: For those entries that record a period, this is the time when the record ends. 
* `value`: The value of a record. E.g., duration of sleep for sleep data
* `device`: Hardware information, if applicable. It contains `sourceName`'s name, manufacturer, model, hardware version, and software version.
* `MetadataEntry`: Include a key-value pair.
* `HeartRateVariabilityMetadataList`: For heart rate data, record a list of Instantaneous Beat Per Minute.

Apple has a specific sleep mode that I have been using to record my sleep information. Every night just before I put my phone down and go to sleep, I turn sleep mode on, which prompts the phone to start counting my sleep time. And the next morning, at the wake up time that I set earlier, it will start playing alarm sound from quieter to lounder so as to not aruptly wake me up, which can deliver a better sleep experience. If I happen to wake up during the night and go back to sleep later, it can automaticaly subtract the actual time that I am awake from my sleep time. The whole process can get even more precise if paired with an Apple Watch, which can analyze different sleep stages and respiratory rate. These information can be really helpful for further analysis, but since most of the time I do not wear my watch to sleep, I decide to not include those sleep stages information as they are insufficient on a daily basis. Read more about Apple's sleep mode [here](https://support.apple.com/en-us/HT211685).

**TODO: CLASSIFICATION TASK**

**Important Notes**: 
1. In order for the `pandas.read_xml()` function to work, I manually remove the first 213 lines (out of 1,443,267 lines) in the XML file. Those 213 lines are Apple's description of how to interprete the instances.
1. At the time I exported the data, Apple returns a dataset with all of the instances timestamp in Pacific Standard Time (PST), based on my current time zone. As I travel back and forth occasionally between the U.S and Vietnam, there are some instances that are recorded in different time zones and converted back to PST. This results in those instances have my sleep time started to record at unconventional times, such as in the middle of the day, which may cause inaccuracy when performing analysis.

### Netflix Viewing History Dataset
The Netflix data comes in a CSV format, which contains my viewing history since I start using the service. This dataset only has table that has the following attributes:
* `Title`: The name of a movie or a serie episode.
* `Date`: The watch date. Since Netflix does not provide additional information about the specific time that I watch, or which time zone those information are recorded in, I assume they are recorded based on my watch location, which may have different time zones.

There are 2,578 instances in this dataset, lasting from January 25, 2020 to November 29, 2022 - the point I downloaded this dataset.

**TODO: CLASSIFICATION TASK**

## Data Analysis
### Data Preparation and Cleaning
#### Apple Health Dataset
From the converted CSV file of the preprocessing step, I loaded them into a pandas DataFrame. I also convert `creationDate`, `startDate`, and `endDate` into datetime format.

In [2]:
df = utils.load_apple()
df.tail(10)

  df = pd.read_csv("export_converted.csv")


Unnamed: 0,originalIndex,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,value,device,MetadataEntry,HeartRateVariabilityMetadataList
946970,946970,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-27 14:29:24-08:00,2022-11-27 14:28:23-08:00,2022-11-27 14:29:22-08:00,72.2227,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946971,946971,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-27 16:26:12-08:00,2022-11-27 16:25:01-08:00,2022-11-27 16:25:57-08:00,44.5048,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946972,946972,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-27 18:30:47-08:00,2022-11-27 18:29:46-08:00,2022-11-27 18:30:46-08:00,26.2327,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946973,946973,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 08:00:59-08:00,2022-11-28 07:59:58-08:00,2022-11-28 08:00:58-08:00,27.0778,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946974,946974,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 08:47:54-08:00,2022-11-28 08:46:53-08:00,2022-11-28 08:47:52-08:00,21.1496,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946975,946975,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 10:24:48-08:00,2022-11-28 10:23:47-08:00,2022-11-28 10:24:46-08:00,24.8409,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946976,946976,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 12:28:40-08:00,2022-11-28 12:27:39-08:00,2022-11-28 12:28:39-08:00,13.0849,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946977,946977,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 14:23:52-08:00,2022-11-28 14:22:51-08:00,2022-11-28 14:23:45-08:00,17.1092,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946978,946978,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 16:34:54-08:00,2022-11-28 16:33:53-08:00,2022-11-28 16:34:53-08:00,26.8948,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,
946979,946979,HKQuantityTypeIdentifierHeartRateVariabilitySDNN,Tony’s Apple Watch,9.1,ms,2022-11-28 18:23:48-08:00,2022-11-28 18:22:47-08:00,2022-11-28 18:23:47-08:00,13.2906,"<<HKDevice: 0x2815fd360>, name:Apple Watch, ma...",,


Since this project concerns only with my sleep data, I will filter out those instances whose data is sleep.

In [3]:
df = utils.sleep_filtering(df)
df.tail(10)

Unnamed: 0,originalIndex,type,sourceName,sourceVersion,unit,creationDate,startDate,endDate,value,device,MetadataEntry,HeartRateVariabilityMetadataList
937029,937029,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-19 08:00:03-08:00,2022-11-18 23:58:22-08:00,2022-11-19 07:28:58-08:00,HKCategoryValueSleepAnalysisInBed,,,
937030,937030,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-20 09:09:31-08:00,2022-11-20 03:14:28-08:00,2022-11-20 09:09:19-08:00,HKCategoryValueSleepAnalysisInBed,,,
937031,937031,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-21 06:56:37-08:00,2022-11-21 00:52:02-08:00,2022-11-21 06:42:47-08:00,HKCategoryValueSleepAnalysisInBed,,,
937032,937032,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-22 06:45:22-08:00,2022-11-22 00:26:27-08:00,2022-11-22 06:45:22-08:00,HKCategoryValueSleepAnalysisInBed,,,
937033,937033,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-23 07:09:08-08:00,2022-11-23 00:31:18-08:00,2022-11-23 06:48:28-08:00,HKCategoryValueSleepAnalysisInBed,,,
937034,937034,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-24 06:40:01-08:00,2022-11-24 00:20:11-08:00,2022-11-24 06:40:00-08:00,HKCategoryValueSleepAnalysisInBed,,,
937035,937035,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-25 08:39:16-08:00,2022-11-25 03:14:23-08:00,2022-11-25 08:39:16-08:00,HKCategoryValueSleepAnalysisInBed,,,
937036,937036,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-27 07:16:44-08:00,2022-11-27 02:11:47-08:00,2022-11-27 06:40:18-08:00,HKCategoryValueSleepAnalysisInBed,,,
937037,937037,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-28 06:40:45-08:00,2022-11-27 23:32:00-08:00,2022-11-28 06:40:45-08:00,HKCategoryValueSleepAnalysisInBed,,,
937038,937038,HKCategoryTypeIdentifierSleepAnalysis,Tony,16.1,,2022-11-29 06:40:15-08:00,2022-11-29 00:13:46-08:00,2022-11-29 06:40:14-08:00,HKCategoryValueSleepAnalysisInBed,,,


Next, I drop the following columns as they are not needed for this project: `type`, `sourceName`, `sourceVersion`, `unit`, `value`, `device`, `MetadataEntry`, `HeartRateVariabilityMetadataList`, and `originalIndex`. I decide to drop `originalIndex` as although it might be useful for me if I need to make reference to the original dataset later, I also call `reset_index()` function. This function produces another column called `Index`, which is the same as the `originalIndex`; thus, keeping both make it becomes redundant.

In [4]:
df = utils.sleep_cleaning(df)
df.tail(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("type", axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("sourceName", axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("sourceVersion", axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("unit", axis=1, inplace=True)
A

Unnamed: 0,originalIndex,creationDate,startDate,endDate
937029,937029,2022-11-19 08:00:03-08:00,2022-11-18 23:58:22-08:00,2022-11-19 07:28:58-08:00
937030,937030,2022-11-20 09:09:31-08:00,2022-11-20 03:14:28-08:00,2022-11-20 09:09:19-08:00
937031,937031,2022-11-21 06:56:37-08:00,2022-11-21 00:52:02-08:00,2022-11-21 06:42:47-08:00
937032,937032,2022-11-22 06:45:22-08:00,2022-11-22 00:26:27-08:00,2022-11-22 06:45:22-08:00
937033,937033,2022-11-23 07:09:08-08:00,2022-11-23 00:31:18-08:00,2022-11-23 06:48:28-08:00
937034,937034,2022-11-24 06:40:01-08:00,2022-11-24 00:20:11-08:00,2022-11-24 06:40:00-08:00
937035,937035,2022-11-25 08:39:16-08:00,2022-11-25 03:14:23-08:00,2022-11-25 08:39:16-08:00
937036,937036,2022-11-27 07:16:44-08:00,2022-11-27 02:11:47-08:00,2022-11-27 06:40:18-08:00
937037,937037,2022-11-28 06:40:45-08:00,2022-11-27 23:32:00-08:00,2022-11-28 06:40:45-08:00
937038,937038,2022-11-29 06:40:15-08:00,2022-11-29 00:13:46-08:00,2022-11-29 06:40:14-08:00
