# Smartphone and Smartwatch Human Activity Recognition (HAR) Model

[Eugene Zen](mailto:ezen@ucsd.edu), [Shane Luna](mailto:shluna@ucsd.edu)


## Table of Contents

[I. Summary](#summary)<br>
[II. Dataset Description](#dataset-description)<br>
[III. Development](#development)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[A. Dependencies](#development-dependencies)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[B. Load Data](#development-load-data)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[C. Exploratory Data Analysis (EDA)](#development-eda)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[D. Data Preparation](#development-data-preparation)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[E. Model Selection & Training](#development-model-selection-and-training)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[F. Model Testing](#development-model-testing)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[G. Dask Cluster Shutdown](#development-dask-cluster-shutdown)

<a id='summary'></a>
## I. Summary
This notebook presents the development of a Human Activity Recognition (HAR) model that utilizes sensors from both smartphones and smartwatches. The data was originally collected and analyzed by the members of the WISDM (Wireless Sensor Data Mining) Lab in the Department of Computer and Information Science of Fordham University. More information on the original experiment can be found in the publication from 2019 [here](https://ieeexplore.ieee.org/document/8835065). The data was made publicly available on the UCI Machine Learning Repository as the "WISDM Smartphone and Smartwach Activity and Biometrics Dataset" and can be found [here](https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset+).

<a id='dataset-description'></a>
## II. Dataset Description
"The 'WISDM Smartphone and Smartwatch Activity and Biometrics Dataset' includes data collected from 51 subjects, each of whom were asked to perform 18 tasks for 3 minutes each. Each subject had a smartwatch placed on his/her dominant hand and a smartphone in their pocket. The data collection was controlled by a custom-made app that ran on the smartphone and smartwatch. The sensor data that was collected was from the accelerometer and gyrocope on both the smartphone and smartwatch, yielding four total sensors."

| Summary Item | Description |
| --------------- | --------------- |
| Number of subjects | 51 |
| Number of activities | 18 |
| Minutes collected per activity | 3 |
| Sensor polling rate | 20Hz |
| Smartphone used | Google Nexus 5/5x or Samsung Galaxy S5 |
| Smartwatch used | LG G Watch |
| Number raw measurements | 15,630,426 |

<a id='development'></a>
## III. Development

<a id='development-dependencies'></a>
### A. Dependencies

In [None]:
%matplotlib inline

from dask.distributed import Client
import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import numpy as np
import pandas as pd
import string
import os

In [None]:
# Start local dask client
client = Client(n_workers=4)

In [None]:
client

<a id='development-load-data'></a>
### B. Load Data

In [None]:
def read_file(filepath):
    df = dd.read_csv(filepath, sep = ',', header = None)
    df.columns = ['subject_id', 'activity_code', 'timestamp', 'x', 'y', 'z']
    df['timestamp_dt'] = dd.to_datetime(df['timestamp'], origin='unix') # phone in microseconds / watch in milliseconds -- letting infer
    df['z'] = df['z'].str.replace(";","").astype('float64') # remove ; and ensure float (having issues with lineterminator)
    return df

In [None]:
# Phone Data
phone_accel_df = read_file('wisdm-dataset/raw/phone/accel/*.txt')
phone_gyro_df = read_file('wisdm-dataset/raw/phone/gyro/*.txt')

# Watch Data
watch_accel_df = read_file('wisdm-dataset/raw/watch/accel/*.txt')
watch_gyro_df = read_file('wisdm-dataset/raw/watch/gyro/*.txt')

Referring to the dataset description provided from the WISDM Lab, one would expect to see the following row counts:

- raw/phone/accel: 4,804,403
- raw/phone/gyro: 3,608,635
- raw/watch/accel: 3,777,046
- raw/watch/gyro: 3,440,342

In [None]:
print(f'Phone Accel:\t{len(phone_accel_df)}')
print(f'Phone Gyro:\t{len(phone_gyro_df)}')
print(f'Watch Accel:\t{len(watch_accel_df)}')
print(f'Watch Gyro:\t{len(watch_gyro_df)}')

All of the above dataframes are stuctured similarly. A sample output of the column datatypes has been provided below for reference. An additional column, timestamp_dt, was added to the original data by converting the timestamp attribute to a datetime object type. The original timestamp column has been preserved if needed for future enhancements.

In [None]:
print(phone_accel_df.dtypes)

<a id='development-eda'></a>
### C. Exploratory Data Analysis (EDA)

First, a check is done for any null values in the data. As expected, none were found.

In [None]:
# Check for nulls

print('--Phone Accel--')
print(phone_accel_df.isna().sum().compute())

print('--Phone Gyro--')
print(phone_gyro_df.isna().sum().compute())

print('--Watch Accel--')
print(watch_accel_df.isna().sum().compute())

print('--Watch Gyro--')
print(watch_gyro_df.isna().sum().compute())

Second, a check was done for any missing sensor datasets.

i.e. Did all subjects actually do all activites and are readings available for each activity on each sensor?

In [None]:
# Which sensors have missing activities
subjects = phone_accel_df['subject_id'].unique().compute()
activities = phone_accel_df['activity_code'].unique().compute()

subject_activities_df = pd.DataFrame(subjects).merge(pd.DataFrame(activities), how='cross')

phone_accel_interval_counts = phone_accel_df.groupby(['subject_id', 'activity_code']).size().rename('phone_accel').reset_index().compute()
subject_activities_df = subject_activities_df.merge(phone_accel_interval_counts, on=['subject_id', 'activity_code'], how='left')

phone_gyro_interval_counts = phone_gyro_df.groupby(['subject_id', 'activity_code']).size().rename('phone_gyro').reset_index().compute()
subject_activities_df = subject_activities_df.merge(phone_gyro_interval_counts, on=['subject_id', 'activity_code'], how='left')

watch_accel_interval_counts = watch_accel_df.groupby(['subject_id', 'activity_code']).size().rename('watch_accel').reset_index().compute()
subject_activities_df = subject_activities_df.merge(watch_accel_interval_counts, on=['subject_id', 'activity_code'], how='left')

watch_gyro_interval_counts = watch_gyro_df.groupby(['subject_id', 'activity_code']).size().rename('watch_gyro').reset_index().compute()
subject_activities_df = subject_activities_df.merge(watch_gyro_interval_counts, on=['subject_id', 'activity_code'], how='left')

subject_activities_df = subject_activities_df.set_index(['subject_id'])

subject_activities_df.isna().sum()

Below are the details for which subject activities are missing sensor data. One might also notice that read frequencies are higher for certain sensor data. @20Hz, 3 minutes of activity data should be 3600 readings per activity. For some activities below, it lists > 8000 readings for certain sensors. This is an anomaly called out in the original research publication. It was stated that "due to the nature of the Android OS, the sampling rate is only taken as a suggestion, so actual sampling rates sometimes differed." 

In [None]:
subject_activities_df[subject_activities_df.isna().any(axis=1)]

To confirm the sampling rate discrepancy listed above, a test is performed. The main cause of high reads could be due to one of two reasons:
1. A difference in sampling rate
2. A difference in overall duration (longer duration at the same sampling rate)

A test is performed on option #2 above and it is confirmed that all the durations are aproximately the same. It is found that all durations for each acitivty fit between 179 < x < 182 seconds except for one activity recorded on both the watch accelerometer and gyroscope. Therefore, by deduction, the discrepancy above is confirmed to be a difference in sampling rate. Additional confirmation is done later in the train/test step.

In [None]:
pa = phone_accel_df[['subject_id', 'activity_code', 'timestamp_dt']].groupby(['subject_id', 'activity_code']).agg(['max', 'min']).reset_index()
pa['duration'] = pa['timestamp_dt']['max'] - pa['timestamp_dt']['min']
pa['duration_s'] = pa['duration'].dt.total_seconds()

print(f'total length: {len(pa)}')
print(f"total in range: {len(pa[(pa['duration_s'] > 179) & (pa['duration_s'] < 182)])}")

In [None]:
pg = phone_gyro_df[['subject_id', 'activity_code', 'timestamp_dt']].groupby(['subject_id', 'activity_code']).agg(['max', 'min']).reset_index()
pg['duration'] = pg['timestamp_dt']['max'] - pg['timestamp_dt']['min']
pg['duration_s'] = pg['duration'].dt.total_seconds()

print(f'total length: {len(pg)}')
print(f"total in range: {len(pg[(pg['duration_s'] > 179) & (pg['duration_s'] < 181)])}")

In [None]:
wa = watch_accel_df[['subject_id', 'activity_code', 'timestamp_dt']].groupby(['subject_id', 'activity_code']).agg(['max', 'min']).reset_index()
wa['duration'] = wa['timestamp_dt']['max'] - wa['timestamp_dt']['min']
wa['duration_s'] = wa['duration'].dt.total_seconds()

print(f'total length: {len(wa)}')
print(f"total in range: {len(wa[(wa['duration_s'] > 179) & (wa['duration_s'] < 181)])}")

print(f"min: {wa['duration_s'].min().compute()}")
print(f"max: {wa['duration_s'].max().compute()}")

In [None]:
wg = watch_gyro_df[['subject_id', 'activity_code', 'timestamp_dt']].groupby(['subject_id', 'activity_code']).agg(['max', 'min']).reset_index()
wg['duration'] = wg['timestamp_dt']['max'] - wg['timestamp_dt']['min']
wg['duration_s'] = wg['duration'].dt.total_seconds()

print(f'total length: {len(wg)}')
print(f"total in range: {len(wg[(wg['duration_s'] > 179) & (wg['duration_s'] < 181)])}")

print(f"min: {wg['duration_s'].min().compute()}")
print(f"max: {wg['duration_s'].max().compute()}")

Third, a check is performed to confirm if the timestamps among device sensors are synced. This will simplify joining the data in the data preparation step. A test was done by picking a single subject and single activity and checking the starting timestamp. Upon doing so, it was realized that the times/clocks among the different devices (phone vs. watch) may not necessarily be synced. One can ignore the the year/dates provided in the timestamps; those are not believe to be accurate. The times are the more important piece. It is confirmed that their intervals show up appropriately and they are being read in, in the same manner across sensors. It is later clarified from one of the professors/authors in the original publication, if one would like to combined the datasets, to assume the same start time across sensors for each subject activity.

In [None]:
print(phone_accel_df[(phone_accel_df['subject_id'] == 1600) & (phone_accel_df['activity_code'] == 'A')]['timestamp_dt'].min().compute())
print(phone_gyro_df[(phone_gyro_df['subject_id'] == 1600) & (phone_gyro_df['activity_code'] == 'A')]['timestamp_dt'].min().compute())
print(watch_accel_df[(watch_accel_df['subject_id'] == 1600) & (watch_accel_df['activity_code'] == 'A')]['timestamp_dt'].min().compute())
print(watch_gyro_df[(watch_gyro_df['subject_id'] == 1600) & (watch_gyro_df['activity_code'] == 'A')]['timestamp_dt'].min().compute())

Fourth, a high-level analysis was performed by looking at some visualizations of the tri-axial sensor readings from both the smartphone and the smartwatch. A random suject was picked for the visualization. The efforts of the rest of this notebook aim to assess the distinctness of these patterns for different physical activities. The plots below are currently separated by device sensor.

In [None]:
labels_dict = {
    'A': "Walking", 
    'B': "Jogging", 
    'C': "Stairs", 
    'D': "Sitting", 
    'E': "Standing", 
    'F': "Typing", 
    'G': "Brushing Teeth", 
    'H': "Eating Soup", 
    'I': "Eating Chips", 
    'J': "Eating Pasta", 
    'K': "Drinking from Cup", 
    'L': "Eating Sandwich", 
    'M': "Kicking (Soccer Ball)", 
    'O': "Playing Catch w/Tennis Ball", 
    'P': "Dribbling (Basketball)", 
    'Q': "Writing", 
    'R': "Clapping", 
    'S': "Folding Clothes"
}

def plot_activity(df, title=None, x_label=None, y_label=None):
    # assuming all 18 activities
    
    subject_id = df.head(1)['subject_id'].item()
    activities = df['activity_code'].unique()
    
    fig, axs = plt.subplots(6, 3, figsize=(20,20))
    fig.suptitle(title)
    
    count, i, j = 0, -1, 0
    for act in activities:
        if count % 3 == 0:
            i += 1
            j = 0
        # do plot stuff #
        axs[i, j].plot(range(len(df[df['activity_code'] == act])), df[df['activity_code'] == act]['x'].to_numpy(), c='C0', label='x')
        axs[i, j].plot(range(len(df[df['activity_code'] == act])), df[df['activity_code'] == act]['y'].to_numpy(), c='C1', label='y')
        axs[i, j].plot(range(len(df[df['activity_code'] == act])), df[df['activity_code'] == act]['z'].to_numpy(), c='C2', label='z')
        axs[i, j].set_title(labels_dict[act])
        axs[i, j].legend()
        #################
        count += 1
        j += 1
    
    for ax in axs.flat:
        ax.set(xlabel=x_label, ylabel=y_label)
        
    # Hide x labels and tick labels for top plots and y ticks for right plots.
    for ax in axs.flat:
        ax.label_outer()
        
#     plt.savefig('phone_accel.png')

In [None]:
activities = phone_accel_df['activity_code'].unique().compute()
activity = pd.DataFrame()
for act in activities:
    temp_act = phone_accel_df[(phone_accel_df['subject_id'] == 1600) & (phone_accel_df['activity_code'] == act)].compute()[0:100]
    activity = pd.concat([activity, temp_act])
plot_activity(activity, title='Subject 1600: Phone Accelerometer', x_label='reading #', y_label='m/s2')

In [None]:
activities = phone_gyro_df['activity_code'].unique().compute()
activity = pd.DataFrame()
for act in activities:
    temp_act = phone_gyro_df[(phone_gyro_df['subject_id'] == 1600) & (phone_gyro_df['activity_code'] == act)].compute()[0:100]
    activity = pd.concat([activity, temp_act])
plot_activity(activity, title='Subject 1600: Phone Gyroscope', x_label='reading #', y_label='radians/s')

In [None]:
activities = watch_accel_df['activity_code'].unique().compute()
activity = pd.DataFrame()
for act in activities:
    temp_act = watch_accel_df[(watch_accel_df['subject_id'] == 1600) & (watch_accel_df['activity_code'] == act)].compute()[0:100]
    activity = pd.concat([activity, temp_act])
plot_activity(activity, title='Subject 1600: Watch Accelerometer', x_label='reading #', y_label='m/s2')

In [None]:
activities = watch_gyro_df['activity_code'].unique().compute()
activity = pd.DataFrame()
for act in activities:
    temp_act = watch_gyro_df[(watch_gyro_df['subject_id'] == 1600) & (watch_gyro_df['activity_code'] == act)].compute()[0:100]
    activity = pd.concat([activity, temp_act])
plot_activity(activity, title='Subject 1600: Watch Gyroscope', x_label='reading #', y_label='radians/s')

Summary of takeaways from EDA:
- Not all subjects performed all activities
- Some subjects may have performed certain activites with limited sensors recording
- Sensors potentially have different frequencies
- Timestamps/clocks between smartphone and smartwatch are not synced

<a id='development-data-preparation'></a>
### D. Data Preparation

*After noticing the timestamps/clocks being out of sync, but also reading that the device data was successfully joined in the original publication, contact was made, for clarifcation, to one of the professors involved/one of the co-authors, Gary Weiss, Professor within the Department of Computer and Information Science at Fordham University. It was clarified that the data was originally aligned on the assumption that each activity had the same start time and that there would not be much drift. For the rest of the analysis in this notebook, it will continue to be done under this same assumption.*

The data preparation step will be broken into the following tasks:
1. Group the sensor data into x second non-overlapping intervals, for each subject activity.
    - x = 3
    - aggregate/engineer desired features for each window in this step
2. Join the data by timestamp and/or interval index.

The following sensor combinations will be evaluated:

- **Phone** = phone_accel + phone_gyro
- **Watch** = watch_accel + watch_gyro
- **Both** = phone_accel + phone_gyro + watch_accel + watch_gyro **(INCOMPLETE)**

When training a model based on both, only activity data available with all 4 sensors will be considered.

The following section of code groups the sensor data by the window size prior to joining the respective device's sensors. This is done for both the phone dataseta and the watch dataset

In [None]:
# Group phone accel by window size
phone_accel_grouped_df = phone_accel_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])
phone_accel_grouped_mean_df = phone_accel_grouped_df.agg(['mean', 'std']).fillna(0)
phone_accel_grouped_mean_df.columns = phone_accel_grouped_mean_df.columns.map('_'.join)
phone_accel_grouped_mean_df = phone_accel_grouped_mean_df.reset_index()
# phone_accel_grouped_mean_df.head()

In [None]:
# Group phone gyro by window size
phone_gyro_grouped_df = phone_gyro_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])
phone_gyro_grouped_mean_df = phone_gyro_grouped_df.agg(['mean', 'std']).fillna(0)
phone_gyro_grouped_mean_df.columns = phone_gyro_grouped_mean_df.columns.map('_'.join)
phone_gyro_grouped_mean_df = phone_gyro_grouped_mean_df.reset_index()
# phone_gyro_grouped_mean_df.head()

In [None]:
# Join phone sensor datasets
phone_grouped_means_df = phone_accel_grouped_mean_df.merge(phone_gyro_grouped_mean_df, on=["subject_id", "activity_code", "timestamp_dt"], how="inner", suffixes=['_accel', '_gyro'])
phone_grouped_means_df = phone_grouped_means_df.drop(['timestamp_mean_accel', 'timestamp_std_accel', 'timestamp_mean_gyro', 'timestamp_std_gyro'], axis=1)
# phone_grouped_means_df.head()

In [None]:
# Group watch accel by window size
watch_accel_grouped_df = watch_accel_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])
watch_accel_grouped_mean_df = watch_accel_grouped_df.agg(['mean', 'std']).fillna(0)
watch_accel_grouped_mean_df.columns = watch_accel_grouped_mean_df.columns.map('_'.join)
watch_accel_grouped_mean_df = watch_accel_grouped_mean_df.reset_index()
# watch_accel_grouped_mean_df.head()

In [None]:
# Group watch gyro by window size
watch_gyro_grouped_df = watch_gyro_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])
watch_gyro_grouped_mean_df = watch_gyro_grouped_df.agg(['mean', 'std']).fillna(0)
watch_gyro_grouped_mean_df.columns = watch_gyro_grouped_mean_df.columns.map('_'.join)
watch_gyro_grouped_mean_df = watch_gyro_grouped_mean_df.reset_index()
# watch_gyro_grouped_mean_df.head()

In [None]:
# Join watch sensor datasets
watch_grouped_means_df = watch_accel_grouped_mean_df.merge(watch_gyro_grouped_mean_df, on=["subject_id", "activity_code", "timestamp_dt"], how="inner", suffixes=['_accel', '_gyro'])
watch_grouped_means_df = watch_grouped_means_df.drop(['timestamp_mean_accel', 'timestamp_std_accel', 'timestamp_mean_gyro', 'timestamp_std_gyro'], axis=1)
# watch_grouped_means_df.head()

In [None]:
print(f'Phone joined length: {len(phone_grouped_means_df)}')
print(f'Watch joined length: {len(watch_grouped_means_df)}')

In [None]:
phone_grouped_clean_df = phone_grouped_means_df.copy()
watch_grouped_clean_df = watch_grouped_means_df.copy()

### ***WORK IN PROGRESS***

#### Both = phone_accel + phone_gyro + watch_accel + watch_gyro
#### Adjust Subject Activity Start Time to Nearest Minute (Floor) Prior to Grouping Data 

Shifting of times of the data was done in preparation to use with the pd.Grouper functionality. pd.Grouper works off whole numbers. To avoid small groupings of interval data at the start of the dataset, a shifting of the timestamps is done to the nearest whole second (floor). In this step, the min() date for each subject activity was taken and rounded down to the nearest whole second. A difference was taken from this new value and the original min() value to capture an offset value. This offset value was then applied to the rest of the subject activity dataset slightly shifting the whole subset of data for each subject activity.

TO DO:
1. Incorporate interval number when grouping to be used for joining the device datasets together

In [None]:
# # Create directories to persist data if they don't already exist
# outdir = './data_persist'
# if not os.path.exists(outdir):
#     os.makedirs(f'{outdir}/phone/accel')
#     os.makedirs(f'{outdir}/phone/gyro')
#     os.makedirs(f'{outdir}/watch/accel')
#     os.makedirs(f'{outdir}/watch/gyro')

In [None]:
# def write_files(outputdirectory, filename_suffix, df):
#     subjects = df['subject_id'].unique().compute()
#     activities = df['activity_code'].unique().compute()
    
#     # persist original df for performance gains
#     # create copy for usage below
#     df = df.persist()
#     df_shift = df.copy()
    
#     min_times = df_shift.groupby(['subject_id', 'activity_code']).min().reset_index().compute()
#     df_shift['timestamp_shift_dt'] = ''
    
#     count = 0
#     for sub in subjects:
#         subject_df = df_shift[(df_shift['subject_id'] == sub)].persist()
#         for act in activities:
#             min_ts = min_times[(min_times['subject_id'] == sub) & (min_times['activity_code'] == act)]['timestamp_dt']
#             if len(min_ts == 1):
#                 min_ts = min_ts.item()
#                 floor_ts = min_ts.floor('min')
#                 offset_ts = min_ts-floor_ts
#                 subject_df['timestamp_shift_dt'] = subject_df['timestamp_shift_dt'].mask(((subject_df['activity_code'] == act)), (subject_df['timestamp_dt'] - offset_ts))
#         subject_df['timestamp_shift_dt'] = dd.to_datetime(subject_df['timestamp_shift_dt'], origin='unix')
#         subject_df = subject_df.compute()
#         subject_df.to_csv(f'{outputdirectory}{sub}_{filename_suffix}.csv', sep=',', index=False, header=True)

In [None]:
# %%time
# write_files('./data_persist/phone/accel/', 'phone_accel', phone_accel_df)

In [None]:
# %%time
# write_files('./data_persist/phone/gyro/', 'phone_gyro', phone_gyro_df)

In [None]:
# %%time
# write_files('./data_persist/watch/accel/', 'watch_accel', watch_accel_df)

In [None]:
# %%time
# write_files('./data_persist/watch/gyro/', 'watch_gyro', watch_gyro_df)

In [None]:
# # Read files back in
# def read_file_new(filepath):
#     df = dd.read_csv(filepath, sep = ',', header=0)
#     df['timestamp_dt'] = dd.to_datetime(df['timestamp_dt'], origin='unix')
#     df['timestamp_shift_dt'] = dd.to_datetime(df['timestamp_shift_dt'], origin='unix')
#     return df

In [None]:
# # Phone Data
# phone_accel_shift_df = read_file_new('data_persist/phone/accel/*.csv')
# phone_gyro_shift_df = read_file_new('data_persist/phone/gyro/*.csv')

# # Watch Data
# watch_accel_shift_df = read_file_new('data_persist/watch/accel/*.csv')
# watch_gyro_shift_df = read_file_new('data_persist/watch/gyro/*.csv')

In [None]:
# # Confrim lengths back to original files
# print('--NEW--')
# print(f'Phone Accel:\t{len(phone_accel_shift_df)}')
# print(f'Phone Gyro:\t{len(phone_gyro_shift_df)}')
# print(f'Watch Accel:\t{len(watch_accel_shift_df)}')
# print(f'Watch Gyro:\t{len(watch_gyro_shift_df)}')
# print('--ORIG--')
# print(f'Phone Accel:\t{len(phone_accel_df)}')
# print(f'Phone Gyro:\t{len(phone_gyro_df)}')
# print(f'Watch Accel:\t{len(watch_accel_df)}')
# print(f'Watch Gyro:\t{len(watch_gyro_df)}')

In [None]:
# Sample Usage - missing interval numbers for join afterwards
# phone_gyro_grouped_df = phone_gyro_shift_df.copy()
# phone_gyro_grouped_df = phone_gyro_grouped_df.drop('timestamp_dt', axis=1)
# phone_gyro_grouped_df = phone_gyro_grouped_df.set_index('timestamp_shift_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])

### ***PREV WORK / ARCHIVE***

This section of code below was used for joining the device dataset first and then grouping. This was done for each respective device.

In [None]:
# Phone
# phone_combined_df = phone_accel_df.merge(phone_gyro_df, on=['subject_id', 'activity_code', 'timestamp_dt'], how='inner', suffixes=['_accel', '_gyro'])
# phone_combined_df = phone_combined_df.drop(['timestamp_gyro'], axis=1)
# phone_combined_df = phone_combined_df.rename(columns={'timestamp_accel': 'timestamp'})
# phone_combined_df.head()

In [None]:
# Watch
# watch_combined_df = watch_accel_df.merge(watch_gyro_df, on=['subject_id', 'activity_code', 'timestamp_dt'], how='inner', suffixes=['_accel', '_gyro'])
# watch_combined_df = watch_combined_df.drop(['timestamp_gyro'], axis=1)
# watch_combined_df = watch_combined_df.rename(columns={'timestamp_accel': 'timestamp'})
# watch_combined_df.head()

In [None]:
# Phone & Watch
# phone_watch_combined_df = phone_combined_df.merge(watch_combined_df, on=['subject_id', 'activity_code', 'timestamp_dt'], how='inner', suffixes=['_phone', '_watch'])
# phone_watch_combined_df.head()
# phone and watch data do not have times synced -- currently unable to use together

In [None]:
# print(f'Phone: {len(phone_combined_df)}')
# print(f'Watch: {len(watch_combined_df)}')

In [None]:
# corr_matrix_phone = phone_combined_df.corr()
# corr_matrix_phone.compute().style.background_gradient(cmap='coolwarm')

In [None]:
# corr_matrix_watch = watch_combined_df.corr()
# corr_matrix_watch.compute().style.background_gradient(cmap='coolwarm')

In [None]:
# phone_grouped_df = phone_combined_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])
# watch_grouped_df = watch_combined_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])

In [None]:
# Timestamp grouping/interval count check
# size = phone_grouped_df.size().rename('count').to_frame()
# size.compute().value_counts().head(50)

In [None]:
# pd.Grouper check (to understand if it keeps the first or last of the interval)
# https://stackoverflow.com/questions/35898667/group-by-time-and-other-column-in-pandas
# sample_df = phone_accel_df.head(60)
# sample_group_df = sample_df.set_index('timestamp_dt').groupby(['subject_id', 'activity_code', pd.Grouper(freq='3S')])

# print(sample_df.head())
# sample_group_df.agg(['mean', 'count'])

# Findings:
# groups by looking at whole second intervals
# shows timestamp for first time in interval

In [None]:
# phone_grouped_means_df = phone_grouped_df.mean().reset_index()
# phone_grouped_counts_df = phone_grouped_df.size().rename('count').reset_index()
# phone_grouped_means_df = phone_grouped_means_df.merge(phone_grouped_counts_df, on=['subject_id', 'activity_code', 'timestamp_dt'])
# print(f'Phone full: {len(phone_grouped_means_df)}')

# phone_grouped_clean_df = phone_grouped_means_df[phone_grouped_means_df['count'] > 20]
# print(f'Phone > 20: {len(phone_grouped_clean_df)}')

# watch_grouped_means_df = watch_grouped_df.mean().reset_index()
# watch_grouped_counts_df = watch_grouped_df.size().rename('count').reset_index()
# watch_grouped_means_df = watch_grouped_means_df.merge(watch_grouped_counts_df, on=['subject_id', 'activity_code', 'timestamp_dt'])
# print(f'Watch full: {len(watch_grouped_means_df)}')

# watch_grouped_clean_df = watch_grouped_means_df[watch_grouped_means_df['count'] > 20]
# print(f'Watch > 20: {len(watch_grouped_clean_df)}')

In [None]:
# phone = phone_grouped_clean_df['activity_code'].value_counts().rename('total').reset_index().compute().sort_values(by=['index'], ascending=True)
# watch = watch_grouped_clean_df['activity_code'].value_counts().rename('total').reset_index().compute().sort_values(by=['index'], ascending=True)

In [None]:
# total = phone['total'].sum()

# plt.figure(figsize=(10,5))

# plt.plot(phone['index'], (phone['total']/total*100), label='phone')
# plt.plot(watch['index'], (watch['total']/total*100), label='watch')

# plt.title('Data Distribution by Activity Code')
# plt.xlabel('Activity Code')
# plt.ylabel('% of Respective Dataset')
# plt.legend()
# plt.show()

### *** END PREV WORK / ARCHIVE***

<a id='development-model-selection-and-training'></a>
### E. Model Selection & Training

In [None]:
# train test split with 80-20 split
import dask_ml.model_selection

data_columns = ['x_mean_accel', 'y_mean_accel', 'z_mean_accel', 'x_mean_gyro', 'y_mean_gyro', 'z_mean_gyro', 'x_std_accel', 'y_std_accel', 'z_std_accel', 'x_std_gyro', 'y_std_gyro', 'z_std_gyro']
label_columns = ['activity_code']

# drop subject and timestamps from train data
phone_data_all = phone_grouped_clean_df[data_columns]
phone_labels_all = phone_grouped_clean_df[label_columns]

# drop subject and timestamps from train data
watch_data_all = watch_grouped_clean_df[data_columns]
watch_labels_all = watch_grouped_clean_df[label_columns]

X_train_phone, X_test_phone, y_train_phone, y_test_phone = dask_ml.model_selection.train_test_split(phone_data_all, phone_labels_all, shuffle = True, random_state=0, test_size = 0.2)
X_train_watch, X_test_watch, y_train_watch, y_test_watch = dask_ml.model_selection.train_test_split(watch_data_all, watch_labels_all, shuffle = True, random_state=0, test_size = 0.2)

In [None]:
print(f'X_train_phone: {len(X_train_phone)}')
print(f'X_test_phone: {len(X_test_phone)}')
print(f'y_train_phone: {len(y_train_phone)}')
print(f'y_test_phone: {len(y_test_phone)}')

In [None]:
print(f'X_train_watch: {len(X_train_watch)}')
print(f'X_test_watch: {len(X_test_watch)}')
print(f'y_train_watch: {len(y_train_watch)}')
print(f'y_test_watch: {len(y_test_watch)}')

In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier

clf_phone = RandomForestClassifier(random_state=0)

with joblib.parallel_backend('dask'):
    clf_phone.fit(X_train_phone, y_train_phone)

In [None]:
%%time

clf_watch = RandomForestClassifier(random_state=0)

with joblib.parallel_backend('dask'):
    clf_watch.fit(X_train_watch, y_train_watch)

<a id='development-model-testing'></a>
### F. Model Testing

In [None]:
%%time
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

with joblib.parallel_backend('dask'):
    predicted_labels_phone = clf_phone.predict(X_test_phone)

In [None]:
%%time

with joblib.parallel_backend('dask'):
    predicted_labels_watch = clf_watch.predict(X_test_watch)

In [None]:
%%time

with joblib.parallel_backend('dask'):
    print(accuracy_score(y_test_phone, predicted_labels_phone))

In [None]:
%%time

with joblib.parallel_backend('dask'):
    print(accuracy_score(y_test_watch, predicted_labels_watch))

In [None]:
unique_phone_labels = np.unique(y_test_phone)
unique_watch_labels = np.unique(y_test_watch)

In [None]:
%%time

# Confusion Matrix: Phone
cm_phone = confusion_matrix(y_test_phone, predicted_labels_phone, labels = unique_phone_labels)

display_labels = ["Walking", "Jogging", "Stairs", "Sitting", "Standing", "Typing", "Brushing Teeth", "Eating Soup", "Eating Chips", "Eating Pasta", "Drinking from Cup", "Eating Sandwich", "Kicking (Soccer Ball)", "Playing Catch w/Tennis Ball", "Dribbling (Basketball)", "Writing", "Clapping", "Folding Clothes"]

fig, ax = plt.subplots(figsize=(10,10))
ax.xaxis.set_ticks_position('top')
ax.xaxis.set_label_position('top')
sns.heatmap(cm_phone, annot=True, fmt='g', xticklabels = display_labels, yticklabels = display_labels)

In [None]:
%%time

# Confusion Matrix: Watch
cm_watch = confusion_matrix(y_test_watch, predicted_labels_watch, labels = unique_watch_labels)

display_labels = ["Walking", "Jogging", "Stairs", "Sitting", "Standing", "Typing", "Brushing Teeth", "Eating Soup", "Eating Chips", "Eating Pasta", "Drinking from Cup", "Eating Sandwich", "Kicking (Soccer Ball)", "Playing Catch w/Tennis Ball", "Dribblinlg (Basketball)", "Writing", "Clapping", "Folding Clothes"]

fig, ax = plt.subplots(figsize=(10,10))
ax.xaxis.set_ticks_position('top')
ax.xaxis.set_label_position('top')
sns.heatmap(cm_watch, annot=True, fmt='g', xticklabels = display_labels, yticklabels = display_labels)

<a id='development-dask-cluster-shutdown'></a>
### G. Dask Cluster Shutdown

In [None]:
client.shutdown()