##Human activity recognition with the WISDM Smartphone and Smartwatch Activity and Biometrics Dataset Data Set
<p>For this tutorial, we'll build a human activity recoginition (HAR) model using the WISDM Smartphone and Smartwatch Activity Dataset. Along the way we'll leverage various features Databricks along with various popular open source data anlysis tools to peform explorartory data analysis and build our model.</p>

This tutorial is comprised of 3 parts:
1. Exploratory analysis and advanced analytics with Pandas and scikit-learn
2. Exploratory analysis and advanced analytics with pyspark
3. Deep Learning with GPUs on Databricks

In [2]:
#%run ./Includes/Setup

In [3]:
#setup some variables
data_path = "dbfs:/FileStore/tmp/wisdm"
data_path_local = "/{}".format(data_path.replace(":",""))

Use dbutils to list file we've copied and downloaded to DBFS (replace username)

In [5]:
%fs ls /FileStore/tmp/wisdm/phone/accel

path,name,size
dbfs:/FileStore/tmp/wisdm/phone/accel/.DS_Store,.DS_Store,6148
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1600_accel_phone.txt,data_1600_accel_phone.txt,3387478
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1601_accel_phone.txt,data_1601_accel_phone.txt,4324299
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1602_accel_phone.txt,data_1602_accel_phone.txt,4544415
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1603_accel_phone.txt,data_1603_accel_phone.txt,4158728
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1604_accel_phone.txt,data_1604_accel_phone.txt,3297781
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1605_accel_phone.txt,data_1605_accel_phone.txt,4296869
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1606_accel_phone.txt,data_1606_accel_phone.txt,3513357
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1607_accel_phone.txt,data_1607_accel_phone.txt,4109725
dbfs:/FileStore/tmp/wisdm/phone/accel/data_1608_accel_phone.txt,data_1608_accel_phone.txt,5068650


Databricks offers a FUSE API which allows you to interact with files stored on DBFS as if they were on the local machine.

In [7]:
%sh ls -lh /dbfs/FileStore/tmp/wisdm/phone/accel

In [8]:
%sh head -n5 /dbfs/FileStore/tmp/wisdm/phone/accel/data_1600_accel_phone.txt

#Loading Data
Let's load the data from DBFS into a Pandas Dataframe. Since pandas does not natively support loading multiple files in a directory, we'll need to build a helper function

In [10]:
import pandas as pd
import numpy as np
from glob import glob

#define function to read the data
read_data = lambda x: pd.read_csv(x, names=['Subject-id', 'ActivityLabel', 'Timestamp', 'x', 'y', 'z'])

#read data from each directory
pdf_phone_accel = pd.concat(map(read_data, glob('{}/phone/accel/*.txt'.format(data_path_local))))
pdf_phone_gyro = pd.concat(map(read_data, glob('{}/phone/gyro/*.txt'.format(data_path_local))))
pdf_watch_accel = pd.concat(map(read_data, glob('{}/watch/accel/*.txt'.format(data_path_local))))
pdf_watch_gyro = pd.concat(map(read_data, glob('{}/watch/gyro/*.txt'.format(data_path_local))))

#add device and sensor columns
pdf_phone_accel['Device'] = "phone"
pdf_phone_accel['Sensor'] = "accel"
pdf_phone_gyro['Device'] = "phone" 
pdf_phone_gyro['Sensor'] = "gyro" 
pdf_watch_accel['Device'] = "watch" 
pdf_watch_accel['Sensor'] = "accel" 
pdf_watch_gyro['Device'] = "watch"
pdf_watch_gyro['Sensor'] = "accel"

#pdf_watch = pd.concat([pdf_watch_accel,pdf_watch_gyro])
#pdf_phone = pd.concat([pdf_watch_accel,pdf_watch_gyro])

In [11]:
#fix the timestamps
pdf_watch_accel['Timestamp'] = pd.to_datetime(pdf_watch_accel['Timestamp']/1000000,unit='ms')
pdf_watch_gyro['Timestamp'] = pd.to_datetime(pdf_watch_gyro['Timestamp']/1000000,unit='ms')

In [12]:
#set index to timestamp so we can resample later
pdf_watch_accel.set_index('Timestamp',inplace=True)
pdf_watch_gyro.set_index('Timestamp',inplace=True)

In [13]:
%sh head -n5 /dbfs/FileStore/tmp/wisdm/activity_key.txt

In [14]:
#get activity descriptions
with open('/dbfs/FileStore/tmp/wisdm/activity_key.txt', 'r') as f:
  activity_map = {}
  for l in f:
      if(len(l.rstrip())>0):
        v,k = l.rstrip().split(' = ')
        activity_map[k] = v

#Exploratory Analysis

In [16]:
#Run a pandas profiling report
import pandas_profiling
profile = pandas_profiling.ProfileReport(pdf_watch_accel.sample(1000))
displayHTML(profile.to_html())

0,1
Number of variables,8
Number of observations,1000
Missing cells,0 (0.0%)
Duplicate rows,0 (0.0%)
Total size in memory,62.6 KiB
Average record size in memory,64.1 B

0,1
Numeric,4
Categorical,1
Boolean,0
Date,1
URL,0
Text (Unique),0
Rejected,2
Unsupported,0

0,1
"Device has constant value ""watch""",Rejected
"Sensor has constant value ""accel""",Rejected

0,1
Distinct count,18
Unique (%),1.8%
Missing (%),0.0%
Missing (n),0

0,1
H,68
O,66
B,65
Other values (15),801

Value,Count,Frequency (%),Unnamed: 3
H,68,6.8%,
O,66,6.6%,
B,65,6.5%,
E,65,6.5%,
C,62,6.2%,
A,60,6.0%,
F,58,5.8%,
L,57,5.7%,
K,57,5.7%,
I,54,5.4%,

0,1
Max length,1
Mean length,1
Min length,1
Contains chars,True
Contains digits,False
Contains spaces,False
Contains non-words,False

0,1
Constant value,watch

0,1
Constant value,accel

0,1
Distinct count,51
Unique (%),5.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1625.572
Minimum,1600
Maximum,1650
Zeros (%),0.0%

0,1
Minimum,1600.0
5-th percentile,1602.95
Q1,1613.0
Median,1627.0
Q3,1638.0
95-th percentile,1647.0
Maximum,1650.0
Range,50.0
Interquartile range,25.0

0,1
Standard deviation,14.35502988
Coef of variation,0.008830756116
Kurtosis,-1.199003446
Mean,1625.572
MAD,12.466264
Skewness,-0.1107972398
Sum,1625572
Variance,206.0668829
Memory size,7.9 KiB

Value,Count,Frequency (%),Unnamed: 3
1639,43,4.3%,
1638,43,4.3%,
1629,35,3.5%,
1637,35,3.5%,
1640,33,3.3%,
1615,28,2.8%,
1621,26,2.6%,
1625,24,2.4%,
1628,24,2.4%,
1644,24,2.4%,

Value,Count,Frequency (%),Unnamed: 3
1600,23,2.3%,
1601,15,1.5%,
1602,12,1.2%,
1603,12,1.2%,
1604,21,2.1%,

Value,Count,Frequency (%),Unnamed: 3
1650,12,1.2%,
1649,15,1.5%,
1648,16,1.6%,
1647,18,1.8%,
1646,20,2.0%,

0,1
Distinct count,1000
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,1970-01-01 00:27:53.304080460
Maximum,1970-02-01 12:49:48.936085023

0,1
Distinct count,995
Unique (%),99.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.5157551209
Minimum,-19.648615
Maximum,19.665974
Zeros (%),0.0%

0,1
Minimum,-19.648615
5-th percentile,-9.2996624
Q1,-4.60300245
Median,-0.347291735
Q3,6.2087635
95-th percentile,12.07664605
Maximum,19.665974
Range,39.314589
Interquartile range,10.81176595

0,1
Standard deviation,7.129000658
Coef of variation,13.8224525
Kurtosis,-0.2525639758
Mean,0.5157551209
MAD,5.771963687
Skewness,0.2754689583
Sum,515.7551209
Variance,50.82265038
Memory size,7.9 KiB

Value,Count,Frequency (%),Unnamed: 3
4.6492405,2,0.2%,
10.044712,2,0.2%,
-6.2438164,2,0.2%,
8.627805,2,0.2%,
-6.109695,2,0.2%,
-0.6582558,1,0.1%,
0.095768064,1,0.1%,
8.59683,1,0.1%,
4.630835,1,0.1%,
-5.572205,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-19.648615,1,0.1%,
-19.580978,1,0.1%,
-18.883667,1,0.1%,
-18.350359,1,0.1%,
-17.567455,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
19.665974,1,0.1%,
19.656397,1,0.1%,
19.639936,1,0.1%,
19.610157,1,0.1%,
19.60507,1,0.1%,

0,1
Distinct count,985
Unique (%),98.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-4.679795137
Minimum,-33.077618
Maximum,20.034231
Zeros (%),0.0%

0,1
Minimum,-33.077618
5-th percentile,-12.31133655
Q1,-8.158498
Median,-4.6190555
Q3,-1.677287875
95-th percentile,4.428756735
Maximum,20.034231
Range,53.111849
Interquartile range,6.481210125

0,1
Standard deviation,5.436455386
Coef of variation,-1.161686618
Kurtosis,2.447907985
Mean,-4.679795137
MAD,4.008827
Skewness,0.08277634973
Sum,-4679.795137
Variance,29.55504716
Memory size,7.9 KiB

Value,Count,Frequency (%),Unnamed: 3
-19.404556,3,0.3%,
-2.6312277,2,0.2%,
-0.904709,2,0.2%,
-6.0841155,2,0.2%,
-8.1480665,2,0.2%,
-19.648466,2,0.2%,
-2.9770093,2,0.2%,
-8.361013,2,0.2%,
-2.6876411,2,0.2%,
-2.7407625,2,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-33.077618,1,0.1%,
-23.78973,1,0.1%,
-20.719315,1,0.1%,
-19.904345,1,0.1%,
-19.868582,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
20.034231,1,0.1%,
18.18037,1,0.1%,
16.002995,1,0.1%,
15.562611,1,0.1%,
12.886192,1,0.1%,

0,1
Distinct count,990
Unique (%),99.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.52482513
Minimum,-32.24415
Maximum,29.516222
Zeros (%),0.0%

0,1
Minimum,-32.24415
5-th percentile,-7.275358185
Q1,-1.38504565
Median,1.5217396
Q3,5.264253
95-th percentile,8.93482485
Maximum,29.516222
Range,61.760372
Interquartile range,6.64929865

0,1
Standard deviation,5.251667402
Coef of variation,3.444111261
Kurtosis,2.913425432
Mean,1.52482513
MAD,4.03298594
Skewness,-0.5863417584
Sum,1524.82513
Variance,27.5800105
Memory size,7.9 KiB

Value,Count,Frequency (%),Unnamed: 3
5.271438,2,0.2%,
-5.518125,2,0.2%,
-0.7807764,2,0.2%,
5.0406923,2,0.2%,
5.264253,2,0.2%,
-0.32830492,2,0.2%,
4.5074267,2,0.2%,
5.4127445,2,0.2%,
1.969979,2,0.2%,
5.4010196,2,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-32.24415,1,0.1%,
-19.666721,1,0.1%,
-19.556288,1,0.1%,
-18.269705,1,0.1%,
-16.663643,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
29.516222,1,0.1%,
18.999636,1,0.1%,
15.464298,1,0.1%,
14.590265,1,0.1%,
13.464691,1,0.1%,

Unnamed: 0,ActivityLabel,Device,Sensor,Subject-id,Timestamp,x,y,z
0,S,watch,accel,1623,1970-01-01 02:45:22.212484279,-5.013159,-3.079243,2.369212
1,R,watch,accel,1608,1970-01-09 07:50:49.345695805,0.489016,-11.146206,2.111087
2,E,watch,accel,1616,1970-01-02 05:38:07.432031271,7.704092,-5.397728,2.449717
3,E,watch,accel,1638,1970-01-14 03:03:32.187854000,1.607058,-9.127419,-1.434617
4,A,watch,accel,1646,1970-01-18 00:23:11.917292884,1.646912,-9.908105,-6.879589
5,C,watch,accel,1616,1970-01-02 05:13:16.481479588,-0.895282,2.504634,10.686669
6,M,watch,accel,1626,1970-01-05 07:52:00.805076296,9.411757,-3.453786,-1.276858
7,I,watch,accel,1638,1970-01-14 03:36:46.900368000,2.835703,-7.695198,4.957691
8,E,watch,accel,1619,1970-01-05 01:07:58.102155417,9.119365,-2.765453,-2.030732
9,B,watch,accel,1645,1970-01-14 03:02:07.414888120,14.058454,-13.438953,-7.869143

Unnamed: 0,ActivityLabel,Device,Sensor,Subject-id,Timestamp,x,y,z
990,A,watch,accel,1602,1970-01-07 01:03:33.414259722,15.832706,-4.499004,1.767071
991,Q,watch,accel,1639,1970-01-03 01:49:01.220097000,-6.229446,-4.715794,5.271438
992,L,watch,accel,1636,1970-01-06 21:28:19.183620434,0.40043,4.470124,8.812907
993,F,watch,accel,1615,1970-01-01 21:54:37.690777778,-0.687286,-4.418649,8.701726
994,R,watch,accel,1624,1970-01-04 08:35:12.821054145,3.938462,-6.531083,-5.79307
995,K,watch,accel,1644,1970-01-22 01:19:20.415181407,-0.70599,-8.495975,4.761468
996,S,watch,accel,1621,1970-01-03 02:26:33.978842897,4.151396,-4.868011,1.03654
997,Q,watch,accel,1611,1970-01-01 01:01:19.545747725,-1.28015,-8.372523,3.849577
998,K,watch,accel,1639,1970-01-03 02:04:47.373132000,-7.098838,-3.151846,5.422325
999,A,watch,accel,1631,1970-01-16 08:55:11.890002750,9.71163,-3.251476,-0.807893


In [17]:
import seaborn as sns
import matplotlib.pyplot as plt
        
#unpivot the dataset and add activity description        
pdf_watch_melt = pd.melt(pdf_watch_accel,
                    id_vars=['Subject-id', 'ActivityLabel','Device','Sensor'],
                    value_vars=['x','y','z'])
pdf_watch_melt['ActivityDescription'] = pdf_watch_melt['ActivityLabel'].map(activity_map) 
        
plt.figure(figsize=(6,10))
sns.boxplot(x="value", y="ActivityDescription",
            hue="variable", 
            showfliers=False,
            data=pdf_watch_melt)
plt.tight_layout()
display()

In [18]:
#resmaple data to capture 10s averages for the x, y, and z
pdf_watch_ac10s = pdf_watch_accel.groupby(['Subject-id','ActivityLabel','Sensor','Device']).resample('10s').mean().drop('Subject-id',axis=1).reset_index()

pdf_watch_gr10s = pdf_watch_gyro.groupby(['Subject-id','ActivityLabel','Sensor','Device']).resample('10s').mean().drop('Subject-id',axis=1).reset_index()

In [19]:
#join accelerometer and gyroscope telemetry data
pdf_watch = pdf_watch_ac10s.merge(
  pdf_watch_gr10s,on=['Subject-id','ActivityLabel','Sensor','Device','Timestamp'],
                                  how='inner',
                                  suffixes=('_acc','_gr')
).dropna()

In [20]:
#see how well the datapoints can be clustered together with K-Means
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import KMeans
sil = SilhouetteVisualizer(KMeans(5), colors='yellowbrick')
sil.fit(pdf_watch[['x_acc','y_acc', 'z_acc', 'x_gr', 'y_gr', 'z_gr']])
display(sil.show())

#Machine Learning

In [22]:
#split into train and test
train = pdf_watch[pdf_watch['Subject-id'] <= 1635]
test = pdf_watch[pdf_watch['Subject-id'] > 1635]

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
enc.fit(pdf_watch['ActivityLabel'].unique())

model = RandomForestClassifier(100)
model.fit(train[['x_acc','y_acc', 'z_acc', 'x_gr', 'y_gr', 'z_gr']], enc.transform(train['ActivityLabel']))

In [24]:
from sklearn.metrics import classification_report

actual = enc.transform(test['ActivityLabel'])
predicted = model.predict(test[['x_acc','y_acc', 'z_acc', 'x_gr', 'y_gr', 'z_gr']])
accuracy = sum(actual == predicted) / actual.shape[0]

print("Accuracy: {} \n".format(accuracy))
print(classification_report(actual,predicted))

In [25]:
from sklearn.metrics import confusion_matrix
mat = pd.DataFrame(confusion_matrix(actual,predicted),columns=activity_map.values(),
                   index=activity_map.values())

In [26]:
plt.figure(figsize = (12,7))
sns.heatmap(mat, cmap="Reds",annot=True, annot_kws={"size": 10}, fmt='d', linewidths=.5)
sns.despine()
plt.tight_layout()
display()

#Case for using Pandas
- Works great on smaller datasets that fit into memory
- Tightly integrated into the scipy ecosystem
- Wide array of built in data visualizations

#Case against Pandas
- Not suitable for Data Lakes:
  - No support for SQL
  - Reads one file at a time
  - Only works with structured data
  - No streaming analytics
- Not suitable for production workloads:
  - No schema enforcement
  - No concept of Null values (NaN is not null)
- Does not scale:
  - Runs on a single thread on a single process
  - Limited by how much data can be stored in RAM