Human Activity Recognition with Smartphones

https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# load up the data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# let's take a gander
display(train.head())

print train.shape
print test.shape

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


(7352, 563)
(2947, 563)


In [3]:
# Seperate subject information
subject_training_data = train['subject']
subject_testing_data = test['subject']

# Seperate labels
training_labels = train['Activity']
testing_labels = test['Activity']

# Drop labels and subject info from data
train = train.drop(['subject', 'Activity'], axis=1)
test = test.drop(['subject', 'Activity'], axis=1)

# Print some information about our data
print "Training data consists of {} instances of data with {} total features:\n{}".format(train.shape[0], train.shape[1], list(train.columns))
print "Training data includes value counts of {}".format(training_labels.value_counts())
print "Testing data consists of {} instances of data".format(test.shape[0])
print "Testing data includes value counts of {}".format(testing_labels.value_counts())

Training data consists of 7352 instances of data with 561 total features:
['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X', 'tBodyAcc-max()-Y', 'tBodyAcc-max()-Z', 'tBodyAcc-min()-X', 'tBodyAcc-min()-Y', 'tBodyAcc-min()-Z', 'tBodyAcc-sma()', 'tBodyAcc-energy()-X', 'tBodyAcc-energy()-Y', 'tBodyAcc-energy()-Z', 'tBodyAcc-iqr()-X', 'tBodyAcc-iqr()-Y', 'tBodyAcc-iqr()-Z', 'tBodyAcc-entropy()-X', 'tBodyAcc-entropy()-Y', 'tBodyAcc-entropy()-Z', 'tBodyAcc-arCoeff()-X,1', 'tBodyAcc-arCoeff()-X,2', 'tBodyAcc-arCoeff()-X,3', 'tBodyAcc-arCoeff()-X,4', 'tBodyAcc-arCoeff()-Y,1', 'tBodyAcc-arCoeff()-Y,2', 'tBodyAcc-arCoeff()-Y,3', 'tBodyAcc-arCoeff()-Y,4', 'tBodyAcc-arCoeff()-Z,1', 'tBodyAcc-arCoeff()-Z,2', 'tBodyAcc-arCoeff()-Z,3', 'tBodyAcc-arCoeff()-Z,4', 'tBodyAcc-correlation()-X,Y', 'tBodyAcc-correlation()-X,Z', 'tBodyAcc-correlation()-Y,Z', '

In [4]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
#from sklearn.manifold import TSNE

#scaler = MinMaxScaler()
#scaled_trainingdata = scaler.fit_transform(train)

# Encode our categorical labels into numerical target labels
le = LabelEncoder()
le = le.fit(["WALKING", "WALKING_UPSTAIRS", "WALKING_DOWNSTAIRS", "SITTING", "STANDING", "LAYING"])
enc_training_labels = le.transform(training_labels)
enc_testing_labels = le.transform(testing_labels)


#tsne = TSNE(init = 'pca')
#tsne_vis = tsne.fit_transform(scaled_trainingdata)
#plt.scatter(tsne_vis[:,0], tsne_vis[:,1], c=encodedlabels)

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier 

#Let's try out some out-of-the-box classifiers and see how they perform
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
xt = ExtraTreesClassifier()
kn = KNeighborsClassifier()

def evaluateclf(clf):
    scores = cross_val_score(clf, train, enc_training_labels)
    avg = scores.mean()
    return "performances: {}, \nAverage: {}".format(scores, avg)

print "Decision Tree {}".format(evaluateclf(dt))

print "Random Forest {}".format(evaluateclf(rf))

print "Extra Trees {}".format(evaluateclf(xt))

print "K Neighbors {}".format(evaluateclf(kn))

Decision Tree performances: [ 0.84013051  0.8225938   0.81168301], 
Average: 0.824802437741
Random Forest performances: [ 0.89885808  0.88743883  0.90604575], 
Average: 0.897447550708
Extra Trees performances: [ 0.90130506  0.88336052  0.91013072], 
Average: 0.898265432691
K Neighbors performances: [ 0.8817292  0.8682708  0.9121732], 
Average: 0.887391067538


In [6]:
from sklearn.model_selection import RandomizedSearchCV

#Extremely Random Trees classifier looks promising, let's fine tune some hyper-parameters and see how much we can improve
parameters = {'n_estimators': np.arange(10,100,10), 'max_features': np.arange(10, 200, 10), 'min_samples_split': np.arange(2,10,2)}

randgrid = RandomizedSearchCV(xt, parameters, n_iter = 133, verbose = 3)

randgrid = randgrid.fit(train, enc_training_labels)

Fitting 3 folds for each of 133 candidates, totalling 399 fits
[CV] n_estimators=80, min_samples_split=6, max_features=20 ...........
[CV]  n_estimators=80, min_samples_split=6, max_features=20, score=0.929445 -   0.0s
[CV] n_estimators=80, min_samples_split=6, max_features=20 ...........


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV]  n_estimators=80, min_samples_split=6, max_features=20, score=0.893148 -   0.1s
[CV] n_estimators=80, min_samples_split=6, max_features=20 ...........


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.6s remaining:    0.0s


[CV]  n_estimators=80, min_samples_split=6, max_features=20, score=0.941585 -   0.0s
[CV] n_estimators=10, min_samples_split=6, max_features=30 ...........
[CV]  n_estimators=10, min_samples_split=6, max_features=30, score=0.907830 -   0.0s
[CV] n_estimators=10, min_samples_split=6, max_features=30 ...........
[CV]  n_estimators=10, min_samples_split=6, max_features=30, score=0.870718 -   0.0s
[CV] n_estimators=10, min_samples_split=6, max_features=30 ...........
[CV]  n_estimators=10, min_samples_split=6, max_features=30, score=0.919935 -   0.0s
[CV] n_estimators=90, min_samples_split=8, max_features=80 ...........
[CV]  n_estimators=90, min_samples_split=8, max_features=80, score=0.939233 -   0.0s
[CV] n_estimators=90, min_samples_split=8, max_features=80 ...........
[CV]  n_estimators=90, min_samples_split=8, max_features=80, score=0.880914 -   0.0s
[CV] n_estimators=90, min_samples_split=8, max_features=80 ...........
[CV]  n_estimators=90, min_samples_split=8, max_features=80, sco

[Parallel(n_jobs=1)]: Done 399 out of 399 | elapsed: 15.9min finished


In [7]:
print randgrid.best_estimator_
print randgrid.best_score_

#We'll train the model and hyper-parameters which produced the best 3-fold cross-validation score
xt = randgrid.best_estimator_
xt.fit_transform(train, enc_training_labels)

#Check the performance of the tuned and trained model on the testing set
print "Testing score for extra random trees is {:.4f}".format(xt.score(test, enc_testing_labels))

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features=50, max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=8, min_weight_fraction_leaf=0.0,
           n_estimators=70, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
0.925870511425




Testing score for extra random trees is 0.9301


In [8]:
from keras.utils.np_utils import to_categorical

# Now let's experiment with a neural network to classify this data and see if we can improve our accuracy even further
# First we need to encode our targets as one-hot label vectors
oh_training_labels = to_categorical(enc_training_labels)
oh_testing_labels = to_categorical(enc_testing_labels)

Using Theano backend.


In [29]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD

# Build a network for this classification task
model = Sequential()
model.add(Dense(36, input_dim = train.shape[1], activation = 'tanh'))
model.add(Dropout(0.5))
model.add(Dense(22, activation = 'tanh'))
model.add(Dropout(0.4))
model.add(Dense(output_dim = 6, activation = 'softmax'))

sgd = SGD(lr = .1, momentum = .1, decay = 1e-4)
model.compile(optimizer = sgd, loss = 'categorical_crossentropy', metrics = ['accuracy'])

model.fit(train.values, oh_training_labels, nb_epoch = 130, batch_size = 50, verbose = 2,
          validation_split = .15, shuffle=True)

Train on 6249 samples, validate on 1103 samples
Epoch 1/133
0s - loss: 1.0265 - acc: 0.5489 - val_loss: 0.4431 - val_acc: 0.8205
Epoch 2/133
0s - loss: 0.5349 - acc: 0.7638 - val_loss: 0.2924 - val_acc: 0.8849
Epoch 3/133
0s - loss: 0.4474 - acc: 0.8024 - val_loss: 0.3192 - val_acc: 0.8667
Epoch 4/133
0s - loss: 0.3854 - acc: 0.8307 - val_loss: 0.2321 - val_acc: 0.9021
Epoch 5/133
0s - loss: 0.3509 - acc: 0.8480 - val_loss: 0.2054 - val_acc: 0.9139
Epoch 6/133
0s - loss: 0.3218 - acc: 0.8637 - val_loss: 0.1861 - val_acc: 0.9202
Epoch 7/133
0s - loss: 0.2974 - acc: 0.8781 - val_loss: 0.1860 - val_acc: 0.9112
Epoch 8/133
0s - loss: 0.2962 - acc: 0.8739 - val_loss: 0.1795 - val_acc: 0.9157
Epoch 9/133
0s - loss: 0.2617 - acc: 0.8934 - val_loss: 0.2053 - val_acc: 0.9048
Epoch 10/133
0s - loss: 0.2451 - acc: 0.8997 - val_loss: 0.1805 - val_acc: 0.9175
Epoch 11/133
0s - loss: 0.2453 - acc: 0.9040 - val_loss: 0.2558 - val_acc: 0.8994
Epoch 12/133
0s - loss: 0.2436 - acc: 0.9000 - val_loss: 0.

<keras.callbacks.History at 0x1d106b70>

In [None]:
# Test set performance