## Introduction
The objective of this project is to classify the activities of a subject based on the readings embedded inertial sensors in a waist mounted smartphone. The original dataset was taken for Kaggle website under the name "Human Activity Recognition with Smartphones". 

## Experiment Description
The dataset was prepared by 30 volunteers who performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) while wearing a smartphone with embedded accelerometer amd gyroscope. The sensors captured 2-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. Also, it is specified in the dataset description that the sensor signals were pre-processed by applying noise filters and then sampled in fixed-width sliding window of 2.56 sec and 50% overlap(128 readings/window). 

## Attribute Information
-->Triaxial acceleration from accelerometer                         
-->Triaxial angular velocity from gyroscope                         
-->A 561-feature vector with time and frequency domain variables     
-->Its activity label                                               
-->An human identifier of subject who carried out the experiment

#### Let's start by importing relevant libraries and datasets

In [65]:
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

train=pd.read_csv('train data Human Activity Recognition with Smartphones.csv')
test=pd.read_csv('test data Human Activity Recognition with Smartphones.csv')

Let's take a look at some top rows of the training dataset

In [90]:
train.head(10)

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
3900,0.217424,-0.0227,-0.106453,-0.975585,-0.950248,-0.98311,-0.976929,-0.967475,-0.985013,-0.943272,...,-0.625542,-0.047591,0.294363,0.371649,0.603734,0.822001,-0.470098,-0.501469,19,LAYING
3249,0.272708,-0.01702,-0.116223,-0.995603,-0.983865,-0.987185,-0.995962,-0.981904,-0.98612,-0.942257,...,-0.834985,0.062786,0.116185,0.539752,0.794508,-0.667936,0.219258,0.219253,17,STANDING
3133,0.234235,-0.010039,-0.099005,-0.987626,-0.992428,-0.994599,-0.987766,-0.992376,-0.994516,-0.944877,...,-0.829668,-0.009214,0.057465,0.009576,-0.550627,0.500488,0.423776,-0.779215,16,LAYING
4748,0.276778,-0.015506,-0.11229,-0.99486,-0.990906,-0.960534,-0.995527,-0.991223,-0.956639,-0.939066,...,-0.358486,-0.006207,0.06588,0.430711,-0.348322,-0.876278,0.057382,0.099142,23,SITTING
4332,0.262955,0.010117,-0.031551,0.187343,0.199283,0.024492,0.14577,0.144106,-0.012666,0.532282,...,-0.400709,0.267329,-0.806729,0.525317,-0.547566,-0.633226,0.319721,0.14166,21,WALKING_DOWNSTAIRS
2334,0.199453,-0.059948,-0.09176,-0.052589,0.365935,0.440688,-0.204273,0.150769,0.341375,0.450127,...,-0.822773,0.274326,-0.70322,0.499672,-0.669544,-0.57203,0.257497,0.27284,14,WALKING_DOWNSTAIRS
1011,0.274678,-0.023188,-0.124038,-0.986829,-0.896585,-0.949241,-0.98864,-0.905424,-0.949307,-0.924255,...,-0.559741,0.036589,-0.208449,-0.622251,-0.746405,-0.723051,0.275768,-0.054909,6,STANDING
2432,0.285042,-0.014604,-0.098389,-0.981022,-0.941819,-0.969614,-0.983827,-0.941344,-0.970939,-0.91556,...,-0.512113,0.013388,0.28503,-0.373395,-0.199334,-0.572743,0.176423,0.313674,14,SITTING
5187,0.231026,-0.018501,-0.085397,-0.647512,-0.226758,-0.464132,-0.684126,-0.211773,-0.465463,-0.526001,...,-0.302043,0.398029,0.035812,0.766658,-0.813525,-0.698809,0.174313,-0.183912,25,WALKING
4948,0.247617,-0.025411,-0.098561,-0.985436,-0.993529,-0.986809,-0.98616,-0.992939,-0.985661,-0.938357,...,-0.517001,-0.034842,0.692982,0.378813,0.19542,0.548939,-0.247938,-0.767568,23,LAYING


Looking at the shapes of training and test datasets

In [67]:
train.shape

(7352, 563)

In [68]:
test.shape

(2947, 563)

Before we begin, let's shuffle our training and test datasets to avoid and elements of bias/patterns and make sure that our models remains general.

In [72]:
train=shuffle(train)
test=shuffle(test)

Next we will try to find if there are any missing or null values which can affect our model performance.

In [73]:
print(train.isnull().any().any())
print(test.isnull().any().any())


False
False


Since there are no missing values in our training and test datasets, we are not required to do any data preprocessing.

Dropping the label columns from training and test sets.

In [74]:

trainData=train.drop('Activity',axis=1).values
trainLabel=train.Activity.values

testData=test.drop('Activity',axis=1).values
testLabel=test.Activity.values

Since RandomForest are very easy to tune and less prone to overfitting, we will try to build a Random Forest Classifier model to see how well it performs in terms of performance and time required for training.

https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria

Link for gini vs entropy for randomforestclassifier

In [76]:

forest=RandomForestClassifier(n_estimators=100,
                              n_jobs=-1,
                              random_state=0)

start=time.time()
forest.fit(trainData,trainLabel)
end=time.time()

print('Time Taken by Random Forest Classifier :',(end-start),'sec')

forest_prediction=forest.predict(testData)
print("F1-score for Random Forest Classifier :",f1_score(testLabel,forest_prediction,average='micro'))
print("Accuracy score for Random Forest Classifier :",accuracy_score(testLabel,forest_prediction))
print('Classification Report :',classification_report(testLabel,forest_prediction,))

Time Taken by Random Forest Classifier : 3.6924359798431396 sec
F1-score for Random Forest Classifier : 0.924669155073
Accuracy score for Random Forest Classifier : 0.924669155073
Classification Report :                     precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.91      0.88      0.89       491
          STANDING       0.89      0.92      0.90       532
           WALKING       0.90      0.97      0.93       496
WALKING_DOWNSTAIRS       0.96      0.85      0.90       420
  WALKING_UPSTAIRS       0.90      0.92      0.91       471

       avg / total       0.93      0.92      0.92      2947



Random Forest Classifier was able to give us a f1_score and accuracy score of 92%. Let's see if we can increase the scores using GradientBoosting Classifier which tries to find optimal linear combination of trees. However, since there can be many hyperparameters that can be tuned to utilize actual potential of GradientBoosting Classifier, we will employ a Grid Search to find optimal hyperparameters for GradientBoosting Classifier.             But before we step in hyperparameter tuning, let's take a look at a baseline model with default parameters and then we will proceed by tuning hyperparameters accordingly

In [77]:

param_grid={'learning_rate': [0.1],
            'n_estimators':[100],
            'max_depth': [3],
            'min_samples_leaf': [1],
            'max_features':['sqrt']
           }


grad_boost_GS_base = GridSearchCV(GradientBoostingClassifier(random_state=0),
                                  param_grid=param_grid,
                                  cv=5,scoring='accuracy',
                                  verbose=1,n_jobs=-1)

start=time.time()
grad_boost_GS_base.fit(trainData,trainLabel)
end=time.time()

print('Training Complete\n')
print('Time Taken :',(end-start),'sec')


grad_boost_predictions_base=grad_boost_GS_base.predict(testData)

print("F1-score for Gradient Boosting Classifier :",f1_score(testLabel,grad_boost_predictions_base,average='micro'))
print("Accuracy score for Gradient Boosting Classifier :",accuracy_score(testLabel,grad_boost_predictions_base))
print('Classification Report :',classification_report(testLabel,grad_boost_predictions_base))

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   24.2s finished


Training Complete

Time Taken : 33.964617013931274 sec
F1-score for Gradient Boosting Classifier : 0.941974889718
Accuracy score for Gradient Boosting Classifier : 0.941974889718
Classification Report :                     precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.93      0.89      0.91       491
          STANDING       0.91      0.94      0.92       532
           WALKING       0.94      0.97      0.95       496
WALKING_DOWNSTAIRS       0.97      0.91      0.94       420
  WALKING_UPSTAIRS       0.91      0.94      0.92       471

       avg / total       0.94      0.94      0.94      2947



Here we find that the baseline model for GradientBoosting Classifier has achieved an accuracy and f1-score of ~94.2%. Let's tune some hyperparameters to see if we can push our model performance even further.

We will follow general approach for parameter tuning as advised in this tutorial (https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/). We first tune tree based parameters followed by boosting parameters.

First let's find an optimal learning rate and optimal number of trees for corresponding learning rate.

In [78]:
param_grid={'learning_rate': [0.1,0.2,0.3],
            'n_estimators':[100,150,200],
            'max_depth': [3],
            'min_samples_leaf': [1],
            'max_features':['sqrt']
           }


grad_boost_GS = GridSearchCV(GradientBoostingClassifier(random_state=0),
                             param_grid=param_grid,
                             cv=5,scoring='accuracy',
                             verbose=1,n_jobs=-1)
start=time.time()
grad_boost_GS.fit(trainData,trainLabel)
end=time.time()

print('Training Complete\n')
print('Time Taken :',(end-start),'sec')


grad_boost_predictions=grad_boost_GS.predict(testData)

print("F1-score for Gradient Boosting Classifier :",f1_score(testLabel,grad_boost_predictions,average='micro'))
print("Accuracy score for Gradient Boosting Classifier :",accuracy_score(testLabel,grad_boost_predictions))
print('Classification Report :',classification_report(testLabel,grad_boost_predictions))

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  2.9min finished


Training Complete

Time Taken : 186.42142868041992 sec
F1-score for Gradient Boosting Classifier : 0.947404139803
Accuracy score for Gradient Boosting Classifier : 0.947404139803
Classification Report :                     precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.94      0.88      0.91       491
          STANDING       0.90      0.94      0.92       532
           WALKING       0.94      0.97      0.96       496
WALKING_DOWNSTAIRS       0.98      0.92      0.95       420
  WALKING_UPSTAIRS       0.93      0.95      0.94       471

       avg / total       0.95      0.95      0.95      2947



In [80]:
grad_boost_GS.best_params_,grad_boost_GS.best_score_

({'learning_rate': 0.2,
  'max_depth': 3,
  'max_features': 'sqrt',
  'min_samples_leaf': 1,
  'n_estimators': 150},
 0.99265505984766045)

Using some iterations in above step, we found 0.2 as optimal learning rate with 150 as optimal trees in this case. Also, we can observe that we were able to increase our model performance upto ~94.7% which is 2.5% higher than the baseline model. Next step is to tune some boosting parameters. We will try with tuning max_depth, num_samples_split and min_samples_leaf and see if we can see any further performance gain.

In [82]:
param_grid={'max_depth': range(3,10,3),
            'min_samples_leaf': range(10,51,10),
            'min_samples_split':range(100,801,200)
           }


grad_boost_GS = GridSearchCV(GradientBoostingClassifier(learning_rate=0.2,
                                                        n_estimators=150,
                                                        max_features='sqrt',
                                                        random_state=0),
                             param_grid=param_grid,
                             cv=5,scoring='accuracy',
                             verbose=1,n_jobs=-1)

start=time.time()
grad_boost_GS.fit(trainData,trainLabel)
end=time.time()

print('Training Complete\n')
print('Time Taken :',(end-start),'sec')


grad_boost_predictions=grad_boost_GS.predict(testData)

print("F1-score for Gradient Boosting Classifier :",f1_score(testLabel,grad_boost_predictions,average='micro'))
print("Accuracy score for Gradient Boosting Classifier :",accuracy_score(testLabel,grad_boost_predictions))
print('Classification Report :',classification_report(testLabel,grad_boost_predictions))

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 19.8min finished


Training Complete

Time Taken : 1200.3514626026154 sec
F1-score for Gradient Boosting Classifier : 0.945707499152
Accuracy score for Gradient Boosting Classifier : 0.945707499152
Classification Report :                     precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.96      0.89      0.92       491
          STANDING       0.91      0.96      0.94       532
           WALKING       0.94      0.95      0.94       496
WALKING_DOWNSTAIRS       0.97      0.91      0.94       420
  WALKING_UPSTAIRS       0.90      0.94      0.92       471

       avg / total       0.95      0.95      0.95      2947



In [83]:
grad_boost_GS.best_params_,grad_boost_GS.best_score_

({'max_depth': 9, 'min_samples_leaf': 50, 'min_samples_split': 500},
 0.99415125136017413)

The performance of the model above didn't showed improvement which can be due to overfitting of our GradientBoosting model.              
Finally let's compare one more model with SVM to see if it can do any better than other models studied above

In [88]:
from sklearn.svm import SVC

param_grid={'kernel':['linear','rbf'],
           'C':[0.1,1,10]}

svm_gs=GridSearchCV(SVC(random_state=0),
                    param_grid=param_grid,
                    scoring='accuracy',
                    verbose=1,n_jobs=-1)

start=time.time()
svm_gs.fit(trainData,trainLabel)
end=time.time()

svm_gs_predictions=svm_gs.predict(testData)


print("F1-score for Gradient Boosting Classifier :",f1_score(testLabel,svm_gs_predictions,average='micro'))
print("Accuracy score for Gradient Boosting Classifier :",accuracy_score(testLabel,svm_gs_predictions))
print('Classification Report :',classification_report(testLabel,svm_gs_predictions))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  1.8min finished


F1-score for Gradient Boosting Classifier : 0.963352561927
Accuracy score for Gradient Boosting Classifier : 0.963352561927
Classification Report :                     precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.96      0.89      0.92       491
          STANDING       0.90      0.97      0.94       532
           WALKING       0.95      1.00      0.97       496
WALKING_DOWNSTAIRS       0.99      0.98      0.98       420
  WALKING_UPSTAIRS       0.98      0.95      0.97       471

       avg / total       0.96      0.96      0.96      2947



We see that SVM were able to perform much better than RandomForest and GradientBoosting Classifiers with accuracy and F1-score of more than 96%. The combination of linear kernel with C=1 gave best result in this case.