# Human Activity Recognition with Smartphones

The Human Activity Recognition database was built from the recordings of 30 study participants performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The objective is to classify activities into one of the six activities performed.

Refer to link https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones  for more details.

The dataset is also available on UCI Machine Learning Repository.

* Problem Type : Multi-Class Classification
* Algorithm  : SVM

# Import Library

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

# import SVC classifier
from sklearn.svm import SVC

# import metrics to compute accuracy (Evulate)
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV

# Load the Training DataSet

In [None]:
df_train = pd.read_csv("../input/human-activity-recognition-with-smartphones/train.csv")

# EDA

In [None]:
df_train.head()

In [None]:
df_train.tail()

In [None]:
df_train.shape

So the train data set has 563 Columns / features (including the target / class), and 7352 rows or data points.

Also the target is Activity. As mentioned in the data description it has 6 unique values. Lets check them also in next step.

## Check for missing values in the dataset

In [None]:
df_train.isnull().values.any()

## Class Distribution

In [None]:
df_train["Activity"].unique()

In [None]:
pd.crosstab(index = df_train["Activity"],columns="count")

## Visualize the Class Distribution

In [None]:
plt.figure(figsize=(10,5))
ax = sns.countplot(x="Activity", data=df_train)
plt.xticks(x = df_train['Activity'],  rotation='vertical')
plt.show()

Class distribution looks good.

Next will check for the feature `subject`. 

Though feature `subject` is not much useful to us, as it is an identifier of the subject who carried out the experiment.

We are good to ignore or drop the feature.

In [None]:
df_train["subject"].unique()

In [None]:
X = pd.DataFrame(df_train.drop(['Activity','subject'],axis=1))
Y = df_train.Activity.values.astype(object)

X.shape, Y.shape

In [None]:
X.head()

In [None]:
Y[1]

## Check the data types of each features. 

All features are of float64 type and all 561 are numeric features, except for Class (y). We need to do Label ENcoder and make it into numeirc.

In [None]:
X.info()

In [None]:
#Total Number of Continous and Categorical features in the training set
num_cols = X._get_numeric_data().columns
print("Number of numeric features:",num_cols.size)

## Transforming non numerical labels into numerical labels

In [None]:
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()

In [None]:
# encoding train labels 
encoder.fit(Y)
y = encoder.transform(Y)
y.shape

In [None]:
y[1]

In [None]:
encoder.classes_

In [None]:
encoder.classes_[2]

## Feature Scaling

In [None]:
# Scaling the feature 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
X = scaler.fit_transform(X)
X[1]

# Split X and y into training and validation sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 99)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

# Train the Model

## Run SVM with default hyperparameters
Default hyperparameter means C=1.0, kernel=rbf and gamma=auto among other parameters.

In [None]:
# instantiate classifier with default hyperparameters
svc = SVC() 

In [None]:
# fit classifier to training set
svc.fit(X_train,y_train)

In [None]:
# make predictions on test set
y_pred = svc.predict(X_valid)

In [None]:
# compute and print accuracy score
print('Model accuracy score with default hyperparameters: {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

Here, y_valid are the true class labels and y_pred are the predicted class labels in the test-set.

## Run SVM with rbf kernel and C=100.0

Some time there are outliers in the dataset. In that case, we should increase the value of C as higher C means fewer outliers. So, might run SVM with kernel=rbf and C=100.0.

We will try playing with various hyper-parameter.

In [None]:
# instantiate classifier with rbf kernel and C=100
svc = SVC(C=100.0) 


# fit classifier to training set
svc.fit(X_train,y_train)


# make predictions on test set
y_pred = svc.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with rbf kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

We can see that we obtain a higher accuracy with C=100.0 as higher C means less outliers. 

Now, I will further increase the value of C=1000.0 and check accuracy.

## Run SVM with rbf kernel and C=1000.0

In [None]:
# instantiate classifier with rbf kernel and C=1000
svc=SVC(C=1000.0) 


# fit classifier to training set
svc.fit(X_train,y_train)


# make predictions on test set
y_pred=svc.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with rbf kernel and C=1000.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

In this case, we can see that the accuracy had decreased with C=1000.0

## Run SVM with linear kernel 
Run SVM with linear kernel and C=1.0

In [None]:
# instantiate classifier with linear kernel and C=1.0
linear_svc=SVC(kernel='linear', C=1.0) 


# fit classifier to training set
linear_svc.fit(X_train,y_train)


# make predictions on test set
y_pred_test=linear_svc.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with linear kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred_test)))

## Run SVM with linear kernel and C=100.0

In [None]:
# instantiate classifier with linear kernel and C=100.0
linear_svc100=SVC(kernel='linear', C=100.0) 


# fit classifier to training set
linear_svc100.fit(X_train, y_train)


# make predictions on test set
y_pred=linear_svc100.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with linear kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

## Run SVM with linear kernel and C=1000.0

In [None]:
# instantiate classifier with linear kernel and C=1000.0
linear_svc1000=SVC(kernel='linear', C=1000.0) 


# fit classifier to training set
linear_svc1000.fit(X_train, y_train)


# make predictions on test set
y_pred=linear_svc1000.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with linear kernel and C=1000.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

We can see that we can obtain higher accuracy with C=100.0 and C=1000.0 as compared to C=1.0.



## Compare the train-set and test-set accuracy
Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [None]:
y_pred_train = linear_svc.predict(X_train)

y_pred_train

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

We can see that the training set and test-set accuracy are very much comparable.

# Check for overfitting and underfitting

In [None]:
print('Training set score: {:.4f}'.format(linear_svc.score(X_train, y_train)))

print('Validation set score: {:.4f}'.format(linear_svc.score(X_valid, y_valid)))

The training-set accuracy score is 99.71 while the validation-set accuracy to be 98.44. These two values are quite comparable. So, there is no question of overfitting.

# Compare model accuracy with null accuracy
So, the model accuracy is 0.9832. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the null accuracy. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.

So, we should first check the class distribution in the validation set.

In [None]:
# check class distribution in validation set

# y_valid.value_counts()

We can see that the occurences of most frequent class ---  is ---. So, we can calculate null accuracy by dividing --- by total number of occurences.

## Run SVM with polynomial kernel
Run SVM with polynomial kernel and C=1.0

In [None]:
# instantiate classifier with polynomial kernel and C=1.0
poly_svc=SVC(kernel='poly', C=1.0) 


# fit classifier to training set
poly_svc.fit(X_train,y_train)


# make predictions on test set
y_pred=poly_svc.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with polynomial kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

## Run SVM with polynomial kernel and C=100.0

In [None]:
# instantiate classifier with polynomial kernel and C=100.0
poly_svc100=SVC(kernel='poly', C=100.0) 


# fit classifier to training set
poly_svc100.fit(X_train, y_train)


# make predictions on test set
y_pred=poly_svc100.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with polynomial kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

In [None]:
Polynomial kernel gives poor performance. It may be overfitting the training set.

## Run SVM with sigmoid kernel
Run SVM with sigmoid kernel and C=1.0

In [None]:
# instantiate classifier with sigmoid kernel and C=1.0
sigmoid_svc=SVC(kernel='sigmoid', C=1.0) 


# fit classifier to training set
sigmoid_svc.fit(X_train,y_train)


# make predictions on test set
y_pred=sigmoid_svc.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with sigmoid kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

## Run SVM with sigmoid kernel and C=100.0

In [None]:
# instantiate classifier with sigmoid kernel and C=100.0
sigmoid_svc100=SVC(kernel='sigmoid', C=100.0) 


# fit classifier to training set
sigmoid_svc100.fit(X_train,y_train)


# make predictions on test set
y_pred=sigmoid_svc100.predict(X_valid)


# compute and print accuracy score
print('Model accuracy score with sigmoid kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_valid, y_pred)))

We can see that sigmoid kernel is also performing poorly just like with polynomial kernel.

# Hyperparameter tuning using grid search and cross validation

In [None]:
# Create the parameter grid based on the results of random search 
params_grid = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

In [None]:
# Performing CV to tune parameters for best SVM fit 
svm_model = GridSearchCV(SVC(), params_grid, cv=5)
svm_model.fit(X_train, y_train)

## Confusion Matrix and Accuracy Score

In [None]:
# View the accuracy score
print('Best score for training data:', svm_model.best_score_,"\n") 

# View the best parameters for the model found using grid search
print('Best C:',svm_model.best_estimator_.C,"\n") 
print('Best Kernel:',svm_model.best_estimator_.kernel,"\n")
print('Best Gamma:',svm_model.best_estimator_.gamma,"\n")

final_model = svm_model.best_estimator_
Y_pred = final_model.predict(X_valid)
Y_pred_label = list(encoder.inverse_transform(Y_pred))

In [None]:
# Making the Confusion Matrix
#print(pd.crosstab(Y_test_label, Y_pred_label, rownames=['Actual Activity'], colnames=['Predicted Activity']))
print(confusion_matrix(y_valid,Y_pred))
print("\n")
print(classification_report(y_valid,Y_pred))

print("Training set score for SVM: %f" % final_model.score(X_train , y_train))
print("Validation set score for SVM: %f" % final_model.score(X_valid  , y_valid ))

# svm_model.score

# Comments
We get maximum accuracy with `rbf` and `linear` kernel with C=100.0. and the accuracy is 99%. Based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.
