# Overview
The goal of this assignment is to use the data from devices to predict users manner. Based on the belt, forearm, arm, and dumbbell of 6 participants, we use the Support Vector Machine (SVM) classification model to predict testing data and generate the labels.

# Background  
Human Activity Recognition (HAR) has been recognized as a key research area and is gaining attention by the computing research community, especially for the development of context-aware systems. 
There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.

#### First, we input the module for our later data anlysis

In [1]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.preprocessing import normalize
from sklearn import linear_model

# Data Exploration  
We first use pandas do read .csv file and filter out the redundant cloumns in the file. There are 52 features left after filteration, which related to 'belt', 'forearm', 'arm', and 'dumbbell'. In order to finding the best model paramaters for SVM, we use GridSearchCV to go through different parameters combination. For the validation purpose, we seperate the training data to 80/20, which 80% of the data for model building 20% of data for validation.

In [2]:
#%% Load Training Data
df = pd.read_csv('pml-training.csv', sep=',', error_bad_lines=False) 
df_clean = df.dropna(axis=1, thresh = None)

# Data Extraction
idx = ['belt','arm','dumbbell']
df_clean_key = []
for i in idx:
    df_clean_key.append(df_clean.filter(like = i, axis=1))

df_data = pd.concat(df_clean_key,ignore_index=True,axis=1)

df_label=df_clean.loc[:,'classe']

# Transform the data valus and type to list
data_train = df_data.values
lbl_train_list = df_label.tolist()
# mapping = {'A':0,'B':1,'C',2,'D':3,'E':4}    
lbl_train = list(map(lambda x: ord(x)-ord('A'), lbl_train_list))
lbl_train = np.array(lbl_train) # numpy.array for later analysis

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
#%% Load Testing  Data
df = pd.read_csv('pml-testing.csv', sep=',', error_bad_lines=False) 
df_clean = df.dropna(axis=1, thresh = None)

idx = ['belt','arm','dumbbell']
df_clean_key = []
for i in idx:
    df_clean_key.append(df_clean.filter(like = i, axis=1))

df_data = pd.concat(df_clean_key,ignore_index=True,axis=1)

data_test = df_data.values

In [4]:
# Normalize
# data_train = normalize(data_train,norm = 'l1')

# Prediction Modeling
We use GridSearchCV to find the best model parameters with cross validation = 3.

In [6]:
# Find the best SVM model
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':['rbf'], 'C':[i for i in range(3,9)], 'tol':[1e-5,1e-6]}
svc = svm.SVC(gamma="scale")
clf_sr = GridSearchCV(svc, parameters, cv=3) # cv: cross validation
clf_sr.fit(data_train, lbl_train)
print(clf_sr.best_params_)
print(clf_sr.best_score_)
# sorted(clf_sr.cv_results_.keys())

{'C': 3, 'kernel': 'rbf', 'tol': 1e-05}
0.42416675160534095


#### We then shuffle our training data and split if to 80/20 for validate our model.

In [7]:
#%% shuffle data for cross validation
indices = np.arange(data_train.shape[0])
np.random.shuffle(indices)
print(indices)
data_train = data_train[indices]
lbl_train = lbl_train[indices]

train_set = data_train[:19622//100*80,:]
test_set = data_train[-19622*20//100:-1,:]
train_set_lbl = lbl_train[:19622//100*80]
test_set_lbl = lbl_train[-19622*20//100:-1]

[ 1884  3808  4230 ... 11031  5179  8293]


#### We first try the Linear SVM to see the prediction result if not the best model

In [8]:
#%% Linear SVM
clf = LinearSVC(penalty = 'l2',dual = False, tol = 1e-6, max_iter = 10000)
clf.fit(train_set, train_set_lbl) 
train_set_pred = clf.predict(train_set)
test_set_pred = clf.predict(test_set)
accuracy_train_set = sum(np.array(train_set_lbl==train_set_pred))/len(train_set_lbl)*100
accuracy_test_set = sum(np.array(test_set_lbl==test_set_pred))/len(test_set_lbl)*100
print('train accuracy = ',accuracy_train_set)
print('test accuracy = ', accuracy_test_set)

train accuracy =  73.41198979591836
test accuracy =  72.6809378185525


#### We then try different C, which control the soft margins.

In [14]:
#%% SVM with kernel: rbf, C = 5
clf5 = SVC(gamma='scale', decision_function_shape='ovo',kernel = 'rbf',tol = 1e-5,C=5)
clf5.fit(train_set, train_set_lbl) 
train_set_pred = clf5.predict(train_set)
test_set_pred = clf5.predict(test_set)
accuracy_train_set = sum(np.array(train_set_lbl==train_set_pred))/len(train_set_lbl)*100
accuracy_test_set = sum(np.array(test_set_lbl==test_set_pred))/len(test_set_lbl)*100
print('train accuracy = ',accuracy_train_set)
print('test accuracy = ', accuracy_test_set)

train accuracy =  99.9936224489796
test accuracy =  93.78185524974516


In [15]:
#%% SVM with kernel: rbf, C = 3
clf3 = SVC(gamma='scale', decision_function_shape='ovo',kernel = 'rbf',tol = 1e-5,C=3)
clf3.fit(train_set, train_set_lbl) 
train_set_pred = clf3.predict(train_set)
test_set_pred = clf3.predict(test_set)
accuracy_train_set = sum(np.array(train_set_lbl==train_set_pred))/len(train_set_lbl)*100
accuracy_test_set = sum(np.array(test_set_lbl==test_set_pred))/len(test_set_lbl)*100
print('train accuracy = ',accuracy_train_set)
print('test accuracy = ', accuracy_test_set)

train accuracy =  99.96811224489795
test accuracy =  93.75637104994902


From the above two results, we can see that the result is pretty much similar with slightly different.
For the further analysis, we can try take some features off and compare the result again. This can help us distinquish if there are some redundant features.

Here we try another model to see the faster prediction result, but with lower acacuracy. We may try different loss function to see the performance.

In [11]:
#%% SGD classifier (use as comparison)
clf = linear_model.SGDClassifier(max_iter=10000, tol=1e-6,loss = 'perceptron')
clf.fit(train_set, train_set_lbl)
train_set_pred = clf.predict(train_set)
test_set_pred = clf.predict(test_set)
accuracy_train_set = sum(np.array(train_set_lbl==train_set_pred))/len(train_set_lbl)*100
accuracy_test_set = sum(np.array(test_set_lbl==test_set_pred))/len(test_set_lbl)*100
print('train accuracy = ',accuracy_train_set)
print('test accuracy = ', accuracy_test_set)

train accuracy =  62.079081632653065
test accuracy =  62.181447502548416


# Model Application
#### Use the best model above (SVM with radio basis function(rbf) kernel and C = 3) to have our test result

In [16]:
test_pred_C3 = clf3.predict(data_test)
print(test_pred_C3)
test_pred_C5 = clf5.predict(data_test)
print(test_pred_C5)

[1 0 1 0 0 4 3 1 0 0 1 2 1 0 4 4 0 1 1 1]
[1 0 1 0 0 4 3 1 0 0 1 2 1 0 4 4 0 1 1 1]


#### We can see from above the prediction result is the same. Let turn our prediciton back to class A - E 

In [22]:
import string
chars = list(string.ascii_uppercase)
print('Testing Data label prediction result: \n',[chars[i] for i in test_pred_C5])
print('Testing Data label prediction result: \n',[chars[i] for i in test_pred_C3])

Testing Data label prediction result: 
 ['B', 'A', 'B', 'A', 'A', 'E', 'D', 'B', 'A', 'A', 'B', 'C', 'B', 'A', 'E', 'E', 'A', 'B', 'B', 'B']
Testing Data label prediction result: 
 ['B', 'A', 'B', 'A', 'A', 'E', 'D', 'B', 'A', 'A', 'B', 'C', 'B', 'A', 'E', 'E', 'A', 'B', 'B', 'B']


The results above are the prediction of label on testing data.