# CROSS VALIDATION ON GESTURE PHASE DATA IN PYTHON

The dataset is composed by features extracted from 1 video out of 7 videos with people gesticulating, aiming at studying Gesture Phase Segmentation. More at https://archive.ics.uci.edu/ml/datasets/Gesture+Phase+Segmentation .

The dataset contains 18 numeric attributes (double), a timestamp and a class attribute (nominal). Features include the position of hands, wrists, head and spine of the user in each frame x, y, and z along with velocity and acceleration of hands and wrists

The task here is to classify each observation to the appropriate gesture phase and then determine the hyperparameters by cross-validation techniques.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

Libraries well-imported, now lets load the dataset and give it a random shuffle.

In [2]:
gesture = pd.read_csv('gesture.csv')

In [3]:
gesture = gesture.sample(frac=1).reset_index(drop=True)
gesture.head(20)

Unnamed: 0,lhx,lhy,lhz,rhx,rhy,rhz,hx,hy,hz,sx,sy,sz,lwx,lwy,lwz,rwx,rwy,rwz,timestamp,phase
0,4.374785,4.862335,1.487255,5.57188,3.699457,1.463631,5.013357,1.647419,1.756658,5.055002,4.289006,1.734413,4.25392,4.684481,1.546359,5.536563,3.537047,1.452075,5755441,Stroke
1,4.397175,5.343486,1.543894,6.463796,3.804019,1.422329,4.698179,1.666746,1.767339,4.989073,4.302614,1.741924,4.287542,5.064166,1.577469,6.458109,3.838278,1.412729,5731526,Stroke
2,4.593478,4.297289,1.496404,5.975338,2.936082,1.408245,5.03016,1.656722,1.735953,5.070876,4.26241,1.747062,4.493125,4.338934,1.508733,5.952941,2.840218,1.401616,5711558,Stroke
3,4.472016,3.046018,1.48435,5.865152,4.11085,1.382707,5.100109,1.697934,1.722947,5.143596,4.286234,1.735683,4.610734,2.61881,1.490633,5.89959,4.114076,1.401826,5712853,Stroke
4,5.787846,4.347365,1.489267,5.600442,4.34487,1.523706,5.571479,1.673162,1.760309,5.537751,4.26748,1.759097,5.446428,4.298337,1.557798,5.525572,4.320976,1.533866,5706363,Rest
5,5.183446,4.283505,1.47161,4.582945,4.310018,1.496711,5.041199,1.636649,1.7699,5.040886,4.262479,1.742013,4.847561,4.244131,1.533032,4.97226,4.265435,1.54367,5800448,Rest
6,4.704253,5.002657,1.490003,4.736167,5.05258,1.46773,5.171966,1.664425,1.765566,5.092784,4.321531,1.733817,4.363448,4.947539,1.542017,4.9038,4.95816,1.488358,5736814,Rest
7,5.032917,4.917639,1.488403,6.538051,3.216248,1.418388,5.048034,1.658013,1.772019,5.284384,4.289791,1.751189,4.948143,4.856595,1.504246,6.471444,3.361984,1.411637,5733585,Stroke
8,3.960567,3.879969,1.500739,6.082852,3.847628,1.471514,5.114834,1.709278,1.733524,5.080438,4.328213,1.737525,4.106085,3.838933,1.456843,6.024591,3.792006,1.440216,5724615,Stroke
9,4.58466,4.337426,1.471784,5.742022,4.078259,1.465616,5.226921,1.752029,1.710091,5.204073,4.302849,1.725498,4.526713,4.377347,1.500223,5.556025,3.636419,1.421782,5714101,Rest


In [4]:
gesture.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1747 entries, 0 to 1746
Data columns (total 20 columns):
lhx          1747 non-null float64
lhy          1747 non-null float64
lhz          1747 non-null float64
rhx          1747 non-null float64
rhy          1747 non-null float64
rhz          1747 non-null float64
hx           1747 non-null float64
hy           1747 non-null float64
hz           1747 non-null float64
sx           1747 non-null float64
sy           1747 non-null float64
sz           1747 non-null float64
lwx          1747 non-null float64
lwy          1747 non-null float64
lwz          1747 non-null float64
rwx          1747 non-null float64
rwy          1747 non-null float64
rwz          1747 non-null float64
timestamp    1747 non-null int64
phase        1747 non-null object
dtypes: float64(18), int64(1), object(1)
memory usage: 273.0+ KB


Observe that the dataset has 1747 rows with 20 columns. We include the variables from 0 to 17 in the feature space in X and 19th variable as the class/output

In [5]:
X = gesture.iloc[:, 0:18].values
y = gesture.iloc[:,19].values

## Data Prepricessing

Now perform feature scaling

In [6]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

Also we need to encode categorical output variable, y

In [7]:
# Encoding Categorical output
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

In [8]:
#Do the'train_test_split'
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Building a classifier

Lets use the well-known SVM classifier on the data.

In [9]:
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)

Now print the error metrics

In [10]:
from sklearn import metrics
print('The accuracy of the svm',metrics.accuracy_score(y_pred,y_test))

The accuracy of the svm 0.872380952381


Now apply k-fold cross validation with k = 10 and determine the mean accuracy

In [11]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()

0.87319972011195524

The mean accuracy is very close to our actual accuracy.

Now apply the Grid Search to find the best accuracy and  hyperparameters such as 'C', 'kernel' and 'gamma'.

In [12]:
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
              {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)

To get the best accuracy score, mean of ten accuracies are evaluated. best_score  and best_params are the attributes to be used

In [13]:
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

In [14]:
best_accuracy

0.94108019639934537

In [15]:
best_parameters

{'C': 10, 'gamma': 0.4, 'kernel': 'rbf'}

## Conclusion
1. The SVM Classifier has an accuracy of 87.2 % on the test set.The accuracy determined using CV is 87.3 %.
2. The best accuracy determined using GridSearch is 94 % and best parameters are :
    {'C': 100, 'gamma': 0.4, 'kernel': 'rbf'}