Linear Regression Machine Learning Algorithm to Recognise Gestures

In [245]:
# Import relevant libraries to be used
# import pandas as pd : not relevant to a numpy only solution attempt
import numpy as np 
import matplotlib.pyplot as plt
import os

Data Preparation

Initially, I downloaded the dataset file and opened all the .rar files to take a cursory look at their structure. An initial analysis shows that all the folders are indexed according to user and day/repetitions. This is not that relevant for our purposes, and is only there for the purpose of good structure. The files in them however are named and indexed according to their gesture and repetition.  The index of the gesture will be somthing that we have to extract, as that will be our target feature for the ML model. The contatining data are space separate x,y,z readings from an accelerometer in a quasi time series format. All of these will be the basis of our ML Models training features.

In [246]:
#Reach data in nested directory

root_dir = '.' # Root directory where files are located is our current folder

file_paths = [] # Empty List to store all filepaths

#Recursively get all files from current working directory
for root, dirs, files in os.walk(root_dir):
    for file in files:
        if ('.txt' in file) and (not 'readme' in file.lower()) and (not len(file)<7): #Look for .txt files in the directories but exclude the Readme.txt file and parameter files, which is less than 7 character typically
            file_paths.append(os.path.join(root, file)) #Append filepaths to empty list

print(len(file_paths))

4481


In [268]:
#Use numpy to arrange data

ges_arrs = [] #List to store numpy array from text files
ges_codes = [] #List to store the corresponding Gesture Code from path name

for file_path in file_paths:
    ges_arr = np.loadtxt(file_path).reshape(1,-1) # Load the values in a text file, and flatten the values to get all the values to use in our features in a single row and multiple columns for each txt data
    gesture_code= file_path.split('_')[-1].split('-')[0][-1] # The gesture code is only mentioned in the filename with the schema index gesture-index repetition. Splice the path string to get gesture code, and put in array to use as feature column
    
    ges_arrs.append(ges_arr)
    ges_codes.append(int(gesture_code))

max_arr_length = max(arr.shape[1] for arr in ges_arrs) #max fcn returns the maximum column number in our array

#Add additional columns/values to our shorter inner arrays so we can make a numpy array without errors
ges_arrs_padded = [] # Empty List in which to append padded arrays
for arr in ges_arrs:
    ges_arr_padded = np.pad(arr, pad_width=((0,0),(0, max_arr_length - arr.shape[1])), mode='constant', constant_values= 0 ) # Theoretically I should be able to leave out contant_values argument and it would pad with value None aka 0 by default. Try it out later
    ges_arrs_padded.append(ges_arr_padded)

comb_ges_arr = np.vstack(ges_arrs_padded)

ges_codes_in_np = np.vstack(ges_codes)

if ges_codes_in_np.shape[0] == comb_ges_arr.shape[0]:
    print('Matrixes are conform')

# Normalize dataset to reduce noise or bias and shuffle samples in rows to prevent user bias in ML Model
# Shuffle Dataset
combined_dataset = np.hstack((comb_ges_arr, ges_codes_in_np)) # Combine dataset

shuffled_dataset = np.random.permutation(combined_dataset) # Create shuffled copy

# Split into X and y
comb_ges_arr = shuffled_dataset[:, :(shuffled_dataset.shape[1]-1)] # X
ges_codes_in_np = shuffled_dataset[:, -1].reshape(-1,1) # y

# Normalize Dataset
mean_ds = np.mean(comb_ges_arr, axis=0) # Mean along rows
std_ds = np.std(comb_ges_arr, axis = 0) # Standard Deviataion by row

X_normed = (comb_ges_arr - mean_ds) / std_ds # Normalized X







Matrixes are conform


Prepare for a Logistic Regression Algorithm application

Now since the datas are treated appropriately, under the assumption that the dataset itself is complete and true, it is time to set up the logisitc regreassion in numpy. 

The mathematics of linear regreassion is simple enough for classification problems. Firstly, 'load' our equation with each feature getting a random assigned weight. Add a constant value column to account for bias or intercept. With the closed form solution, we can get the weights and use it to identify gestures. It will be useful to separate a few hundred rows of sample data to use as validation for our trained model. I will also attempt to add in optimisation and regularisation techniques to bring the performance closer to library levels as I'm used to in sklearn. It is indeed a novel challenge, as I've never set up my own interpretation of the LogReg algorithm from scratch yet!

In [273]:
#Make Training Data Set of sample size ratio 8:2
index_cutoff= int(np.round(comb_ges_arr.shape[0] * 0.8)) # Cutoff index 8:2

#Segregate Samples for Training 

X_train_0 = comb_ges_arr[:index_cutoff,:] # Suffled Dataset X
y_train = ges_codes_in_np[:index_cutoff,:] # Shuffled Dataset y

# Shuffled and Normalized Dataset; y_train is unchanged
X_train_0_norm = X_normed[:index_cutoff,:]


#Add a column of constant value, in this case 1, for line intercept considerations
#X_train = np.hstack( ((np.ones((X_train_0.shape[0], 1), dtype=int) ), X_train_0))

#Make Validation Data Set of Sample Size of ratio 8:2

# Shuffled validation sets
X_valid_0 = comb_ges_arr[index_cutoff:,:] 
y_valid = ges_codes_in_np[index_cutoff:,:]

#Shuffled and normalized set
X_valid_0_norm = X_normed[index_cutoff:,:]

#Add a column of constant value, in this case 1, for line intercept considerations
#X_valid = np.hstack( ((np.ones((X_valid_0.shape[0], 1), dtype=int) ), X_valid_0))

#Set up weights for matrix
np.random.seed(13)
weights = np.random.randn(X_train.shape[1]).reshape(-1,1)

if X_train.shape[1] == weights.shape[0]:
    print('Matrixes can be multiplied')

Matrixes can be multiplied


Setup a Logistic Regression Algorithm

In [274]:
# Define a extended sigmoid function for multi classification purposes
# Taken from ML Script for TU Darmstadt and Springer Text Book on Numerical Analysis

def sigmoid_softmax(input):
    '''The Softmax function is an extension of the Sigmoid function.
    It is commonly used in logistic regressions and expands itself from
    binary classification to be able to classify multiple classes.
    In multi-class classification algorithms, it take a vector of 
    scores and transfroms them into a probabiliy distribution over multiple classes.
    The function calculates the predicted class as one with the highest probabilit 
    according to output.
    
    Input: score of real value
    Output: probability distribution over multiple classes'''

    exp_input = np.exp(input - np.max(input, axis=1, keepdims=True)) # Return Maximum along Column

    score = exp_input / np.sum(exp_input, axis=1, keepdims=True)

    return score


In [275]:
# Define the loss function
# Taken from Springer Text Book on Numerical Analysis

def cross_entropy (y_predicted, y):
    '''The Cross Entropy Loss fucntion is used in logistic regression for multi classification problems.
    It measures the deviance between predicted probabilites and true labels
    
    The goal of our regression will be to minimize the loss represented by this function.
    This will help to classify input sampled to its correct classes'''

    loss = -np.mean(np.sum(y * np.log(y_predicted), axis = 1))

    return loss

In [276]:
# Setup Gradient Descent Algorithm to update and calculate weights
# Taken from documentation in sklearn library

def gradient_descent (X, y, learning_rate, iter):
    '''Implement gradient descent algorithm to initialize and update weights for our model'''
    
    num_samples, num_features = X.shape 
    num_classes = y.shape[1]

    # Initialize weights
    weight = np.zeros((num_features,num_classes))

    # Setup recursion to update weights
    for i in range(iter):
        scores = np.dot(X, weight) # Dot product of arrays
        # alterntely use @?
        probabiliy_dist = sigmoid_softmax(scores)
        gradient = np.dot(X.T, probabiliy_dist - y) / num_samples
        weight = weight - learning_rate*gradient

        # Print progress
        if (i+1)%50 == 0:
            loss = cross_entropy(probabiliy_dist, y)
            print(f'Iteration {i+1}, Loss: {loss}')
    
    return weight
    
        



In [277]:
# Setup prediction function
# Self implemented from context

def prediction(X, weight):

    '''Take X and learned weight from gradient descent to give a prediction on y'''
    scores = np.dot(X,weight)
    probability_dist = sigmoid_softmax(scores)
    
    # Return indices from probability
    return np.argmax(probability_dist, axis=1)



In [278]:
# Setup accuracy function
# Self implemented from context

def accuracy(y_predicted, y):
    '''Calculate accurarcy of our predictoins against true values'''

    acc_pred = (y_predicted == np.argmax(y, axis=1))
    acc_pred_mean = np.mean(acc_pred)

    print(f'Accuracy is {acc_pred_mean}')

    return acc_pred, acc_pred_mean



In [None]:
# Set hyperparameters in a dictionary
hyperparameters = {
    'Learning Rate' : 0.1,
    'Number of Iterations': 1000
}


In [254]:
# Setup learning parameters
learn_rate = 0.009
iter = 10
a = 0.015

# Define a Loss Function
def loss_mse(y_1,y_2):
    loss_mse = np.mean((y_1-y_2)**2)
    return loss_mse

After a bit of consideration, i decided to look up sources in some kaggle notebooks and consult friends and stackexchange to see how I can add a regularization algorithm to regulate my weights, and tell the algorithm if it is doing a good job or not. Without it, I keep getting fractionals and negative values, which is normal, as I can still abs and int the values to get interpretable values. But I still want to go upto higher learning rates and iterations and get a solid prediction basis without doing such modificaitons.

The regularization method I chose is L2 ridge, as that was the easiest to implement at first glance. In future, it is worth testing the individual regulators to see the subtleties of how they perform

In [250]:
#Implement gradient descent to update weight
loss_perepoch = [] # Empty list to append loss values

# Recursion to implement best fit regression
for i in range(iter):
    
    #Calculate predicted outputs with current version of weights
    y_pred = X_train @  weights
    print(y_pred)
    
    #Calculate gradients
    nabla = ((2/int(X_train.shape[0])) * ( X_train.T @ (y_pred - y_train)) )
    L2 = 2 * a * weights # Regulator term
    
    #Update weight
    weights = weights - learn_rate*(nabla+L2)

    #Loss function
    loss = loss_mse(y_pred,y_train)
    loss_perepoch.append(loss)

    if loss < 0.2:
        break

loss_epoch = np.array(loss)




[[  1.19189288]
 [  4.91927124]
 [ -8.27645365]
 ...
 [-13.124223  ]
 [-14.66841443]
 [-15.03571337]]
[[ 5.46397246]
 [12.78182927]
 [-2.63900787]
 ...
 [-5.08166382]
 [-6.09792999]
 [-6.37703142]]
[[ 4.61136341]
 [ 7.46161872]
 [-3.13884069]
 ...
 [-5.57121479]
 [-6.23359398]
 [-6.45965054]]
[[ 5.42872415]
 [ 8.11846095]
 [-1.76991677]
 ...
 [-3.17220967]
 [-3.51936795]
 [-3.71979377]]
[[ 5.2688431 ]
 [ 6.71017064]
 [-1.61860128]
 ...
 [-2.43167842]
 [-2.53359785]
 [-2.72562986]]
[[ 5.3271257 ]
 [ 6.44697454]
 [-1.26504255]
 ...
 [-1.35872069]
 [-1.25392722]
 [-1.44954179]]
[[ 5.18093912]
 [ 5.92888442]
 [-1.19307306]
 ...
 [-0.68918649]
 [-0.41505872]
 [-0.62173607]]
[[ 5.05134541]
 [ 5.67683163]
 [-1.13379852]
 ...
 [-0.05758821]
 [ 0.35882002]
 [ 0.13642076]]
[[ 4.87817046]
 [ 5.43477602]
 [-1.15079152]
 ...
 [ 0.43125854]
 [ 0.96694608]
 [ 0.72625988]]
[[ 4.70638235]
 [ 5.27779271]
 [-1.18589686]
 ...
 [ 0.85464508]
 [ 1.49137462]
 [ 1.23122225]]


In [251]:
#Validate the result of our final weights
y_model = X_valid @ weights

#Loss in Validation Data
#loss_valid = loss_mse(y_model, y_valid)

# Compare our predictions to our validation data
y_model = abs(np.ndarray.round(y_model))
comparison = y_model == y_valid
perc_pred = (np.sum(comparison) / comparison.size) * 100 
print(f'{perc_pred}%')
#Display visualisations to gain insight


10.397553516819572%


In [252]:
y_model

array([[ 4.],
       [ 4.],
       [ 6.],
       [ 9.],
       [ 7.],
       [15.],
       [10.],
       [14.],
       [ 9.],
       [ 9.],
       [12.],
       [ 7.],
       [ 8.],
       [13.],
       [ 7.],
       [10.],
       [ 3.],
       [ 5.],
       [13.],
       [ 3.],
       [ 5.],
       [ 1.],
       [23.],
       [ 3.],
       [ 7.],
       [ 3.],
       [ 4.],
       [ 9.],
       [ 5.],
       [15.],
       [18.],
       [ 4.],
       [ 9.],
       [ 3.],
       [ 5.],
       [10.],
       [ 5.],
       [12.],
       [ 2.],
       [ 0.],
       [17.],
       [ 2.],
       [13.],
       [ 5.],
       [10.],
       [ 4.],
       [ 2.],
       [ 2.],
       [ 4.],
       [ 7.],
       [ 5.],
       [ 1.],
       [ 6.],
       [ 2.],
       [ 3.],
       [ 6.],
       [ 7.],
       [ 6.],
       [ 4.],
       [ 5.],
       [ 7.],
       [ 1.],
       [ 1.],
       [ 3.],
       [ 7.],
       [ 3.],
       [ 1.],
       [ 2.],
       [ 5.],
       [ 0.],
       [ 5.],
      