# Pre-Processing Data for training and testing models 

This notebook walks through the pre-processing steps to take the data from the open source reporitory and structure it in matrices that are ready to train and test machine and deep learning models.<br>

You will have to change file paths and create folders either locally or online if you plan to use this script to pre-process the data. 

### Objective of Restructuring Data

We want to structure our data so that we have each signal from a sensor in its own matrix where each row of the matrix is a stride and each column of the matrix is a timepoint of the gait cycle. 

Since there are 4 sensors and 3 (angular velocity) signals per sensor, there will be 12 total matrices. 

I resampled to 100 datapoints to convert strides to 0-100% of the gait cycle. 

The following function takes one data file (and its indices for heel strikes), crops each signal from right heel strike to right heel strike (RHS to RHS), resamples the signal to 1000 data points, and stores it in a matrix. This function is repeatedly called in a lower cell to automatically restructure the data.

# Table of Contents
1) Restructuring Data<br>
2) Standardizign Data

# Part 1: Restructuring Data

In [None]:
#import libraries
import os
import numpy as np 
import pandas as pd 

### Right Shank

Function that crops the right leg signals from right heel strike (RHS) to the next consecutive RHS, time normalizes the signal from 0-100% of the gait cycle, and concatenates each stride into a matrix where was row is a stride and each column is a timepoint of the gait cycle. 

In [30]:
#function for cropping right leg signals to RHS-RHS and concatenating them into a single matrix for each variable
def right_leg_data_wrangler(data, indices, variable, subject_ID, speed, trial):
    data_matrix = []
    
    #loop over each row of the indices data, pulling the two heel strike indexes for each stride
    for i in range(len(indices)-1):
    
        #heel strike indices
        first_RHS = indices['RHS'].values[i]
        second_RHS = indices['RHS'].values[i+1]

        #get vector between current heel strikes
        stride = data[variable].loc[first_RHS:second_RHS]

        #resample stride to 1000 data points
        resampled_stride = resample(stride,101)
        
        #insert subject_id and speed into dataframe
        resampled_stride = np.append(resampled_stride, trial)
        resampled_stride = np.append(resampled_stride, speed)
        resampled_stride = np.append(resampled_stride, subject_ID)

        #append data matrix
        data_matrix.append(resampled_stride)
        
    #convert to dataframe
    data_frame = pd.DataFrame(data_matrix)
        
    return data_frame

### Set Directory

Set the working directory to the folder containing the data from the open source repository.

In [41]:
data_folder = sorted(os.listdir(os.getcwd()+'/data'))
#remove .DS_Store file if it exists
if '.DS_Store' in data_folder:
    data_folder.remove('.DS_Store')

### Restructure data and save 

The next cell loops over the directory, loads each file, and calls the right_leg_data_wrangler function to restructure the data. Data files are then exported to the "variable_matrices" folder.

In [42]:
#right limb variables of interest
GYRx_taR_all = pd.DataFrame()
GYRy_taR_all = pd.DataFrame()
GYRz_taR_all = pd.DataFrame()
GYRx_tbR_all = pd.DataFrame()
GYRy_tbR_all = pd.DataFrame()
GYRz_tbR_all = pd.DataFrame()


#loop over all subjects
for subject_idx,subject_name in enumerate(data_folder):
    #assign current subject directory
    subject_folder = sorted(os.listdir(os.getcwd()+'/data/'+subject_name))
    #loop over files in directory
    for file_idx, file_name in enumerate(subject_folder):
        #ignore any files with 'ev' or 'up'
        if 'ev' not in file_name:
            if 'up' not in file_name:
                #assign current data file and load
                data_path = os.path.join(os.getcwd()+'/data/'+subject_name+'/'+file_name)
                data = pd.read_csv(data_path,sep='\t')
                #assign current indices file and load
                indices_path = os.path.join(os.getcwd()+'/data/'+subject_name+'/'+file_name[:-4]+'ev.txt')
                indices = pd.read_csv(indices_path,sep='\t')

                #using the function from above, generate matrices where each row is a stride and each column is a timepoint of the gait cycle
                #easiest to do this without a loop
                #6 right leg metrics = 6 pandas dataframes
                
                #GYRx_taR
                GYRx_taR = right_leg_data_wrangler(data, indices, 'GYRx_taR', file_name[0:3], file_name[3], file_name[4])
                GYRx_taR_all = GYRx_taR_all.append(GYRx_taR)
                #GYRy_taR
                GYRy_taR = right_leg_data_wrangler(data, indices, 'GYRy_taR', file_name[0:3], file_name[3], file_name[4])
                GYRy_taR_all = GYRy_taR_all.append(GYRy_taR)
                #GYRz_taR
                GYRz_taR = right_leg_data_wrangler(data, indices, 'GYRz_taR', file_name[0:3], file_name[3], file_name[4])
                GYRz_taR_all = GYRz_taR_all.append(GYRz_taR)

                #GYRx_tbR
                GYRx_tbR = right_leg_data_wrangler(data, indices, 'GYRx_tbR', file_name[0:3], file_name[3], file_name[4])
                GYRx_tbR_all = GYRx_tbR_all.append(GYRx_tbR)
                #GYRy_tbR
                GYRy_tbR = right_leg_data_wrangler(data, indices, 'GYRy_tbR', file_name[0:3], file_name[3], file_name[4])
                GYRy_tbR_all = GYRy_tbR_all.append(GYRy_tbR)
                #GYRz_tbR
                GYRz_tbR = right_leg_data_wrangler(data, indices, 'GYRz_tbR', file_name[0:3], file_name[3], file_name[4])
                GYRz_tbR_all = GYRz_tbR_all.append(GYRz_tbR)
                

                
#move last three columns to beginning of pandas dataframes and rename
GYRx_taR_all_1 = GYRx_taR_all.pop(101)
GYRx_taR_all.insert(0, 'trial', GYRx_taR_all_1)
GYRx_taR_all_2 = GYRx_taR_all.pop(102)
GYRx_taR_all.insert(0, 'speed', GYRx_taR_all_2)
GYRx_taR_all_3 = GYRx_taR_all.pop(103)
GYRx_taR_all.insert(0, 'subject_ID', GYRx_taR_all_3)

GYRy_taR_all_1 = GYRy_taR_all.pop(101)
GYRy_taR_all.insert(0, 'trial', GYRy_taR_all_1)
GYRy_taR_all_2 = GYRy_taR_all.pop(102)
GYRy_taR_all.insert(0, 'speed', GYRy_taR_all_2)
GYRy_taR_all_3 = GYRy_taR_all.pop(103)
GYRy_taR_all.insert(0, 'subject_ID', GYRy_taR_all_3)

GYRz_taR_all_1 = GYRz_taR_all.pop(101)
GYRz_taR_all.insert(0, 'trial', GYRz_taR_all_1)
GYRz_taR_all_2 = GYRz_taR_all.pop(102)
GYRz_taR_all.insert(0, 'speed', GYRz_taR_all_2)
GYRz_taR_all_3 = GYRz_taR_all.pop(103)
GYRz_taR_all.insert(0, 'subject_ID', GYRz_taR_all_3)

GYRx_tbR_all_1 = GYRx_tbR_all.pop(101)
GYRx_tbR_all.insert(0, 'trial', GYRx_tbR_all_1)
GYRx_tbR_all_2 = GYRx_tbR_all.pop(102)
GYRx_tbR_all.insert(0, 'speed', GYRx_tbR_all_2)
GYRx_tbR_all_3 = GYRx_tbR_all.pop(103)
GYRx_tbR_all.insert(0, 'subject_ID', GYRx_tbR_all_3)

GYRy_tbR_all_1 = GYRy_tbR_all.pop(101)
GYRy_tbR_all.insert(0, 'trial', GYRy_tbR_all_1)
GYRy_tbR_all_2 = GYRy_tbR_all.pop(102)
GYRy_tbR_all.insert(0, 'speed', GYRy_tbR_all_2)
GYRy_tbR_all_3 = GYRy_tbR_all.pop(103)
GYRy_tbR_all.insert(0, 'subject_ID', GYRy_tbR_all_3)

GYRz_tbR_all_1 = GYRz_tbR_all.pop(101)
GYRz_tbR_all.insert(0, 'trial', GYRz_tbR_all_1)
GYRz_tbR_all_2 = GYRz_tbR_all.pop(102)
GYRz_tbR_all.insert(0, 'speed', GYRz_tbR_all_2)
GYRz_tbR_all_3 = GYRz_tbR_all.pop(103)
GYRz_tbR_all.insert(0, 'subject_ID', GYRz_tbR_all_3)


#export matrices to .csv files
pd.DataFrame.to_csv(GYRx_taR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRx_taR.csv', sep=',')
pd.DataFrame.to_csv(GYRy_taR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRy_taR.csv', sep=',')
pd.DataFrame.to_csv(GYRz_taR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRz_taR.csv', sep=',')

pd.DataFrame.to_csv(GYRx_tbR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRx_tbR.csv', sep=',')
pd.DataFrame.to_csv(GYRy_tbR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRy_tbR.csv', sep=',')
pd.DataFrame.to_csv(GYRz_tbR_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRz_tbR.csv', sep=',')

I ended up running the above cell 3 separate times because my laptop didn't have enough computing power to handle all of the files at once. I divided the 22 subjects into 3 groups, ran the previous 2 cells separately for each group, then used the following cell to concatenate all of the files. 

In [52]:
#create list of variables
variable_list = ['GYRx_taR', 'GYRy_taR', 'GYRz_taR', 'GYRx_tbR', 'GYRy_tbR', 'GYRz_tbR']
#loop over variables in list
for i in variable_list:
    #load files
    file_1 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_1/'+i+'.csv', index_col=0)
    file_2 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_2/'+i+'.csv', index_col=0)
    file_3 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_3/'+i+'.csv', index_col=0)
    #concatenate files
    master_file = pd.concat([file_1, file_2, file_3])
    #save file
    pd.DataFrame.to_csv(master_file, os.getcwd()+'/strides/variable_matrices_strides/'+i+'.csv', sep=',')

12102
12102
12102
12102
12102
12102


Lastly, lets horizontally concatenate the files from all 6 variables so we are left with one matrix with 12102 rows and 600 variables. 

In [69]:
#create list of variables
variable_list = ['GYRx_taR', 'GYRy_taR', 'GYRz_taR', 'GYRx_tbR', 'GYRy_tbR', 'GYRz_tbR']
R_variables_all = pd.DataFrame()
#loop over variables in list
for i in variable_list:
    #load files
    current_file = pd.read_csv(os.getcwd()+'/strides/variable_matrices_strides/'+i+'.csv', index_col=0)
    #horizontally concatenate files
    R_variables_all = pd.concat([R_variables_all, current_file], axis=1)

    
#pop off repeated columns (subject_ID, trial, speed)
subject_ID = R_variables_all.pop('subject_ID').iloc[:,1]
speed = R_variables_all.pop('speed').iloc[:,1]
trial = R_variables_all.pop('trial').iloc[:,1]
#and reinsert one instance at beginning of dataframe
R_variables_all.insert(0, 'trial', trial)
R_variables_all.insert(0, 'speed', speed)
R_variables_all.insert(0, 'subject_ID', subject_ID)

#save file    
pd.DataFrame.to_csv(R_variables_all, os.getcwd()+'/strides/ML_data/R_variables_all.csv', sep=',')

Now we have one matrix with 12102 observations and 606 variables, where each observation is a stride from a subject at a specific walking speed and each variable is a datapoint from a percentage of the gait cycle from one of the IMU's angular velocity signals. This is for the right limb only, so next we'll repeat for the left shank.

### Left Shank

In [79]:
#function for cropping left leg signals to RHS-RHS and concatenating them into a single matrix for each variable
def left_leg_data_wrangler(data, indices, variable, subject_ID, speed, trial):
    data_matrix = []
    
    #loop over each row of the indices data, pulling the two heel strike indexes for each stride
    for i in range(len(indices)-1):
    
        #heel strike indices
        first_LHS = indices['LHS'].values[i]
        second_LHS = indices['LHS'].values[i+1]

        #get vector between current heel strikes
        stride = data[variable].loc[first_LHS:second_LHS]

        #resample stride to 1000 data points
        resampled_stride = resample(stride,101)
        
        #insert subject_id and speed into dataframe
        resampled_stride = np.append(resampled_stride, trial)
        resampled_stride = np.append(resampled_stride, speed)
        resampled_stride = np.append(resampled_stride, subject_ID)

        #append data matrix
        data_matrix.append(resampled_stride)
        
    #convert to dataframe
    data_frame = pd.DataFrame(data_matrix)
        
    return data_frame

### Set Directory

In [80]:
data_folder = sorted(os.listdir(os.getcwd()+'/data'))
#remove .DS_Store file if it exists
if '.DS_Store' in data_folder:
    data_folder.remove('.DS_Store')

### Restructure data and save

The next cell loops over the directory, loads each file, and calls the left_leg_data_wrangler function to restructure the data. Data files are then exported to the "variable_matrices" folder.

In [83]:
#left limb variables of interest
GYRx_taL_all = pd.DataFrame()
GYRy_taL_all = pd.DataFrame()
GYRz_taL_all = pd.DataFrame()
GYRx_tbL_all = pd.DataFrame()
GYRy_tbL_all = pd.DataFrame()
GYRz_tbL_all = pd.DataFrame()


#loop over all subjects
for subject_idx,subject_name in enumerate(data_folder):
    #assign current subject directory
    subject_folder = sorted(os.listdir(os.getcwd()+'/data/'+subject_name))
    #loop over files in directory
    for file_idx, file_name in enumerate(subject_folder):
        #ignore any files with 'ev' or 'up'
        if 'ev' not in file_name:
            if 'up' not in file_name:
                #assign current data file and load
                data_path = os.path.join(os.getcwd()+'/data/'+subject_name+'/'+file_name)
                data = pd.read_csv(data_path,sep='\t')
                #assign current indices file and load
                indices_path = os.path.join(os.getcwd()+'/data/'+subject_name+'/'+file_name[:-4]+'ev.txt')
                indices = pd.read_csv(indices_path,sep='\t')

                #using the function from above, generate matrices where each row is a stride and each column is a timepoint of the gait cycle
                #easiest to do this without a loop
                #6 left leg metrics = 6 pandas dataframes
                
                #GYRx_taL
                GYRx_taL = left_leg_data_wrangler(data, indices, 'GYRx_taL', file_name[0:3], file_name[3], file_name[4])
                GYRx_taL_all = GYRx_taL_all.append(GYRx_taL)
                #GYRy_taL
                GYRy_taL = left_leg_data_wrangler(data, indices, 'GYRy_taL', file_name[0:3], file_name[3], file_name[4])
                GYRy_taL_all = GYRy_taL_all.append(GYRy_taL)
                #GYRz_taL
                GYRz_taL = left_leg_data_wrangler(data, indices, 'GYRz_taL', file_name[0:3], file_name[3], file_name[4])
                GYRz_taL_all = GYRz_taL_all.append(GYRz_taL)

                #GYRx_tbL
                GYRx_tbL = left_leg_data_wrangler(data, indices, 'GYRx_tbL', file_name[0:3], file_name[3], file_name[4])
                GYRx_tbL_all = GYRx_tbL_all.append(GYRx_tbL)
                #GYRy_tbL
                GYRy_tbL = left_leg_data_wrangler(data, indices, 'GYRy_tbL', file_name[0:3], file_name[3], file_name[4])
                GYRy_tbL_all = GYRy_tbL_all.append(GYRy_tbL)
                #GYRz_tbL
                GYRz_tbL = left_leg_data_wrangler(data, indices, 'GYRz_tbL', file_name[0:3], file_name[3], file_name[4])
                GYRz_tbL_all = GYRz_tbL_all.append(GYRz_tbL)
                

                
#move last three columns to beginning of pandas dataframes and rename
GYRx_taL_all_1 = GYRx_taL_all.pop(101)
GYRx_taL_all.insert(0, 'trial', GYRx_taL_all_1)
GYRx_taL_all_2 = GYRx_taL_all.pop(102)
GYRx_taL_all.insert(0, 'speed', GYRx_taL_all_2)
GYRx_taL_all_3 = GYRx_taL_all.pop(103)
GYRx_taL_all.insert(0, 'subject_ID', GYRx_taL_all_3)

GYRy_taL_all_1 = GYRy_taL_all.pop(101)
GYRy_taL_all.insert(0, 'trial', GYRy_taL_all_1)
GYRy_taL_all_2 = GYRy_taL_all.pop(102)
GYRy_taL_all.insert(0, 'speed', GYRy_taL_all_2)
GYRy_taL_all_3 = GYRy_taL_all.pop(103)
GYRy_taL_all.insert(0, 'subject_ID', GYRy_taL_all_3)

GYRz_taL_all_1 = GYRz_taL_all.pop(101)
GYRz_taL_all.insert(0, 'trial', GYRz_taL_all_1)
GYRz_taL_all_2 = GYRz_taL_all.pop(102)
GYRz_taL_all.insert(0, 'speed', GYRz_taL_all_2)
GYRz_taL_all_3 = GYRz_taL_all.pop(103)
GYRz_taL_all.insert(0, 'subject_ID', GYRz_taL_all_3)

GYRx_tbL_all_1 = GYRx_tbL_all.pop(101)
GYRx_tbL_all.insert(0, 'trial', GYRx_tbL_all_1)
GYRx_tbL_all_2 = GYRx_tbL_all.pop(102)
GYRx_tbL_all.insert(0, 'speed', GYRx_tbL_all_2)
GYRx_tbL_all_3 = GYRx_tbL_all.pop(103)
GYRx_tbL_all.insert(0, 'subject_ID', GYRx_tbL_all_3)

GYRy_tbL_all_1 = GYRy_tbL_all.pop(101)
GYRy_tbL_all.insert(0, 'trial', GYRy_tbL_all_1)
GYRy_tbL_all_2 = GYRy_tbL_all.pop(102)
GYRy_tbL_all.insert(0, 'speed', GYRy_tbL_all_2)
GYRy_tbL_all_3 = GYRy_tbL_all.pop(103)
GYRy_tbL_all.insert(0, 'subject_ID', GYRy_tbL_all_3)

GYRz_tbL_all_1 = GYRz_tbL_all.pop(101)
GYRz_tbL_all.insert(0, 'trial', GYRz_tbL_all_1)
GYRz_tbL_all_2 = GYRz_tbL_all.pop(102)
GYRz_tbL_all.insert(0, 'speed', GYRz_tbL_all_2)
GYRz_tbL_all_3 = GYRz_tbL_all.pop(103)
GYRz_tbL_all.insert(0, 'subject_ID', GYRz_tbL_all_3)


#export matrices to .csv files
pd.DataFrame.to_csv(GYRx_taL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRx_taL.csv', sep=',')
pd.DataFrame.to_csv(GYRy_taL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRy_taL.csv', sep=',')
pd.DataFrame.to_csv(GYRz_taL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRz_taL.csv', sep=',')

pd.DataFrame.to_csv(GYRx_tbL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRx_tbL.csv', sep=',')
pd.DataFrame.to_csv(GYRy_tbL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRy_tbL.csv', sep=',')
pd.DataFrame.to_csv(GYRz_tbL_all, os.getcwd()+'/pre_processing/variable_matrices/'+'GYRz_tbL.csv', sep=',')

I ended up running the above cell 3 separate times because my laptop didn't have enough computing power to handle all of the files at once. I divided the 22 subjects into 3 groups, ran the previous 2 cells separately for each group, then used the following cell to concatenate all of the files. 

In [85]:
#create list of variables
variable_list = ['GYRx_taL', 'GYRy_taL', 'GYRz_taL', 'GYRx_tbL', 'GYRy_tbL', 'GYRz_tbL']
#loop over variables in list
for i in variable_list:
    #load files
    file_1 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_1/'+i+'.csv', index_col=0)
    file_2 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_2/'+i+'.csv', index_col=0)
    file_3 = pd.read_csv(os.getcwd()+'/pre_processing/variable_matrices_3/'+i+'.csv', index_col=0)
    #concatenate files
    master_file = pd.concat([file_1, file_2, file_3])
    #save file
    pd.DataFrame.to_csv(master_file, os.getcwd()+'/strides/variable_matrices_strides/'+i+'.csv', sep=',')

Lastly, lets horizontally concatenate the files from all 6 variables so we are left with one matrix with 12102 rows and 600 variables. 

In [86]:
#create list of variables
variable_list = ['GYRx_taL', 'GYRy_taL', 'GYRz_taL', 'GYRx_tbL', 'GYRy_tbL', 'GYRz_tbL']
L_variables_all = pd.DataFrame()
#loop over variables in list
for i in variable_list:
    #load files
    current_file = pd.read_csv(os.getcwd()+'/strides/variable_matrices_strides/'+i+'.csv', index_col=0)
    #horizontally concatenate files
    L_variables_all = pd.concat([L_variables_all, current_file], axis=1)

    
#pop off repeated columns (subject_ID, trial, speed)
subject_ID = L_variables_all.pop('subject_ID').iloc[:,1]
speed = L_variables_all.pop('speed').iloc[:,1]
trial = L_variables_all.pop('trial').iloc[:,1]
#and reinsert one instance at beginning of dataframe
L_variables_all.insert(0, 'trial', trial)
L_variables_all.insert(0, 'speed', speed)
L_variables_all.insert(0, 'subject_ID', subject_ID)

#save file    
pd.DataFrame.to_csv(L_variables_all, os.getcwd()+'/strides/ML_data/L_variables_all.csv', sep=',')

## Part 2: Standardizing Data

It is important to standardize data before feeding it into machine learning models. Variables with larger ranges may hold more weight when training and validating ML models. For example, we could expect that motions in the sagittal plane (i.e., when the shank moves forward to back) would produce larger angular velocity values compared to the frontal plane (i.e., when the shank moves from left to right). Standardizing time-series data is a little bit different than traditional standardization in ML modeling. To do this, I typically use a z-transoformation by subtracting the mean and dividing by the standard deviation. This keeps the same overall shape of the waveform, which is really what we care about. However, it removes the discrepencies between signals in the ranges of data.

In [502]:
#loop over each folder
for file_idx, file_name in enumerate(variable_matrices_folder):
    #load file
    file_path = os.path.join(os.getcwd()+'/strides/variable_matrices_strides/'+file_name)
    variable_matrix = pd.read_csv(file_path,sep=',', index_col=0)
    #drop the subject_ID, speed, and trial columns
    variable_matrix_values = variable_matrix.drop(['subject_ID', 'speed', 'trial'], axis=1)
    
    #z-transformation
    variable_matrix_minus_mean = variable_matrix_values.sub(variable_matrix_values.mean(axis=1), axis=0)
    variable_matrix_std = variable_matrix_minus_mean.divide(variable_matrix_minus_mean.std(axis=1), axis=0)
    
    #add the columns back to the dataframe
    variable_matrix_std.insert(0, 'subject_ID', variable_matrix['subject_ID'])
    variable_matrix_std.insert(1, 'speed', variable_matrix['speed'])
    variable_matrix_std.insert(2, 'trial', variable_matrix['trial'])
    
    #export standardized dataframes
    pd.DataFrame.to_csv(variable_matrix_std, os.getcwd()+'/strides/variable_matrices_strides_standard/'+file_name[:-4]+'_std.csv', sep=',')

Next, concatenate all data into one file where each row is an observation and each column is a timepoint for each of the variables (total of 600 columns)

In [504]:
#create list of variables
variable_list = ['GYRx_taR_std', 'GYRy_taR_std', 'GYRz_taR_std', 'GYRx_tbR_std', 'GYRy_tbR_std', 'GYRz_tbR_std']
R_variables_all = pd.DataFrame()
#loop over variables in list
for i in variable_list:
    #load files
    current_file = pd.read_csv(os.getcwd()+'/strides/variable_matrices_strides_standard/'+i+'.csv', index_col=0)
    #horizontally concatenate files
    R_variables_all = pd.concat([R_variables_all, current_file], axis=1)

    
#pop off repeated columns (subject_ID, trial, speed)
subject_ID = R_variables_all.pop('subject_ID').iloc[:,1]
speed = R_variables_all.pop('speed').iloc[:,1]
trial = R_variables_all.pop('trial').iloc[:,1]
#and reinsert one instance at beginning of dataframe
R_variables_all.insert(0, 'trial', trial)
R_variables_all.insert(0, 'speed', speed)
R_variables_all.insert(0, 'subject_ID', subject_ID)

#save file    
pd.DataFrame.to_csv(R_variables_all, os.getcwd()+'/strides/ML_data/R_variables_all_std.csv', sep=',')

In [506]:
#create list of variables
variable_list = ['GYRx_taL_std', 'GYRy_taL_std', 'GYRz_taL_std', 'GYRx_tbL_std', 'GYRy_tbL_std', 'GYRz_tbL_std']
L_variables_all = pd.DataFrame()
#loop over variables in list
for i in variable_list:
    #load files
    current_file = pd.read_csv(os.getcwd()+'/strides/variable_matrices_strides_standard/'+i+'.csv', index_col=0)
    #horizontally concatenate files
    L_variables_all = pd.concat([L_variables_all, current_file], axis=1)

    
#pop off repeated columns (subject_ID, trial, speed)
subject_ID = L_variables_all.pop('subject_ID').iloc[:,1]
speed = L_variables_all.pop('speed').iloc[:,1]
trial = L_variables_all.pop('trial').iloc[:,1]
#and reinsert one instance at beginning of dataframe
L_variables_all.insert(0, 'trial', trial)
L_variables_all.insert(0, 'speed', speed)
L_variables_all.insert(0, 'subject_ID', subject_ID)

#save file    
pd.DataFrame.to_csv(L_variables_all, os.getcwd()+'/strides/ML_data/L_variables_all_std.csv', sep=',')

Now we can think about performing some machine learning modeling. One approach would be to feed the standardized IMU signals directly into a model, however that might be a longer training and validation process because of the large number of variables. Another approach would be to extract some features from the data such as statistical metrics (mean, median, range, max, min, etc.) and/or principal components. Lets try the principal components first and see how well that performs. 