### Making Dataset

In this file we will be doing the following

* Creating the dataset from raw data
* Merging the dataset (Containing measurements from accelerometer and gyroscope)
* Preprocessing the data
* Cleaning the data
* Finally exporting the preprocessed data and storing it in `'../data/interim'` folder.

In [1]:
# Importing the necessary library

import pandas as pd
from glob import glob

In [2]:
# Function to read the data
files = glob('../data/raw/MetaMotion/*.csv')
data_path = '../data/raw/MetaMotion/'

def read_data(files):
    accel_df = pd.DataFrame()
    gyro_df = pd.DataFrame()

# Adding exercise sets for each type of dataset
    accel_set = 1
    gyro_set  = 1

    for f in files:
        participant = f.split('-')[0].replace(data_path, '')
        # Label - Type of exercise e.g ohp - Overhead press
        label = f.split('-')[1]
        # Quick fixing category for other files
        category = f.split('-')[2].rstrip('123').rstrip('_MetaWear_2019')
        df = pd.read_csv(f)
        df['participant'] = participant
        df['label'] = label
        df['category'] = category

        if 'Accelerometer' in f:
            df['set'] = accel_set
            accel_set += 1
            accel_df = pd.concat([accel_df, df])

        if 'Gyroscope' in f:
            df['set'] = gyro_set
            gyro_set += 1
            gyro_df = pd.concat([gyro_df, df])
    
    accel_df.index = pd.to_datetime(accel_df['epoch (ms)'], unit='ms')
    gyro_df.index = pd.to_datetime(gyro_df['epoch (ms)'], unit='ms') 
    del accel_df['epoch (ms)']
    del accel_df['time (01:00)']
    del accel_df['elapsed (s)']

    del gyro_df['epoch (ms)']
    del gyro_df['time (01:00)']
    del gyro_df['elapsed (s)'] 

    return accel_df, gyro_df

accel_df, gyro_df = read_data(files) 


In [35]:
# Merging the data (Accelerometer and Gyroscope)

merged_data = pd.concat([accel_df.iloc[:,:3], gyro_df], axis=1)

In [36]:
merged_data.tail()

Unnamed: 0_level_0,x-axis (g),y-axis (g),z-axis (g),x-axis (deg/s),y-axis (deg/s),z-axis (deg/s),participant,label,category,set
epoch (ms),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-01-20 17:35:13.382,-0.06,-1.021,-0.058,,,,,,,
2019-01-20 17:35:13.462,-0.035,-1.037,-0.026,,,,,,,
2019-01-20 17:35:13.542,-0.045,-1.029,-0.033,,,,,,,
2019-01-20 17:35:13.622,-0.039,-1.027,-0.039,,,,,,,
2019-01-20 17:35:13.702,-0.049,-1.031,-0.049,,,,,,,


In [37]:
# Renaming columns

merged_data.columns = [
    'acc_x',
    'acc_y',
    'acc_z',
    'gyr_x',
    'gyr_y',
    'gyr_z',
    'participant',
    'label',
    'category',
    'set'
]

**Note:** There are bound to be rows filled with NaN values. That is because after merging the 2 data frames we will be dealing with 2 types of measurement and the chance that the accelerometer data is exactly the same as Gyroscope data while the device is in active status is fairly small.

1. Accelerometer data
2. Gyroscope data

We will be handling the null values through resampling the data

In [38]:
merged_data.head()

Unnamed: 0_level_0,acc_x,acc_y,acc_z,gyr_x,gyr_y,gyr_z,participant,label,category,set
epoch (ms),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-01-11 15:08:04.950,,,,-10.671,-1.524,5.976,B,bench,heavy,64.0
2019-01-11 15:08:04.990,,,,-8.72,-2.073,3.171,B,bench,heavy,64.0
2019-01-11 15:08:05.030,,,,0.488,-3.537,-4.146,B,bench,heavy,64.0
2019-01-11 15:08:05.070,,,,0.244,-5.854,3.537,B,bench,heavy,64.0
2019-01-11 15:08:05.110,,,,-0.915,0.061,-2.805,B,bench,heavy,64.0


**Resampling:** When you convert a certain frequency to higher or lower range

In [39]:
# Now we will be resampling the data
sampling = {
    'acc_x': 'mean',
    'acc_y': 'mean',
    'acc_z': 'mean',
    'gyr_x': 'mean',
    'gyr_y': 'mean',
    'gyr_z': 'mean',
    'participant': 'last',
    'label': 'last',
    'category': 'last',
    'set': 'last',  
}


days = [g for n, g in merged_data.groupby(pd.Grouper(freq='D'))]
data_final = pd.concat([df.resample(rule='200ms').apply(sampling).dropna() for df in days])

In [40]:
# changing set from float to int
data_final['set'] = data_final['set'].astype('int')

In [41]:
data_final.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9009 entries, 2019-01-11 15:08:05.200000 to 2019-01-20 17:33:27.800000
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   acc_x        9009 non-null   float64
 1   acc_y        9009 non-null   float64
 2   acc_z        9009 non-null   float64
 3   gyr_x        9009 non-null   float64
 4   gyr_y        9009 non-null   float64
 5   gyr_z        9009 non-null   float64
 6   participant  9009 non-null   object 
 7   label        9009 non-null   object 
 8   category     9009 non-null   object 
 9   set          9009 non-null   int64  
dtypes: float64(6), int64(1), object(3)
memory usage: 774.2+ KB


In [42]:
# Exporting the dataset using the pickle method
# Exporting as pickle over csv since it easier to work with timestamps there

data_final.to_pickle('../data/interim/processed_data.pkl')