# Mobile Sensor data

This dataset we are going to analyse is from [Kaggle](https://www.kaggle.com/)  datasets. It includes time-series data generated by accelerometer and gyroscope sensors (attitude, gravity, userAcceleration, and rotationRate). It is collected with an iPhone 6s kept in the participant's front pocket using SensingKit which collects information from Core Motion framework on iOS devices. A total of 24 participants in a range of gender, age, weight, and height performed 6 activities in 15 trials in the same environment and conditions: downstairs, upstairs, walking, jogging, sitting, and standing. With this dataset, authors aim to look for personal attributes fingerprints in time-series of sensor data, i.e. attribute-specific patterns that can be used to infer gender or personality of the data subjects in addition to their activities.


For each participant, the study had been commenced by collecting their demographic (age and gender) and physically-related (height and weight) information. Then, they provided them with a dedicated smartphone (iPhone 6) and asked them to store it in their trousers' front pocket during the experiment. All the participant were asked to wear flat shoes. They then asked them to perform 6 different activities (walk downstairs, walk upstairs, sit, stand and jogging) around the [Queen Mary University of London'](https://www.qmul.ac.uk/)s Mile End campus. 

For each trial, the researcher set up the phone and gave it to the current participants, then the researcher stood in a corner. Then, the participant pressed the start button of Crowdsense app and put it in their trousers' front pocket and performed the specified activity. We asked them to do it as natural as possible, like their everyday life. At the end of each trial, they took the phone out of their pocket and pressed the stop button. 

There are 15 trials:

**Long trials:** those with number 1 to 9 with around 2 to 3 minutes duration.<br>
**Short trials:** those with number 11 to 16 that are around 30 seconds to 1 minutes duration.

There are 24 data subjects. The `A_DeviceMotion_data` folder contains time-series collected by both Accelerometer and Gyroscope for all 15 trials. For every trial we have a multivariate time-series. Thus, we have time-series with 12 features: `attitude.roll`, `attitude.pitch`, `attitude.yaw`, `gravity.x`, `gravity.y`, `gravity.z`, `rotationRate.x`, `rotationRate.y`, `rotationRate.z`, `userAcceleration.x`, `userAcceleration.y`, `userAcceleration.z`.

The accelerometer measures the sum of two acceleration vectors: gravity and user acceleration. 

**User acceleration** is the acceleration that the user imparts to the device. Because Core Motion is able to track a device’s attitude using both the gyroscope and the accelerometer, it can differentiate between gravity and user acceleration. A CMDeviceMotion object provides both measurements in the gravity and userAcceleration properties. 

There are 6 different labels:


**`dws`**: downstairs <br>
**`ups`**: upstairs <br>
**`sit`**: sitting <br>
**`std`**: standing <br>
**`wlk`**: walking <br>
**`jog`**: jogging <br>


*__Acknowledgements:__*<br>
*Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2018. Protecting Sensory Data against Sensitive Inferences. In W-P2DS’18: 1st Workshop on Privacy by Design in Distributed Systems , April 23–26, 2018, Porto, Portugal. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3195258.3195260 *


## Reading the data

Setting working directory.

In [1]:
import pandas as pd
import statsmodels.api as sm
import os,re
from pathlib2 import Path
from glob2 import glob #can capture patterns and supports recursive wildcards.

I will continue setting my working directoy. The cell has been hidden to preserve privacy but the pattern of the code is:

In [2]:
#os.chdir("Path")

The data has been compiled in a serie of different csv files. In order to have a full time serie and practise some data wrangling ,  I want to create an unique dataset from all these files.

In [4]:
os.listdir(".") #List current folder

['.ipynb_checkpoints',
 'A_DeviceMotion_data',
 'data.txt',
 'data_subjects_info.csv',
 'mobile_ts.ipynb',
 'mobile_ts.md',
 'output.csv',
 'output2.csv',
 'output3.csv',
 'output4.csv',
 'output_1.csv',
 'output_original.csv']

In [6]:
#root = Path+A_DeviceMotion_data # hidden information but this is the pattern of the string
df_csv = []

for root, dirs, files in os.walk(root):
        for file in files:
            if file.endswith (".csv"):
                df_csv.append(os.path.join(root,file)) #path

In [7]:
#print(df_csv) #uncomment to print

Now, we have a list of files to read with the different directories. As the file name is the id of the subject of this experiment I want to keep this information. The performed activity is also an information that is stored as the folder name.
So, I am going to create a pair of columns with this information.

In [8]:
data = []
for csv in df_csv:
    frame = pd.read_csv(csv)
    path = os.path.dirname(csv)
    frame["activity"] = os.path.basename(path)
    frame["subject"] = os.path.basename(csv)
    data.append(frame)

bigframe = pd.concat(data,ignore_index=True) #concatenate all files

There are some cleaning to be done before exploring our data.

In [9]:
bigframe["act_id"] = bigframe.activity.apply(lambda x: x.split("_")[-1]) #write to another column activity id
bigframe["activity"] = bigframe.activity.apply(lambda x: x.split("_")[0]) #column to keep only activity name
bigframe["subject"] = bigframe["subject"].str.rstrip(".csv") #right strip removing file extension
bigframe["subject"] = bigframe.subject.apply(lambda x: x.split("sub_")[-1]) #keep only subject id

In [10]:
bigframe.columns = ["time", "roll", "pitch", "yaw", 
                    "gr_x", "gr_y", "gr_z", 
                    "rot_x", "rot_y", "rot_z",
                    "acc_x", "acc_y", "acc_z", 
                    "activity", "subject", "act_id"] 

In [11]:
def label_len(row):
    '''
    Result will be set to "short" or long according to :
    Long trials: those with number 1 to 9 with around 2 to 3 minutes duration.
    Short trials: those with number 11 to 16 that are around 30 seconds to 1 minutes duration.
    '''
    if int(row) in range(1,10): 
        return "long"
    elif int(row) in range(11,17): 
        return "short"


bigframe["label_len"]=bigframe.act_id.apply(label_len)

Let's take a look what it look like now:

In [12]:
print(bigframe).head()

   time      roll     pitch       yaw      gr_x      gr_y      gr_z     rot_x  \
0     0  1.528132 -0.733896  0.696372  0.741895  0.669768 -0.031672  0.316738   
1     1  1.527992 -0.716987  0.677762  0.753099  0.657116 -0.032255  0.842032   
2     2  1.527765 -0.706999  0.670951  0.759611  0.649555 -0.032707 -0.138143   
3     3  1.516768 -0.704678  0.675735  0.760709  0.647788 -0.041140 -0.025005   
4     4  1.493941 -0.703918  0.672994  0.760062  0.647210 -0.058530  0.114253   

      rot_y     rot_z     acc_x     acc_y     acc_z activity subject act_id  \
0  0.778180  1.082764  0.294894 -0.184493  0.377542      dws       1      1   
1  0.424446  0.643574  0.219405  0.035846  0.114866      dws       1      1   
2 -0.040741  0.343563  0.010714  0.134701 -0.167808      dws       1      1   
3 -1.048717  0.035860 -0.008389  0.136788  0.094958      dws       1      1   
4 -0.912890  0.047341  0.199441  0.353996 -0.044299      dws       1      1   

  label_len  
0      long  
1      lon

Saving our final result is never a bad option.

In [13]:
bigframe.to_csv("output.csv", index = True, header = True)