<center><h1>Pre Processing</h1></center>

<h2>Importing Libraries</h2>

In [1]:
import pandas as pd
import numpy as np

<h2>Loading Data</h2>

<h3>X Data</h3>

If we examine our data, we can see that the X_train dataset does not have feature names, but the features are in a different file called features.txt.

So let us first go ahead and get the features, so it is easy to attach them to the data when we load it.

In [2]:
features = pd.read_csv("../data/UCI HAR Dataset/features.txt", header=None, delim_whitespace=True, names = ["index", "feature"])
feature_names = pd.Series(features["feature"].values)
feature_names[:5]

0    tBodyAcc-mean()-X
1    tBodyAcc-mean()-Y
2    tBodyAcc-mean()-Z
3     tBodyAcc-std()-X
4     tBodyAcc-std()-Y
dtype: object

In [3]:
features["feature"].value_counts()

fBodyAccJerk-bandsEnergy()-41,48    3
fBodyAcc-bandsEnergy()-57,64        3
fBodyAcc-bandsEnergy()-41,48        3
fBodyAcc-bandsEnergy()-33,40        3
fBodyAcc-bandsEnergy()-25,32        3
                                   ..
tBodyGyro-arCoeff()-Z,3             1
tBodyGyro-arCoeff()-Z,2             1
tBodyGyro-arCoeff()-Z,1             1
tBodyGyro-arCoeff()-Y,4             1
angle(Z,gravityMean)                1
Name: feature, Length: 477, dtype: int64

Looks like there are duplicates in our feature names and it can be conflicting when we try to assign them to the data are loading, so let us make necessary changes to accomodate the duplicates.

In [4]:
feature_names = feature_names + "_" + features.index.astype(str)
feature_names.head()

0    tBodyAcc-mean()-X_0
1    tBodyAcc-mean()-Y_1
2    tBodyAcc-mean()-Z_2
3     tBodyAcc-std()-X_3
4     tBodyAcc-std()-Y_4
dtype: object

This should solve the problem, as we are attachinmg the index number, nothing but the column number starting from 0, so that each column is unique.

<b>data = pd.read_csv("../data/UCI HAR Dataset/train/X_train.txt",sep=" ", header=None)</b>

I have initially tried to load the dataset using the above line of code, but it failed. It threw me an error that the number of columns in row 27 are 665, while expected were 662. But there was no mention of uneven number of columns in the Readme.md file.

Taking a look at the features.txt file, looks like there are only 561 features, which contradicts the number of rows that our function thinks there are.

Then it was time to dive into the X_train.txt file and have a deeper look. The first observation would be that the files are separated by a space, but when we take a closer look, we wouldn't know if they are separated by a single space or multiple which is where the issue arises. 

when we use sep = " ", it explicitly thinks that only a single whitespace is separating the columns, while delim_whitespace=True will treat any amount of whitespace as a single column separator.

<b>It is a common practice to use delim_whitespace = True when the data columns are separated by whitespace rather than a single fixed delimiter.</b>

In [5]:
X_data = pd.read_csv("../data/UCI HAR Dataset/train/X_train.txt",delim_whitespace=True, header=None, names = feature_names)

In [6]:
X_data.shape

(7352, 561)

In [7]:
X_data.head()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tBodyAcc-mad()-X_6,tBodyAcc-mad()-Y_7,tBodyAcc-mad()-Z_8,tBodyAcc-max()-X_9,...,fBodyBodyGyroJerkMag-meanFreq()_551,fBodyBodyGyroJerkMag-skewness()_552,fBodyBodyGyroJerkMag-kurtosis()_553,"angle(tBodyAccMean,gravity)_554","angle(tBodyAccJerkMean),gravityMean)_555","angle(tBodyGyroMean,gravityMean)_556","angle(tBodyGyroJerkMean,gravityMean)_557","angle(X,gravityMean)_558","angle(Y,gravityMean)_559","angle(Z,gravityMean)_560"
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.074323,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,0.404573,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,0.087753,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892


In [8]:
X_test = pd.read_csv("../data/UCI HAR Dataset/test/X_test.txt",delim_whitespace=True, header=None, names = feature_names)

In [9]:
X_test.shape

(2947, 561)

We can see that we got a total of 7352 rows of data with 561 features.

In [10]:
X_test.head()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tBodyAcc-mad()-X_6,tBodyAcc-mad()-Y_7,tBodyAcc-mad()-Z_8,tBodyAcc-max()-X_9,...,fBodyBodyGyroJerkMag-meanFreq()_551,fBodyBodyGyroJerkMag-skewness()_552,fBodyBodyGyroJerkMag-kurtosis()_553,"angle(tBodyAccMean,gravity)_554","angle(tBodyAccJerkMean),gravityMean)_555","angle(tBodyGyroMean,gravityMean)_556","angle(tBodyGyroJerkMean,gravityMean)_557","angle(X,gravityMean)_558","angle(Y,gravityMean)_559","angle(Z,gravityMean)_560"
0,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,-0.952501,-0.925249,-0.674302,-0.894088,...,0.071645,-0.33037,-0.705974,0.006462,0.16292,-0.825886,0.271151,-0.720009,0.276801,-0.057978
1,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,-0.986799,-0.968401,-0.945823,-0.894088,...,-0.401189,-0.121845,-0.594944,-0.083495,0.0175,-0.434375,0.920593,-0.698091,0.281343,-0.083898
2,0.275485,-0.02605,-0.118152,-0.993819,-0.969926,-0.962748,-0.994403,-0.970735,-0.963483,-0.93926,...,0.062891,-0.190422,-0.640736,-0.034956,0.202302,0.064103,0.145068,-0.702771,0.280083,-0.079346
3,0.270298,-0.032614,-0.11752,-0.994743,-0.973268,-0.967091,-0.995274,-0.974471,-0.968897,-0.93861,...,0.116695,-0.344418,-0.736124,-0.017067,0.154438,0.340134,0.296407,-0.698954,0.284114,-0.077108
4,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,-0.994111,-0.965953,-0.977346,-0.93861,...,-0.121711,-0.534685,-0.846595,-0.002223,-0.040046,0.736715,-0.118545,-0.692245,0.290722,-0.073857


In [11]:
Y_data = pd.read_csv("../data/UCI HAR Dataset/train/y_train.txt", delim_whitespace=True, header=None, names = ["activity_code"])

In [12]:
Y_data.head()

Unnamed: 0,activity_code
0,5
1,5
2,5
3,5
4,5


In [13]:
Y_data.value_counts()

activity_code
6                1407
5                1374
4                1286
1                1226
2                1073
3                 986
dtype: int64

In [14]:
Y_test = pd.read_csv("../data/UCI HAR Dataset/test/y_test.txt", delim_whitespace=True, header=None, names = ["activity_code"])

In [15]:
Y_test.head()

Unnamed: 0,activity_code
0,5
1,5
2,5
3,5
4,5


In [16]:
Y_test.value_counts()

activity_code
6                537
5                532
1                496
4                491
2                471
3                420
dtype: int64

Looks like we have 6 distinct activities, which are given by numbers 1 through 6. What we are going to do here is associate the activities with the activity codes, so that it is easy for us to interpret.

In [17]:
activity_labels = pd.read_csv("../data/UCI HAR Dataset/activity_labels.txt",delim_whitespace = True, header=None, names = ["id", "label"])

In [18]:
activity_labels

Unnamed: 0,id,label
0,1,WALKING
1,2,WALKING_UPSTAIRS
2,3,WALKING_DOWNSTAIRS
3,4,SITTING
4,5,STANDING
5,6,LAYING


In [19]:
Y_data["activity"] = Y_data["activity_code"].map(dict(zip(activity_labels.id, activity_labels.label)))
Y_test["activity"] = Y_test["activity_code"].map(dict(zip(activity_labels.id, activity_labels.label)))

what we are doing here is just mapping the activity labels id with the label we have in our activity labels file so we can see both the activity code and its label for easier human understanding.

In [20]:
Y_data.value_counts()

activity_code  activity          
6              LAYING                1407
5              STANDING              1374
4              SITTING               1286
1              WALKING               1226
2              WALKING_UPSTAIRS      1073
3              WALKING_DOWNSTAIRS     986
dtype: int64

In [21]:
Y_test.value_counts()

activity_code  activity          
6              LAYING                537
5              STANDING              532
1              WALKING               496
4              SITTING               491
2              WALKING_UPSTAIRS      471
3              WALKING_DOWNSTAIRS    420
dtype: int64

The original count matches with the activity count, so we can proceed forward.

<h2>Overview of our Data</h2>

In [22]:
X_data.describe()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tBodyAcc-mad()-X_6,tBodyAcc-mad()-Y_7,tBodyAcc-mad()-Z_8,tBodyAcc-max()-X_9,...,fBodyBodyGyroJerkMag-meanFreq()_551,fBodyBodyGyroJerkMag-skewness()_552,fBodyBodyGyroJerkMag-kurtosis()_553,"angle(tBodyAccMean,gravity)_554","angle(tBodyAccJerkMean),gravityMean)_555","angle(tBodyGyroMean,gravityMean)_556","angle(tBodyGyroJerkMean,gravityMean)_557","angle(X,gravityMean)_558","angle(Y,gravityMean)_559","angle(Z,gravityMean)_560"
count,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,...,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0,7352.0
mean,0.274488,-0.017695,-0.109141,-0.605438,-0.510938,-0.604754,-0.630512,-0.526907,-0.60615,-0.468604,...,0.125293,-0.307009,-0.625294,0.008684,0.002186,0.008726,-0.005981,-0.489547,0.058593,-0.056515
std,0.070261,0.040811,0.056635,0.448734,0.502645,0.418687,0.424073,0.485942,0.414122,0.544547,...,0.250994,0.321011,0.307584,0.336787,0.448306,0.608303,0.477975,0.511807,0.29748,0.279122
min,-1.0,-1.0,-1.0,-1.0,-0.999873,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-0.995357,-0.999765,-0.97658,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,0.262975,-0.024863,-0.120993,-0.992754,-0.978129,-0.980233,-0.993591,-0.978162,-0.980251,-0.936219,...,-0.023692,-0.542602,-0.845573,-0.121527,-0.289549,-0.482273,-0.376341,-0.812065,-0.017885,-0.143414
50%,0.277193,-0.017219,-0.108676,-0.946196,-0.851897,-0.859365,-0.950709,-0.857328,-0.857143,-0.881637,...,0.134,-0.343685,-0.711692,0.009509,0.008943,0.008735,-0.000368,-0.709417,0.182071,0.003181
75%,0.288461,-0.010783,-0.097794,-0.242813,-0.034231,-0.262415,-0.29268,-0.066701,-0.265671,-0.017129,...,0.289096,-0.126979,-0.503878,0.150865,0.292861,0.506187,0.359368,-0.509079,0.248353,0.107659
max,1.0,1.0,1.0,1.0,0.916238,1.0,1.0,0.967664,1.0,1.0,...,0.9467,0.989538,0.956845,1.0,1.0,0.998702,0.996078,1.0,0.478157,1.0


In [23]:
X_test.describe()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tBodyAcc-mad()-X_6,tBodyAcc-mad()-Y_7,tBodyAcc-mad()-Z_8,tBodyAcc-max()-X_9,...,fBodyBodyGyroJerkMag-meanFreq()_551,fBodyBodyGyroJerkMag-skewness()_552,fBodyBodyGyroJerkMag-kurtosis()_553,"angle(tBodyAccMean,gravity)_554","angle(tBodyAccJerkMean),gravityMean)_555","angle(tBodyGyroMean,gravityMean)_556","angle(tBodyGyroJerkMean,gravityMean)_557","angle(X,gravityMean)_558","angle(Y,gravityMean)_559","angle(Z,gravityMean)_560"
count,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,...,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0,2947.0
mean,0.273996,-0.017863,-0.108386,-0.613635,-0.50833,-0.633797,-0.641278,-0.522676,-0.637038,-0.462063,...,0.130236,-0.277593,-0.598756,0.005264,0.003799,0.040029,-0.017298,-0.513923,0.074886,-0.04872
std,0.06057,0.025745,0.042747,0.412597,0.494269,0.362699,0.385199,0.479899,0.357753,0.523916,...,0.231018,0.317245,0.311042,0.336147,0.445077,0.634989,0.501311,0.509205,0.3243,0.241467
min,-0.592004,-0.362884,-0.576184,-0.999606,-1.0,-0.998955,-0.999417,-0.999914,-0.998899,-0.952357,...,-0.785543,-1.0,-1.0,-1.0,-0.993402,-0.998898,-0.991096,-0.984195,-0.913704,-0.949228
25%,0.262075,-0.024961,-0.121162,-0.990914,-0.973664,-0.976122,-0.992333,-0.974131,-0.975352,-0.934447,...,-0.008433,-0.517494,-0.829593,-0.130541,-0.2826,-0.518924,-0.428375,-0.829722,0.02214,-0.098485
50%,0.277113,-0.016967,-0.108458,-0.931214,-0.790972,-0.827534,-0.937664,-0.799907,-0.817005,-0.852659,...,0.142676,-0.311023,-0.683672,0.005188,0.006767,0.047113,-0.026726,-0.729648,0.181563,-0.010671
75%,0.288097,-0.010143,-0.097123,-0.267395,-0.105919,-0.311432,-0.321719,-0.133488,-0.322771,-0.009965,...,0.28832,-0.083559,-0.458332,0.1462,0.288113,0.622151,0.394387,-0.545939,0.260252,0.092373
max,0.671887,0.246106,0.494114,0.465299,1.0,0.489703,0.439657,1.0,0.427958,0.786436,...,1.0,1.0,1.0,0.998898,0.986347,1.0,1.0,0.83318,1.0,0.973113


In [24]:
X_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7352 entries, 0 to 7351
Columns: 561 entries, tBodyAcc-mean()-X_0 to angle(Z,gravityMean)_560
dtypes: float64(561)
memory usage: 31.5 MB


In [25]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2947 entries, 0 to 2946
Columns: 561 entries, tBodyAcc-mean()-X_0 to angle(Z,gravityMean)_560
dtypes: float64(561)
memory usage: 12.6 MB


<h3> Checking for Null Values</h3>

In [26]:
X_data.isnull().values.any()

False

In [27]:
X_test.isnull().values.any()

False

Looks like our data does not have any null values, it is safe to proceed to the next steps.

<h2> Feature Selection </h2>

We can see that our X_data which is our training data and also our testing set has 561 features each which is a lot if we want to train our model on all 561 of them. 

But considering all the features are going to slow the model down (training time), make the model complex and harder to interpret.

So what we are going to do is deep dive into the features.txt to see which ones we can reduce by understanding what the features are doing.

In [28]:
selected_features = feature_names[feature_names.str.contains("mean\\(\\)|std\\(\\)", regex=True)]

In [29]:
selected_features

0                  tBodyAcc-mean()-X_0
1                  tBodyAcc-mean()-Y_1
2                  tBodyAcc-mean()-Z_2
3                   tBodyAcc-std()-X_3
4                   tBodyAcc-std()-Y_4
                    ...               
516      fBodyBodyAccJerkMag-std()_516
528        fBodyBodyGyroMag-mean()_528
529         fBodyBodyGyroMag-std()_529
541    fBodyBodyGyroJerkMag-mean()_541
542     fBodyBodyGyroJerkMag-std()_542
Length: 66, dtype: object

So if we take a closer look at the features, all of the features have the mean and standard deviation for them, which we are going to select to train our model and compare it with training on all the features.

<h3><b>Why are we selecting the mean and standard deviation feature over everything else?</b></h3>

The mean and standard Deviation are selected, because they capture the most essential patterns of human motion. Mean represents the overall activity level, whereas the standard deviation represents the variability or intensity of the motion. Together they provide highly discriminative signals for human activities (goal of this project)without being overly sensitive towards the noise.

In [30]:
diminished_X_train = X_data[selected_features]
diminished_X_train.head()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tGravityAcc-mean()-X_40,tGravityAcc-mean()-Y_41,tGravityAcc-mean()-Z_42,tGravityAcc-std()-X_43,...,fBodyGyro-std()-Y_427,fBodyGyro-std()-Z_428,fBodyAccMag-mean()_502,fBodyAccMag-std()_503,fBodyBodyAccJerkMag-mean()_515,fBodyBodyAccJerkMag-std()_516,fBodyBodyGyroMag-mean()_528,fBodyBodyGyroMag-std()_529,fBodyBodyGyroJerkMag-mean()_541,fBodyBodyGyroJerkMag-std()_542
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,0.963396,-0.14084,0.115375,-0.98525,...,-0.973886,-0.994035,-0.952155,-0.956134,-0.993726,-0.993755,-0.980135,-0.961309,-0.99199,-0.990697
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,0.966561,-0.141551,0.109379,-0.997411,...,-0.987168,-0.989785,-0.980857,-0.975866,-0.990335,-0.99196,-0.988296,-0.983322,-0.995854,-0.996399
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,0.966878,-0.14201,0.101884,-0.999574,...,-0.993399,-0.987328,-0.987795,-0.989015,-0.98928,-0.990867,-0.989255,-0.986028,-0.995031,-0.995127
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,0.967615,-0.143976,0.09985,-0.996646,...,-0.991646,-0.988678,-0.987519,-0.986742,-0.992769,-0.9917,-0.989413,-0.987836,-0.995221,-0.995237
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,0.968224,-0.14875,0.094486,-0.998429,...,-0.991956,-0.987944,-0.993591,-0.990063,-0.995523,-0.994389,-0.991433,-0.989059,-0.995093,-0.995465


In [31]:
diminished_X_train.shape

(7352, 66)

In [32]:
diminished_X_test = X_test[selected_features]

In [33]:
diminished_X_test.head()

Unnamed: 0,tBodyAcc-mean()-X_0,tBodyAcc-mean()-Y_1,tBodyAcc-mean()-Z_2,tBodyAcc-std()-X_3,tBodyAcc-std()-Y_4,tBodyAcc-std()-Z_5,tGravityAcc-mean()-X_40,tGravityAcc-mean()-Y_41,tGravityAcc-mean()-Z_42,tGravityAcc-std()-X_43,...,fBodyGyro-std()-Y_427,fBodyGyro-std()-Z_428,fBodyAccMag-mean()_502,fBodyAccMag-std()_503,fBodyBodyAccJerkMag-mean()_515,fBodyBodyAccJerkMag-std()_516,fBodyBodyGyroMag-mean()_528,fBodyBodyGyroMag-std()_529,fBodyBodyGyroJerkMag-mean()_541,fBodyBodyGyroJerkMag-std()_542
0,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,0.936489,-0.282719,0.115288,-0.925427,...,-0.822677,-0.956165,-0.790946,-0.711074,-0.895061,-0.89636,-0.77061,-0.797113,-0.890165,-0.907308
1,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,0.927404,-0.289215,0.152568,-0.989057,...,-0.932011,-0.970143,-0.954127,-0.959746,-0.945437,-0.934152,-0.924461,-0.916774,-0.951977,-0.938212
2,0.275485,-0.02605,-0.118152,-0.993819,-0.969926,-0.962748,0.929915,-0.287513,0.146086,-0.995937,...,-0.977194,-0.979095,-0.97565,-0.983784,-0.971069,-0.970308,-0.975209,-0.973998,-0.985689,-0.983273
3,0.270298,-0.032614,-0.11752,-0.994743,-0.973268,-0.967091,0.928881,-0.293396,0.142926,-0.993139,...,-0.971909,-0.965275,-0.973393,-0.98212,-0.971655,-0.978484,-0.976297,-0.971248,-0.985562,-0.985843
4,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,0.9266,-0.302961,0.138307,-0.995575,...,-0.976565,-0.970017,-0.977739,-0.978838,-0.987489,-0.989716,-0.977007,-0.969619,-0.990498,-0.990572


In [34]:
diminished_X_test.shape

(2947, 66)

<h2> Scaling </h2>

Tree based models split on threshold values, so either we scale or not would make no difference. So, I am not going to scale the data, but it is always safe to scale data for numeric sensitive algorithms, which Trees are not.

<h2> Checking Class-Imbalance</h2>

For classifications, it is always recommended to take a look at class-imbalance before proceeding. If the class is imbalanced then the model might not learn effectively since the available data is very small and on top of that regular measuring metrics like accuracy do not provide a reliable measure.

Checking for class-imbalance on both training and testing label sets.

In [35]:
Y_data.value_counts(normalize=True)

activity_code  activity          
6              LAYING                0.191376
5              STANDING              0.186888
4              SITTING               0.174918
1              WALKING               0.166757
2              WALKING_UPSTAIRS      0.145947
3              WALKING_DOWNSTAIRS    0.134113
dtype: float64

In [36]:
Y_test.value_counts(normalize=True)

activity_code  activity          
6              LAYING                0.182219
5              STANDING              0.180523
1              WALKING               0.168307
4              SITTING               0.166610
2              WALKING_UPSTAIRS      0.159824
3              WALKING_DOWNSTAIRS    0.142518
dtype: float64

We can say that class-imbalance is not a problem in this case. So we can proceed and download out files.

<h2>Exporting Data</h2>

Exporting datasets with all the features.

In [37]:
X_data.to_csv("../Processed Data/X_train.csv", index = False)
X_test.to_csv("../Processed Data/X_test.csv", index = False)
Y_data.to_csv("../Processed Data/Y_train.csv", index = False)
Y_test.to_csv("../Processed Data/Y_test.csv", index = False)

Exporting dataset having diminished features.

In [38]:
diminished_X_train.to_csv("../Processed Data/feature_reduced_X_train.csv", index = False)
diminished_X_test.to_csv("../Processed Data/feature_reduced_X_test.csv", index = False)