## Preparing the data

For data preparation, I am going to 
- read all the files, 
- sort them by timestamp, 
- merge the corresponding label and sensor files
- drop rows with missing values
- add a sensor type to the dataset
- combine data from all sensors, sort by timestamp and then delete timestamps

In [1]:
import pandas as pd
import numpy as np
a_sensor_data = pd.read_csv("Downloads/Phone shake detection assignmen/a.sensor.csv")
a_labels = pd.read_csv("Downloads/Phone shake detection assignmen/a.lbl.csv")

p_sensor_data = pd.read_csv("Downloads/Phone shake detection assignmen/p.sensor.csv")
p_labels = pd.read_csv("Downloads/Phone shake detection assignmen/p.lbl.csv")


m_sensor_data = pd.read_csv("Downloads/Phone shake detection assignmen/m.sensor.csv")
m_labels = pd.read_csv("Downloads/Phone shake detection assignmen/m.lbl.csv")

# sort the files by time
a_sensor_data.sort_values('timestamp(ms)', axis=0, inplace=True)
a_labels.sort_values('timestamp(ms)', axis=0, inplace=True)
p_sensor_data.sort_values('timestamp(ms)', axis=0, inplace=True)
p_labels.sort_values('timestamp(ms)', axis=0, inplace=True)
m_sensor_data.sort_values('timestamp(ms)', axis=0, inplace=True)
m_labels.sort_values('timestamp(ms)', axis=0, inplace=True)

In [2]:
# add the labels to the sensor file. 
# assuming m is greater than k, if label at timestamp t_k is 1 and is zero at the timestamp t_m, then the labels for all the sensor data lying between timestamps t_k and t_m will be 1

def add_labels(sensor_data, label_data):
    sensor_data['label'] = np.nan
    for i in range(label_data.shape[0] - 1):
        index = sensor_data[(sensor_data['timestamp(ms)'] >= label_data.iloc[i, 0]) & (
                sensor_data['timestamp(ms)'] <= label_data.iloc[i + 1, 0])].index.tolist()
        sensor_data.loc[sensor_data.index[index], 'label'] = label_data.iloc[i, 1]
    
    return sensor_data

a_sensor_data = add_labels(a_sensor_data, a_labels)
p_sensor_data = add_labels(p_sensor_data, p_labels)
m_sensor_data = add_labels(m_sensor_data, m_labels)

In [3]:
# drop missing values
a_sensor_data.dropna(axis=0, how='any', inplace=True)
p_sensor_data.dropna(axis=0, how='any', inplace=True)
m_sensor_data.dropna(axis=0, how='any', inplace=True)

In [4]:
# add a sensor type to the dataset
a_sensor_data['sensor_type'] = 'a'
p_sensor_data['sensor_type'] = 'p'
m_sensor_data['sensor_type'] = 'm'

In [5]:
# rearrange columns
column_order = ['timestamp(ms)', 'acceleration_x(g)', 'acceleration_y(g)', 'acceleration_z(g)', 'roll(rad)', 'pitch(rad)', 'yaw(rad)', 'angular_velocity_x(rad/sec)', 'angular_velocity_y(rad/sec)', 'angular_velocity_z(rad/sec)', 'sensor_type', 'label']
a_sensor_data = a_sensor_data[column_order]
p_sensor_data = p_sensor_data[column_order]
m_sensor_data = m_sensor_data[column_order]

In [6]:
# combine data, sort by timestamps, and delete timestamps
phone_shake_data = pd.concat([a_sensor_data, p_sensor_data, m_sensor_data], ignore_index=True)
phone_shake_data.sort_values('timestamp(ms)', inplace=True)
phone_shake_data.reset_index(inplace=True)
phone_shake_data.drop(["index", 'timestamp(ms)'], inplace=True, axis=1)


## Analysis

### Univariate analysis

#### Check the number of ones and zeros in the data

In [7]:
phone_shake_data['label'].value_counts()

1.0    84386
0.0     2548
Name: label, dtype: int64

This is an imbalanced dataset the ratio of 0 and 1 is about 1:33. This is some imbalance in the dataset. We might have to fix this when we start the modeling process

#### Check for outliers

I am going to use z-scores to evaluate the data. If any data point has a z-score more than 3 or less than -3, I am going to flag it as an outlier.


In [8]:
# lets check for outliers in our data
from scipy import stats
import numpy as np
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
z = np.abs(stats.zscore(phone_shake_data[numeric_cols]))
threshold = 3
rows, cols = np.where(z > threshold)
phone_shake_data[numeric_cols].values[rows, cols]
phone_shake_data[numeric_cols].iloc[rows[0], cols[0]]
print("Number of records that has atleast one outlier value: ", len(rows))


Number of records that has atleast one outlier value:  16444


I kept the threshold at three standard deviations as it accounts for 99% of data on a normal distribution. People can select different value as their thresholds. But since this is sensor data and these accelerations and velocities may be true, I don't want to be too strict with outlier treatment. 

Going by z-scores, we will have issues with about 16k of our records which is about 20 percent of our data. That's a lot. I can't just delete these records straight away. 

And also, since this is an imbalanced dataset, I want to see how many zeros we will be losing if we simply dropped the outliers. Lets check how many of these outliers are labeled as ones and how many as zeros.

In [9]:
# caclulate percentage of zeros in the outliers
num_zeros_in_dataset = (1-phone_shake_data['label'].mean())*phone_shake_data.shape[0]
num_zeros_in_outliers = len(set(rows)) - phone_shake_data.values[list(set(rows)), -1].sum()
print("Percentage of zeros in the outliers: ", str(100*round(num_zeros_in_outliers/num_zeros_in_dataset, 2))[:4])

Percentage of zeros in the outliers:  72.0


This is a large percentage of our zeros. If we were to drop our outliers, we will loose very valuable data. One more thing to note here is that these outliers have became even more useful for us. 

In fact, I am going to treat them as valuable data points rather than outliers. To do that I am adding a flag column. If a row has even one value that is an outlier in its column, I am going to flag it. 

But before I do that, there is one more thing I want to look at it. Many of the rows that were flagged to have outliers had outliers in multiple varaibles. I want to see if there is a relation between the number of outliers in a row and the labels

In [10]:
from collections import Counter
cntr = Counter(rows)
rev_cntr = {}
for k in dict(cntr):
    if cntr[k] in rev_cntr:
        rev_cntr[cntr[k]].append(k)
    else:
        rev_cntr[cntr[k]] = [k]

rev_cntr[4][:5]

[0, 2, 44, 46, 65]

The `cntr` dictionary tells me how many outliers the rows have. The `rev_cntr` is reverse dictionary that tells me which row indexes have `k` number of outliers. For example `rev_cntr[4]` stores values `[0,2,44,46,...]`. That means the rows with index `0`, `2`, etc. have four outliers. Similarly for `rev_cntr[3]`, etc.

In [11]:
import collections
rev_cntr = collections.OrderedDict(sorted(rev_cntr.items()))
print('Num outliers in a row', '    Num rows having that number of outliers', '      Percentage of ones')
for each_key in rev_cntr:
    er = phone_shake_data.loc[rev_cntr[each_key], 'label'].mean()
    print("{0: <25} {1: <45} {2: <25}".format(each_key, len(rev_cntr[each_key]), 100*er))

Num outliers in a row     Num rows having that number of outliers       Percentage of ones
1                         3760                                          86.99468085106383        
2                         2183                                          78.74484654145671        
3                         1094                                          70.018281535649          
4                         600                                           57.666666666666664       
5                         339                                           43.657817109144545       
6                         126                                           27.77777777777778        
7                         23                                            4.3478260869565215       
8                         3                                             0.0                      


The population drops as the number of outliers in a row increases. But the percentage of ones also start dropping heavily with each additional outlier. So instead of simply adding a binary flag, I am going to use this count information. 

In [12]:
phone_shake_data.loc[:, 'outlier_counts'] = 0
for each_key in rev_cntr:
    phone_shake_data.loc[rev_cntr[each_key], 'outlier_counts'] = each_key

#### Taking aggregates by `sensor_type`

In [13]:
# groupwise counts. Just to see how much data we got from each sensor.
phone_shake_data['sensor_type'].value_counts()

p    60660
m    21279
a     4995
Name: sensor_type, dtype: int64

In [14]:
# check groupwise event rate
phone_shake_data.groupby('sensor_type')['label'].mean()

sensor_type
a    0.986186
m    0.945862
p    0.978124
Name: label, dtype: float64

All the sensor types have an imbalanced distribution of zeros and ones.

#### Checking for high correlation within the data

In [15]:
# check correlations. Since the features are angular velocities and acc in perpendicular directions, I don't expect to see high correlation between features
phone_shake_data.corr()


Unnamed: 0,acceleration_x(g),acceleration_y(g),acceleration_z(g),roll(rad),pitch(rad),yaw(rad),angular_velocity_x(rad/sec),angular_velocity_y(rad/sec),angular_velocity_z(rad/sec),label,outlier_counts
acceleration_x(g),1.0,-0.05214,-0.016647,-0.031733,-0.069623,0.038151,-0.06905,-0.06905,0.110037,0.068899,-0.158333
acceleration_y(g),-0.05214,1.0,0.018931,0.115318,0.219196,-0.044758,-0.093788,-0.093788,0.058232,-0.286008,0.465782
acceleration_z(g),-0.016647,0.018931,1.0,-0.006351,-0.018357,0.017445,-0.108362,-0.108362,0.007924,0.060044,-0.049836
roll(rad),-0.031733,0.115318,-0.006351,1.0,-0.250398,-0.074876,-0.000929,-0.000929,-0.013593,-0.055336,0.097269
pitch(rad),-0.069623,0.219196,-0.018357,-0.250398,1.0,0.095816,0.008207,0.008207,-0.015275,-0.06018,0.027842
yaw(rad),0.038151,-0.044758,0.017445,-0.074876,0.095816,1.0,-0.000327,-0.000327,-0.008461,0.074939,-0.105644
angular_velocity_x(rad/sec),-0.06905,-0.093788,-0.108362,-0.000929,0.008207,-0.000327,1.0,1.0,-0.137175,0.017399,0.000717
angular_velocity_y(rad/sec),-0.06905,-0.093788,-0.108362,-0.000929,0.008207,-0.000327,1.0,1.0,-0.137175,0.017399,0.000717
angular_velocity_z(rad/sec),0.110037,0.058232,0.007924,-0.013593,-0.015275,-0.008461,-0.137175,-0.137175,1.0,-0.037306,0.004935
label,0.068899,-0.286008,0.060044,-0.055336,-0.06018,0.074939,0.017399,0.017399,-0.037306,1.0,-0.446593


This is a little strange. I didn't expect to see any significant correlation. But `angular_velocity_x(rad/sec)`	and `angular_velocity_y(rad/sec)` have a correlation of 1

In [16]:
sum(phone_shake_data['angular_velocity_x(rad/sec)'] - phone_shake_data['angular_velocity_y(rad/sec)'])

0.0

So those two columns are exactly the same. I checked the original files. Those two columns are same. I am going to drop `angular_velocity_y(rad/sec)` from the dataset.

In [17]:
phone_shake_data.drop('angular_velocity_y(rad/sec)', axis=1, inplace=True)

#### Comparing mean and distribution of the numeric features by `label`

Now lets check how the mean of the variables vary across the ones and zeros.

In [18]:
phone_shake_data.groupby('label').mean()

Unnamed: 0_level_0,acceleration_x(g),acceleration_y(g),acceleration_z(g),roll(rad),pitch(rad),yaw(rad),angular_velocity_x(rad/sec),angular_velocity_z(rad/sec),outlier_counts
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,-0.195617,0.552796,-0.082712,0.101988,0.078071,-0.156965,-0.096869,0.349567,2.0
1.0,-0.031582,0.022265,0.012478,-0.067346,-0.073827,0.680466,0.007036,-0.010356,0.134477


By the looks of it, many variables have very different means across the labels. We can perform a t-test to verify if the sample means of variables in the two groups are same. 

For t-tests we assume homogeneity of variance and normality of distributions. Lets check if these assumptions hold

In [19]:
phone_shake_data.groupby('label').std()

Unnamed: 0_level_0,acceleration_x(g),acceleration_y(g),acceleration_z(g),roll(rad),pitch(rad),yaw(rad),angular_velocity_x(rad/sec),angular_velocity_z(rad/sec),outlier_counts
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,1.503442,1.170089,0.972815,1.134731,0.76265,1.799946,3.626207,6.179145,1.825742
1.0,0.311645,0.226445,0.211744,0.484516,0.410482,1.881964,0.805034,1.253804,0.555718


The variance look very different. We can apply a either an F-test to see of the variances are equal. But before that lets see if the distributions are normal

In [20]:
from scipy import stats
normality_tests = pd.DataFrame([col, stats.normaltest(phone_shake_data[col])] for col in phone_shake_data if col not in ['label', 'sensor_type'])
normality_tests

Unnamed: 0,0,1
0,acceleration_x(g),"(48331.53242501518, 0.0)"
1,acceleration_y(g),"(99262.97243275064, 0.0)"
2,acceleration_z(g),"(79202.278579982, 0.0)"
3,roll(rad),"(14021.489936568762, 0.0)"
4,pitch(rad),"(7041.943555582621, 0.0)"
5,yaw(rad),"(40489.37398100806, 0.0)"
6,angular_velocity_x(rad/sec),"(38884.515952611364, 0.0)"
7,angular_velocity_z(rad/sec),"(41179.290053697805, 0.0)"
8,outlier_counts,"(81223.15804426529, 0.0)"


Using a p-value = 0.05, we have to reject the null hypotheses for all variables. Because our data violates two important assumptions of the t-test. 

Since the above assumptions are violated, let's apply Mann Whitney U test. Mannu Whitney U test is a non parametric test and is useful when t-test assumptions are being violated. From this test we will get the p-values under the numm hypothesis that both groups are from identical distributions

In [21]:
mannWhitneyU_tests = []
zero_index = phone_shake_data.loc[phone_shake_data['label']==0,:].index
ones_index = phone_shake_data.loc[phone_shake_data['label']==1,:].index

# peform the test and get p-values
for col in phone_shake_data:
    if col not in ['label', 'sensor_type']:
        mannWhitneyU_tests.append( stats.mannwhitneyu(phone_shake_data.loc[zero_index, col], phone_shake_data.loc[ones_index, col])[1])

# display results
pd.DataFrame([[col for col in phone_shake_data if col not in ['label', 'sensor_type']], mannWhitneyU_tests]).transpose()

Unnamed: 0,0,1
0,acceleration_x(g),1.21402e-17
1,acceleration_y(g),6.10027e-193
2,acceleration_z(g),0.0672456
3,roll(rad),2.13668e-10
4,pitch(rad),3.5927100000000005e-55
5,yaw(rad),2.2330900000000002e-114
6,angular_velocity_x(rad/sec),0.175571
7,angular_velocity_z(rad/sec),1.0895e-05
8,outlier_counts,0.0


Looking at the p values, the Mann Whitney U test suggests that the null hypothesis (both groups are from identical distributions) can be rejected for all except `roll(rad)`,`angular_velocity_x(rad/sec)`. These two may not be very important features. 

This suggests that despite the imbalance, we may have strong predictors in the data. Tree based models might do well in these cases. Let's take a look at the distributions of these variables across the two labels.

In [22]:
from matplotlib import pyplot as plt
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
i = 1
plt.figure(figsize=(20, 16), dpi= 80, facecolor='w', edgecolor='k')
for num_col in numeric_cols:    
    plt.subplot(9, 2, i)  # 1 line, 2 rows, index nr 1 (first position in the subplot)
    plt.hist(phone_shake_data.loc[zero_index, num_col], 200, alpha=0.75, color="b")
    plt.xlabel('zeros')
    plt.ylabel(num_col)
    i+=1
    plt.subplot(9, 2, i)  # 1 line, 2 rows, index nr 2 (second position in the subplot)
    plt.hist(phone_shake_data.loc[ones_index, num_col], 200, alpha=0.75, color="g")
    plt.xlabel('ones')
    plt.ylabel(num_col)
    i+=1

plt.subplots_adjust(left=1.25, bottom=0.1, right=1.9, top=0.9, wspace=0.3, hspace=2.0)
plt.show()

<Figure size 1600x1280 with 18 Axes>

This is a useful plot for us. We know that the number of zeros in our dataset only three percent and ones are 97%. Inspite of this, the the distribution of ones is very very centred. The distribution of zeros on the other hand is more spread out. Let's look at the standard deviations once again.

In [23]:
phone_shake_data.groupby('label').std()

Unnamed: 0_level_0,acceleration_x(g),acceleration_y(g),acceleration_z(g),roll(rad),pitch(rad),yaw(rad),angular_velocity_x(rad/sec),angular_velocity_z(rad/sec),outlier_counts
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,1.503442,1.170089,0.972815,1.134731,0.76265,1.799946,3.626207,6.179145,1.825742
1.0,0.311645,0.226445,0.211744,0.484516,0.410482,1.881964,0.805034,1.253804,0.555718


Given that 97 percent of the data is ones, I was expecting data for `label==1` to be more spread out. But understanding that since `label=1` implies end of shake motion, this distribution seems fine. 

## Latent Structures

The next thing I want to look at is if there is a latent structure in the dataset. I'm going to use K-Means clustering and PCA for that. 

### K Means
The idea of using K means is to see if there is any latent structure. After selecting the appropriate number of clusters, I will add two set of variables
- cluster number to the dataset and see the distribution of labels across clusters. 
- distance from each cluster

In [24]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# train the center scaler
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
ss = StandardScaler(with_std=True, with_mean=True)
ss.fit(phone_shake_data.loc[:, numeric_cols])

# center scale the data
phone_shake_data_scaled = pd.DataFrame(ss.transform(phone_shake_data.loc[:, numeric_cols]))

# perform clustering for different cluster numbers
wsse_collect = []
k_min = 2
k_max = 15
for i in range(k_min, k_max):
    km = KMeans(random_state=42, max_iter=200, n_clusters=i)
    _ = km.fit(phone_shake_data_scaled)
    wsse = km.inertia_
    print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=round(wsse,200)))
    wsse_collect.append(wsse)
    


For k = 002 WSSE is 682052.276029
For k = 003 WSSE is 607928.333134
For k = 004 WSSE is 557237.514108
For k = 005 WSSE is 514085.021047
For k = 006 WSSE is 479132.389410
For k = 007 WSSE is 451596.773563
For k = 008 WSSE is 432165.837520
For k = 009 WSSE is 414676.748351
For k = 010 WSSE is 397700.877929
For k = 011 WSSE is 383695.122162
For k = 012 WSSE is 371624.462840
For k = 013 WSSE is 361038.971355
For k = 014 WSSE is 352178.435402


In [None]:
# plot the within cluster sum of square 
plt.style.use('seaborn-deep')
plt.plot(range(k_min, k_max), wsse_collect, 'b',alpha=.55)
plt.plot(7, wsse_collect[8], 'r.', alpha=.95, ms=10)
plt.ylabel("Within cluster SSE")
plt.xlabel("Number of clusters")
plt.show()

I am going to select 7 as the optimal number of clusters here.

In [26]:
# train the K Means model with 7 clusters
n_clusters = 7
km = KMeans(random_state=42, max_iter=200, n_clusters=n_clusters)
_ = km.fit(phone_shake_data_scaled)

In [27]:
# add the clustering features
phone_shake_data.loc[:,"cluster_number"] = km.predict(phone_shake_data_scaled)
distance_from_cluster = km.transform(phone_shake_data_scaled)
distance_from_cluster = pd.DataFrame(distance_from_cluster, columns=["distance_from_cluster_"+str(i) for i in range(0, n_clusters)])

# merge the data
phone_shake_data = phone_shake_data.merge(distance_from_cluster, left_index=True, right_index=True)

#### Cluster Analysis

Now that we have done our clustering. Let's see what kind of results we got from that exercise. I am going to look at
- the cluster wise mean of our target variable-  `label`
- the number of rows in each cluster
- the mean of the variables across the clusters

In [28]:
# check the cluster wise event rate
phone_shake_data.groupby('cluster_number')['label'].mean()

cluster_number
0    0.983183
1    0.994196
2    0.797757
3    0.954916
4    0.891322
5    0.496324
6    0.618009
Name: label, dtype: float64

We have two groups of clusters forming here. On one hand we have Clusters 2,4,5 and 6 with the lowest event rates while on the other hand, clusters 0, 1 and 5 have the highest event rates. Lets check how much percentage of total data lies in these two groups.

In [29]:
100*phone_shake_data.groupby('cluster_number')['label'].count()/phone_shake_data.shape[0]

cluster_number
0    25.650493
1    59.065498
2     2.872294
3     7.475786
4     2.783721
5     0.938643
6     1.213564
Name: label, dtype: float64

Clusters 0 and 1 account for three-fourths of our data. This looks useful. We'll see how it does in the models. We also have taken cluster distances in our dataset. Lets see how they relate with the labels. 

In [30]:
phone_shake_data.groupby('label')[["distance_from_cluster_"+str(i) for i in range(0, n_clusters)]].mean()

Unnamed: 0_level_0,distance_from_cluster_0,distance_from_cluster_1,distance_from_cluster_2,distance_from_cluster_3,distance_from_cluster_4,distance_from_cluster_5,distance_from_cluster_6
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,8.263061,8.32152,8.779548,8.711568,8.848802,10.798159,10.01469
1.0,2.349495,1.776491,5.444479,3.474924,4.27316,9.12589,7.869939


The means of a few distances look quite different. I am not sure if there is a statistical test that can be used in this scenario to see if these cluster means are significantly different. 

### PCA
The second part in our attempt to find latent structure revolves around PCA. The idea is simple. Train a PCA model and see if we can get if these variables are good indicators. 

The problem with this approach is that PCA is based on Pearson correlation coefficient and assumes that there is a linear relationship with all variables. But since lot of the given features are variables in independent direction- `acceleration_x(g), acceleration_y(g)`, we cannot rely on this method. 

### Checking for multicollinearity in the newly added features

We've added a handful of features. The distance features that we picked up from KMeans often have high multicollinearity. This is expected behavior. So in each step I will- 
- calculate the VIF for each variable, 
- delete the variable with highest VIF score (given that their VIF score is > 5)
- repeat until all remaining variables have VIF under 5

In [31]:
# check for multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# add a constant because statsmodels' VIF doesn't add a constant
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          466.365006
acceleration_x(g)                7.481648
acceleration_y(g)                2.982452
acceleration_z(g)                4.321960
roll(rad)                       10.522018
pitch(rad)                       6.800106
yaw(rad)                         7.856432
angular_velocity_x(rad/sec)      6.338166
angular_velocity_z(rad/sec)      4.039489
outlier_counts                  10.532517
cluster_number                   4.881399
distance_from_cluster_0         49.817396
distance_from_cluster_1         31.816849
distance_from_cluster_2         40.252507
distance_from_cluster_3         46.977570
distance_from_cluster_4         64.163915
distance_from_cluster_5         12.409318
distance_from_cluster_6         21.722797
dtype: float64

In [32]:
phone_shake_data.drop('distance_from_cluster_4', axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          408.913551
acceleration_x(g)                7.479693
acceleration_y(g)                1.898464
acceleration_z(g)                4.100130
roll(rad)                        2.077954
pitch(rad)                       6.249977
yaw(rad)                         7.539152
angular_velocity_x(rad/sec)      6.137281
angular_velocity_z(rad/sec)      3.939166
outlier_counts                   9.424188
cluster_number                   4.827463
distance_from_cluster_0         41.902290
distance_from_cluster_1         30.328149
distance_from_cluster_2         39.676601
distance_from_cluster_3         46.737994
distance_from_cluster_5         11.329096
distance_from_cluster_6         18.715387
dtype: float64

In [33]:
phone_shake_data.drop('distance_from_cluster_3', axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          408.912701
acceleration_x(g)                7.462798
acceleration_y(g)                1.815058
acceleration_z(g)                4.096640
roll(rad)                        1.339759
pitch(rad)                       1.851603
yaw(rad)                         7.147379
angular_velocity_x(rad/sec)      6.080281
angular_velocity_z(rad/sec)      3.933524
outlier_counts                   6.777659
cluster_number                   4.185274
distance_from_cluster_0         35.763112
distance_from_cluster_1         26.478600
distance_from_cluster_2         39.402150
distance_from_cluster_5         11.296718
distance_from_cluster_6         18.319135
dtype: float64

In [34]:
phone_shake_data.drop('distance_from_cluster_2', axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          346.679609
acceleration_x(g)                1.383273
acceleration_y(g)                1.735822
acceleration_z(g)                1.170836
roll(rad)                        1.338475
pitch(rad)                       1.847640
yaw(rad)                         7.141277
angular_velocity_x(rad/sec)      5.911242
angular_velocity_z(rad/sec)      3.926054
outlier_counts                   5.175560
cluster_number                   4.147186
distance_from_cluster_0         35.286702
distance_from_cluster_1         25.489568
distance_from_cluster_5          8.900189
distance_from_cluster_6         12.786482
dtype: float64

In [35]:
phone_shake_data.drop('distance_from_cluster_0', axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          282.933972
acceleration_x(g)                1.365834
acceleration_y(g)                1.633716
acceleration_z(g)                1.170793
roll(rad)                        1.227058
pitch(rad)                       1.847314
yaw(rad)                         2.791875
angular_velocity_x(rad/sec)      5.898146
angular_velocity_z(rad/sec)      3.923587
outlier_counts                   4.382144
cluster_number                   3.356181
distance_from_cluster_1         11.457052
distance_from_cluster_5          8.735630
distance_from_cluster_6         12.123101
dtype: float64

In [36]:
phone_shake_data.drop(['distance_from_cluster_6'], axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          264.485330
acceleration_x(g)                1.183225
acceleration_y(g)                1.620645
acceleration_z(g)                1.099380
roll(rad)                        1.222139
pitch(rad)                       1.763706
yaw(rad)                         2.651004
angular_velocity_x(rad/sec)      1.407068
angular_velocity_z(rad/sec)      1.300819
outlier_counts                   4.381426
cluster_number                   3.318327
distance_from_cluster_1         10.169866
distance_from_cluster_5          3.445686
dtype: float64

In [37]:
phone_shake_data.drop(['distance_from_cluster_1'], axis=1, inplace=True)
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]
X = add_constant(phone_shake_data[numeric_cols])
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

const                          182.219911
acceleration_x(g)                1.182867
acceleration_y(g)                1.480223
acceleration_z(g)                1.084435
roll(rad)                        1.213595
pitch(rad)                       1.720393
yaw(rad)                         1.193296
angular_velocity_x(rad/sec)      1.215314
angular_velocity_z(rad/sec)      1.155933
outlier_counts                   2.686758
cluster_number                   2.680874
distance_from_cluster_5          2.097570
dtype: float64

### Adding acceleration and angular velocity vectors.

Since we have the x, y, z components of acceleration and angular velocity, I am going to add them up and see if the resulting features have any value

In [38]:
phone_shake_data.loc[:, "acceleration"] = phone_shake_data.loc[:, "acceleration_x(g)"]**2 + phone_shake_data.loc[:, "acceleration_y(g)"]**2 + phone_shake_data.loc[:, "acceleration_z(g)"]**2 

phone_shake_data.loc[:, "acceleration"] = phone_shake_data.loc[:, "acceleration"]**0.5

In [39]:
phone_shake_data.loc[:, "angular_velocity"] = phone_shake_data.loc[:, "angular_velocity_x(rad/sec)"]**2 + phone_shake_data.loc[:, "angular_velocity_x(rad/sec)"]**2 + phone_shake_data.loc[:, "angular_velocity_z(rad/sec)"]**2 

phone_shake_data.loc[:, "angular_velocity"] = phone_shake_data.loc[:, "angular_velocity"]**0.5

## Modeling

I am going to build a logisitc, xgboost, random forest and decision tree model.  Since we have an imbalanced dataset on our hands, I am going to 
- use _F1_ as our evaluation metric
- use `class_weight='balanced'` wherever possible. This is so that algorithm adjusts weights of the classes inversely proportional to their frequencies in the input data

Also, I am using 5 fold cross validation to report the evaluation metrics. There are many models to pick from but I'll be sticking to these four models- 
- Logistic Regression
- Random Forest
- XGBoost
- CatBoost

### Logisitc Regression

#### Logisitc Regression with added features
I'll build two models here. One with the original variables and one with the cluster features. Lets see if our insights so far our correct. 

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# select numeric columns
numeric_cols = [i for i in phone_shake_data.columns if i not in ['sensor_type', 'label']]

# train logistic model
lr = LogisticRegression(class_weight="balanced", random_state=42, verbose=0, solver='lbfgs', max_iter=1000)
lr_scores = cross_val_score(lr, phone_shake_data[numeric_cols], phone_shake_data['label'], cv=5, scoring='f1')

print("Logistic Regression with added features F1: %0.2f (+/- %0.2f)" % (lr_scores.mean(), lr_scores.std() * 2))

Logistic Regression with added features F1: 0.92 (+/- 0.25)


#### Logisitic Regression with the given variables only

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# initialize model
lr = LogisticRegression(class_weight="balanced", random_state=42, verbose=0, solver='lbfgs', max_iter=1000)
original_cols = ['acceleration_x(g)', 'acceleration_y(g)', 'acceleration_z(g)', 'roll(rad)', 'pitch(rad)', 'yaw(rad)', 'angular_velocity_x(rad/sec)', 'angular_velocity_z(rad/sec)']

# train with cross validation
lr_scores = cross_val_score(lr, phone_shake_data[original_cols], phone_shake_data['label'], cv=5, scoring='f1')
print("Logistic Regression with Original variables F1: %0.2f (+/- %0.2f)" % (lr_scores.mean(), lr_scores.std() * 2))

Logistic Regression with Original variables F1: 0.78 (+/- 0.37)


Adding clustering features has not only improved the average score, its also making the model more stable. I am not going to include this comparison for the rest of the models. I just wanted to see/show if the variables I created are adding any value

#### Summary of Logistic Regression model

Since _sklearn_ doesn't provide p-values and other statistical measurements for regression models, I am using the _statsmodels_ api to get the model summary

In [42]:
import statsmodels.api as sm
logit = sm.Logit(phone_shake_data['label'], sm.add_constant(phone_shake_data[numeric_cols]))
logit_fit = logit.fit()

Optimization terminated successfully.
         Current function value: 0.080029
         Iterations 9


In [43]:
logit_fit.summary()

0,1,2,3
Dep. Variable:,label,No. Observations:,86934.0
Model:,Logit,Df Residuals:,86920.0
Method:,MLE,Df Model:,13.0
Date:,"Mon, 01 Jun 2020",Pseudo R-squ.:,0.3953
Time:,00:59:22,Log-Likelihood:,-6957.2
converged:,True,LL-Null:,-11504.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,2.0366,0.239,8.522,0.000,1.568,2.505
acceleration_x(g),-0.1168,0.031,-3.794,0.000,-0.177,-0.056
acceleration_y(g),0.6703,0.052,12.769,0.000,0.567,0.773
acceleration_z(g),0.1860,0.047,3.954,0.000,0.094,0.278
roll(rad),0.0280,0.028,0.984,0.325,-0.028,0.084
pitch(rad),-0.4135,0.044,-9.349,0.000,-0.500,-0.327
yaw(rad),0.0959,0.014,6.859,0.000,0.068,0.123
angular_velocity_x(rad/sec),-0.1101,0.014,-8.077,0.000,-0.137,-0.083
angular_velocity_z(rad/sec),0.0740,0.008,9.022,0.000,0.058,0.090


### Random Forest

In [44]:
from sklearn.ensemble import RandomForestClassifier

# initialize model
rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=12, min_samples_split=3, min_samples_leaf=1, max_features='auto', min_impurity_decrease=10**-3, random_state=42, class_weight="balanced_subsample")

# train with cross validation
rf_scores = cross_val_score(rf, phone_shake_data[numeric_cols], phone_shake_data['label'], cv=5, scoring='f1')
print("Random Forest F1: %0.2f (+/- %0.2f)" % (rf_scores.mean(), rf_scores.std() * 2))

Random Forest F1: 0.89 (+/- 0.30)


In [45]:
rf.fit(phone_shake_data[numeric_cols], phone_shake_data['label'])
rf_importances = pd.DataFrame([numeric_cols, rf.feature_importances_]).transpose()
rf_importances.columns = ['feature', 'rf_importance']
rf_importances = rf_importances.sort_values(by='rf_importance', ascending=False).reset_index().drop("index", axis=1)
rf_importances

Unnamed: 0,feature,rf_importance
0,angular_velocity,0.243208
1,acceleration,0.233795
2,outlier_counts,0.151643
3,distance_from_cluster_5,0.0849706
4,acceleration_y(g),0.0772178
5,cluster_number,0.0705915
6,acceleration_x(g),0.0456815
7,angular_velocity_z(rad/sec),0.0437702
8,angular_velocity_x(rad/sec),0.02027
9,yaw(rad),0.0101188


The variables I created seem to be doing well in the random forest model. Also, in section , we estimated that `roll(rad)` might not be quite useful and so it seems to be here.

### XGBoost

In [46]:
import xgboost
from sklearn.model_selection import cross_val_score

# initialize model
xgb = xgboost.XGBClassifier(objective='binary:logistic', random_state=42, max_depth=12)

# train with cross validation
xgb_scores = cross_val_score(xgb, phone_shake_data[numeric_cols], phone_shake_data['label'], cv=5, scoring='f1')
print("XGBoost F1: %0.2f (+/- %0.2f)" % (xgb_scores.mean(), xgb_scores.std() * 2))

XGBoost F1: 0.94 (+/- 0.18)


In [47]:
xgb = xgboost.XGBClassifier(objective='binary:logistic', random_state=42, max_depth=12)
_ = xgb.fit(phone_shake_data[numeric_cols], phone_shake_data['label'])

In [48]:
xgb_importances = pd.DataFrame([numeric_cols, xgb.feature_importances_]).transpose()
xgb_importances.columns = ['feature', 'xgb_importance']
xgb_importances = xgb_importances.sort_values(by='xgb_importance', ascending=False).reset_index().drop("index", axis=1)
xgb_importances

Unnamed: 0,feature,xgb_importance
0,acceleration,0.356262
1,angular_velocity,0.13172
2,cluster_number,0.0601349
3,yaw(rad),0.0537647
4,pitch(rad),0.0520796
5,roll(rad),0.0517437
6,acceleration_x(g),0.0457271
7,acceleration_z(g),0.0454941
8,acceleration_y(g),0.0447633
9,angular_velocity_z(rad/sec),0.0413676


### CatBoost

In [49]:
from catboost import CatBoostClassifier

# initialize model
cb = CatBoostClassifier(iterations=200, eval_metric="F1", random_seed=42, max_depth=12, logging_level='Silent')

# train with cross validation
cb_scores = cross_val_score(cb, phone_shake_data[numeric_cols], phone_shake_data['label'], cv=5, scoring='f1')
print("CatBoost F1: %0.2f (+/- %0.2f)" % (cb_scores.mean(), cb_scores.std() * 2))

CatBoost F1: 0.94 (+/- 0.18)


In [50]:
cb = CatBoostClassifier(iterations=100, eval_metric="F1", random_seed=42, max_depth=12)
cb.fit(phone_shake_data[numeric_cols], phone_shake_data['label'], logging_level='Silent')

<catboost.core.CatBoostClassifier at 0x1294fd550>

In [51]:
from sklearn.metrics import f1_score
cb_importances = pd.DataFrame([numeric_cols, cb.get_feature_importance()]).transpose()
cb_importances.columns = ['feature', 'cb_importance']
cb_importances = cb_importances.sort_values(by='cb_importance', ascending=False).reset_index().drop("index", axis=1)
cb_importances

Unnamed: 0,feature,cb_importance
0,yaw(rad),24.108
1,roll(rad),13.7007
2,pitch(rad),11.592
3,angular_velocity,7.90542
4,angular_velocity_z(rad/sec),6.7634
5,acceleration_x(g),6.49436
6,acceleration_z(g),6.3079
7,angular_velocity_x(rad/sec),6.16798
8,distance_from_cluster_5,5.2677
9,acceleration_y(g),4.75807


All the features I created have moved to the bottom. 

### TPOT Classifier

TPOT classifier is a genetic algorithm based classifier and is different from our ususal models. The aglorithm does well but runs for a long time. 

In [52]:
#from tpot import TPOTClassifier
#tpot = TPOTClassifier(generations=1, population_size=50, verbosity=2, random_state=42)
#tpot.fit(phone_shake_data[numeric_cols], phone_shake_data['label'])

## train with cross validation
#tpot_scores = cross_val_score(tpot, phone_shake_data[numeric_cols], phone_shake_data['label'], cv=3, scoring='f1')
#print("CatBoost F1: %0.2f (+/- %0.2f)" % (tpot_scores.mean(), tpot_scores.std() * 2))

## Balancing the dataset

We have seen the models do reasonably well on our imbalanced dataset. I am going to try and add some balance to it and see how the models respond to it. 

I usually like that minority class to be atleast around 10-15 percent. So, in the resampled dataset I would want the zeros to be nearly 15 percent of the resampled dataset

In [53]:
from imblearn.over_sampling import ADASYN as oversampler_
from sklearn.preprocessing import StandardScaler

# centre and scale the data
ss = StandardScaler()
ss.fit(phone_shake_data[numeric_cols])

# initialize and train the sampler
oversampler = oversampler_(random_state=42, sampling_strategy=0.15)
X_res, y_res = oversampler.fit_resample(ss.transform(phone_shake_data[numeric_cols]), phone_shake_data['label'])
X_res = ss.inverse_transform(X_res)


Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Lets check the number of ones in the resampled data

In [54]:
y_res.mean()

0.8700125780976142

Thats not 15 percent exactly but it's good enough to work with. In the next section, I'll retrain all the models from we trained above and see if we get better results

##  Modeling after balancing the data


In [55]:
#################### LOGISTIC REGRESSION ####################
# initialize model
lr = LogisticRegression(class_weight="balanced", random_state=42, verbose=0, solver='lbfgs', max_iter=1000)
original_cols = ['acceleration_x(g)', 'acceleration_y(g)', 'acceleration_z(g)', 'roll(rad)', 'pitch(rad)', 'yaw(rad)', 'angular_velocity_x(rad/sec)', 'angular_velocity_z(rad/sec)']

# train with cross validation
lr_scores = cross_val_score(lr, X_res, y_res, cv=5, scoring='f1')
print("Logistic Regression F1 with balanced data : %0.2f (+/- %0.2f)" % (lr_scores.mean(), lr_scores.std() * 2))

Logistic Regression F1 with balanced data : 0.90 (+/- 0.27)


In [56]:
#################### RANDOM FOREST ########################
# initialize model
rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=12, min_samples_split=3, min_samples_leaf=1, max_features='auto', min_impurity_decrease=10**-3, random_state=42, class_weight="balanced_subsample")

# train with cross validation
rf_scores = cross_val_score(rf, X_res, y_res, cv=5, scoring='f1')
print("Random Forest F1 with balanced data: %0.2f (+/- %0.2f)" % (rf_scores.mean(), rf_scores.std() * 2))

Random Forest F1 with balanced data: 0.85 (+/- 0.34)


In [57]:
#################### XGBOOST #############################
# initialize model
xgb = xgboost.XGBClassifier(objective='binary:logistic', random_state=42, max_depth=12)

# train with cross validation
xgb_scores = cross_val_score(xgb, X_res, y_res, cv=5, scoring='f1')
print("XGBoost F1 with balanced data : %0.2f (+/- %0.2f)" % (xgb_scores.mean(), xgb_scores.std() * 2))

XGBoost F1 with balanced data : 0.90 (+/- 0.23)


In [58]:
#################### CatBOOST ############################
# initialize model
cb = CatBoostClassifier(iterations=200, eval_metric="F1", random_seed=42, max_depth=12, logging_level='Silent')

# train with cross validation
cb_scores = cross_val_score(cb, X_res, y_res, cv=5, scoring='f1')
print("CatBoost F1 with balanced data : %0.2f (+/- %0.2f)" % (cb_scores.mean(), cb_scores.std() * 2))

CatBoost F1 with balanced data : 0.91 (+/- 0.24)


Seeing the results from above experiments and working with different values of `sampling_strategy` and `k_neighbors` doesn't yield any increase in model performance. In fact, it has slightly reduced our model performance. 

## Model Tuning

We have worked with four models above which have performed decently. Now it's time to up the model performance by tuning the model by searching across a hyper parameter space. 

I am using `skopt` package and the Gaussian Optimization routine for this optimization

In [69]:
from skopt.space import Real, Integer
from skopt import BayesSearchCV

# logistic regression hyperparameter space
lr_space = {
    "C": Real(1, 100, prior='log-uniform'),
    "max_iter": Real(300, 1000, prior='log-uniform')
}


# random forest hyperparameter space
rf_space = {
    "n_estimators": Integer(100, 200),
    "max_depth": Integer(6, 32),
    "min_samples_split": Integer(1, 100),
    "min_samples_leaf": Integer(1, 200),
    "max_features": Real(0.3, 0.7, prior='log-uniform'),
    "max_samples": Real(0.3, 0.7, prior='log-uniform'),
}

# xgboost hyperparameter space
xgb_space = {
    "n_estimators": Integer(100, 200),
    "max_depth": Integer(6, 32),
    "gamma": Real(0.01, 1, prior='log-uniform'),
    "learning_rate": Real(0.0001, 1, prior='log-uniform'),
    "max_delta_step": Real(0.1, 10, prior='log-uniform'),
    "reg_alpha": Real(0.1, 100, prior='log-uniform'),
    "reg_lambda": Real(0.1, 100, prior='log-uniform'),
    "subsample": Real(0.3, 0.7, prior='log-uniform'),
}


# catboost hyperparameter space
cb_space = {
    "iterations": Integer(100, 200),
    "max_depth": Integer(6, 16),
    "learning_rate": Real(0.0001, 1, prior='log-uniform'),
    "l2_leaf_reg": Real(1, 10, prior='log-uniform'),
}

n_iter = 2
lr_opt = BayesSearchCV(
    LogisticRegression(random_state=42),
    lr_space,
    n_iter=n_iter,
    random_state=42,
    scoring='f1'
)

rf_opt = BayesSearchCV(
    RandomForestClassifier(random_state=42),
    rf_space,
    n_iter=n_iter,
    random_state=42,
    scoring='f1'
)

xgb_opt = BayesSearchCV(
    xgboost.XGBClassifier(random_state=42),
    xgb_space,
    n_iter=n_iter,
    random_state=42,
    scoring='f1'
)

cb_opt = BayesSearchCV(
    CatBoostClassifier(random_seed=42, logging_level='Silent'),
    cb_space,
    n_iter=n_iter,
    random_state=42,
    scoring='f1'
)

# shuffle the data
phone_shake_data = phone_shake_data.sample(frac=1).reset_index(drop=True)

num_train_rows = int(0.7*phone_shake_data.shape[0])


In [60]:
# executes bayesian optimization
_ = lr_opt.fit(phone_shake_data.loc[0:num_train_rows,numeric_cols], phone_shake_data.loc[0:num_train_rows, 'label'])
print("Logistic Regression F1 after model tuning: ", lr_opt.score(phone_shake_data.loc[num_train_rows:,numeric_cols], phone_shake_data.loc[num_train_rows:, 'label']))



Logistic Regression F1 after model tuning:  0.9860765690786253


In [61]:
# executes bayesian optimization
_ = rf_opt.fit(phone_shake_data.loc[0:num_train_rows,numeric_cols], phone_shake_data.loc[0:num_train_rows, 'label'])
print("Random Forest F1 after model tuning: ", rf_opt.score(phone_shake_data.loc[num_train_rows:,numeric_cols], phone_shake_data.loc[num_train_rows:, 'label']))



Random Forest F1 after model tuning:  0.9867734687896846


In [62]:
# executes bayesian optimization
_ = xgb_opt.fit(phone_shake_data.loc[0:num_train_rows,numeric_cols], phone_shake_data.loc[0:num_train_rows, 'label'])
print("XGBoost F1 after model tuning: ", xgb_opt.score(phone_shake_data.loc[num_train_rows:,numeric_cols], phone_shake_data.loc[num_train_rows:, 'label']))



XGBoost F1 after model tuning:  0.9874440869496979


In [None]:
# executes bayesian optimization
_ = cb_opt.fit(phone_shake_data[numeric_cols], phone_shake_data['label'])

In [68]:
print("CatBoost F1 after model tuning: ", cb_opt.score(phone_shake_data.loc[num_train_rows:,numeric_cols], phone_shake_data.loc[num_train_rows:, 'label']))

CatBoost F1 after model tuning:  0.9952979598260835


## Conclusion

A few insights from the exercise
- outliers in this scenario are not to be discarded but need to be valued 
- the overall acceleration and angular velocity are useful features too
- oversampling the data using SMOTE, ADASYN, etc doesn't really make a lot of impact
- CatBoost and XGBoost have been the better performing model with Logistic coming in close
- Model tuning with just a few iterations gives vast a improvement in F1 scores
