## Introduction:
Robots are smart… by design. To fully understand and properly navigate a task, however, they need input about their environment.
In this competition, you’ll help robots recognize the floor surface they’re standing on using data collected from Inertial Measurement Units (IMU sensors).

## About Data: 
CareerCon has collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises. 

## Objective:
The task is to predict which one of the nine floor types (carpet, tiles, concrete) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they won’t fall down on the job.


In [57]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style 
style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

import plotly.offline as py 
from plotly.offline import init_notebook_mode, iplot
py.init_notebook_mode(connected=True) # this code, allow us to work with offline plotly version
import plotly.graph_objs as go # it's like "plt" of matplot

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import confusion_matrix
import gc


# Any results you write to the current directory are saved as output.

['X_train.csv', 'sample_submission.csv', 'X_test.csv', 'y_train.csv']


In [58]:
X_train = pd.read_csv('../input/X_train.csv')
X_train.head(3)

Unnamed: 0,row_id,series_id,measurement_number,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
0,0_0,0,0,-0.75853,-0.63435,-0.10488,-0.10597,0.10765,0.017561,0.000767,-0.74857,2.103,-9.7532
1,0_1,0,1,-0.75853,-0.63434,-0.1049,-0.106,0.067851,0.029939,0.003385,0.33995,1.5064,-9.4128
2,0_2,0,2,-0.75853,-0.63435,-0.10492,-0.10597,0.007275,0.028934,-0.005978,-0.26429,1.5922,-8.7267


In [59]:
y_train = pd.read_csv('../input/y_train.csv')
y_train.head(3)

Unnamed: 0,series_id,group_id,surface
0,0,13,fine_concrete
1,1,31,concrete
2,2,20,concrete


In [None]:
X_test = pd.read_csv('../input/X_test.csv')
X_test.head(3)

# Descriptive Statistics

In [None]:
print('Size of Train Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(X_train.shape[0], X_train.shape[1]))

print('\nSize of Test Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(X_test.shape[0], X_test.shape[1]))

print('\nSize of Target Data')
print('Number of samples are: {0}\nNumber of features are: {1}'.format(y_train.shape[0], y_train.shape[1]))

## Train Data Description

In [None]:
X_train.describe()

## Target surface type and their sample count

In [None]:
target = y_train['surface'].value_counts().reset_index().rename(columns = {'index' : 'target'})
target

In [None]:
#sns.countplot(y='surface',data = y_train)
trace0 = go.Bar(
    x = y_train['surface'].value_counts().index,
    y = y_train['surface'].value_counts().values
    )

trace1 = go.Pie(
    labels = y_train['surface'].value_counts().index,
    values = y_train['surface'].value_counts().values,
    domain = {'x':[0.55,1]})

data = [trace0, trace1]
layout = go.Layout(
    title = 'Frequency Distribution for surface/target data',
    xaxis = dict(domain = [0,.50]))

fig = go.Figure(data = data, layout = layout)
py.iplot(fig)


## Preprocessing data

### Is there any missing data?

In [None]:
X_train.isnull().sum()

#### Observation: No missing data

### Is there any duplicate data?

In [None]:
X_train['is_duplicate'] = X_train.duplicated()
X_train['is_duplicate'].value_counts()

#### Observation: There is no duplicate data

In [None]:
X_train = X_train.drop(['is_duplicate'], axis = 1)

### Sorting based on series_id and measurement_number

In [None]:
X_train_sort = X_train.sort_values(by = ['series_id', 'measurement_number'], ascending = True)
X_train_sort.head()

### Correlation Matrix

In [None]:
corr = X_train.corr()
corr

In [None]:
fig, ax = plt.subplots(1,1, figsize = (15,6))

hm = sns.heatmap(X_train.iloc[:,3:].corr(),
                ax = ax,
                cmap = 'coolwarm',
                annot = True,
                fmt = '.2f',
                linewidths = 0.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Orientation, Angular_velocity and Linear_accelaration Correlation Heatmap for Train dataset', 
              fontsize=14, 
              fontweight='bold')

In [None]:
fig, ax = plt.subplots(1,1, figsize = (15,6))

hm = sns.heatmap(X_test.iloc[:,3:].corr(),
                ax = ax,
                cmap = 'coolwarm',
                annot = True,
                fmt = '.2f',
                linewidths = 0.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Orientation, Angular_velocity and Linear_accelaration Correlation Heatmap for Test dataset', 
              fontsize=14, 
              fontweight='bold')

**Observation:**
*     orientation_X and orientation_W are strongly correlated
*     orientation_Y and orientation_Z are strongly correlated
*     linear_accelaration_Y and linear_accelaration_Z also has positive correlation
*     angular_velocity_Y and angular_velocity_Z has negative correlation

### Box plot of angular_velocity, orientation and linear_accelaration data

In [None]:
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(311)
ax.set_title('Distribution of Orientation_X,Y,Z,W',
             fontsize=14, 
             fontweight='bold')
X_train.iloc[:,3:7].boxplot()
ax = fig.add_subplot(312)
ax.set_title('Distribution of Angular_Velocity_X,Y,Z',fontsize=14, 
             fontweight='bold')
X_train.iloc[:,7:10].boxplot()
ax = fig.add_subplot(313)
ax.set_title('Distribution of linear_accelaration_X,Y,Z',fontsize=14, 
             fontweight='bold')
X_train.iloc[:,10:13].boxplot()

**Observation**: There are many outliers in angular_velocity and linear accelaration data

### Histogram plot for all features

In [None]:
plt.figure(figsize=(26, 16))
for i, col in enumerate(X_train.columns[3:]):
    ax = plt.subplot(3, 4, i + 1)
    sns.distplot(X_train[col], bins=100, label='train')
    sns.distplot(X_test[col], bins=100, label='test')
    ax.legend()   

### Observation:
*    Angular velocity are normally distributed infect they are symmetrical data distribution
*    linear_accelaration are normally distributed/symmetrical distribution but average value is slightly negative for linear_accelaration_Z
*    X,Y,Z,W orientation data are not symmetrical or bell shaped distributed. 
*         X,Y orientation data are distributed un-even between 1 to -1.
*         Z,W orientation data are distributed un-even between 1.5 to -1.5

Since orientation data is not linearly distributed, taking log of the orientation data may improve the results.

### Feature distribution for each target value (surface)

In [None]:
df = X_train.merge(y_train, on = 'series_id', how = 'inner')
targets = (y_train['surface'].value_counts()).index

In [None]:
df.head(3)

In [None]:
plt.figure(figsize=(26, 16))
for i,col in enumerate(df.columns[3:13]):
    ax = plt.subplot(3,4,i+1)
    ax = plt.title(col)
    for surface in targets:
        surface_feature = df[df['surface'] == surface]
        sns.kdeplot(surface_feature[col], label = surface)

**Observation:**

*     For hard tile surface we can see little jerk in orientation data.
*     for orientation_X these data range is approx 0.5 to 1.0, 
*     for orientation_Y these data range is approx -1.0 to -0.5
*     for orientation_Z these data range is approx -0.12 to -0.8
*     for orientation_W these data range is approx 0.07 to 0.12 
*     for angular velocity and linear accelaration data, there is a symmetry around mean in terms of data distribution.
    

## Feature Enginnering

Feature Enginnering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive.
The features in your data are important to the predictive models you use and will influence the results you are going to achieve. The quality and quantity of the features will have great influence on whether the model is good or not.

### Euler angles
The Euler angles are three angles introduced by Leonhard Euler to describe the orientation of a rigid body with respect to a fixed coordinate system.

### Fast Fourier Transform Denoising

In [None]:
series_dict = {}
for series in (X_train['series_id'].unique()):
    series_dict[series] = X_train[X_train['series_id'] == series] 

In [None]:
# From: Code Snippet For Visualizing Series Id by @shaz13
def plotSeries(series_id):
    style.use('ggplot')
    plt.figure(figsize=(28, 16))
    print(y_train[y_train['series_id'] == series_id]['surface'].values[0].title())
    for i, col in enumerate(series_dict[series_id].columns[3:]):
        if col.startswith("o"):
            color = 'red'
        elif col.startswith("a"):
            color = 'green'
        else:
            color = 'blue'
        if i >= 7:
            i+=1
        plt.subplot(3, 4, i + 1)
        plt.plot(series_dict[series_id][col], color=color, linewidth=3)
        plt.title(col)

In [None]:
plotSeries(1)

If for whatever reason you want to denoise the signal, you can use fast fourier transform. Detailed implementation of how it's done is out of the scope of this kernel. You can learn more about it here: https://en.wikipedia.org/wiki/Fast_Fourier_transform

In [None]:
# from @theoviel at https://www.kaggle.com/theoviel/fast-fourier-transform-denoising
def filter_signal(signal, threshold=1e3):
    fourier = rfft(signal)
    frequencies = rfftfreq(signal.size, d=20e-3/signal.size)
    fourier[frequencies > threshold] = 0
    return irfft(fourier)

In [None]:
# denoise train and test angular_velocity and linear_acceleration data
X_train_denoised = X_train.copy()
X_test_denoised = X_test.copy()

Let's say that I want to denoise the signal on angular_velocity and linear_acceleration column

In [None]:
X_train.head(3)

In [None]:
from numpy.fft import *

# train
for col in X_train.columns:
    if col[0:3] == 'ang' or col[0:3] == 'lin':
        # Apply filter_signal function to the data in each series
        denoised_data = X_train.groupby(['series_id'])[col].apply(lambda x: filter_signal(x))
        
        # Assign the denoised data back to X_train
        list_denoised_data = []
        for arr in denoised_data:
            for val in arr:
                list_denoised_data.append(val)
                
        X_train_denoised[col] = list_denoised_data
        
# test
for col in X_test.columns:
    if col[0:3] == 'ang' or col[0:3] == 'lin':
        # Apply filter_signal function to the data in each series
        denoised_data = X_test.groupby(['series_id'])[col].apply(lambda x: filter_signal(x))
        
        # Assign the denoised data back to X_train
        list_denoised_data = []
        for arr in denoised_data:
            for val in arr:
                list_denoised_data.append(val)
                
        X_test_denoised[col] = list_denoised_data
        

Now, let's look at the result:

In [None]:
series_dict = {}
for series in (X_train_denoised['series_id'].unique()):
    series_dict[series] = X_train_denoised[X_train_denoised['series_id'] == series] 

In [None]:
plotSeries(1)

As you can see, our signal become much smoother than before. Here's a closer comparison:

In [None]:
plt.figure(figsize=(24, 8))
plt.title('linear_acceleration_X')
plt.plot(X_train.angular_velocity_Z[128:256], label="original");
plt.plot(X_train_denoised.angular_velocity_Z[128:256], label="denoised");
plt.legend()
plt.show()

## Feature Enginnering
Feature Enginnering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The features in your data are important to the predictive models you use and will influence the results you are going to achieve. The quality and quantity of the features will have great influence on whether the model is good or not.

## Euler angles
The Euler angles are three angles introduced by Leonhard Euler to describe the orientation of a rigid body with respect to a fixed coordinate system.

In [None]:
#https://en.wikipedia.org/wiki/Conversion_between_quaternions_and_Euler_angles
#quaternion to eular
def quaternion_to_euler(qx,qy,qz,qw):
    import math
    # roll (x-axis rotation)
    sinr_cosp = +2.0 * (qw * qx + qy + qz)
    cosr_cosp = +1.0 - 2.0 * (qx * qx + qy * qy)
    roll = math.atan2(sinr_cosp, cosr_cosp)
    
    # pitch (y-axis rotation)
    sinp = +2.0 * (qw * qy - qz * qx)
    if(math.fabs(sinp) >= 1):
        pitch = copysign(M_PI/2, sinp)
    else:
        pitch = math.asin(sinp)
        
    # yaw (z-axis rotation)
    siny_cosp = +2.0 * (qw * qz + qx * qy)
    cosy_cosp = +1.0 - 2.0 * (qy * qy + qz * qz)
    yaw = math.atan2(siny_cosp, cosy_cosp)
    
    return roll, pitch, yaw

In [None]:
def eular_angle(data):
    x, y, z, w = data['orientation_X'].tolist(), data['orientation_Y'].tolist(), data['orientation_Z'].tolist(), data['orientation_W'].tolist()
    nx, ny, nz = [], [], []
    for i in range(len(x)):
        xx, yy, zz = quaternion_to_euler(x[i], y[i], z[i], w[i])
        nx.append(xx)
        ny.append(yy)
        nz.append(zz)
    
    data['euler_x'] = nx
    data['euler_y'] = ny
    data['euler_z'] = nz
    
    return data

In [None]:
data = eular_angle(X_train_denoised)
test = eular_angle(X_test_denoised)
print(data.shape, test.shape)

In [None]:
data.head(3)

### Feature Engineering
* calculate total angular velocity
* calculate total linear accelearation
* calculate total orientaion
* calculate acceleration vs velocity
* calculate total eular angle

In [None]:
def fe_eng1(data):
    data['total_angular_vel'] = (data['angular_velocity_X']**2 + data['angular_velocity_Y']**2 + data['angular_velocity_Z']**2)** 0.5
    data['total_linear_acc'] = (data['linear_acceleration_X']**2 + data['linear_acceleration_Y']**2 + data['linear_acceleration_Z']**2)**0.5
    data['total_orientation'] = (data['orientation_X']**2 + data['orientation_Y']**2 + data['orientation_Z']**2)**0.5
    data['acc_vs_vel'] = data['total_linear_acc'] / data['total_angular_vel']
    data['total_angle'] = (data['euler_x'] ** 2 + data['euler_y'] ** 2 + data['euler_z'] ** 2) ** 5
    data['angle_vs_acc'] = data['total_angle'] / data['total_linear_acc']
    data['angle_vs_vel'] = data['total_angle'] / data['total_angular_vel']
    return data

In [None]:
data = fe_eng1(data)
test = fe_eng1(test)
print(data.shape, test.shape)

In [None]:
def fe_eng2(data):
    df = pd.DataFrame()
    
    for col in data.columns:
        if col in ['row_id','series_id','measurement_number']:
            continue
        df[col + '_mean'] = data.groupby(['series_id'])[col].mean()
        df[col + '_median'] = data.groupby(['series_id'])[col].median()
        df[col + '_max'] = data.groupby(['series_id'])[col].max()
        df[col + '_min'] = data.groupby(['series_id'])[col].min()
        df[col + '_std'] = data.groupby(['series_id'])[col].std()
        df[col + '_range'] = df[col + '_max'] - df[col + '_min']
        df[col + '_maxtoMin'] = df[col + '_max'] / df[col + '_min']
        #in statistics, the median absolute deviation (MAD) is a robust measure of the variablility of a univariate sample of quantitative data.
        df[col + '_mad'] = data.groupby(['series_id'])[col].apply(lambda x: np.median(np.abs(np.diff(x))))
        df[col + '_abs_max'] = data.groupby(['series_id'])[col].apply(lambda x: np.max(np.abs(x)))
        df[col + '_abs_min'] = data.groupby(['series_id'])[col].apply(lambda x: np.min(np.abs(x)))
        df[col + '_abs_avg'] = (df[col + '_abs_min'] + df[col + '_abs_max'])/2
    return df

In [None]:
%%time
data = fe_eng2(data)
test = fe_eng2(test)
print(data.shape, test.shape)

In [None]:
data.head(3)

#### Observation:
Now our data file sample size is same as target sample size. our test file sample size is same as number of requested series_ids.

In [None]:
data.fillna(0, inplace = True)
data.replace(-np.inf, 0, inplace = True)
data.replace(np.inf, 0, inplace = True)
test.fillna(0, inplace = True)
test.replace(-np.inf, 0, inplace = True)
test.replace(np.inf, 0, inplace = True)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train['surface'] = le.fit_transform(y_train['surface'])

In [None]:
y_train.head()

## Run Model:
#### As this is a multi class classification problem. Lets try Random Forest Classifier algorithm.

In [None]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=60)
predicted = np.zeros((test.shape[0],9))
measured= np.zeros((data.shape[0]))
score = 0

In [None]:
for times, (trn_idx, val_idx) in enumerate(folds.split(data.values,y_train['surface'].values)):
    model = RandomForestClassifier(n_estimators=700, n_jobs = -1)
    #model = RandomForestClassifier(n_estimators=500, max_depth=10, min_samples_split=5, n_jobs=-1)
    model.fit(data.iloc[trn_idx],y_train['surface'][trn_idx])
    measured[val_idx] = model.predict(data.iloc[val_idx])
    predicted += model.predict_proba(test)/folds.n_splits
    score += model.score(data.iloc[val_idx],y_train['surface'][val_idx])
    print("Fold: {} score: {}".format(times,model.score(data.iloc[val_idx],y_train['surface'][val_idx])))
    
    gc.collect()

In [None]:
print('Average score', score / folds.n_splits)

In [None]:
confusion_matrix(measured,y_train['surface'])

In [None]:
fig, ax = plt.subplots(1,1,figsize=(12,5))
sns.heatmap(pd.DataFrame(confusion_matrix(measured,y_train['surface'])),
            ax = ax,
            cmap = 'coolwarm',
            annot = True,
            fmt = '.2f',
            linewidths = 0.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Confusion matrix, Actual vs Predicted label Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

**Feature Importance**

Understanding about important features will help us fine tuning feature enginnering as well accuracy improvement.

In [None]:
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis = 0)
indices = np.argsort(importances)[::-1]

In [None]:
feature_importances = pd.DataFrame(importances, index = data.columns, columns = ['importance'])
feature_importances.sort_values('importance', ascending = False)
feature_importances.head(20)

In [None]:
feature_importances.sort_values('importance', ascending = False).plot(kind = 'bar', 
                         figsize = (35,8), 
                         color = 'r', 
                         yerr=std[indices], 
                        align = 'center')
plt.xticks(rotation=90)
plt.show()

In [None]:
feature_importances.sort_values('importance', ascending = False)[:100].plot(kind = 'bar',
                                                                            figsize = (30,5),
                                                                            color = 'g', 
                                                                            yerr=std[indices[:100]], 
                                                                            align = 'center')
plt.xticks(rotation=90)
plt.show()

In [None]:
less_important_features = feature_importances.loc[feature_importances['importance'] < 0.0025]
print('There are {0} features their importance value is less then 0.0025'.format(less_important_features.shape[0]))

In [None]:
#Remove less important features from train and test set.
for i, col in enumerate(less_important_features.index):
    data = data.drop(columns = [col], axis = 1)
    test = test.drop(columns = [col], axis = 1)
    
data.shape, test.shape

### Run ML Model Again

In [None]:
predicted = np.zeros((test.shape[0],9))
measured= np.zeros((data.shape[0]))
score = 0
for times, (trn_idx, val_idx) in enumerate(folds.split(data.values,y_train['surface'].values)):
    model = RandomForestClassifier(n_estimators=700, n_jobs = -1)
    #model = RandomForestClassifier(n_estimators=500, max_depth=10, min_samples_split=5, n_jobs=-1)
    model.fit(data.iloc[trn_idx],y_train['surface'][trn_idx])
    measured[val_idx] = model.predict(data.iloc[val_idx])
    predicted += model.predict_proba(test)/folds.n_splits
    score += model.score(data.iloc[val_idx],y_train['surface'][val_idx])
    print("Fold: {} score: {}".format(times,model.score(data.iloc[val_idx],y_train['surface'][val_idx])))
    
    gc.collect()

In [None]:
print('Average score', score / folds.n_splits)

**Observation:**

Looks like orientation features are Most important features. we can do further feature engineering around Orientation Feature. Lets remove low importance features and then run the model.

In [None]:
submission = pd.read_csv('../input/sample_submission.csv')
submission['surface'] = le.inverse_transform(predicted.argmax(axis=1))
submission.to_csv('rs_surface_submission6.csv', index=False)
submission.head(10)

Ref:

feature engg kernel1: https://www.kaggle.com/jesucristo/1-robots-eda-rf-cval-0-73
kernel 2: https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics/notebook

feature importance: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

median absolute deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation
Quaternions and 3rd rotation, explained interactively: https://www.youtube.com/watch?v=zjMuIxRvygQ https://en.wikipedia.org/wiki/Conversion_between_quaternions_and_Euler_angles

Thanks for stopping by. Please upvote if you like my kernel. 
Stay Tuned for further Analaysis and model accuracy improvement.