<h1><center><font size="6">Robots need help!</font></center></h1>

<img src="https://upload.wikimedia.org/wikipedia/commons/d/df/RobotsMODO.jpg" width="400"></img>

<br>

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
- <a href='#3'>Data exploration</a>   
 - <a href='#31'>Check the data</a>   
 - <a href='#32'>Distribution of target feature `surface`</a>   
 - <a href='#33'>Density plots of features</a>   
- <a href='#4'>Feature engineering</a>
- <a href='#5'>Model</a>
- <a href='#6'>Submission</a>  
- <a href='#7'>References</a>

# <a id='1'>Introduction</a>  

## Competition
In this competition, we willl help robots recognize the floor surface theyâ€™re standing on. The floor could be of various types, like carpet, tiles, concrete.

## Data
The data provided by the organizers  is collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises.  

## Kernel
In this Kernel we perform EDA on the data, explore with feature engineering and build a predictive model.

# <a id='2'>Prepare for data analysis</a>  


## Load packages


In [None]:
import gc
import os
import logging
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')

## Load data   

Let's check what data files are available.

In [None]:
IS_LOCAL = False
if(IS_LOCAL):
    PATH="../input/careercon/"
else:
    PATH="../input/"
os.listdir(PATH)

Let's load the data.

In [None]:
%%time
X_train = pd.read_csv(os.path.join(PATH, 'X_train.csv'))
X_test = pd.read_csv(os.path.join(PATH, 'X_test.csv'))
y_train = pd.read_csv(os.path.join(PATH, 'y_train.csv'))

In [None]:
print("Train X: {}\nTrain y: {}\nTest X: {}".format(X_train.shape, y_train.shape, X_test.shape))

We can observe that train data and labels have different number of rows.

# <a id='3'>Data exploration</a>  

## <a id='31'>Check the data</a>  

Let's check the train and test set.

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_test.head()

X_train and X_test datasets have the following entries:  

* series and measurements identifiers: **row_id**, **series_id**, **measurement_number**: these identify uniquely a series and measurement; there are 3809 series, each with max 127 measurements;  
* measurement orientations: **orientation_X**, **orientation_Y**, **orientation_Z**, **orientation_W**;   
* angular velocities: **angular_velocity_X**, **angular_velocity_Y**, **angular_velocity_Z**;
* linear accelerations: **linear_acceleration_X**, **linear_acceleration_Y**, **linear_acceleration_Z**.

y_train has the following columns:  

* **series_id** - this corresponds to the series in train data;  
* **group_id**;  
* **surface** - this is the surface type that need to be predicted.



In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

In [None]:
missing_data(X_train)

In [None]:
missing_data(X_test)

There are no missing values in train and test data.

In [None]:
missing_data(y_train)

Also, train labels has no missing data.

In [None]:
X_train.describe()

In [None]:
X_test.describe()

In [None]:
y_train.describe()

There is the same number of series in X_train and y_train, numbered from 0 to 3809 (total 3810). Each series have 128 measurements.   
Each series in train dataset is part of a group (numbered from 0 to 72, 72 being the half of 128).  
The number of rows in X_train and X_test differs with 6 x 128, 128 being the number of measurements for each group.  

## <a id='32'>Distribution of target feature - surface</a>  


In [None]:
f, ax = plt.subplots(1,1, figsize=(16,4))
g = sns.countplot(y_train['surface'])
g.set_title("Number of labels for each class")
plt.show()    

## <a id='32'>Density plots of features</a>  

Let's show now the density plot of variables in train and test dataset. 

We represent with different colors the distribution for values with different values of **surface**.

In [None]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(2,5,figsize=(16,8))

    for feature in features:
        i += 1
        plt.subplot(2,5,i)
        sns.kdeplot(df1[feature], bw=0.5,label=label1)
        sns.kdeplot(df2[feature], bw=0.5,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=8)
        plt.tick_params(axis='y', which='major', labelsize=8)
    plt.show();

In [None]:
features = X_train.columns.values[3:]
plot_feature_distribution(X_train, X_test, 'train', 'test', features)

In [None]:
def plot_feature_class_distribution(classes,tt, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(5,2,figsize=(16,24))

    for feature in features:
        i += 1
        plt.subplot(5,2,i)
        for clas in classes:
            ttc = tt[tt['surface']==clas]
            sns.kdeplot(ttc[feature], bw=0.5,label=clas)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=8)
        plt.tick_params(axis='y', which='major', labelsize=8)
    plt.show();

In [None]:
classes = (y_train['surface'].value_counts()).index
tt = X_train.merge(y_train, on='series_id', how='inner')
plot_feature_class_distribution(classes, tt, features)

# <a id='4'>Feature engineering</a>  


This section is heavily borrowing from: https://www.kaggle.com/vanshjatana/help-humanity-by-helping-robots Kernel.

In [None]:
# https://stackoverflow.com/questions/53033620/how-to-convert-euler-angles-to-quaternions-and-get-the-same-euler-angles-back-fr?rq=1
def quaternion_to_euler(x, y, z, w):
    import math
    t0 = +2.0 * (w * x + y * z)
    t1 = +1.0 - 2.0 * (x * x + y * y)
    X = math.atan2(t0, t1)

    t2 = +2.0 * (w * y - z * x)
    t2 = +1.0 if t2 > +1.0 else t2
    t2 = -1.0 if t2 < -1.0 else t2
    Y = math.asin(t2)

    t3 = +2.0 * (w * z + x * y)
    t4 = +1.0 - 2.0 * (y * y + z * z)
    Z = math.atan2(t3, t4)

    return X, Y, Z

def perform_feature_engineering(actual):
    new = pd.DataFrame()
    actual['total_angular_velocity'] = (actual['angular_velocity_X'] ** 2 + actual['angular_velocity_Y'] ** 2 + actual['angular_velocity_Z'] ** 2) ** 0.5
    actual['total_linear_acceleration'] = (actual['linear_acceleration_X'] ** 2 + actual['linear_acceleration_Y'] ** 2 + actual['linear_acceleration_Z'] ** 2) ** 0.5
    
    actual['acc_vs_vel'] = actual['total_linear_acceleration'] / actual['total_angular_velocity']
    
    x, y, z, w = actual['orientation_X'].tolist(), actual['orientation_Y'].tolist(), actual['orientation_Z'].tolist(), actual['orientation_W'].tolist()
    nx, ny, nz = [], [], []
    for i in range(len(x)):
        xx, yy, zz = quaternion_to_euler(x[i], y[i], z[i], w[i])
        nx.append(xx)
        ny.append(yy)
        nz.append(zz)
    
    actual['euler_x'] = nx
    actual['euler_y'] = ny
    actual['euler_z'] = nz
    
    actual['total_angle'] = (actual['euler_x'] ** 2 + actual['euler_y'] ** 2 + actual['euler_z'] ** 2) ** 5
    actual['angle_vs_acc'] = actual['total_angle'] / actual['total_linear_acceleration']
    actual['angle_vs_vel'] = actual['total_angle'] / actual['total_angular_velocity']
    
    def mean_change_of_abs_change(x):
        return np.mean(np.diff(np.abs(np.diff(x))))

    def mean_abs_change(x):
        return np.mean(np.abs(np.diff(x)))
    
    for col in actual.columns:
        if col in ['row_id', 'series_id', 'measurement_number']:
            continue
        new[col + '_mean'] = actual.groupby(['series_id'])[col].mean()
        new[col + '_min'] = actual.groupby(['series_id'])[col].min()
        new[col + '_max'] = actual.groupby(['series_id'])[col].max()
        new[col + '_std'] = actual.groupby(['series_id'])[col].std()
        new[col + '_max_to_min'] = new[col + '_max'] / new[col + '_min']
        
        # Change. 1st order.
        new[col + '_mean_abs_change'] = actual.groupby('series_id')[col].apply(mean_abs_change)
        
        # Change of Change. 2nd order.
        new[col + '_mean_change_of_abs_change'] = actual.groupby('series_id')[col].apply(mean_change_of_abs_change)
        
        new[col + '_abs_max'] = actual.groupby('series_id')[col].apply(lambda x: np.max(np.abs(x)))
        new[col + '_abs_min'] = actual.groupby('series_id')[col].apply(lambda x: np.min(np.abs(x)))

    return new

In [None]:
%%time
X_train = perform_feature_engineering(X_train)
X_test = perform_feature_engineering(X_test)

In [None]:
X_train.head()

In [None]:
X_test.head()

# <a id='5'>Model</a>  


In [None]:
le = LabelEncoder()
y_train['surface'] = le.fit_transform(y_train['surface'])

In [None]:
X_train.fillna(0, inplace = True)
X_train.replace(-np.inf, 0, inplace = True)
X_train.replace(np.inf, 0, inplace = True)
X_test.fillna(0, inplace = True)
X_test.replace(-np.inf, 0, inplace = True)
X_test.replace(np.inf, 0, inplace = True)

In [None]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
#               'max_features': max_features,
               'max_depth': max_depth,
#               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
#               'bootstrap': bootstrap
              }
print(random_grid)

In [None]:
params = {'n_estimators': 800, 'min_samples_leaf': 1, 'max_depth': 20}

We use a Random Forest Classifier model.

In [None]:
sub_preds_rf = np.zeros((X_test.shape[0], 9))
oof_preds_rf = np.zeros((X_train.shape[0]))
score = 0
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train['surface'])):
    clf =  RandomForestClassifier(**params)
    #rf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
    # Fit the random search model
    #rf_random.fit(X_train.iloc[trn_idx], y_train['surface'][trn_idx])
    #print(rf_random.best_params_)
    #clf = rf_random.best_estimator_
    clf.fit(X_train.iloc[trn_idx], y_train['surface'][trn_idx])
    oof_preds_rf[val_idx] = clf.predict(X_train.iloc[val_idx])
    sub_preds_rf += clf.predict_proba(X_test) / folds.n_splits
    score += clf.score(X_train.iloc[val_idx], y_train['surface'][val_idx])
    print('Fold: {} score: {}'.format(fold_,clf.score(X_train.iloc[val_idx], y_train['surface'][val_idx])))
print('Avg Accuracy', score / folds.n_splits)

# <a id='6'>Submission</a>  

We submit the solution.

In [None]:
submission = pd.read_csv(os.path.join(PATH,'sample_submission.csv'))
submission['surface'] = le.inverse_transform(sub_preds_rf.argmax(axis=1))
submission.to_csv('submission.csv', index=False)
submission.head(10)

# <a id='7'>References</a>    

[1] https://www.kaggle.com/vanshjatana/help-humanity-by-helping-robots-4e306b  
[2] https://www.kaggle.com/artgor/where-do-the-robots-drive  
