# **Descriptions**<br>
In this competition, you’ll help robots recognize the floor surface they’re standing on using data collected from Inertial Measurement Units (IMU sensors).
We’ve collected IMU sensor data while driving a small mobile robot over different floor surfaces on the university premises. The task is to predict which one of the nine floor types (carpet, tiles, concrete) the robot is on using sensor data such as acceleration and velocity. Succeed and you'll help improve the navigation of robots without assistance across many different surfaces, so they won’t fall down on the job.

# **Objective**<br>
We have to predict which one of the 9 floor types robot is standing. 

# **Evaluation/Performence  Metric** <br>
The problems comes under the multiclass classification and the performence metric is **Multiclass Accuracy**, which is simply the average number of observations with the correct label. 

# **DataSet Information** <br>
   - **Train_data & Test_data** - Used to train model. It contains 10 sensor channels and 128 measurements per time series plus three ID columns. We can think of as a sensors signals are processed by applying some filters and then sampled in fixed-windows that contains 128 readings each. <br> 
      - **row_id** - Current row number<br>
      - **series_id** - ID number for the measurement series. Foreign key to y_train/sample_submission.<br>
      - **measurement_number** - measurement number within the series<br>
      - **orientation_W,X,Y,Z** - The 10 sensor channels that measures the current angles of how robot is oriented as quaternion<br>
      - **angular_velocity_X,Y,Z** - The 10 sensor channels that measures anguler velocity(rotational angle per unit time) and speed of motion same as gyroscope sensor<br>
      - **Linear_accleration_X,Y,Z** - The 10 sensor channels that measure how speed is changing at different times.<br>
   - **Y_train **- The surface of training set<br>
      - **series_id** - ID number for the measurement series.<br>
      - **group_id** - Number of all measurement in recording sessions.<br>
      - **surface ** - Class label/target.<br>
   - **sample_submission** - We need to submit prediction that contains series_id and target<br>
   
   

# **Y_labels(Encoded)**<br>
As the problems is multiclass problem so we will encode all class labels into 1 to 9.<br>
 - fine_concrete                    1
 - concrete                            2
 - soft_tiles                            3
 - tiled                                   4
 - soft_pvc                            5
 - carpet                               6
 - hard_tiles_large_space    7
 - hard_tiles                         8
 - wood                                9

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")
from sklearn.manifold import TSNE 
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import confusion_matrix
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Read and store train and test data
train_df = pd.read_csv("../input/X_train.csv", squeeze = True)
test_df = pd.read_csv("../input/X_test.csv")

In [None]:
# shape of train data
train_df.shape

In [None]:
# Each window has 128 readings
print("Dimension of train data", train_df.shape[0]//128)
# To see first 5 train data-points
train_df.head()

In [None]:
# Read y_train(target)
y_train = pd.read_csv("../input/y_train.csv", skipinitialspace = True, squeeze = True)
y_train.head()

In [None]:
# Encode class label
y_train["surface_label"] = y_train["surface"].map({'fine_concrete':0, 'concrete':1, 'soft_tiles':2, 'tiled':3, 'soft_pvc':4,
       'hard_tiles_large_space':5, 'carpet':6, 'hard_tiles':7, 'wood':8}) 

In [None]:
# To merge train_data with corresponding class label
train_df = pd.merge(train_df, y_train[["surface_label", "series_id"]], on = "series_id") 
train_df.head()

In [None]:
# Unique feature names 
# Belongs to 9 categories
y_train.surface.unique()

In [None]:
# To see first 5 data-points
print("Test data dimension",test_df.shape[0]//128)
test_df.head() 

# **Data Exploration**

In [None]:
# Check for number of datapoints per class
y_train["surface"].value_counts()

In [None]:
# Plot per class data-points
plt.figure(figsize = (12, 8))
sns.countplot(y_train["surface"])
plt.title("Number of datapoints per class")
plt.ylabel("Number of datapoints")
plt.xlabel("Class name")
plt.show()

We have data imbalanced problem. We will look into this concern and handle it. 

In [None]:
# Check for duplicates row in train and test
print("Number of duplicates row in train data {}".format(sum(train_df.duplicated())))
print("Number of duplicates row in test data {}".format(sum(test_df.duplicated())))

In [None]:
# Check for null/nan values in both
print("Total number of null/nan values in train data \n{}".format(train_df.isnull().sum()))

In [None]:
print("Total number of null/nan values in test data \n{}".format(test_df.isnull().sum()))

> We do not have any null/nan and duplicated data in the dataset. That's great!

In [None]:
# Boxplot of angular_velocity_X,Y,Z
plt.figure(figsize = (12, 10))
sns.boxplot(x = y_train["surface"], y = train_df["angular_velocity_Z"], data = train_df)
plt.show()

In [None]:
# Boxplot for orientation_X
plt.figure(figsize = (10, 8))
sns.boxplot(x = y_train["surface"], y = train_df["orientation_X"], data = train_df)
plt.show()

In [None]:
# Boxplot for linear_acceleration_X
plt.figure(figsize = (10, 8))
sns.boxplot(x = y_train["surface"], y = train_df["linear_acceleration_X"], data = train_df)
plt.show()

In [None]:
# Distribution plot for linear_acceleration_X
label = y_train["surface"].unique()
plt.figure(figsize = (10, 8))
color = ["r", "g", "b", "c", "k", "y", "lime", "orange", "m"]
for i in range(len(y_train.surface.unique())):
    df = train_df[train_df["surface_label"] == i]
    sns.distplot(df["linear_acceleration_X"], color = color[i], hist = False, label = label[i])
    plt.tight_layout()
plt.show()

- Looks like distribution of feature linear_acceleration_X is peaked(i.e. kurtosis is high) and almost centered at 0. It seems like gaussion but it is not. 
- Data doesn't looks like linearly sepearable so we will create some features that might be useful in predicting class label.
- Without domain knowledge eda has no meaning.

# **Feature Engineering**

We will introduce some features that may useful in prediction and will explore some of them later. 
* **mean():** Mean value
* **std():** Standard deviation
* **mad():** Median absolute deviation
* **max():** Largest value in array
* **min():** Smallest value in array
* **sma():** Signal magnitude area
* **iqr():** Interquartile range
* **entropy():** Signal entropy
* **arCoeff():** Autorregresion coefficients with Burg order equal to 4
* **correlation():** correlation coefficient between two signals
* **maxInds():** index of the frequency component with largest magnitude
* **meanFreq():** Weighted average of the frequency components to obtain a mean frequency
* **skewness():** skewness of the frequency domain signal
* **kurtosis():** kurtosis of the frequency domain signal
* **angle():** Angle between to vectors.

In [None]:
train_data = train_df.drop(["surface_label"], axis = 1)
train_data.columns[3:]

In [None]:
# Signal magnitude area
import math
def sma(x, y, z):
    sum = 0
    for i in range(len(x)):
        sum += (abs(x[i]) + abs(y[i]) + abs(z[i]))
    return sum/len(x)

In [None]:
train_data['sma'] = sma(train_data['angular_velocity_X'], train_data['angular_velocity_Y'], train_data['angular_velocity_Z'])
test_df['sma'] = sma(test_df['angular_velocity_X'], test_df['angular_velocity_Y'], test_df['angular_velocity_Z'])

In [None]:
# https://www.kaggle.com/jesucristo/1-robots-eda-rf-predictions-0-72
def feat_eng(data):
    
    df = pd.DataFrame()
    data['totl_anglr_vel'] = (data['angular_velocity_X']**2 + data['angular_velocity_Y']**2 +
                             data['angular_velocity_Z'])** 0.5
    data['totl_linr_acc'] = (data['linear_acceleration_X']**2 + data['linear_acceleration_Y']**2 +
                             data['linear_acceleration_Z'])**0.5
    data['totl_xyz'] = (data['orientation_X']**2 + data['orientation_Y']**2 +
                             data['orientation_Z'])**0.5
    data['acc_vs_vel'] = data['totl_linr_acc'] / data['totl_anglr_vel']
    
    for col in data.columns:
        if col in ['row_id','series_id','measurement_number']:
            continue
        df[col + '_mean'] = data.groupby(['series_id'])[col].mean()
        df[col + '_median'] = data.groupby(['series_id'])[col].median()
        df[col + '_max'] = data.groupby(['series_id'])[col].max()
        df[col + '_min'] = data.groupby(['series_id'])[col].min()
        df[col + '_std'] = data.groupby(['series_id'])[col].std()
        df[col + '_q25'] = data.groupby(['series_id'])[col].quantile(0.25)
        #df[col + '_q50'] = data.groupby(['series_id'])[col].quantile(0.5)
        df[col + '_q75'] = data.groupby(['series_id'])[col].quantile(0.75)
        #df[col + '_mad'] = data.groupby(['series_id'])[col].mad()
        #df[col + '_skew'] = data.groupby(['series_id'])[col].skew()
        df[col + '_range'] = df[col + '_max'] - df[col + '_min']
        #df[col + '_maxtoMin'] = df[col + '_max'] / df[col + '_min']
        df[col + '_mean_abs_chg'] = data.groupby(['series_id'])[col].apply(lambda x: np.mean(np.abs(np.diff(x))))
        df[col + '_abs_max'] = data.groupby(['series_id'])[col].apply(lambda x: np.max(np.abs(x)))
        df[col + '_abs_min'] = data.groupby(['series_id'])[col].apply(lambda x: np.min(np.abs(x)))
        df[col + '_abs_avg'] = (df[col + '_abs_min'] + df[col + '_abs_max'])/2
        #df[col + '_angle'] = data.groupby(['series_id'])[col].apply(lambda x: np.angle(x, deg = True))
        #df[col + 'perm_entropy'] = data.groupby(['series_id'])[col].apply(lambda x: ent.permutation_entropy(x, order = 3, normalize = True))
    return df

In [None]:
X_train = feat_eng(train_data)
test_data = feat_eng(test_df)
print (X_train.shape)
X_train.head()

In [None]:
#tsne_data = train_df.drop("surface_label", axis = 1)
tsne_data = X_train
#sampled = tsne_data[0:3810]
x_tsne = MinMaxScaler().fit_transform(tsne_data) 
y_tsne = y_train["surface"]
# Convert nan to num
x_tsne = np.nan_to_num(x_tsne)

In [None]:
# performs t-sne with different perplexity values and their repective plots..

def perform_tsne(X_data, y_data, perplexities, n_iter=1000, img_name_prefix='t-sne'):
        
    for index,perplexity in enumerate(perplexities):
        # perform t-sne
        print('\nperforming tsne with perplexity {} and with {} iterations at max'.format(perplexity, n_iter))
        X_reduced = TSNE(verbose=2, perplexity=perplexity).fit_transform(X_data)
        print('Done..')
        
        # prepare the data for seaborn         
        print('Creating plot for this t-sne visualization..')
        df = pd.DataFrame({'x':X_reduced[:,0], 'y':X_reduced[:,1] ,'label':y_data})
        
        # draw the plot in appropriate place in the grid
        sns.lmplot(data=df, x='x', y='y', hue='label', fit_reg=False, size=8,\
                   palette="Set1")
        plt.title("perplexity : {} and max_iter : {}".format(perplexity, n_iter))
        img_name = img_name_prefix + '_perp_{}_iter_{}.png'.format(perplexity, n_iter)
        print('saving this plot as image in present working directory...')
        plt.savefig(img_name)
        plt.show()
        print('Done')


In [None]:
# Call method to plot tsne
perform_tsne(X_data = x_tsne,y_data = y_tsne, perplexities = [2, 5, 10, 20, 50])

- Data are not fully clusterd together but they are nicely clustered togeather means it can be seperated in higher dimension space and also we can get more clean plot(All the same class points can be in a group) if we change perplexity and iteration. 

# **Function to plot confusion matrix**

In [None]:
plt.rcParams["font.family"] = 'DejaVu Sans'

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# **Generic method to run any model**

In [None]:
def perform_model(test_data, model, X_train, y_train, X_test, y_test, class_labels, cm_normalize=True, \
                 print_cm=True, cm_cmap=plt.cm.Greens):
    
    
    # to store results at various phases
    results = dict()
    
    # time at which model starts training 
    train_start_time = datetime.now()
    print('training the model..')
    model.fit(X_train, y_train)
    print('Done \n \n')
    train_end_time = datetime.now()
    results['training_time'] =  train_end_time - train_start_time
    print('training_time(HH:MM:SS.ms) - {}\n\n'.format(results['training_time']))
    
    
    # predict test data
    print('Predicting test data')
    test_start_time = datetime.now()
    y_pred = model.predict(X_test)
    prediction = model.predict(test_data)
    #y_pred = np.argmax(y_pred, axis=1)
    test_end_time = datetime.now()
    print('Done \n \n')
    results['testing_time'] = test_end_time - test_start_time
    print('testing time(HH:MM:SS:ms) - {}\n\n'.format(results['testing_time']))
    results['predicted'] = y_pred
   
    # calculate overall accuracty of the model
    accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_pred)
    # store accuracy in results
    results['accuracy'] = accuracy
    print('---------------------')
    print('|      Accuracy      |')
    print('---------------------')
    print('\n    {}\n\n'.format(accuracy))
    
    
    # confusion matrix
    cm = metrics.confusion_matrix(y_test, y_pred)
    results['confusion_matrix'] = cm
    if print_cm: 
        print('--------------------')
        print('| Confusion Matrix |')
        print('--------------------')
        print('\n {}'.format(cm))
        
    # plot confusin matrix
    plt.figure(figsize=(8,8))
    plt.grid(b=False)
    plot_confusion_matrix(cm, classes=class_labels, normalize=True, title='Normalized confusion matrix', cmap = cm_cmap)
    plt.show()
    
    # get classification report
    print('-------------------------')
    print('| Classifiction Report |')
    print('-------------------------')
    classification_report = metrics.classification_report(y_test, y_pred)
    # store report in results
    results['classification_report'] = classification_report
    print(classification_report)
    
    # add the trained  model to the results
    results['model'] = model
    
    return prediction, results
    
    

# **Method to print gridserach attribute**

In [None]:
def print_grid_search_attributes(model):
    # Estimator that gave highest score among all the estimators formed in GridSearch
    print('--------------------------')
    print('|      Best Estimator     |')
    print('--------------------------')
    print('\n\t{}\n'.format(model.best_estimator_))


    # parameters that gave best results while performing grid search
    print('--------------------------')
    print('|     Best parameters     |')
    print('--------------------------')
    print('\tParameters of best estimator : \n\n\t{}\n'.format(model.best_params_))


    #  number of cross validation splits
    print('---------------------------------')
    print('|   No of CrossValidation sets   |')
    print('--------------------------------')
    print('\n\tTotal numbre of cross validation sets: {}\n'.format(model.n_splits_))


    # Average cross validated score of the best estimator, from the Grid Search 
    print('--------------------------')
    print('|        Best Score       |')
    print('--------------------------')
    print('\n\tAverage Cross Validate scores of best estimator : \n\n\t{}\n'.format(model.best_score_))

In [None]:
X_train = np.nan_to_num(X_train)
#X_test = np.nan_to_num(X_test)
test_data = np.nan_to_num(test_data)

In [None]:
X_train.shape, y_train["surface_label"].shape

In [None]:
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train["surface_label"], test_size = 0.3, random_state = 4, stratify = y_train["surface_label"], shuffle = True)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# ** Logistic Regression with Hyperparameter Tuning**

In [None]:
# start Grid search
labels = ['fine_concrete', 'concrete', 'soft_tiles', 'tiled', 'soft_pvc', 'hard_tiles_large_space', 'carpet', 'hard_tiles', 'wood']
parameters = {'C':[0.01, 0.1, 1, 10, 20, 30], 'penalty':['l2','l1']}
log_reg = linear_model.LogisticRegression()
log_reg_grid = GridSearchCV(log_reg, param_grid = parameters, cv = 10, verbose = 1, n_jobs = -1)
predict_lr, log_reg_grid_results =  perform_model(test_data, log_reg_grid, X_train, y_train, X_test, y_test, class_labels = labels)

In [None]:
# Confusion matrix
plt.figure(figsize = (8,8))
plt.grid(b = False)
plot_confusion_matrix(log_reg_grid_results['confusion_matrix'], classes = labels, cmap = plt.cm.Greens, )
plt.show()

In [None]:
# observe the attributes of the model 
print_grid_search_attributes(log_reg_grid_results['model'])

- Accuracy is low and confusion matrix is self-explanatory. We will use another model. 

# **GBDT With Hyperparameter Tuning**

In [None]:
# Model
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {'max_depth': np.arange(5,8,1), \
             'n_estimators':np.arange(130,170,10)}
gbdt = GradientBoostingClassifier()
gbdt_grid = GridSearchCV(gbdt, param_grid=param_grid, n_jobs=-1)
predict_gbdt, gbdt_grid_results = perform_model(test_data, gbdt_grid, X_train, y_train, X_test, y_test, class_labels=labels)
print_grid_search_attributes(gbdt_grid_results['model'])

# **RandomForest Classifier with Hyperparameter Tuning**

In [None]:
# Model
params = {'n_estimators': np.arange(10,201,20), 'max_depth':np.arange(3,15,2)}
rfc = RandomForestClassifier()
rfc_grid = GridSearchCV(rfc, param_grid=params, cv = 10, n_jobs=-1)
labels = ['fine_concrete', 'concrete', 'soft_tiles', 'tiled', 'soft_pvc', 'hard_tiles_large_space', 'carpet', 'hard_tiles', 'wood']
predict_rf, rfc_grid_results = perform_model(test_data, rfc_grid, X_train, y_train, X_test, y_test, class_labels = labels)
print_grid_search_attributes(rfc_grid_results['model']) 

In [None]:
# Submission 
submission = pd.read_csv("../input/sample_submission.csv")
submission["surface"] = predict_rf
submission["surface"] = submission["surface"].map({0:'fine_concrete', 1:'concrete', 2:'soft_tiles', 3:'tiled', 4:'soft_pvc',
       5:'hard_tiles_large_space', 6:'carpet', 7:'hard_tiles', 8:'wood'}) 
submission.to_csv("sample_submission.csv", index = False)

# Stay Tuned...........