# Starter Code For CareerCon Challenge
"To fully understand and properly navigate a task, however, they need input about their environment. In this competition, you’ll help robots recognize the floor surface they’re standing on using data collected from Inertial Measurement Units (IMU sensors)."

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Loading Data
We will now load the train, test, and true labels. Notice that we are going to predict for each series_id, a surface label. We have 487680 X training rows and 3810 rows for our labels. 

In [None]:
train = pd.read_csv('../input/X_train.csv')
test = pd.read_csv('../input/X_test.csv')
y = pd.read_csv('../input/y_train.csv')
print(train.head())
print(train.columns)
print(y.head())
print("Length of Train", len(train))
print("Length of Y Labels", len(y))

## Standardizing Columns

In [None]:
# Standardize all Columns that are not ID's or measurement numbers
col = train.columns[3:]
scaler = StandardScaler()
# scale the columns that contain the data
new_df = scaler.fit_transform(train[col])
new_df = pd.DataFrame(new_df, columns=col)
# Add back index
new_df["series_id"] = train['series_id']
new_df.head()

## Encoding Y Labels
We see that we have 9 different surface types, and we should encode them so that our models can predict the label. 

In [None]:
np.unique(y['surface'])

In [None]:
y = pd.read_csv('../input/y_train.csv')
y['surface'].value_counts().plot(kind='bar')

In [None]:
le = LabelEncoder()
y = le.fit_transform(y['surface'])
y

## Feature Engineering
We have 3810 rows to predict, so we aggregate our time-series data using the groupby function, to make 3810 rows. 

Taken from: https://www.kaggle.com/jsaguiar/surface-recognition-baseline

In [None]:
def change1(x):
    return np.mean(np.abs(np.diff(x)))

def change2(x):
    return np.mean(np.diff(np.abs(np.diff(x))))

def feature_extraction(raw_frame):
    frame = pd.DataFrame()
    raw_frame['angular_velocity'] = raw_frame['angular_velocity_X'] + raw_frame['angular_velocity_Y'] + raw_frame['angular_velocity_Z']
    raw_frame['linear_acceleration'] = raw_frame['linear_acceleration_X'] + raw_frame['linear_acceleration_Y'] + raw_frame['linear_acceleration_Z']
    raw_frame['velocity_to_acceleration'] = raw_frame['angular_velocity'] / raw_frame['linear_acceleration']
    #raw_frame['acceleration_cumsum'] = raw_frame['linear_acceleration'].cumsum()
    
    for col in raw_frame.columns[3:]:
        frame[col + '_mean'] = raw_frame.groupby(['series_id'])[col].mean()
        frame[col + '_std'] = raw_frame.groupby(['series_id'])[col].std()
        frame[col + '_max'] = raw_frame.groupby(['series_id'])[col].max()
        frame[col + '_min'] = raw_frame.groupby(['series_id'])[col].min()
        frame[col + '_max_to_min'] = frame[col + '_max'] / frame[col + '_min']
        
        # Change 1st order
        frame[col + '_mean_abs_change'] = raw_frame.groupby('series_id')[col].apply(change1)
        # Change 2nd order
        #frame[col + '_mean_abs_change2'] = raw_frame.groupby('series_id')[col].apply(change2)
        frame[col + '_abs_max'] = raw_frame.groupby('series_id')[col].apply(lambda x: np.max(np.abs(x)))
    return frame

train_df = feature_extraction(new_df)
len(train_df)

## Light Gradient Boosting
We will now try to classify using Light Gradient Boosting

In [None]:
import lightgbm as lgb
import time
num_folds = 10
target = y

params = {
    'num_leaves': 18,
    'min_data_in_leaf': 40,
    'objective': 'multiclass',
    'metric': 'multi_error',
    'max_depth': 8,
    'learning_rate': 0.01,
    "boosting": "gbdt",
    "bagging_freq": 5,
    "bagging_fraction": 0.812667,
    "bagging_seed": 11,
    "verbosity": -1,
    'reg_alpha': 0.2,
    'reg_lambda': 0,
    "num_class": 9,
    'nthread': -1
}

t0 = time.time()
train_set = lgb.Dataset(train_df, label=target)
eval_hist = lgb.cv(params, train_set, nfold=10, num_boost_round=9999,
                   early_stopping_rounds=100, seed=19)
num_rounds = len(eval_hist['multi_error-mean'])
# retrain the model and make predictions for test set
clf = lgb.train(params, train_set, num_boost_round=num_rounds)

print("Timer: {:.1f}s".format(time.time() - t0))

In [17]:
predictions = clf.predict(train_df, parameters = None)

In [27]:
y_pred = np.argmax(predictions, axis = 1)
le.inverse_transform(y_pred)

array(['fine_concrete', 'concrete', 'concrete', ..., 'fine_concrete',
       'tiled', 'soft_pvc'], dtype=object)

## Extra Links for Further Exploring!
PyTorch LSTM: https://www.kaggle.com/artgor/basic-pytorch-lstm
Complete EDA w/Model Analysis: https://www.kaggle.com/artgor/where-do-the-robots-drive
Current Best Score: https://www.kaggle.com/jesucristo/1-robots-eda-rf-cv-predictions-0-73