# CareerCon 2019 - Help Navigate Robots
## This is a first take on this data, the main point is to perform some exploratory data analysis so we can:

#### 1. Understand better what the data is about
#### 2. Understand the problem



In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

## Reading the data

We have two train .csv files, one for the response variable and another one for all the explanatory variables. And a test file.

In [None]:

data_train_path = '../input/X_train.csv'
response_train_path = '../input/y_train.csv'
test_data_path = '../input/X_test.csv'
sub_data_path = '../input/sample_submission.csv'

data_train = pd.read_csv(data_train_path)
response_train = pd.read_csv(response_train_path)
data_test = pd.read_csv(test_data_path)
sub_data = pd.read_csv(sub_data_path)

### Going into the 1. point of this notebook, let's see how this looks like

The 3 id's the data set is containing are defined as:

* row_id : The ID for this row.
* series_id: ID number for the measurement series. Foreign key to y_train/sample_submission.
* measurement_number: Measurement number within the series.

In [None]:
data_train.head()

In [None]:
data_train.shape

Let's check what up with the ID's first.
* So far nothing wrong with *row_id*, as it should have as many unique values as rows in the train set.
* *series_id* should have as many unique values as rows in the response set, since it is its foreign key
* *measurement_numer* number of measurements for every each of the values in *series_id*

In [None]:
print( 'Counts row_id: ' + str(len(data_train['row_id'].unique())),
      '\nCounts series_id: ' + str(len(data_train['series_id'].unique())),
      '\nCounts measurement_number: ' + str(len(data_train['measurement_number'].unique())) 
     )

In [None]:
#all values in 'series_id' hold the same number of measurements
sum(data_train.groupby('series_id')['measurement_number'].count() !=128)

Then we have 
* 4 variables for orientation, 3 for angular velocity and 3 more for linear acceleration

Before checking what is up with them, let's quickly see what the test set is about:

#### response training set exploration

In [None]:
response_train.head()

Let's try to understand what *group_id* means

In [None]:
#73 numbers for group id
len(response_train['group_id'].unique())

In [None]:
series_bygroupid = response_train.groupby('series_id')['group_id'].max()
groupid_counts = response_train.groupby('series_id')['group_id'].count()

series_bygroupid.head()

In [None]:
len(groupid_counts == 1)

Let's plot 

In [None]:
series_bygroupid.plot()

Ok now, what is this:

So we've plotted *series_id* against its maximum value for *group_id* (which happens to be the only value) and we can clearly see a change between the two halves of the set. 

Groups_id are numbers given for all those series recorded in a same measurement session. We can see a clear distinction between those series before ~1750 and after. 

That could mean series are not properly shuffled, there is two groups which have been sampled separetely.

* could we create a feature out of this? binary variable for the two groups, would it be meaningful?
* Is there any difference between those two groups?

In [None]:
#we've got 9 different categories
len(response_train['surface'].unique())

The names seem coherent and descriptive, their meaning could be holding some sort of value

In [None]:
#let's see their names
response_train['surface'].unique()

Let's see how these categories are distributed across the different groups found. 

We can see here the appearences of this categories in the set. Being *hard_tiles* and *carpet* the less frequent ones.

In [None]:
response_train.groupby('surface')['group_id'].count()

We see how they are not evenly distributed. For example *hard_tiles* appears only for one *group_id*

In [None]:
response_train.groupby('surface')['group_id'].nunique()

In [None]:
response_train['surface'] = response_train['surface'].astype('category')

Here we can see how each group is assigned solely to one category. 

In [None]:
sum(response_train.groupby('group_id')['surface'].nunique() != 1)

#### Now let's go back to the train set with the explanatory variables and let's try to understand the rest of it

Ok, all of them are float64 types

In [None]:
data_train.dtypes

We can see how for the ones refering to orientation:

     X and Y move between ~1 and ~-1 , and Z and W between ~0.16 and ~-0.16
     Std is ~2/3 of their maximum and minimum values
     their means are all around 0
     
We see for the angular velocity variables:

      how the max values are between 1 and 2.28
      stds are just ~0.1
      mean is also ~0

For linear accelerations: 
    
      max is 36.8 for X, 73 for Y, 65.8 for Z
      std 1.9, 2.1 and 2.8
      mean 0.13, 2.9, -9.34



In [None]:
data_train.describe()

Let's first go with the orientation axis

*"The orientation channels encode the current angles how the robot is oriented as a quaternion"*

A unit quaternion is defined as:

$$ \mathbf{q} = \begin{bmatrix} q_w & q_x & q_y & q_z \end{bmatrix}^T $$


$$|\mathbf{q}|^2 = q_w^2 + q_x^2 + q_y^2 + q_z^2 = 1$$

All of them are indeed either 1 or very close to it:

In [None]:
unit_quat = (data_train['orientation_W']**2+
data_train['orientation_X']**2+  
data_train['orientation_Y']**2+
data_train['orientation_Z']**2)

unit_quat.head()

The following formula can be used to transform this variables into Euler Angles

$$ \begin{bmatrix}
\phi \\ \theta \\ \psi
\end{bmatrix} =
\begin{bmatrix}
\mbox{atan2}  (2(q_0 q_1 + q_2 q_3),1 - 2(q_1^2 + q_2^2)) \\
\mbox{asin} (2(q_0 q_2 - q_3 q_1)) \\
\mbox{atan2}  (2(q_0 q_3 + q_1 q_2),1 - 2(q_2^2 + q_3^2))
\end{bmatrix} $$


and more intuition on quaternions can be build here: https://eater.net/quaternions/

Let's check the angular velocity variables

*"the angular velocity of a particle is the rate at which its angular position about a chosen center point changes"*

In [None]:
data_train['angular_velocity_Y'].plot()

We can see what we've seen when getting the descriptive statistics of the dataset, angular velocities are centered around the mean and look stationary

Now more let's dive into the linear acceleration ones.

In [None]:
data_train['linear_acceleration_X'].plot()

In [None]:
data_train['linear_acceleration_Y'].plot()

In [None]:
data_train['linear_acceleration_Z'].plot()

## Moving on to point 2. Let's now focus on the problem

The problem consist in predicting on which kind of surface out of those 9 types given the robot is moving based on its data.

So we are gonna start with a baseline model that we are gonna try to optimize.

In [None]:
data_merge = pd.merge(data_train, response_train, on = 'series_id')

In [None]:
#we check all series have only one category assigned
sum(data_merge.groupby('series_id')['surface'].nunique() !=  1)

Now let's drop all the id variables but series_id, so we can group by series_id means of the axis variables (orientation, acceleration, velocity)

In [None]:
data_clus = data_train.drop(['row_id','measurement_number'], axis=1)

In [None]:
data_clusmean = data_clus.groupby('series_id').mean()

Now let's fit the data for 9 clusters, as many a categories we've got to predict

In [None]:
kmeans = KMeans(n_clusters=9, random_state=0).fit(data_clusmean)

In [None]:
#let's get the labels as a dataframe and create the column for series_id so we can merge it
labels_clus = pd.DataFrame(kmeans.labels_)
labels_clus['series_id'] = range(response_train.shape[0])

In [None]:
labels_clus.columns = ['labels', 'series_id']

In [None]:
response_labeled = pd.merge(response_train, labels_clus, on='series_id')

Now we can check which are the most frequent categories for every label

In [None]:
freq_catlabel = response_labeled.groupby('labels')['surface'].value_counts()

In [None]:
freq_catlabel

Ok, so let's create a dictionary with the most frequent category for every class, let's not be meticulous and just go with the most frequent ones, no matter what.

In [None]:
freq_labels_dict = ({0:'concrete', 1:'soft_pvc',2:'wood', 3:'concrete',
                    4:'soft_pvc',5:'concrete',6:'tiled',7:'soft_tiles', 8:'tiled'})

In [None]:
freq_labels_dict

So, let's check it on the test

In [None]:
data_test_cluster = data_test.drop(['measurement_number', 'row_id'],axis=1)
data_test_cluster_group = data_test_cluster.groupby('series_id').mean()

In [None]:
#predict fitted clusters on the test
y_pred_test = kmeans.predict(data_test_cluster_group)

In [None]:
y_pred_test = pd.DataFrame(y_pred_test)

In [None]:
y_pred_test['series_id']  = range(len(y_pred_test))

In [None]:
y_pred_test.columns = ['surface', 'series_id']

In [None]:
y_pred_test['surface']  = y_pred_test['surface'].map(freq_labels_dict)

In [None]:
y_pred_test = y_pred_test[['series_id', 'surface']]

Let's safe the submission. After submission the score was 0.21.

In [None]:
#y_pred_test.to_csv()