# Safety
### Based on telematics data, how might we detect if the driver is driving dangerously?  
Given the telematics data for each trip and the label if the trip is tagged as dangerous driving, derive a model that can detect dangerous driving trips.

Submission by: Xavier M. Puspus  
Email: xpuspus@gmail.com  
Country: Philippines  

The given dataset contains telematics data during trips (bookingID). Each trip will be assigned with label 1 or 0 in a separate label file to indicate dangerous driving. Pls take note that dangerous drivings are labelled per trip, while each trip could contain thousands of telematics data points. participants are supposed to create the features based on the telematics data before training models.



In [1]:
# Reload dependencies dynamically
%load_ext autoreload
%autoreload 2

### Load Directories

In [2]:
import glob
import pandas as pd
from utils.utils import load_from_directory, process_data, show_confusion_matrix
from sklearn.metrics import confusion_matrix as cf
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from imblearn.over_sampling import SMOTE as sm
from sklearn.metrics import roc_curve as roc
from sklearn.metrics import auc as auc
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

### Load Data

In [3]:
# Place data in directory data/
FEATURES_PATH = 'data/safety/features'
LABELS_PATH = 'data/safety/labels'

# Load data
features_df = load_from_directory(FEATURES_PATH)
features_df = features_df.sort_values(['bookingID', 'second'])#.set_index('second')
labels_df = load_from_directory(LABELS_PATH)

# We groupby and sum labels to account for bookings with multiple labels (defaulting to dangerous)
labels_df = labels_df.groupby('bookingID').max().reset_index()

all_data = pd.merge(features_df, labels_df, on='bookingID', how='left')

In [4]:
all_data.head()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed,label
0,0,12.0,143.298294,0.818112,-9.941461,-2.014999,-0.016245,-0.09404,0.070732,0.0,3.442991,0
1,0,8.0,143.298294,0.546405,-9.83559,-2.038925,-0.047092,-0.078874,0.043187,1.0,0.228454,0
2,0,8.0,143.298294,-1.706207,-9.270792,-1.209448,-0.028965,-0.032652,0.01539,2.0,0.228454,0
3,0,8.0,143.298294,-1.416705,-9.548032,-1.860977,-0.022413,0.005049,-0.025753,3.0,0.228454,0
4,0,8.0,143.298294,-0.598145,-9.853534,-1.378574,-0.014297,-0.046206,0.021902,4.0,0.228454,0


In [5]:
all_data.columns

Index(['bookingID', 'Accuracy', 'Bearing', 'acceleration_x', 'acceleration_y',
       'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z', 'second', 'Speed',
       'label'],
      dtype='object')

In [24]:
# Feature Engineer Data
all_data['acceleration'] = (all_data['acceleration_x']**2 + 
                            all_data['acceleration_y']**2 + 
                            all_data['acceleration_z']**2).apply(lambda x: np.sqrt(x))

all_data['gyro'] = (all_data['gyro_x']**2 + 
                            all_data['gyro_y']**2 + 
                            all_data['gyro_z']**2).apply(lambda x: np.sqrt(x))


In [25]:
# Split Bookings to Train and Test
n_ratio = 0.7
n_ids = labels_df.shape[0]

# Get Train-Test IDs
train_ids = pd.DataFrame(labels_df.bookingID[:int(n_ids*n_ratio)])
test_ids = pd.DataFrame(labels_df.bookingID[int(n_ids*n_ratio):])

In [26]:
# Get features for train-test ids
train_df = pd.merge(train_ids, all_data, on='bookingID', how='inner')
test_df = pd.merge(test_ids, all_data, on='bookingID', how='inner')

# Get labels for train-test ids
train_label = pd.merge(train_ids, labels_df, on='bookingID', how='inner')
test_label = pd.merge(test_ids, labels_df, on='bookingID', how='inner')

In [27]:
# Get df sizes
train_df.shape, test_df.shape, train_label.shape, test_label.shape

((11246691, 14), (4888870, 14), (14000, 2), (6000, 2))

In [28]:
# Get train-test Columns
feature_columns = ['Accuracy', 'Bearing', 'acceleration_x', 'acceleration_y',
                    'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z', 'second', 'Speed']
label_column = ['label']

# Get train-test features and labels only
# train_feats = train_df[feature_columns]
# test_feats = test_df[feature_columns]

# Feature Engineer Train-Test features
train_feats = process_data(train_df)
test_feats = process_data(test_df)

train_lbl = train_label[label_column]
test_lbl = test_label[label_column]

In [29]:
train_feats.shape, train_lbl.shape

((14000, 1320), (14000, 1))

In [30]:
# Oversample Minority in train set
sm_ = sm(random_state = 42)
X_train_res, y_train_res = sm_.fit_sample(train_feats, train_lbl)

  y = column_or_1d(y, warn=True)


In [31]:
X_train_res.shape, y_train_res.shape

((21340, 1320), (21340,))

In [32]:
# Create dataframe from resampled feature train set
X_train_res_df = pd.DataFrame(X_train_res)

In [40]:
# Instantiate model
model = MLPClassifier((256, 128, 64), random_state=42, verbose=1, early_stopping=True)

In [41]:
%%time
# Fit Data to Model
model.fit(X_train_res_df, y_train_res)
predictions = model.predict(test_feats)

Iteration 1, loss = 2.41905530
Validation score: 0.526242
Iteration 2, loss = 0.77602333
Validation score: 0.622306
Iteration 3, loss = 0.67022786
Validation score: 0.603561
Iteration 4, loss = 0.64453347
Validation score: 0.664480
Iteration 5, loss = 0.61169415
Validation score: 0.685098
Iteration 6, loss = 0.58216474
Validation score: 0.610590
Iteration 7, loss = 0.55168076
Validation score: 0.705248
Iteration 8, loss = 0.49699708
Validation score: 0.608716
Iteration 9, loss = 0.48347671
Validation score: 0.721181
Iteration 10, loss = 0.43215887
Validation score: 0.630272
Iteration 11, loss = 0.39878368
Validation score: 0.643861
Iteration 12, loss = 0.41391980
Validation score: 0.750703
Iteration 13, loss = 0.34725292
Validation score: 0.721649
Iteration 14, loss = 0.31770138
Validation score: 0.735239
Iteration 15, loss = 0.30512780
Validation score: 0.783974
Iteration 16, loss = 0.25871267
Validation score: 0.650890
Iteration 17, loss = 0.30998096
Validation score: 0.752577
Iterat

In [44]:
# Accuracy Score
model.score(test_feats, test_lbl)

0.6565

In [45]:
# Measure AUC-ROC
roc_auc_score(test_lbl, predictions)

0.4968435498058478

In [46]:
# Save Model to File
model_fn = "model/safety_challenge.joblib.dat"
joblib.dump(model, model_fn)

['model/safety_challenge.joblib.dat']

# Measure From Holdout Data

**For examiner:** Please save hold out data to `data/test/` folder with the same folder structure as the one provided for the challenge in `safety/` folder. Run cells below once the holdout data is in the suggested folder.

### Load Holdout Data

In [20]:
# Place data in directory data/
FEATURES_HOLDOUT_PATH = 'data/test/features'
LABELS_HOLDOUT_PATH = 'data/test/labels'

# Load data
features_holdout_df = load_from_directory(FEATURES_HOLDOUT_PATH)
features_holdout_df = features_holdout_df.sort_values(['bookingID', 'second'])#.set_index('second')
labels_holdout_df = load_from_directory(LABELS_HOLDOUT_PATH)

# We groupby and sum labels to account for bookings with multiple labels (defaulting to dangerous)
labels_holdout_df = labels_holdout_df.groupby('bookingID').max().reset_index()

# Merge Holdout Data
all_holdout_data = pd.merge(features_holdout_df, labels_holdout_df, on='bookingID', how='left')

In [22]:
# Feature Engineer Data
all_holdout_data['acceleration'] = (all_holdout_data['acceleration_x']**2 + 
                            all_holdout_data['acceleration_y']**2 + 
                            all_holdout_data['acceleration_z']**2).apply(lambda x: np.sqrt(x))

all_holdout_data['gyro'] = (all_holdout_data['gyro_x']**2 + 
                            all_holdout_data['gyro_y']**2 + 
                            all_holdout_data['gyro_z']**2).apply(lambda x: np.sqrt(x))



# Process holdout data
holdout_features = process_data(all_holdout_data)

In [47]:
# Load Saved Model
loaded_model = joblib.load(model_fn)

In [48]:
# make predictions for test data
holdout_predictions = loaded_model.predict(holdout_features)

In [49]:
# Measure AUC-ROC
roc_auc_score(labels_holdout_df.label, holdout_predictions)

0.5005791894491294