# HackerEarth ML 1:  Predict the Road Sign
### Random Forest solution: 91%

Download the dataset from the following link.

[Dataset Download](https://he-s3.s3.amazonaws.com/media/hackathon/hackerearth/predict-the-road-sign/4b699168-4-here_dataset.zip)

In [1]:
import os
import sys
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.externals import joblib
from sklearn.model_selection import GridSearchCV

Defining folder paths and reading train and test data.

In [2]:
model_name = "RFC"
obj_dir = "./obj"
data_dir = "./data"
output_dir = "./output"

train_file = os.path.join(data_dir, "train.csv")
test_file = os.path.join(data_dir, "test.csv")
op_file = os.path.join(output_dir, model_name+"_"+datetime.now().strftime('%Y%m%d%H%M%S')+".csv")
model_file = os.path.join(obj_dir, model_name+"_"+datetime.now().strftime('%Y%m%d')+".pkl")

print("Output file: {}\nModel file: {}".format(op_file, model_file))

Output file: ./output/RFC_20180901223049.csv
Model file: ./obj/RFC_20180901.pkl


In [3]:
train = pd.read_csv(train_file)
print("Train data info: ", train.info())

test = pd.read_csv(test_file)
print("Test data info: ", test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38485 entries, 0 to 38484
Data columns (total 7 columns):
Id                     38485 non-null object
DetectedCamera         38485 non-null object
AngleOfSign            38485 non-null int64
SignAspectRatio        38485 non-null float64
SignWidth              38485 non-null int64
SignHeight             38485 non-null int64
SignFacing (Target)    38485 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 2.1+ MB
Train data info:  None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31485 entries, 0 to 31484
Data columns (total 6 columns):
Id                 31485 non-null object
DetectedCamera     31485 non-null object
AngleOfSign        31485 non-null int64
SignAspectRatio    31485 non-null float64
SignWidth          31485 non-null int64
SignHeight         31485 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 1.4+ MB
Test data info:  None


Create a common mapping dictionary for the camera and target columns. Apply the mapping and dropping Aspect Ratio feature.

In [4]:
mapping = {'Front':0, 'Right':1, 'Left':2, 'Rear':3}

train = train.replace({'DetectedCamera':mapping})
test = test.replace({'DetectedCamera':mapping})
train = train.replace({'SignFacing (Target)':mapping})

y_train = train['SignFacing (Target)']
test_id = test['Id']

df = train.append(test, sort=False)
print(df.shape, train.shape, test.shape)

drop_cols = ['SignAspectRatio', 'SignFacing (Target)', 'Id']
df.drop(columns=drop_cols, inplace=True)
X_train = df.iloc[:len(train), :]
print(X_train.shape)
X_test = df.iloc[len(train):, :]
print(X_test.shape)

(69970, 7) (38485, 7) (31485, 6)
(38485, 4)
(31485, 4)


Building the random forest classifier model. Model file will be saved to the obj directory.

In [5]:
rfc = RFC(n_estimators=300,max_features=3,min_samples_split=5,oob_score=True)
rfc.fit(X_train, y_train)
param_grid = {
    'bootstrap': [True],
    'max_depth': [90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 500]
}
joblib.dump(rfc, "./obj/rfc_base.pkl")
grid_search = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=100 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=100 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=100 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=100 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=100 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=200 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=200 
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=8, n_estimators=200 
[CV] bootstrap=True, max_depth=90, max_features=2, min_s

[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   18.0s


[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=200, total=   8.3s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=200, total=   8.5s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=200, total=   8.5s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=200, total=   8.4s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=3, min_samples_split=10, n_estimators=500 
[CV]  bootstrap=True, max_depth=90, max_features=2, 

[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.5min


[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=8, n_estimators=300, total=  12.6s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=100, total=   4.3s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   8.5s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   8.5s
[CV] bootstrap=True, max_depth=90, max_features=2, min_samples_leaf=5, min_samples_split=10, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=2, m

[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  7.4min


[CV]  bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=200, total=  12.3s
[CV] bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=200, total=  12.0s
[CV] bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=300 
[CV]  bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=200, total=  12.5s
[CV] bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=500 
[CV]  bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=200, total=  12.0s
[CV] bootstrap=True, max_depth=90, max_features=3, min_samples_leaf=5, min_samples_split=12, n_estimators=500 
[CV]  bootstrap=True, max_depth=90, max_features=3, 

[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 13.7min


[CV]  bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=200, total=  12.6s
[CV] bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=500 
[CV]  bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=8, n_estimators=500, total=  31.7s
[CV] bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=500 
[CV]  bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=8, n_estimators=500, total=  31.9s
[CV] bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=500 
[CV]  bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=8, n_estimators=500, total=  31.2s
[CV] bootstrap=True, max_depth=100, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=500 
[CV]  bootstrap=True, max_depth=100, max_featur

[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 22.8min


[CV]  bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=300, total=  23.4s
[CV] bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=12, n_estimators=100 
[CV]  bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=8, n_estimators=500, total=  40.2s
[CV] bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=12, n_estimators=100 
[CV]  bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=10, n_estimators=300, total=  22.7s
[CV] bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=12, n_estimators=100 
[CV]  bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=12, n_estimators=100, total=   7.1s
[CV] bootstrap=True, max_depth=110, max_features=3, min_samples_leaf=4, min_samples_split=12, n_estimators=100 
[CV]  bootstrap=True, max_depth=110, max_feat

[Parallel(n_jobs=-1)]: Done 1080 out of 1080 | elapsed: 25.6min finished


Predicting the target probabilities for the test dataset.

In [6]:
print(grid_search.best_params_)
best_grid = grid_search.best_estimator_
joblib.dump(rfc, model_file)
y_test = grid_search.predict_proba(X_test)

{'bootstrap': True, 'max_depth': 100, 'max_features': 2, 'min_samples_leaf': 5, 'min_samples_split': 12, 'n_estimators': 300}


Saving the submission file.

In [7]:
submission = pd.DataFrame(data=y_test, columns=['Front','Left','Rear','Right'])
submission['Id'] = test_id
submission = submission[['Id','Front','Left','Rear','Right']]
submission.to_csv(op_file, index=False)