Hi, Kagglers!

Hereafter I will try to publish **some basic approaches to climb up the Leaderboard**

**Competition goal**

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench.
<br>Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. <br>Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards. 

**The Notebook adopts skeleton from (maybe?) this script: https://www.kaggle.com/ermolushka/starter-xgboost**

### Stay tuned, this notebook will be updated on a regular basis
**P.s. Upvotes and comments would let me update it faster and in a more smart way :)**

In [6]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
from sklearn.decomposition import PCA, FastICA

%load_ext autotime

### Import

In [2]:
# read datasets
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

# process columns, apply LabelEncoder to categorical features
for c in train.columns:
    if train[c].dtype == 'object':
        lbl = LabelEncoder() 
        lbl.fit(list(train[c].values) + list(test[c].values)) 
        train[c] = lbl.transform(list(train[c].values))
        test[c] = lbl.transform(list(test[c].values))

# shape        
print('Shape train: {}\nShape test: {}'.format(train.shape, test.shape))

Shape train: (4209, 378)
Shape test: (4209, 377)


### Add decomposed components: PCA / ICA etc.

In [4]:
n_comp = 10

# PCA
pca = PCA(n_components=n_comp, random_state=42)
pca2_results_train = pca.fit_transform(train.drop(["y"], axis=1))
pca2_results_test = pca.transform(test)

# ICA
ica = FastICA(n_components=n_comp, random_state=42)
ica2_results_train = ica.fit_transform(train.drop(["y"], axis=1))
ica2_results_test = ica.transform(test)

# Append decomposition components to datasets
for i in range(1, n_comp+1):
    train['pca_' + str(i)] = pca2_results_train[:,i-1]
    test['pca_' + str(i)] = pca2_results_test[:, i-1]
    
    train['ica_' + str(i)] = ica2_results_train[:,i-1]
    test['ica_' + str(i)] = ica2_results_test[:, i-1]
    
y_train = train["y"]
y_mean = np.mean(y_train)

### Preparing Regressor

In [13]:
 ()# mmm, xgboost, loved by everyone ^-^
import xgboost as xgb

# prepare dict of params for xgboost to run with
xgb_params = {
    'n_trees': 500, 
    'eta': 0.005,
    'max_depth': 4,
    'subsample': 0.95,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': y_mean, # base prediction = mean(target)
    'silent': 1
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(train.drop('y', axis=1), y_train)
dtest = xgb.DMatrix(test)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=500, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50, 
                   show_stdv=False
                  )

num_boost_rounds = len(cv_result)
print(num_boost_rounds)

# train model
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)
print(r2_score(dtrain.get_label(), model.predict(dtrain)))

[0]	train-rmse:12.6399	test-rmse:12.6383
[50]	train-rmse:11.0903	test-rmse:11.1515
[100]	train-rmse:10.0181	test-rmse:10.1468
[150]	train-rmse:9.28968	test-rmse:9.48769
[200]	train-rmse:8.80274	test-rmse:9.06665
[250]	train-rmse:8.47841	test-rmse:8.80237
[300]	train-rmse:8.26046	test-rmse:8.63718
[350]	train-rmse:8.09258	test-rmse:8.53672
[400]	train-rmse:7.9487	test-rmse:8.47722
[450]	train-rmse:7.83236	test-rmse:8.44252
[499]	train-rmse:7.72831	test-rmse:8.42335
500
0.611979654616
time: 55.4 s


In [1]:
 ()# mmm, xgboost, loved by everyone ^-^
import xgboost as xgb

# prepare dict of params for xgboost to run with
xgb_params = {
    'n_trees': 500, 
    'eta': 0.005,
    'max_depth': 4,
    'subsample': 0.95,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': y_mean, # base prediction = mean(target)
    'silent': 1,
    'tree_method':'hist'
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(train.drop('y', axis=1), y_train)
dtest = xgb.DMatrix(test)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   num_boost_round=500, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50, 
                   show_stdv=False
                  )

num_boost_rounds = len(cv_result)
print(num_boost_rounds)

# train model
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)
print(r2_score(dtrain.get_label(), model.predict(dtrain)))

NameError: name 'train' is not defined

In [19]:
xgb_params['tree_method']

'hist'

time: 3.6 ms


In [20]:
# make predictions and save results
y_pred = model.predict(dtest)
output = pd.DataFrame({'id': test['ID'].astype(np.int32), 'y': y_pred})
output.to_csv('../submit/xgboost-depth{}{}-pca-ica.csv'.format(xgb_params['max_depth'], xgb_params['tree_method']), index=False)

time: 90.6 ms


### Prepare Predictions

### Stay tuned, this notebook will be updated on a regular basis
**P.s. Upvotes and comments would let me update it faster and in a more smart way :)**