# Baselines-To-Start-With(LB=0.56+)

[Fred Navruzov's Kernel](https://www.kaggle.com/frednavruzov/baselines-to-start-with-lb-0-56)

<br>

Hi, Kagglers!

Hereafter I will try to publish **some basic approaches to climb up the Leaderboard.**

**Competition goal**

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on he test bench. Competitors will work with a dataset representing different permutations(순열, 치환) of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will conribute to speedier testing, resulting in lower carbon dioxide(이산화탄소) emissions(배출) withou reducing Daimler's standards.

**The Notebook adopts skeleton from (maybe?) this script:  
[https://www.kaggle.com/ermolushka/starter-xgboost](https://www.kaggle.com/ermolushka/starter-xgboost)

Stay tuned, this notebook will be updated on a regular basis.

**P.s. Upvotes and comments would let me update it faster and in a more smart way :)**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder

### Import

In [7]:
# read datasets
train = pd.read_csv("../../input/train.csv")
test = pd.read_csv("../../input/test.csv")

print("##Before Label Encoding##")
print("Shape train: {}\nShape test: {}".format(train.shape, test.shape))
print("--------------------------------------------")

# process columns, apply LabelEncoder to categorical features
for c in train.columns :
    if train[c].dtype == 'object' :
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(list(train[c].values))
        test[c] = lbl.transform(list(test[c].values))
        
# shape
print("##After Label Encoding##")
print("Shape train: {}\nShape test: {}".format(train.shape, test.shape))

##Before Label Encoding##
Shape train: (4209, 378)
Shape test: (4209, 377)
--------------------------------------------
##After Label Encoding##
Shape train: (4209, 378)
Shape test: (4209, 377)


### Add decomposed(분해된) components(요소들): PCA / ICA etc.

In [10]:
from sklearn.decomposition import PCA, FastICA
n_comp = 10

# PCA
pca = PCA(n_components=n_comp, random_state=42)
pca2_results_train = pca.fit_transform(train.drop(["y"], axis=1))
pca2_results_test = pca.transform(test)

# ICA
ica = FastICA(n_components=n_comp, random_state=42)
ica2_results_train = ica.fit_transform(train.drop(["y"], axis=1))
ica2_results_test = ica.transform(test)

# Append decomposition components to datasets
for i in range(1, n_comp+1) :
    train['pca_' + str(i)] = pca2_results_train[:, i-1]
    test['pca_' + str(i)] = pca2_results_test[:, i-1]
    
    train['ica_' + str(i)] = ica2_results_train[:, i-1]
    test['ica_' + str(i)] = ica2_results_test[:, i-1]
    
y_train = train['y']
y_mean = np.mean(y_train)

### Preparing Regressor

In [12]:
()# mmm, xgboost, loved by everyone ^-^
import xgboost as xgb

# Prepare dict of params for xgboost to run with
xgb_params = {
    'n_trees': 500,
    'eta': 0.005,
    'max_depth': 4,
    'subsample': 0.95,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': y_mean, # base prediction = mean(target)
    'silent': 1
}

# form DMatrices for Xgboost training
dtrain = xgb.DMatrix(train.drop('y', axis=1), y_train)
dtest = xgb.DMatrix(test)

# xgboost, cross-validation
cv_result = xgb.cv(xgb_params,
                   dtrain,
                   num_boost_round=500, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50,
                   show_stdv=False)

num_boost_rounds = len(cv_result)
print(num_boost_rounds)

# train model
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=num_boost_rounds)

[0]	train-rmse:12.6401	test-rmse:12.6387
[50]	train-rmse:11.0905	test-rmse:11.1527
[100]	train-rmse:10.0181	test-rmse:10.1487
[150]	train-rmse:9.28978	test-rmse:9.48862
[200]	train-rmse:8.8028	test-rmse:9.06727
[250]	train-rmse:8.47773	test-rmse:8.80199
[300]	train-rmse:8.25941	test-rmse:8.63741
[350]	train-rmse:8.09155	test-rmse:8.53632
[400]	train-rmse:7.95297	test-rmse:8.47684
[450]	train-rmse:7.83878	test-rmse:8.44109
[499]	train-rmse:7.73445	test-rmse:8.42257
500


In [13]:
# check f2-score (to get higher score - increase num_boost_round in previous cell)
from sklearn.metrics import r2_score

# now fixed, correct calculation
print(r2_score(dtrain.get_label(), model.predict(dtrain)))

0.6108367775830508


In [14]:
# make predictions and save results
y_pred = model.predict(dtest)
output = pd.DataFrame({'id': test['ID'].astype(np.int32),
                       'y': y_pred})
output.to_csv('xgboost-depth{}-pca-ica.csv'.format(xgb_params['max_depth']), index=False)

### Prepare Prediction

Stay tuned, this notebook will be updated on a regular basis.

**P.s. Upvotes and comments would let me update it faster and in a more smart way :)**