This is a LightGBM-Starter based on previous work with XGBoost. I prefer saving current settings to JSONand being able to-reload to be able to do some reproducible work. Also external tuning and CV can be performed and then simply appied by changeing the JSON file.

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from datetime import datetime

from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_predict

import lightgbm as lgb

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("Train: ", train_df.shape)
print("Test: ", test_df.shape)

train_df.head()

Perform a little bit of data cleaning and remove outliers - not much so far..
There is one strange outlier in the data with a time of more than 250 seconds. This will not make any significant imrovement with tree-based methods so this one is dropped and a cut abobe the "usual" time is applied.

In [None]:
maxtime = 200
train_df = train_df.loc[train_df['y']<maxtime]

plt.figure(figsize=(8,6))
plt.title('y distibution')
plt.hist(train_df['y'], bins=100)
plt.xlabel('y value [s]', fontsize=12)
plt.yscale('log')
plt.show()

In [None]:
df_nan = train_df.isnull().sum(axis=0)
print("Null/NaN features:" + str(df_nan.sum()))

Prepare and encode labels

In [None]:
for col in train_df.columns:
    if train_df[col].dtype == 'object':
        lenc = preprocessing.LabelEncoder()
        lenc.fit(list(train_df[col]) + list(test_df[col]))
        train_df[col] = lenc.transform(list(train_df[col]))
        test_df[col] = lenc.transform(list(test_df[col]))

Generate final arrays

In [None]:
X = train_df.drop(['ID','y'], axis=1)
y = train_df['y']

X_test = test_df.drop(['ID'], axis=1)
ID_test = test_df['ID']

print("-> Train: ", X.shape)
print("-> Test: ", X_test.shape)

Load LGBM Parameters and train

In [None]:
#with open("parameter_LGB_0.5166900124_2017-06-13-11-46.json") as fp:
#    loaded_pars=json.load(fp)
#fp.close()

model = lgb.LGBMRegressor()
#model.set_params(**loaded_pars)
print("Training...")
model.fit(X,y, init_score=np.mean(y))

CV predict and plot

In [None]:
y_pred = cross_val_predict(model, X=X, y=y, cv=10, n_jobs=5)
y_diff = np.clip(100 * ( (y_pred - y) / y ), -50, 50)

R2 = r2_score(y, y_pred)

print("CV R2-Score: " + str(R2))

plt.figure(figsize=(8,6))
plt.title('True vs Predicted Y')
plt.scatter(y, y_pred, c=y_diff, cmap=plt.cm.seismic)
plt.colorbar()
plt.xlabel('True y')
plt.ylabel('Predicted y')
plt.show()

plt.figure(figsize=(8,6))
plt.hist(y_pred, 50)
plt.xlabel('Predicted y')
plt.show()

Interesting... Predictions are somehow grouped in bands or clusters. This definitely needs further investigation...

Finally prepare submission and save parameters.

In [None]:
print("Preparing submission and parameters file...")
subm = pd.DataFrame()
subm['ID'] = ID_test
subm['y'] = model.predict(X_test)

sub_file = 'submission_LGB_' + str(R2) + '_' + str(datetime.now().strftime('%Y-%m-%d-%H-%M')) + '.csv'
lgb_file = sub_file.replace('submission', 'parameter')
lgb_file = lgb_file.replace('csv', 'json')

#with open(lgb_file, mode="w") as fp:
#    json.dump(model.get_params(), fp)
#fp.close()

subm.to_csv(sub_file, index=False)
print("done.")