Simplified and optimized version of code can be found at my profile: https://www.kaggle.com/valeriyparubets/sklearn-gbr-0-413
Here is an explanation about how it's done
Also, the same in Russian is here: https://github.com/Vzzzz/kaggle-bike

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from datetime import datetime
from scipy import stats

In [None]:
trainData = pd.read_csv('../input/train.csv')
testData = pd.read_csv('../input/test.csv')

In [None]:
trainData.head(2)

First, let's visualize distribution and rolling mean of bikes rented for the hole period:

In [None]:
fig, axes = plt.subplots(figsize=(15, 4), ncols=2, nrows=1)
sn.distplot(trainData["count"],ax=axes[0])
plt.plot(pd.rolling_mean(trainData['count'], 100))
plt.show()

With this form of distribution it's better to use logarithm of count. Thus we simplify the distribution:

In [None]:
trainData['logcount'] = trainData['count'].apply(lambda x: np.log1p(x))
fig, axes = plt.subplots(figsize=(15, 8))
sn.distplot(trainData["logcount"], ax=axes)

As a first step of feature engineering we separate date-time column on a set of few features: 'date', 'month', 'hour', 'weekday'

In [None]:
trainData['date'] = trainData.datetime.apply(lambda x : x.split()[0])
trainData['hour'] = trainData.datetime.apply(lambda x : x.split()[1].split(":")[0])
trainData['weekday'] = trainData.date.apply(lambda dateString : datetime.strptime(dateString, '%Y-%m-%d').weekday())
trainData['month'] = trainData.date.apply(lambda dateString : datetime.strptime(dateString, '%Y-%m-%d').month)

testData['date'] = testData.datetime.apply(lambda x : x.split()[0])
testData['hour'] = testData.datetime.apply(lambda x : x.split()[1].split(":")[0])
testData['weekday'] = testData.date.apply(lambda dateString : datetime.strptime(dateString, '%Y-%m-%d').weekday())
testData['month'] = testData.date.apply(lambda dateString : datetime.strptime(dateString, '%Y-%m-%d').month)

timeColumn = testData['datetime']

Also here we can try to train a regressor. It will give approximatelly 0.465

In [None]:
import xgboost as xgb

X = trainData.drop(['count', 'datetime', 'registered', 'casual', 'date', 'logcount'], axis=1).values
Y = trainData['logcount'].values

testX = testData.drop(['datetime', 'date'], axis=1).values

trainMatrix = xgb.DMatrix(X, label=Y)

max_depth = 5
min_child_weight = 8
subsample = 0.9
num_estimators = 1000
learning_rate = 0.1

clf = xgb.XGBRegressor(max_depth=max_depth,
                min_child_weight=min_child_weight,
                subsample=subsample,
                n_estimators=num_estimators,
                learning_rate=learning_rate)

clf.fit(X,Y)

pred = clf.predict(testX)
pred = np.expm1(pred)

submission = pd.DataFrame({
        "datetime": timeColumn,
        "count": pred
    })
submission.to_csv('XGBNoFE.csv', index=False)

Let's continue with further feature engineering.
Let's see distribution of working and non-working days, seasons and daytime

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(15, 8)
sn.boxplot(data=trainData, y='count', x='season', ax=axes[0])
sn.boxplot(data=trainData, y='count', x='workingday', ax=axes[1])
axes[0].set(xlabel='season', ylabel='count')
axes[1].set(xlabel='workingday', ylabel='count')

In [None]:
fix, axes = plt.subplots(figsize=(15, 10))
sn.boxplot(data=trainData, y='count', x='hour', ax=axes)

With a rule of three sigmas let's clear dataset of "anomaly" entries. So we drop ~1% of data that doesn't suits this distribution

In [None]:
trainDataWithoutOutliers = trainData[np.abs(trainData['count']-trainData['count'].mean())
                                     <=(3*trainData['count'].std())] 
print(trainDataWithoutOutliers.shape)
trainData = trainDataWithoutOutliers

Also we can see the correlation between features. This will help to decide which of them better drop.

In [None]:
corrMat = trainData.corr()
mask = np.array(corrMat)
mask[np.tril_indices_from(mask)] = False
fig, ax= plt.subplots(figsize=(20, 10))
sn.heatmap(corrMat, mask=mask,vmax=1., square=True,annot=True)

Also we may build some dependencies between features using plots like these::

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(15, 15))

meanMonthly = pd.DataFrame(trainData.groupby('month')['count'].mean()).reset_index().sort_values(by='count', ascending=False)
sn.barplot(data=meanMonthly, x='month', y='count', ax=axes[0])
axes[0].set(xlabel='month', ylabel='count')

hoursSeasonly = pd.DataFrame(trainData.groupby(['hour', 'season'], sort=True)['count'].mean()).reset_index()
sn.pointplot(x=hoursSeasonly['hour'], y=hoursSeasonly['count'], hue=hoursSeasonly['season'], data=hoursSeasonly, join=True, ax=axes[1])
axes[1].set(xlabel='hour', ylabel='count')

hoursDayly = pd.DataFrame(trainData.groupby(['hour','weekday'], sort=True)['count'].mean()).reset_index()
sn.pointplot(x=hoursDayly['hour'], y=hoursDayly['count'], hue=hoursDayly['weekday'], data=hoursDayly, join=True,ax=axes[2])
axes[2].set(xlabel='hour', ylabel='count')

hoursSeasonly = pd.DataFrame(trainData.groupby(['hour', 'month'], sort=True)['count'].mean()).reset_index()
sn.pointplot(x=hoursSeasonly['hour'], y=hoursSeasonly['count'], hue=hoursSeasonly['month'], data=hoursSeasonly, join=True, ax=axes[3])
axes[1].set(xlabel='hour', ylabel='count')

With this information we chose the list of features that can be used for training

In [None]:
X = trainData.drop(['date', 'temp', 'casual', 'registered', 'logcount', 'datetime', 'count'], axis=1)

season_df = pd.get_dummies(trainData['season'], prefix='s', drop_first=True)
weather_df = pd.get_dummies(trainData['weather'], prefix='w', drop_first=True)
hour_df = pd.get_dummies(trainData['hour'], prefix='h', drop_first=True)
weekday_df = pd.get_dummies(trainData['weekday'], prefix='d', drop_first=True)
month_df = pd.get_dummies(trainData['month'], prefix='m', drop_first=True)

X = X.join(season_df)
X = X.join(weather_df)
X = X.join(hour_df)
X = X.join(weekday_df)
X = X.join(month_df)

X = X.values
Y=trainData['logcount'].values
print(X.shape)

testX = testData.drop(['date', 'temp', 'datetime'], axis=1)

season_df = pd.get_dummies(testData['season'], prefix='s', drop_first=True)
weather_df = pd.get_dummies(testData['weather'], prefix='w', drop_first=True)
hour_df = pd.get_dummies(testData['hour'], prefix='h', drop_first=True)
weekday_df = pd.get_dummies(testData['weekday'], prefix='d', drop_first=True)
month_df = pd.get_dummies(testData['month'], prefix='m', drop_first=True)

testX = testX.join(season_df)
testX = testX.join(weather_df)
testX = testX.join(hour_df)
testX = testX.join(weekday_df)
testX = testX.join(month_df)

testX = testX.values
print(testX.shape)

And we train some models

In [None]:
clf=xgb.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4,colsample_bytree=0.6,subsample=0.6)
clf.fit(X,Y)

pred = clf.predict(testX)
pred = np.expm1(pred)

submission = pd.DataFrame({
        "datetime": timeColumn,
        "count": pred
    })
submission.to_csv('XGBwithFE.csv', index=False)

Also we may use some sklearn models with their's optimization

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer


Competition's metrics based on RLMSE, so if we want to optimize classifier we'd better use custom loss function:

In [None]:
def loss_func(truth, prediction):
    truth = np.expm1(truth)
    prediction = np.expm1(prediction)
    log1 = np.array([np.log(x + 1) for x in truth])
    log2 = np.array([np.log(x + 1) for x in prediction])
    return np.sqrt(np.mean((log1 - log2)**2))

In [None]:
param_grid = {
    'n_estimators': [50, 80, 100, 120],
    'max_depth': [None, 1, 2, 5],
    'max_features': ['sqrt', 'log2', 'auto']
}

scorer = make_scorer(loss_func, greater_is_better=False)

regr = RandomForestRegressor(random_state=42)

rfr = GridSearchCV(regr, param_grid, cv=4, scoring=scorer, n_jobs=4).fit(X, Y)
print('\tParams:', rfr.best_params_)
print('\tScore:', rfr.best_score_)

In [None]:
pred = rfr.predict(testX)
pred = np.expm1(pred)

submission = pd.DataFrame({
        "datetime": timeColumn,
        "count": pred
    })
submission.to_csv('RandomForest.csv', index=False)

Code below runs too long, Usually it return the following:
('Params:', {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 4})
('Score:', -0.09649149584358846)

In [None]:
#
#param_grid = {
#    'learning_rate': [0.1, 0.01, 0.001, 0.0001],
#    'n_estimators': [100, 1000, 1500, 2000, 4000],
#    'max_depth': [1, 2, 3, 4, 5, 8, 10]
#}
#
#scorer = make_scorer(loss_func, greater_is_better=False)
#
#gb = GradientBoostingRegressor(random_state=42)
#
#gbr = GridSearchCV(gb, param_grid, cv=4, scoring=scorer, n_jobs=3).fit(X, Y)
#print('\tParams:', gbr.best_params_)
#print('\tScore:', gbr.best_score_)

gbr = GradientBoostingRegressor(n_estimators=2000, learning_rate=0.01, max_depth=4)

gbr.fit(X, Y)

In [None]:
pred = gbr.predict(testX)
pred = np.expm1(pred)

submission = pd.DataFrame({
        "datetime": timeColumn,
        "count": pred
    })
submission.to_csv('GradientBoost.csv', index=False)