In this notebook, I will use Random Forest and XGBoost with some improvements to get a RMSLE score of 0.26 on a validation set.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

from sklearn.model_selection import train_test_split,cross_validate,cross_val_score,cross_val_predict
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import MinMaxScaler,StandardScaler

**1. Import and preprocess data:**

The most important step is transforming the "datetime" column into hours, days and months. The hour attribute is a very interesting feature as we'll see. Also, features like the weather and the season must be one-hot encoded.

In [None]:
#Read training/test set file
df1=pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv")
df2=pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv")

#Define preprocessing function to apply to train & test sets
def preprocess(df):
    #Extract hour,day,month,weekday from datetime
    df["hour"]=pd.to_datetime(df["datetime"]).dt.hour
    df["dayofweek"]=pd.to_datetime(df["datetime"]).dt.dayofweek
    df["dayofmonth"]=pd.to_datetime(df["datetime"]).dt.day
    df["month"]=pd.to_datetime(df["datetime"]).dt.month
    df["year"]=pd.to_datetime(df["datetime"]).dt.year.map({2011:0, 2012:1})
    #One-hot encode weather and season
    df=pd.get_dummies(df,columns=["weather","season"],prefix=["weather","season"],drop_first=True,dtype=int)
    #Drop out datetime and return
    return df.drop(["datetime"],axis=1)

df1=preprocess(df1)
df2=preprocess(df2)

**2. Visualization**

We will plot the correlation matrix and the impact of some important features on the target. I chose 2 features (hour and temp) for the sake of simplicity.

In [None]:
#Visualize correlation matrix.
cor_mat= df1.corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(cor_mat,mask=mask,square=True,annot=True,cbar=True)

In [None]:
sns.barplot(x="hour",y="count",data=df1)

Here, you see 3 peaks (8am,12-13pm,17-18pm) corresponding to people using regularly bikes to go to work or lunch. Non-linearity here makes it complicated to linear models to capture label correlation with the hour feature. Since we are using random forest, we will see that this is by far the most important feature.

As you can see below, temperature is linearly correlated to the target.

In [None]:
sns.lineplot(x="temp",y="count",data=df1)

****3. Making predictions****

Now, we are going to fit 2 algorithms, Random Forest and Extreme Gradient Boosting, on a 75/25 train/validation split.
Note that the evaluation metric is RMSLE (Root Mean Squared Logarithmic Error), so transforming labels yields better results since these regression models minimize loss functions on a MSE metric. We'll see how this little trick improves the results.

In [None]:
#Define features and labels
features=[x for x in df1.columns if x not in ["count","casual","registered"]]
label1=["casual"]
label2=["registered"]
label=["count"]

#Separate features from labels
X_train,y1_train,y2_train,y_train=df1[features],df1[label1],df1[label2],df1[label]
X_test=df2[features]

#Validation set
X_train_0,X_train_1,y_train_0,y_train_1=train_test_split(X_train,y_train,test_size=0.25,random_state=42)

In [None]:
#Train Random Forest Model
rf=RandomForestRegressor(n_estimators=500,n_jobs=-1)

rf.fit(X_train_0,y_train_0)
pred=rf.predict(X_train_1)

#Validation set error
print(mean_squared_log_error(y_train_1,pred)**0.5)

In [None]:
#Transforming labels
rf.fit(X_train_0,y_train_0.apply(lambda x:np.log1p(x)))
pred=np.expm1(rf.predict(X_train_1))

#Validation set error
print(mean_squared_log_error(y_train_1,pred)**0.5)

So transforming labels improves the RMSLE by 0.03 (0.32 -> 0.29), in a Random Forest regression model with 500 trees.

By printing feature importances from the random forest model attributes, hour is around 76% while temp is around 4%.

In [None]:
#Print most important features
print(pd.DataFrame({"Features": features,"Importance" : rf.feature_importances_.round(2)}).sort_values("Importance",ascending=False))

In [None]:
#Train Extreme Gradient Boosting Model

#Create D-matrices for training and validation sets

DM_0=xgb.DMatrix(X_train_0,y_train_0)
DM_1=xgb.DMatrix(X_train_1,y_train_1)

params={"booster":"gbtree"}
xgbr=xgb.train(dtrain=DM_0,num_boost_round=380,params=params)

pred = xgbr.predict(DM_1)

pred[pred<0]=0
#Validation set error
print(mean_squared_log_error(y_train_1,pred)**0.5)

In [None]:
#Transforming labels
DM_0=xgb.DMatrix(X_train_0,y_train_0.apply(lambda x:np.log1p(x)))

params={"booster":"gbtree"}
xgbr=xgb.train(dtrain=DM_0,num_boost_round=380,params=params)

pred = np.expm1(xgbr.predict(DM_1))

pred[pred<0]=0

#Validation set error
print(mean_squared_log_error(y_train_1,pred)**0.5)

So transforming labels improves the RMSLE by 0.23 !!! (0.53 -> 0.30), in a XGBoost regression tree model.

With tuning some hyperparameters, we can obtain RMSLE close to 0.26. This is hand-tuning so grid or random search can yield better performance.

In [None]:
#Fine-tuning some hyperparameters
params={"booster":"gbtree","learning_rate":0.09,"n_jobs":-1,"subsample":0.7,"alpha": 0.009}

xgbr=xgb.train(dtrain=DM_0,num_boost_round=380,params=params)

pred = np.expm1(xgbr.predict(DM_1))

#Validation set error
print(mean_squared_log_error(y_train_1,pred)**0.5)

Finally, note that on submitting you may not find the same result. To go further, one must consider time series analysis since  the test set is considering the last 10 days of the month. So the test set distribution is not the same as the training set distribution. 