This notebook will be less about data exploration and more about data preparation and model building.
There are out there multiple notebooks which handle data exploration wonderfuly with excelent visuals.
I will nevertheless present the general findings and some personal ideas regarding the data.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing

Read & append train and test

In [None]:
train = pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv')
test = pd.read_csv('/kaggle/input/bike-sharing-demand/test.csv')
data = train.append(test, sort=False)

In [None]:
data.head()

Extract data features from datetime

In [None]:
data['datetime'] = data['datetime'].astype('datetime64[ns]')
data['Day'] = pd.DatetimeIndex(data['datetime']).day
data['Month'] = pd.DatetimeIndex(data['datetime']).month
data['Year'] = pd.DatetimeIndex(data['datetime']).year
data['Hour'] = pd.DatetimeIndex(data['datetime']).hour
data['weekday'] = pd.DatetimeIndex(data['datetime']).weekday

Now I will ged rid of some unuseful colums:
- atemp - highly correlated with temp
- registered, casual - not used as predicted variable, "count" will be used instead
- season - as long as there is Month, season does not add any additional value. 
- Day - the train data has only Days from 1 to 19, the test data has days from 20 to 31. I believe the model cannot predict well on unseen data.

Leaving some of the variables in the model will most certainly overfit the model, nevertheless it will bring a higher public score, as there is just one metric the models are evaluated with.

This could be the case of workingday, which I decided to keep, although it is mostly explained by weekday and holiday, as workingday 0 means Saturday&Sunday and holiday.

In [None]:
df=data.drop(['registered','casual','atemp','Day','season'],axis=1)

Replace or not replace 0's in windspeed column?
I prefer not to replace, as there might be a reasonable explanation in the temperature measurement that leads to 0's.

In [None]:
df[df["windspeed"]==0].head()

No significant **correlation** between continuous variables.

In [None]:
%config InlineBackend.figure_format = 'retina'
corr1 = df[['temp','humidity','windspeed']].corr() 
mask = np.zeros_like(corr1)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr1, vmax=0.8, mask=mask, square=True,annot=True)

**Data preprocessing**

- Year - categorize to integers 0 and 1 
- Continuos variables normalisation - When  data is comprised of attributes with varying scales,machine learning algorithms benefit from rescaling the attributes to the same scale. The attributes are  rescaled into the range between 0 and 1, thus will be on the same level of magnitude and will have smaller standard deviations, which can suppress the effect of outliers.
In particular for random forests, this will not add much optimisation, as they do not benefit that much from scaling methods. 



In [None]:
lb = preprocessing.LabelBinarizer()
df.Year = lb.fit_transform(df.Year)
cont=['temp','humidity','windspeed']
feat=df[cont]
minmax_scale = preprocessing.MinMaxScaler().fit(feat.values)
df[cont] = minmax_scale.transform(feat.values)

In [None]:
df.head()

In [None]:
df[df["windspeed"]==0].head()

**Categorize variables and create dummy **variables for holiday, weather, workingday. This is because some machine learning algorithms will consider hierarchical values as measure of importance for cathegorical values.

In [None]:
categ=['holiday','weather','Month','Year','Hour','weekday','workingday']
for var in categ:
    df[var] = df[var].astype("category")
df = pd.get_dummies(data=df, columns=['holiday','weather','workingday','Year'])
df.head()

**Outliers**

Normalisation and transformation of the predicted variable will reduce the effect of these outliers, so I will not eliminate them. Also I prefer to keep some variability in the train data, as it might reflect better the variability of the test data.

**Splitting** the train and test data.

In [None]:
df_train = df[pd.notnull(df['count'])].sort_values(by=['datetime'])
df_test = df[~pd.notnull(df['count'])].sort_values(by=['datetime'])

In [None]:
df_train=df_train.drop(['datetime'],axis=1)
test=df_test.drop(['datetime','count'],axis=1)

**Transformation** of the predicted variable

* As the dependant variable is a highly skewed data, I will transform this data using log transformation.  Log transformation will cause less penalisation if there are significant differences in final variable values. This will also be important as the evaluation metric is RMSLE.


In [None]:

df_train['count'].plot(kind="hist", bins=100)

The large skew is evident.

In [None]:
df_train['count'] = np.log1p(df_train['count'])

**Random Forest**

In [None]:
X = df_train.drop(['count'],axis=1)
y=df_train['count']
X.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1)
rfr = RandomForestRegressor(n_estimators = 100)
rfr.fit(X_train, y_train)
pred = rfr.predict(X_test)
sns.scatterplot(x = y_test, y = pred)

Predictions look reasonably good. Let's calculate the RMSLE score.

**The RMSLE score **

In [None]:
from sklearn import metrics
print('RMSLE:', np.sqrt(metrics.mean_squared_log_error(np.expm1(y_test), np.expm1(pred))))

**Gradient Boost**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor(n_estimators=2000,alpha=0.01)
gbm.fit(X_train,y_train)
preds = gbm.predict(X_test)
print('RMSLE:', np.sqrt(metrics.mean_squared_log_error(np.expm1(y_test), np.expm1(preds))))

**Combine methods**

In [None]:
algo_gbm = gbm.predict(X_test)
algo_rf = rfr.predict(X_test)
algo_mean =np.expm1(algo_gbm)*0.9 + np.expm1(algo_rf)*0.1
print('RMSLE:', np.sqrt(metrics.mean_squared_log_error(np.expm1(y_test), algo_mean)))

**Prepare and submit**

In [None]:
algo_gbm_tst = gbm.predict(test)
algo_rf_tst = rfr.predict(test)
algo_mean_tst =np.expm1(algo_gbm_tst)*0.9 + np.expm1(algo_rf_tst)*0.1

In [None]:
submission = pd.DataFrame({'datetime':df_test['datetime'],'count':algo_mean_tst})
submission.head()

In [None]:
submission.to_csv('Submission.csv',index=False)