# **Bike sharing demand - brief data analysis**
#### In this section I'll try to get an idea about the data, correlations, disproportions etc.

#### I'll try to choose the variables I'll use as an input, maybe produce some new ones.

#### I'll also try to work my way up the train-val split.

Some of the graphs posted here were directly inspired by [Chun's notebook](https://www.kaggle.com/chun1182/bike-shared-xgboost), [Vivek Srinivasan's notebook](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile) and [Raj Mehrotra's notebook](https://www.kaggle.com/rajmehra03/bike-sharing-demand-rmsle-0-3194/).

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
%matplotlib inline
sns.set(style='whitegrid',color_codes=True)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# sampleSubmission = pd.read_csv("../input/bike-sharing-demand/sampleSubmission.csv")
test = pd.read_csv("../input/bike-sharing-demand/test.csv")
train = pd.read_csv("../input/bike-sharing-demand/train.csv")
df = train.copy()
test_df = test.copy()

In [None]:
# Let's look at first 5 rows of train data
df.head()

In [None]:
df.describe()

## **Some basic things to remember:**
* All features are numeric.
 
* They are not balanced, which may be an issue, TODO.
 
* Datetime is treated as ID.

* We only care about total count. *However, discriminating registered and casual users may prove helpful? (as with the whales;) TODO verify.*

In [None]:
# (safety first)
df.isnull().sum()

disclaimer: The following stats are accumulative and do not reflect any correlation.

In [None]:
# BY SEASON
print("Season:")
df.season.value_counts()
sns.factorplot(x='season',data=df,kind='count',size=3,aspect=1)
# by holiday
print("holiday")
print(df.holiday.value_counts())
sns.factorplot(x='holiday',data=df,kind='count',size=3,aspect=1) # majority of data is for non holiday days.
print("working day")
print(df.workingday.value_counts())
sns.factorplot(x='workingday',data=df,kind='count',size=3,aspect=1) # majority of data is for working days.
print("weather (0 is the clearest)")
print(df.weather.value_counts())
sns.factorplot(x='weather',data=df,kind='count',size=3,aspect=1)  


In [None]:
# distribution and outliers of continuous variables
sns.boxplot(data=df[['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']])
fig=plt.gcf()
fig.set_size_inches(9,9)
# or maybe let's see the histograms
df.temp.unique()
fig,axes=plt.subplots(2,2)
axes[0,0].hist(x="temp",data=df,edgecolor="black",linewidth=2,color='#ff4125')
axes[0,0].set_title("Variation of temp")
axes[0,1].hist(x="atemp",data=df,edgecolor="black",linewidth=2,color='#ff4125')
axes[0,1].set_title("Variation of atemp")
axes[1,0].hist(x="windspeed",data=df,edgecolor="black",linewidth=2,color='#ff4125')
axes[1,0].set_title("Variation of windspeed")
axes[1,1].hist(x="humidity",data=df,edgecolor="black",linewidth=2,color='#ff4125')
axes[1,1].set_title("Variation of humidity")
fig.set_size_inches(10,10)

And now what we all have been waiting for...

In [None]:
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

# Ok, so..
1. Atemp and temp are correlated in a very similar extent, and they are very correlated with each other. Let's get rid of one of those. Based on my intuition, I'll keep atemp. (I'd say this correlation is most likely non-linear tho, extreme temperatures are probably bad -> some combination of temperature and season is probably what we care about.)
2. "Casual" and "registered" are not "factors" here, so ignore those
3. Of the important factors here are also humidity, season, weather and windspeed.
4. Casual users are more affected by temperature than regular ones. They're also slightly more discouraged by high humidity, bad weather and a tiny bit by the wind.
5. Casual users care less about the season tho: regulars use other transportation in the winter, that's my guess.
6. Casual users don't ride bikes on working days, opposed to regulars. In the holidays the disproportion is smaller, but still: more random bikers, less regulars.
7. Also, "humidity" and "weather" are pretty strongly correlated. Humidity has more impact and should be kept, but "weather" may not be a crucial factor here after all.

Note: We still have to check the hour's effect on the demand.

In [None]:
df["hour"] = [t.hour for t in pd.DatetimeIndex(df.datetime)]
df["day"] = [t.day for t in pd.DatetimeIndex(df.datetime)]
df["month"] = [t.month for t in pd.DatetimeIndex(df.datetime)]
df['year'] = [t.year for t in pd.DatetimeIndex(df.datetime)]
df['year'] = df['year'].map({2011:0, 2012:1})
df.head()
# same for test
test_df["hour"] = [t.hour for t in pd.DatetimeIndex(test_df.datetime)]
test_df["day"] = [t.day for t in pd.DatetimeIndex(test_df.datetime)]
test_df["month"] = [t.month for t in pd.DatetimeIndex(test_df.datetime)]
test_df['year'] = [t.year for t in pd.DatetimeIndex(test_df.datetime)]
test_df['year'] = test_df['year'].map({2011:0, 2012:1})
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

Ok. So, we don't care about "day" (also proved in cell 17 [here](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile#Ensemble-Model---Gradient-Boost)). We do care about "month", but more or less the same way as about the "season".
We do however care about hour, there is a strong correlation here. Also, it seems like the year matters: the service getting more popular with time might be an explanation.

So for now let's drop the inputs we decided are redundant.

In [None]:
df.drop('datetime',axis=1,inplace=True)
test_df.drop('datetime',axis=1,inplace=True)
df.drop('temp',axis=1,inplace=True)
test_df.drop('temp',axis=1,inplace=True)
# let's not drop day for now, we shall need it for the split
# df.drop('day',axis=1,inplace=True)
# test_df.drop('day',axis=1,inplace=True)
print(df.head())
print(test_df.head())

# Train-validation split
#### Now, let's split data into training and validation
Test data contains last 10 days of each month. Train set consists of first 19 days of each month. So for validation, let's get 4 days of each month - hopefully this way we can avoid domain shift. I'll just do it manually.

In [None]:
valid_df = df[df["day"] < 5]
valid_df.head()

In [None]:
train_df = df[df["day"] >= 5]
train_df.head()

In [None]:
x_train = train_df.copy()
x_train.drop('count',axis=1,inplace=True)
x_train.drop("casual",axis=1,inplace=True)
x_train.drop("registered",axis=1,inplace=True)
x_valid = valid_df.copy()
x_valid.drop('count',axis=1,inplace=True)
x_valid.drop("casual",axis=1,inplace=True)
x_valid.drop("registered",axis=1,inplace=True)
y_train = train_df["count"]
y_valid = valid_df["count"]
print(x_train.head())
print(x_valid.head())
print(y_train.head())
print(y_valid.head())

# 1st model choice: RF and other regressors
Ok, from other people's notebooks, Random Forests seem like a way to go... let's compare it with some other regression models.

In [None]:
from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error 

models=[RandomForestRegressor(),AdaBoostRegressor(),BaggingRegressor(),SVR(),KNeighborsRegressor()]
model_names=['RandomForestRegressor','AdaBoostRegressor','BaggingRegressor','SVR','KNeighborsRegressor']
rmsle=[]
d={}
for model in range (len(models)):
    clf=models[model]
    clf.fit(x_train,y_train)
    valid_pred=clf.predict(x_valid)
    rmsle.append(np.sqrt(mean_squared_log_error(valid_pred,y_valid)))
d={'Modelling Algo':model_names,'RMSLE':rmsle}   

In [None]:
rmsle_frame=pd.DataFrame(d)
rmsle_frame

Now for curiosity, I'll try regressing separately to registered and casual.

In [None]:
registered_train = train_df["registered"]
casual_train = train_df["casual"]

registered_valid = valid_df["registered"]
casual_valid = valid_df["casual"]

In [None]:
# registered
rmsle = []
d = {}
reg_counts = []
cas_counts = []

for model in range (len(models)):
    clf = models[model]
    clf.fit(x_train,registered_train)
    valid_pred = clf.predict(x_valid)
    reg_counts.append(valid_pred)
    rmsle.append(np.sqrt(mean_squared_log_error(valid_pred,registered_valid)))
d={'Modelling Algo':model_names,'RMSLE':rmsle}   
print(pd.DataFrame(d))
# casual
rmsle = []
d = {}
for model in range (len(models)):
    clf = models[model]
    clf.fit(x_train,casual_train)
    valid_pred = clf.predict(x_valid)
    cas_counts.append(valid_pred)
    rmsle.append(np.sqrt(mean_squared_log_error(valid_pred,casual_valid)))
d={'Modelling Algo':model_names,'RMSLE':rmsle}   
print(pd.DataFrame(d))
del d

In [None]:
# let's see total count error now
rmsle = []
d = {}
for model in range(len(models)):
    total_pred = [reg_c+cas_c for reg_c, cas_c in zip(reg_counts[model],cas_counts[model])]
    rmsle.append(np.sqrt(mean_squared_log_error(total_pred,y_valid)))
d={'Modelling Algo':model_names,'RMSLE':rmsle}   
print("TOTAL COUNTS SCORE")
print(pd.DataFrame(d))
del d

Random forests do seem the best from these regression models. Let's try and find better parameters...

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
no_of_test=[200,300,400,500,600,700]
params_dict={'n_estimators':no_of_test,'n_jobs':[-1],'max_features':["auto",'sqrt','log2']}
clf_rf=GridSearchCV(estimator=RandomForestRegressor(),param_grid=params_dict,scoring='neg_mean_squared_log_error')
clf_rf.fit(x_train,y_train)
print("RMSE on training set")
pred_tr=clf_rf.predict(x_train)
print((np.sqrt(mean_squared_log_error(pred_tr,y_train))))
print("RMSE on validation set")
pred=clf_rf.predict(x_valid)
print((np.sqrt(mean_squared_log_error(pred,y_valid))))
print(clf_rf.best_params_)
del pred

In [None]:
# registered
no_of_test=[300] # to simplify computations; experiments show it doesn't affect the score too much
params_dict={'n_estimators':no_of_test,'n_jobs':[-1],'max_features':["auto",'sqrt','log2']}
clf_rf_reg=GridSearchCV(estimator=RandomForestRegressor(),param_grid=params_dict,scoring='neg_mean_squared_log_error')
clf_rf_reg.fit(x_train,registered_train)
reg_pred=clf_rf_reg.predict(x_valid)
reg_train_pred = clf_rf_reg.predict(x_train)
print("REGISTERED USER COUNT ERROR")
print((np.sqrt(mean_squared_log_error(reg_pred,registered_valid))))
print(clf_rf_reg.best_params_)
# casual
no_of_test=[500]
params_dict={'n_estimators':no_of_test,'n_jobs':[-1],'max_features':["auto",'sqrt','log2']}
clf_rf_cas=GridSearchCV(estimator=RandomForestRegressor(),param_grid=params_dict,scoring='neg_mean_squared_log_error')
clf_rf_cas.fit(x_train,casual_train)
cas_pred=clf_rf_cas.predict(x_valid)
cas_train_pred = clf_rf_cas.predict(x_train)
print("CASUAL USER COUNT ERROR")
print((np.sqrt(mean_squared_log_error(cas_pred,casual_valid))))
print(clf_rf_cas.best_params_)
# total
total_pred = [cas_pred_single+reg_pred_single for reg_pred_single, cas_pred_single in zip(reg_pred, cas_pred)]
total_train_pred = [cas_pred_single+reg_pred_single for reg_pred_single, cas_pred_single in zip(reg_train_pred, cas_train_pred)]
print("TOTAL COUNTS ERROR - validation")
print((np.sqrt(mean_squared_log_error(total_pred,y_valid))))
print("TOTAL COUNTS ERROR - training")
print((np.sqrt(mean_squared_log_error(total_train_pred,y_train))))

# 2nd model: Now let's try MLP

In [None]:
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(hidden_layer_sizes=(80, 80, 60, 60, 40), activation='relu', solver='adam', alpha=0.001, batch_size=40, learning_rate='adaptive', learning_rate_init=0.001, power_t=0.5, max_iter=1000, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=True, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=20)
clf.fit(x_train,y_train)
train_pred=clf.predict(x_train)
train_pred[train_pred < 0] = 0 #making sure there are no negative values
rmsle_tr = np.sqrt(mean_squared_log_error(abs(train_pred),y_train))
valid_pred=clf.predict(x_valid)
valid_pred[valid_pred < 0] = 0 #making sure there are no negative values
rmsle = np.sqrt(mean_squared_log_error(abs(valid_pred),y_valid))
print("RMSLE on validation")
print(rmsle)
print("RMSLE on training")
print(rmsle_tr)
print("nr of iterations")
print(clf.n_iter_)

Not too bad, let's see what changes when we train it separately for registered and casual users.

In [None]:
# registered
# clf_mlp_reg= MLPRegressor(hidden_layer_sizes=(80, 80, 60,40), activation='relu', solver='adam', alpha=0.0001, batch_size=40, learning_rate='adaptive', learning_rate_init=0.001, power_t=0.5, max_iter=1000, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=True, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)
clf_mlp_reg = clf
clf_mlp_reg.fit(x_train,registered_train)
reg_pred=clf_mlp_reg.predict(x_valid)
reg_pred[reg_pred < 0] = 0 #making sure there are no negative values

print("REGISTERED USER COUNT ERROR")
print((np.sqrt(mean_squared_log_error(reg_pred,registered_valid))))
# casual
# clf_mlp_cas=MLPRegressor(hidden_layer_sizes=(80, 80, 60,40), activation='relu', solver='adam', alpha=0.0001, batch_size=40, learning_rate='adaptive', learning_rate_init=0.001, power_t=0.5, max_iter=1000, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=True, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)
clf_mlp_cas = clf
clf_mlp_cas.fit(x_train,casual_train)
cas_pred=clf_mlp_cas.predict(x_valid)
cas_pred[cas_pred < 0] = 0 #making sure there are no negative values
print("CASUAL USER COUNT ERROR")
print((np.sqrt(mean_squared_log_error(cas_pred,casual_valid))))
# total
total_pred = [cas_pred_single+reg_pred_single for reg_pred_single, cas_pred_single in zip(reg_pred, cas_pred)]
print("TOTAL COUNTS ERROR")
print((np.sqrt(mean_squared_log_error(total_pred,y_valid))))

Now let's export test predictions from the best model.

In [None]:
best_clf = clf_rf

pred=best_clf.predict(test_df)
d={'datetime':test['datetime'],'count':pred}
ans=pd.DataFrame(d)
ans.to_csv('answer_bestsinglemodel.csv',index=False) # saving to a csv file for predictions on kaggle.

best_cas_clf = clf_rf_cas
best_reg_clf = clf_rf_reg

pred_cas=best_cas_clf.predict(test_df)
pred_reg=best_reg_clf.predict(test_df)
pred = [pred_reg_s + pred_cas_s for pred_reg_s, pred_cas_s in zip(pred_reg, pred_cas)]
d={'datetime':test['datetime'],'count':pred}
ans=pd.DataFrame(d)
ans.to_csv('answer_besttwomodels.csv',index=False) # saving to a csv file for predictions on kaggle.


# Results
RMSLE:

On test set:
* single RF model: 0.49458
* two RF models (for casual and registered): 0.49267

..which takes ca. 1400th position of the leaderboard, of total 3242 participating teams.

On validation set (ca. 25% of all training data):
* single RF model: 0.38816
* two RF models (for casual and registered): 0.38387
* single MLP: 0.36067

On training set:
* single RF model: 0.15186
* two RF models (for casual and registered): 0.15151
* single MLP: 0.15044 



# Conclusions and future work
Here, unsurprisingly (if you looked at other people's work), Random Forests proved best suited. The striking thing is the difference between scores for casual and registered users. The former's behaviour proved much harder to anticipate, both by the regression random forest model, and by MLP. 

**About the difference between train, validation and test score**: differences between sets are significant. Especially training and validation score are very different - which suggests overfitting. In case of RF, changing the number of trees / their depth (also regularization) probably could mitigate this problem. Test score and validation score also are different, but here the difference is smaller. It seems to suggest domain shift, i.e. that validation and test set are too different from each other - despite my efforts to create them analogically. The fault could be that I didn't take into account balancing holidays of test set and validation, and ex. there were almost none in validation and therefore it wasn't a representative portion for the generalization purposes. Still, when I looked through notebooks of other users, I noticed some of them had much bigger differences between scores they claimed to have on validation and their positions on the leaderboard.

### Random Forest

I believe the RF score may be significantly improved by hand-picking input parameters more wisely, i.e. by maximizing correlations between input variables and total bikers' count. As it is now, the only modifications made were: drop of temp, and creation of numerical day, month and year variables. It probably would be beneficial to ex. merge "season" and "weather" categories, into categories such as "spring and sunny", "spring and rainy", "winter and cloudy" etc. 

I performed a GridSearch, but it was not too thorough. Tuning more parameters and in a wider range may improve that score too. Also, RandomSearch instead of GridSearch always seems like a more reasonable idea, so it probably should be attempted as well.

Another improvement could be achieved by making sure the inputs are well balanced, i.e. the model treats the inputs with similar/proportional importance. Also, removing outliers (ex. outside of range of mean val +- 2x std dev) may have improved the regression!

### Multi-layer Perceptron

Intuition tells me that RF cares about the right inputs (highly correlated to target value and ideally not correlated to each other) more than MLP. I was able to improve MLP significantly by tweaking the architecture (numbers of layers and neurons in hidden layers), batch size, learning rate, number of epochs, early stopping etc.. without touching parameters. The network by definition should eventually be able to adjust to inputs, even to learn that some of them are not helpful after all. Especially if we are using L1 and L2 regularization (which I am here). That is why it stays my favourite method, even if it didn't achieve the best score in this case. Note, it probably would achieve the best score, if I played around with all the hyperparameters. 


NOTE: Obviously, more graphs could be created, but in the notebooks cited above I found so many, that I found it unnecessary to copy all of them here.