Here we analyze a bike sharing dataset from the following link:
https://www.kaggle.com/marklvl/bike-sharing-dataset

In this dataset, the number of rented bikes in DC, among other features such as time, weather condition, day of the week and etc., are provided.
In this code we try to use this dataset to predict the number of rented bikes given other features.
We first decide to treat which features categorical and which ones as numerical features.
Then we split the data to training and testing set.
We use Regression Tree to predict the rented number of bikes.
Finally, we evaluate the performance of our method.
This process has been done twice, once on the hourly data and once again on the daily data.
On the daily data we found an anomaly, and the reason for that turned out to be very interesting which you'll read more about it towards the end of this code.

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

In [9]:
day = pd.read_csv('../input/bike-sharing-dataset/day.csv')#,parse_dates=['dteday'])
day.describe()

In [8]:
hour = pd.read_csv('../input/bike-sharing-dataset/hour.csv',parse_dates=['dteday'])
hour.describe()

In [4]:
hour.columns

Check to see if there is missing data

In [5]:
hour.isnull().sum().sum()

We will treat the following features as categorical:
'year', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit'
and the followings as Numerical:
'hr', 'temp', 'atemp', 'hum', 'windspeed'
We try to predict 'casual', 'registered' and 'cnt' separately.

In [6]:
#Make a copy of hour to process
all_data = hour.copy()

In [7]:
categorical = ['yr','mnth','holiday','weekday','workingday','weathersit']

In [8]:
for feat in categorical:
    all_data[feat] = all_data[feat].apply(lambda x : str(x))

In [9]:
all_data.dtypes

In [10]:
all_data.drop(['dteday', 'instant', 'season'], axis=1, inplace=True)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(all_data.drop(['casual','registered','cnt'],1), all_data['cnt'], test_size=0.33, random_state=42)

In [12]:
regr = DTR()
regr.fit(X_train,y_train)
y_pred = regr.predict(X_test)

In [13]:
mean_absolute_error(y_test,y_pred)

In [14]:
median_absolute_error(y_test,y_pred)

In [15]:
feature_importance = regr.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

In [16]:
#Relative Error
RE= abs(y_pred-y_test)/y_test

In [22]:
#Relative Error Histogram
sns.distplot(RE)
plt.show()

In [23]:
#Make a copy of day to process
day_data = day.copy()

In [24]:
categorical = ['yr','mnth','holiday','weekday','workingday','weathersit']

In [25]:
for feat in categorical:
    day_data[feat] = day_data[feat].apply(lambda x : str(x))

In [26]:
day_data.dtypes

In [27]:
day_data.drop(['dteday', 'instant', 'season'], axis=1, inplace=True)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(day_data.drop(['casual','registered','cnt'],1), day_data['cnt'], test_size=0.33, random_state=42)

In [29]:
regr = DTR()
regr.fit(X_train,y_train)
y_pred = regr.predict(X_test)

In [30]:
mean_absolute_error(y_test,y_pred)

In [31]:
median_absolute_error(y_test,y_pred)

In [32]:
feature_importance = regr.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

In [33]:
#Relative Error
RE= abs(y_pred-y_test)/y_test

In [35]:
#Relative Error Histogram
sns.distplot(RE)
plt.show()

# One Anomaly Detected

In [36]:
RE[RE>10]

Only one point with large relative error

In [37]:
max_err_idx = (abs(y_pred-y_test)/y_test).idxmax(axis=0)
max_err_pos = y_test.index.get_loc(max_err_idx)
y_test.iloc[max_err_pos],y_pred[max_err_pos]

In [38]:
(y_pred[max_err_pos]-y_test.iloc[max_err_pos])/y_test.iloc[max_err_pos]

In [39]:
day.loc[max_err_idx,'dteday']

On 29-October-2012 Hurricane Sandy occured which is an anomaly