In [1]:
import sys
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\BikeSharing/train.csv", header = 0, error_bad_lines=False)

#### Engineer features for the model

Split the datetime field into date and time.

In [3]:
# engineer date and time features
temp = pd.DatetimeIndex(data['datetime'])
data['date'] = temp.date
data['time'] = temp.time

Taking a closer look at the time feature, every reading is on the hour. So lets create an hour feature instead.

In [4]:
# create a feature called hour
data['hour'] = pd.to_datetime(data.time, format="%H:%M:%S")
data['hour'] = pd.Index(data['hour']).hour

#### The date can tell us what day it is, so lets create a feature dayofweek. It also tells us how much time has passed since the survey began, which could be useful if there is a general growth in rental volumes over time. So lets create a feature dateDays.

In [5]:
# there appears to be a general increase in rentals over time, so days from start should be captured
data['dateDays'] = (data.date - data.date[0]).astype('timedelta64[D]')

# create a categorical feature for day of the week (0=Monday to 6=Sunday)
data['dayofweek'] = pd.DatetimeIndex(data.date).dayofweek

We can look more closely at which days of the week impact bike rental volumes. For casual users, volumes are much greater on Saturday & Sunday. For registered users, volumes are much less on Sunday. So let’s create binary features that ask if it is Saturday or Sunday.

In [6]:
byday = data.groupby('dayofweek')
byday['casual'].sum().reset_index()

Unnamed: 0,dayofweek,casual
0,0,46288
1,1,35365
2,2,34931
3,3,37283
4,4,47402
5,5,100782
6,6,90084


In [7]:
byday['registered'].sum().reset_index()

Unnamed: 0,dayofweek,registered
0,0,249008
1,1,256620
2,2,257295
3,3,269118
4,4,255102
5,5,210736
6,6,195462


In [8]:
# create binary features which show if day is Saturday/Sunday
data['Saturday']=0
data.Saturday[data.dayofweek==5]=1

data['Sunday']=0
data.Sunday[data.dayofweek==6]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Finally, let’s remove redundant fields and see what we are left with.

In [9]:
# remove old data features
dataRel = data.drop(['datetime', 'count','date','time','dayofweek'], axis=1)

### Vectorize features

All features must be in a vectorized format for the Scikit-learn models. We can achieve this by converting the Pandas dataframe to a dictionary, and then using a DictVectorizer from there.

We will split continuous and categorical features for now, so that we can prepare them differently in the next section.

In [10]:
# put continuous features into a dictionary
featureConCols = ['temp','atemp','humidity','windspeed','dateDays','hour']
dataFeatureCon = dataRel[featureConCols]
dataFeatureCon = dataFeatureCon.fillna( 'NA' ) #in case I missed any
X_dictCon = dataFeatureCon.T.to_dict().values() 

In [11]:
# put categorical features into a dictionary
featureCatCols = ['season','holiday','workingday','weather','Saturday', 'Sunday']
dataFeatureCat = dataRel[featureCatCols]
dataFeatureCat = dataFeatureCat.fillna( 'NA' ) #in case I missed any
X_dictCat = dataFeatureCat.T.to_dict().values() 


In [13]:
# vectorize features
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse = False)
X_vec_cat = vec.fit_transform(X_dictCat)
X_vec_con = vec.fit_transform(X_dictCon)

### Standardize continuous features

Continuous features should be standardized to have zero mean and unit variance. This stops features with large numbers having too much impact on the models.

In [17]:
# standardize data - zero mean and unit variance
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_vec_con)
X_vec_con = scaler.transform(X_vec_con)

#### Encode categorical features

For example, before encoding the season field is one column containing either 1, 2, 3 or 4. After encoding, the season field will be represented by four binary fields for each option 1, 2, 3 or 4.

In [19]:
# encode categorical features
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(X_vec_cat)
X_vec_cat = enc.transform(X_vec_cat).toarray()

### Combine all features

After combining all features into one vector, we are ready to start training models!

In [20]:
# combine cat & con features
X_vec = np.concatenate((X_vec_con,X_vec_cat), axis=1)

Below is a view of the first vectorized feature set. The first 6 are standardized continuous features, the rest are encoded categorical features. 

In [23]:
X_vec

array([[-1.09273697, -1.70912256, -1.66894356, ...,  0.        ,
         1.        ,  0.        ],
       [-1.18242083, -1.70912256, -1.52434128, ...,  0.        ,
         1.        ,  0.        ],
       [-1.18242083, -1.70912256, -1.379739  , ...,  0.        ,
         1.        ,  0.        ],
       ..., 
       [-0.91395927,  1.70183906,  1.36770431, ...,  0.        ,
         0.        ,  1.        ],
       [-0.73518157,  1.70183906,  1.51230659, ...,  0.        ,
         0.        ,  1.        ],
       [-0.82486544,  1.70183906,  1.65690887, ...,  0.        ,
         0.        ,  1.        ]])

### Vectorize targets

One final note, we must also create target vectors for the model.

In [24]:
# vectorize targets
Y_vec_reg = dataRel['registered'].values.astype(float)
Y_vec_cas = dataRel['casual'].values.astype(float)