In [None]:
Trying to figure out how to implement a ML pipeline, starting with the individual parts.

In [26]:
%matplotlib inline

import os
import pickle
import numpy as np
import pandas as pd
from datetime import datetime

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report


	- instant: record index
	- dteday : date
	- season : season (1:springer, 2:summer, 3:fall, 4:winter)
	- yr : year (0: 2011, 1:2012)
	- mnth : month ( 1 to 12)
	- hr : hour (0 to 23)
	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
	- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
	- hum: Normalized humidity. The values are divided to 100 (max)
	- windspeed: Normalized wind speed. The values are divided to 67 (max)
	- casual: count of casual users
	- registered: count of registered users
	- cnt: count of total rental bikes including both casual and registered

In [19]:
df = pd.read_csv('~/Desktop/py/data/day.csv')
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
instant,731.0,366.0,211.165812,1.0,183.5,366.0,548.5,731.0
season,731.0,2.49658,1.110807,1.0,2.0,3.0,3.0,4.0
yr,731.0,0.500684,0.500342,0.0,0.0,1.0,1.0,1.0
mnth,731.0,6.519836,3.451913,1.0,4.0,7.0,10.0,12.0
holiday,731.0,0.028728,0.167155,0.0,0.0,0.0,0.0,1.0
weekday,731.0,2.997264,2.004787,0.0,1.0,3.0,5.0,6.0
workingday,731.0,0.683995,0.465233,0.0,0.0,1.0,1.0,1.0
weathersit,731.0,1.395349,0.544894,1.0,1.0,1.0,2.0,3.0
temp,731.0,0.495385,0.183051,0.05913,0.337083,0.498333,0.655417,0.861667
atemp,731.0,0.474354,0.162961,0.07907,0.337842,0.486733,0.608602,0.840896


Date: String to datetime to ordinal.

In [20]:
df['date'] = df['dteday'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df['date'] = df['date'].apply(lambda x: x.toordinal())
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,date
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985,734138
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801,734139
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349,734140
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562,734141
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600,734142


Dropping based on a priori knowledge. cnt is our target, dteday is in the wrong format, instant is the index, and casual and registered would of course be collinear with total rides. This is our pre-split X matrix.

Our y vector (target) is just cnt.

In [21]:
X = df.drop(['cnt', 'dteday', 'instant', 'casual', 'registered'], axis=1)
y = df.cnt
X.head()

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,date
0,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,734138
1,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,734139
2,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,734140
3,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,734141
4,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,734142


In [6]:
#pipeline = work on later when better understood
'''
features = Pipeline([
    ('scaling', StandardScaler()),
    ('feature_selection', SelectFromModel(LassoCV()))
])

'''

"\nfeatures = Pipeline([\n    ('scaling', StandardScaler()),\n    ('feature_selection', SelectFromModel(LassoCV()))\n])\n\n"

Train-test-split then fit the scaler on the training data. Use the fitted scaler to transform the train and test X matrices.

In [23]:
# Seems like scaling target isn't necessary
# https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re
# Split
X_train, X_test, y_train, y_test = tts(X, y, train_size=0.33)
# Fit the scaler on TRAINING data - important!
ss = StandardScaler().fit(X_train)
# Now can scale training and test data based on training data
X_train_std = ss.transform(X_train)
X_test_std = ss.transform(X_test)
X_train_std




array([[-1.28839265, -0.9958592 , -1.47871842, ..., -0.83942658,
        -0.21461332, -1.70233918],
       [ 1.40382668, -0.9958592 ,  1.06881156, ...,  0.02426597,
        -0.11593013, -0.25771417],
       [-0.39098621, -0.9958592 , -0.91260064, ..., -1.56136205,
         0.29618567, -1.32166097],
       ...,
       [ 1.40382668, -0.9958592 ,  1.06881156, ..., -0.28077592,
        -0.1314123 , -0.32116054],
       [-1.28839265,  1.00415802, -1.47871842, ...,  0.56240416,
        -1.0878426 ,  0.08880062],
       [-1.28839265, -0.9958592 , -1.47871842, ...,  1.28183255,
        -0.31816126, -1.73162212]])

Fit a lasso model and instantiate SFM using its results. 

Use SFM to further transform the X matrix (selecting columns). 

Use sfm.get_support() to see which variables are selected.

In [24]:
lasso = LassoCV().fit(X_train_std, y_train)
sfm = SelectFromModel(lasso, prefit=True)
X_train_new = sfm.transform(X_train_std)
print(X_train_new.shape)
# What variables did lasso select?
print(X.columns[sfm.get_support()])


(241, 12)
Index(['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'date'],
      dtype='object')




Apply the variable selection from fitting SFM to the training data, to the test data. 

Fit another LassoCV model (again? did I just do this twice?). 

Use that fitted model to predict y_hat using the transformed X_test data. 

Print MSE.

In [27]:
X_test_new = X_test[X.columns[sfm.get_support()]]
model = LassoCV()
model.fit(X_train_new, y_train)
y_hat = model.predict(X_test_new)
print("MSE: {}".format(mean_squared_error(y_test, y_hat)))

MSE: 1.0712427968803574e+17




Not sure about the interpretation of MSE here (seems really high, but I have no baseline)