# ML Lab
In this notebook, I use the UCI bikeshare data to demonstrate preparing data and creating pipelines for machine learning.

The focus here is on feature selection.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='darkgrid', palette='colorblind')
pd.options.display.max_rows = None
pd.options.display.max_columns = None

	- instant: record index
	- dteday : date
	- season : season (1:springer, 2:summer, 3:fall, 4:winter)
	- yr : year (0: 2011, 1:2012)
	- mnth : month ( 1 to 12)
	- hr : hour (0 to 23)
	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
	- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
	- hum: Normalized humidity. The values are divided to 100 (max)
	- windspeed: Normalized wind speed. The values are divided to 67 (max)
	- casual: count of casual users
	- registered: count of registered users
	- cnt: count of total rental bikes including both casual and registered

# 0. Load data

In [2]:
# What does our data look like?
df = pd.read_csv('~/Desktop/py/data/day.csv')
df.shape

(731, 16)

In [3]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
instant,731.0,366.0,211.165812,1.0,183.5,366.0,548.5,731.0
season,731.0,2.49658,1.110807,1.0,2.0,3.0,3.0,4.0
yr,731.0,0.500684,0.500342,0.0,0.0,1.0,1.0,1.0
mnth,731.0,6.519836,3.451913,1.0,4.0,7.0,10.0,12.0
holiday,731.0,0.028728,0.167155,0.0,0.0,0.0,0.0,1.0
weekday,731.0,2.997264,2.004787,0.0,1.0,3.0,5.0,6.0
workingday,731.0,0.683995,0.465233,0.0,0.0,1.0,1.0,1.0
weathersit,731.0,1.395349,0.544894,1.0,1.0,1.0,2.0,3.0
temp,731.0,0.495385,0.183051,0.05913,0.337083,0.498333,0.655417,0.861667
atemp,731.0,0.474354,0.162961,0.07907,0.337842,0.486733,0.608602,0.840896


Since we want to focus on total rides ('cnt'), we'll want to drop casual and registered.

We can also drop the datetime variable 'dteday' and keep 'instant' as a quasi-time variable (increments by 1 every day).

We may also want to drop one of temp and atemp since they are so highly collinear. Atemp is probably more important since what the temperature 'feels like' is more likely than the actual temperature to affect a person's decision whether or not to bike.

# 1. Select our y variable
We are interested in total rides, so we choose 'cnt'.

In [8]:
y_cols = ['casual', 'registered', 'cnt']
y = df[y_cols[-1]]

# 2. Select our X for further processing
Drop unneeded variables

In [9]:
feature_cols = [col for col in df.columns if (col not in y_cols) & (col not in ['dteday', 'temp'])]
Xraw = df[feature_cols]
Xraw.shape

(731, 11)

# 3. Train-test-split

In [10]:
from sklearn.model_selection import train_test_split as tts

Xtrain, Xtest, ytrain, ytest = tts(Xraw, y, test_size=0.2)

Xtrain.head()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,atemp,hum,windspeed
402,403,1,1,2,0,2,1,1,0.361104,0.49625,0.147379
270,271,4,0,9,0,3,1,2,0.575158,0.84875,0.148629
433,434,1,1,3,0,5,1,2,0.397083,0.407083,0.4148
680,681,4,1,11,0,0,0,1,0.421713,0.659167,0.1275
349,350,4,0,12,0,5,1,2,0.359825,0.500417,0.260575


# 4. Preprocessing pipeline

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

processpipe = Pipeline([
    ('poly', PolynomialFeatures(2, include_bias=False)),
    ('scaler', StandardScaler()),
])

# Run train and test X matrices through pipeline
Xtrain_scaled = processpipe.fit_transform(Xtrain)
Xtest_scaled = processpipe.transform(Xtest)

# Get all of the feature names created by poly
Xtrain_cols = processpipe.named_steps['poly'].get_feature_names(Xtrain.columns)
Xtest_cols = processpipe.named_steps['poly'].get_feature_names(Xtest.columns)

Xtrain = pd.DataFrame(Xtrain_scaled, columns=Xtrain_cols)
Xtest = pd.DataFrame(Xtest_scaled, columns=Xtest_cols)
print(Xtrain.shape, Xtest.shape)

(584, 77) (147, 77)


# 5. Feature selection
Now that we have interaction/squared terms and scaled variables, we can perform feature selection.

First we can try ElasticNet and Lasso using SelectFromModel and see how they compare to f_regression.

In [16]:
'''Elastic Net with Select From Model'''
from sklearn.linear_model import ElasticNetCV
from sklearn.feature_selection import SelectFromModel

alphas = np.logspace(-10, 0, 200)
l1_ratio = [.1, .5, .7, .9, .95, .99, 1]

en = ElasticNetCV(l1_ratio=l1_ratio, alphas=alphas, fit_intercept=False)
sfm = SelectFromModel(en)
Xtrain_en = sfm.fit_transform(Xtrain, ytrain)
Xtest_en = sfm.transform(Xtest)

mask = sfm.get_support()
selected_columns_en = Xtrain.columns[mask]
Xtrain_en = pd.DataFrame(Xtrain_en, columns=selected_columns_en)
Xtest_en = pd.DataFrame(Xtest_en, columns=selected_columns_en)
print(selected_columns_en)
print("Elastic Net selects ",len(selected_columns_en)," features")



Index(['instant', 'season', 'yr', 'weathersit', 'atemp', 'windspeed',
       'instant season', 'instant yr', 'instant weekday', 'instant workingday',
       'instant atemp', 'season yr', 'season atemp', 'yr^2', 'yr weekday',
       'yr workingday', 'yr atemp', 'mnth^2', 'mnth weathersit', 'mnth atemp',
       'weekday atemp', 'workingday atemp', 'weathersit^2', 'weathersit atemp',
       'weathersit hum', 'weathersit windspeed', 'atemp^2', 'atemp hum',
       'atemp windspeed', 'hum^2', 'hum windspeed', 'windspeed^2'],
      dtype='object')
32


In [None]:
'''Elastic Net with RFE - THIS CELL TAKES DAYS TO RUN

from sklearn.linear_model import ElasticNetCV
from sklearn.feature_selection import RFECV

alphas = np.logspace(-10, 10, 200)
l1_ratio = [.1, .5, .7, .9, .95, .99, 1]

en = ElasticNetCV(fit_intercept=False)
rfe = RFECV(en)
Xtrain_enrfe = rfe.fit_transform(Xtrain, ytrain)
Xtest_enrfe = rfe.transform(Xtest)

mask = rfe.get_support()
selected_columns_enrfe = Xtrain.columns[mask]
Xtrain_enrfe = pd.DataFrame(Xtrain_enrfe, columns=selected_columns_enrfe)
Xtest_enrfe = pd.DataFrame(Xtest_enrfe, columns=selected_columns_enrfe)
print(selected_columns_enrfe)
'''

In [25]:
'''Lasso with Select From Model'''
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel

alphas = np.logspace(-10, 0, 200)

la = LassoCV(alphas=alphas, fit_intercept=False)
sfm = SelectFromModel(la)
Xtrain_la = sfm.fit_transform(Xtrain, ytrain)
Xtest_la = sfm.transform(Xtest)

mask = sfm.get_support()
selected_columns_la = Xtrain.columns[mask]
Xtrain_la = pd.DataFrame(Xtrain_la, columns=selected_columns_la)
Xtest_la = pd.DataFrame(Xtest_la, columns=selected_columns_la)
print(selected_columns_la)
print("Lasso selects ",len(selected_columns_la)," features")



Index(['instant', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
       'weathersit', 'atemp', 'hum', 'windspeed', 'instant^2', 'instant mnth',
       'instant holiday', 'instant weekday', 'instant workingday',
       'instant weathersit', 'instant atemp', 'instant hum',
       'instant windspeed', 'season^2', 'season yr', 'season holiday',
       'season weekday', 'season workingday', 'season weathersit',
       'season atemp', 'season windspeed', 'yr^2', 'yr mnth', 'yr holiday',
       'yr weekday', 'yr weathersit', 'mnth^2', 'mnth weekday',
       'mnth workingday', 'mnth atemp', 'mnth hum', 'holiday^2',
       'holiday weekday', 'holiday weathersit', 'holiday atemp', 'holiday hum',
       'holiday windspeed', 'weekday workingday', 'weekday atemp',
       'weekday hum', 'weekday windspeed', 'workingday^2', 'workingday atemp',
       'workingday hum', 'workingday windspeed', 'weathersit^2',
       'weathersit atemp', 'weathersit windspeed', 'atemp^2', 'atemp hum',
      



In [38]:
print("What variables do both select?")
print(set(selected_columns_en).intersection(selected_columns_la))

print("\nWhat variables are selected by one but not the other?")
print(set(selected_columns_en).difference(selected_columns_la))

print("\nWhat variables are in one or another but not both?")
print(set(selected_columns_en).symmetric_difference(selected_columns_la))

la_en_selected = list(set(selected_columns_en).intersection(selected_columns_la))

What variables do both select?
{'instant workingday', 'instant weekday', 'weathersit windspeed', 'weathersit atemp', 'season', 'instant', 'weathersit^2', 'yr', 'hum^2', 'hum windspeed', 'mnth atemp', 'atemp', 'atemp hum', 'yr^2', 'weekday atemp', 'yr weekday', 'mnth^2', 'season yr', 'windspeed', 'season atemp', 'weathersit', 'atemp windspeed', 'instant atemp', 'windspeed^2', 'atemp^2', 'workingday atemp'}

What variables are selected by one but not the other?
{'instant yr', 'instant season', 'mnth weathersit', 'weathersit hum', 'yr atemp', 'yr workingday'}

What variables are in one or another but not both?
{'holiday weekday', 'season workingday', 'holiday', 'instant holiday', 'workingday hum', 'holiday^2', 'instant mnth', 'season weekday', 'workingday^2', 'mnth', 'instant yr', 'workingday windspeed', 'instant^2', 'yr weathersit', 'instant season', 'season windspeed', 'instant weathersit', 'mnth weathersit', 'mnth hum', 'holiday atemp', 'instant hum', 'weathersit hum', 'yr atemp', 'yr 

Now to try f_regression to see how the results turn out.

In [36]:
'''f_regression with SelectKBest'''

from sklearn.feature_selection import f_regression, SelectKBest

skb = SelectKBest(f_regression, k=30)
skb.fit(Xtrain, ytrain)
Xtrain_fr = skb.transform(Xtrain)
Xtest_fr = skb.transform(Xtest)

mask = skb.get_support()
selected_columns_fr = Xtrain.columns[mask]
Xtrain_fr = pd.DataFrame(Xtrain_fr, columns=selected_columns_fr)
Xtest_fr = pd.DataFrame(Xtest_fr, columns=selected_columns_fr)
print(selected_columns_fr)
print("F_regression selects",len(selected_columns_fr),"features (by design)")

Index(['instant', 'season', 'yr', 'atemp', 'instant^2', 'instant season',
       'instant yr', 'instant mnth', 'instant weekday', 'instant workingday',
       'instant atemp', 'instant hum', 'instant windspeed', 'season^2',
       'season yr', 'season atemp', 'yr^2', 'yr mnth', 'yr weekday',
       'yr workingday', 'yr weathersit', 'yr atemp', 'yr hum', 'yr windspeed',
       'mnth atemp', 'weekday atemp', 'workingday atemp',
       'weathersit windspeed', 'atemp^2', 'atemp hum'],
      dtype='object')
F_regression selects 30 features (by design)


  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


In [40]:
print("What variables do all 3 select?")
print(set(selected_columns_fr).intersection(la_en_selected))

selected_columns = list(set(selected_columns_fr).intersection(la_en_selected))

What variables do all 3 select?
{'instant', 'yr', 'season atemp', 'weekday atemp', 'season yr', 'instant workingday', 'instant weekday', 'mnth atemp', 'weathersit windspeed', 'atemp', 'instant atemp', 'atemp^2', 'workingday atemp', 'atemp hum', 'yr^2', 'yr weekday', 'season'}


To do: add more feature selection. I continue on with model fitting in the bikeshare_ml_fitting.ipynb notebook.