# Cyclic Factor Encoding

## 3 Ways: 
1. Label Encoding (Not Recommended)
2. Mapping Factors onto the Unit Circle
3. One-Hot-Encoding (also called Dummy Variables)

## To Do:
* Entity Embeddings for Factors (also called Categorical Embeddings)
* Time Series Analysis (Most Appropriate for Autocorrelated Data)

In [1]:
import os
os.getcwd()

'/home/tim/Documents/VSCode/Cyclic_Factor_Encoding'

In [2]:
# run once

# !curl -O 'https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip'
# !unzip 'Bike-Sharing-Dataset.zip' -d '/home/tim/Documents/VSCode/Cyclic_Factor_Encoding'

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('hour.csv')

df.columns.values

array(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday',
       'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum',
       'windspeed', 'casual', 'registered', 'cnt'], dtype=object)

In [4]:
df.shape

(17379, 17)

In [5]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [6]:
# use the .copy() method to store a subset of the df in memory
# focus on cyclic factors ('month', 'hour') and number of bikes sold ('cnt')
# create a simple two-factor model using ['month', 'hour'] to predict 'cnt'


df = df[['mnth','hr','cnt']].copy()

In [7]:
print('Unique values of month:', df.mnth.unique())
print('Unique values of hour:',  df.hr.unique())

Unique values of month: [ 1  2  3  4  5  6  7  8  9 10 11 12]
Unique values of hour: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]


### Label Encoding

#### (encode each factor with a single column of numbers)

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline

# Construct the pipeline with a standard scaler and a small neural network
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('nn', MLPRegressor(hidden_layer_sizes=(5,), max_iter=1000)))
model = Pipeline(estimators)

# purpose of the pipeline is to assemble several steps that can be
# cross-validated together

In [9]:
features = ['mnth','hr']
X = df[features].values
y = df.cnt

In [10]:
# We'll use 5-fold cross validation. That is, a random 80% of the data will be used
# to train the model, and the prediction score will be computed on the remaining 20%.
# This process is repeated five times such that the training sets in each "fold"
# are mutually orthogonal.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=5)

In [11]:
%%time
# use the cross_val_score function on the model, X, y, and cv=kfold
# calculate MSE

results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
print('CV Scoring Result: mean=',np.mean(results),'std=',np.std(results))

CV Scoring Result: mean= -28784.0955880366 std= 10796.091569464059
CPU times: user 42.5 s, sys: 645 ms, total: 43.2 s
Wall time: 42 s


In [12]:
baseline_cv_score = np.mean(results) # save for later

Not a great MSE.

### Map Factors onto the Unit Circle

In [13]:
# transform values to the corresponding (x, y) coordinates on a unit circle
# via the sin and cos ufuncs

df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth-1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth-1)*(2.*np.pi/12))

In [14]:
%%time

features = ['mnth_sin','mnth_cos','hr_sin','hr_cos']
X = df[features].values

results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
print('CV Scoring Result: mean=',np.mean(results),'std=',np.std(results))

CV Scoring Result: mean= -22742.956898950797 std= 8266.53665241351
CPU times: user 2min 4s, sys: 736 ms, total: 2min 4s
Wall time: 2min 3s


In [15]:
unit_circle_cv_score = np.mean(results) # cv score after mapping factors onto the unit circle

print("Relative to label encoding (baseline), we observe a", (baseline_cv_score/unit_circle_cv_score - 1)*100, "% decrease in MSE.")

Relative to label encoding (baseline), we observe a 26.56267923263971 % decrease in MSE.


### One-Hot Encoding

In [16]:
mnth_dummies = pd.get_dummies(df['mnth'])
hr_dummies = pd.get_dummies(df['hr'])
X = np.column_stack([mnth_dummies, hr_dummies])

In [17]:
mnth_dummies.shape

(17379, 12)

In [18]:
hr_dummies.shape

(17379, 24)

In [19]:
X.shape

(17379, 36)

In [20]:
%%time
results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
print('CV Scoring Result: mean=',np.mean(results),'std=',np.std(results))

CV Scoring Result: mean= -21540.303113957958 std= 5442.637197887752
CPU times: user 1min 54s, sys: 703 ms, total: 1min 55s
Wall time: 1min 54s


In [21]:
one_hot_cv_score = np.mean(results) # cv score after one-hot encoding

print("Relative to label encoding (baseline), we observe a", (baseline_cv_score/one_hot_cv_score - 1)*100, "% decrease in MSE.")

Relative to label encoding (baseline), we observe a 33.62901829076266 % decrease in MSE.
