### Big Idea: Learned embeddings for categorical features and the datetime component. 

* I show here how to get embeddings from datetime components. 

Better ones can doubtless be extracted: e.g. cyclical components, or the daily time-elapsed [code included]), and latLong's rounded then embedded. 

* Based on:
https://github.com/minimaxir/predict-reddit-submission-success/blob/master/predict_askreddit_submission_success_timing.ipynb

 * Rossman categorical embeddings idea:  https://www.kaggle.com/c/rossmann-store-sales/discussion/17974

The approach mentioned by taxi trajectory winners:
http://blog.kaggle.com/2015/07/27/taxi-trajectory-winners-interview-1st-place-team-%F0%9F%9A%95/
 

In [None]:
import pandas as pd
import numpy as np

from random import random, sample, seed

In [None]:
train = pd.read_csv('../input/train.csv',infer_datetime_format=True,parse_dates=["pickup_datetime"])

test = pd.read_csv('../input/test.csv',infer_datetime_format=True,parse_dates=["pickup_datetime"])
print(train.shape)

In [None]:
## drop outlier duration trips. I leave in 0 passenger trips and the like, so you may want to clean differently

duration_mask = ((train.trip_duration < 70) | # < 1.1 min
             (train.trip_duration > 3600*4)) # > 4 hours # orig: 3,600 = 1 hours
print('Anomalies in trip duration, %: {:.2f}'.format(
    train[duration_mask].shape[0] / train.shape[0] * 100
))
train = train[~duration_mask] # drop 10k anomalies
print(train.shape)

In [None]:
train.head()

In [None]:
train.head().pickup_datetime

### Adding cyclic datetime components 
* More details: https://github.com/ddofer/talk/blob/master/Introduction%20to%20Time%20Series%20and%20Feature%20Engineering.pdf
* A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. 
We divide by the cardinality/max value of the period in question : e.g. 24 for hours of the day, 7 for days of the week. 
    * We actually divide by the cardinality -1 , since the values are typically zero indexed. i.e hours of day are 0-23
    
* Excellent reference notebook: https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning#Encoding-Cyclical-Features-for-Deep-Learning

In [None]:
def cyclic_datetime(df,date="date"):
    """Assumes datetime column exists and is parsed as pandas datetime. Extracts cyclic seasonality features.
    By default, leaves original values in.
    More values can be added, and can use already extracted datetime components with modification (e.g. df['hour'] = df[date].dt.hour)"""
    
    df['hour_sin'] = np.sin(2 * np.pi * df[date].dt.hour/23.0)
    df['hour_cos'] = np.cos(2 * np.pi * df[date].dt.hour/23.0)
    
#     df['checkin_month_sin'] = np.sin((df["checkin_month"]-1)*(2.*np.pi/12))
#     df['checkin_month_cos'] = np.cos((df["checkin_month"]-1)*(2.*np.pi/12))

    df['checkin_week_sin'] = np.sin((df[date].dt.week-1)*(2.*np.pi/53))
    df['checkin_week_cos'] = np.cos((df[date].dt.week-1)*(2.*np.pi/53))
    
    df['minutes_sin'] = np.sin((df[date].dt-1)*(2.*np.pi/53))
    df['minutes_cos'] = np.cos((df[date].dt-1)*(2.*np.pi/53))
    
    
### Improved on function from: https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning#Encoding-Cyclical-Features-for-Deep-Learning
def cyclic_encode(data, col, max_val=None):
    if max_val is None:
        max_val = data[col].max()
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

## Add seconds since start of day
* Done with dt subtraction and times set to midnight..
    * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.normalize.html#pandas.Series.dt.normalize
    
    * currently not used.


In [None]:
# seconds since start of day
train["seconds_elapsed"] = (train.pickup_datetime - train.pickup_datetime.dt.normalize()).dt.seconds

## Add cyclical time features 

# train['week_delta_sin'] = np.sin((train["pickup_datetime"].dt.dayofweek / 7) * np.pi)**2
# train['hour_sin'] = np.sin((train["pickup_datetime"].dt.hour / 24) * np.pi)**2

In [None]:
hours = np.array(train["pickup_datetime"].dt.hour, dtype=int)
minutes = np.array(train["pickup_datetime"].dt.minute, dtype=int)
dayofweeks = np.array(train["pickup_datetime"].dt.dayofweek, dtype=int)
dayofyear = np.array(train["pickup_datetime"].dt.dayofyear, dtype=int)

In [None]:
print(hours[0:2])
print(minutes[0:2])
print(dayofweeks[0:2])
print(dayofyear[0:2])

## Process Categoricals 
* All features must be zero-indexed integers.
* hours is in the correct format. (0 = 12AM EST, 23 = 11PM EST)
* dayofweeks is in the correct format (0 = Sunday, 6 = Saturday)
* minutes is in the correct format verbatim.
* dayofyears is 1-indexed, so  subtract 1.

In [None]:
dayofyears_tf = dayofyear - 1

print(dayofyears_tf[0:10])

In [None]:
from keras.models import Input, Model
from keras.layers import Dense, Embedding, GlobalAveragePooling1D, concatenate, Activation
from keras.layers.core import Masking, Dropout, Reshape
from keras.layers.normalization import BatchNormalization

batch_size = 64
embedding_dims = 64
epochs = 20

# Categoricals' Embedding Branch
Each variable gets its own input and Embeddings. (size of each Embedding is already known by construction of the variables).

Reshape is necessary to convert from 2D to 1D.

In [None]:
meta_embedding_dims = 64

hours_input = Input(shape=(1,), name='hours_input')
hours_embedding = Embedding(24, meta_embedding_dims)(hours_input)
hours_reshape = Reshape((meta_embedding_dims,))(hours_embedding)

dayofweeks_input = Input(shape=(1,), name='dayofweeks_input')
dayofweeks_embedding = Embedding(7, meta_embedding_dims)(dayofweeks_input)
dayofweeks_reshape = Reshape((meta_embedding_dims,))(dayofweeks_embedding)

minutes_input = Input(shape=(1,), name='minutes_input')
minutes_embedding = Embedding(60, meta_embedding_dims)(minutes_input)
minutes_reshape = Reshape((meta_embedding_dims,))(minutes_embedding)

dayofyears_input = Input(shape=(1,), name='dayofyears_input')
dayofyears_embedding = Embedding(366, meta_embedding_dims)(dayofyears_input)
dayofyears_reshape = Reshape((meta_embedding_dims,))(dayofyears_embedding)

## following this, combine with other feature layers then run learning

* remainder of code to be filled in ; e.g. with all numeric features (after 0-1/normalization)

In [None]:
merged = concatenate([ hours_reshape, dayofweeks_reshape, minutes_reshape, dayofyears_reshape])

hidden_1 = Dense(256, activation='relu')(merged)
hidden_1 = BatchNormalization()(hidden_1)

main_output = Dense(1, activation='sigmoid', name='main_out')(hidden_1)


model = Model(inputs=[hours_input,
                      dayofweeks_input,
                      minutes_input,
                      dayofyears_input], outputs=[main_output])

model.compile(loss='mean_squared_error', optimizer='adam')

model.summary()

In [None]:
model.fit(train)