# On-line Feature Engineering

The goal of this notebook is to provide a framework for online feature engineering that seems to be needed for this competition.The notebook mainly rely on my previous work in previous competitions. (Janestreet: https://www.kaggle.com/lucasmorin/running-algos-fe-for-fast-inference, G-research: https://www.kaggle.com/code/lucasmorin/on-line-feature-engineering)

## Features engineering techniques :

- [Start with the end](#Start) 
- [Get Data](#Get_Data)
- [Reorder Data](#Reorder_Data)
- [Missing Assets](#Missing_Assets)
- [Base Feature Engineering](#Base_FE)
- [Market Features](#Market_Features)
- [Time Features](#Time_Features)
- [Running Moving Average](#RMA) (<- Magic)
- [Moving Average Features](#MA_FE)
- [Betas](#Betas)
- [Putting it all together](#All) (<- All the features)
- [Complete Feature Exploration](#FE_exploration)

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
import pickle

def timestamp_to_date(timestamp):
    return(datetime.fromtimestamp(timestamp))

<a id='Start'></a>
# Start with the end
Looking at iterator submission data.

In [None]:
(prices, options, financials, trades, secondary_prices, sample_prediction) = next(iter_test)

We'll start with with prices.

In [None]:
prices.head()

In [None]:
specs = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/data_specifications/stock_price_spec.csv')

<a id='Get_Data'></a>
# Get Data
Change data from pandas to numpy.

In [None]:
train_df = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')


In [None]:
test_df = prices

In [None]:
len(test_df.SecuritiesCode.unique())

<a id='Missing_Assets'></a>
# Missing Assets ?

Handling missing assets: adding rows with nan. 

In [None]:
dtype_dict = {'RowId':         'object',
'Date':                 'object',
'SecuritiesCode':        'int16',
'Open':                'float32',
'High':                'float32',
'Low':                 'float32',
'Close':               'float32',
'Volume':                'int64',
'AdjustmentFactor':    'float32',
'ExpectedDividend':    'float32',
'SupervisionFlag':        'bool',
'Target':              'float64'}

df_train = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv', dtype = dtype_dict)
df_train_sup = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv', dtype = dtype_dict)

In [None]:
plt.plot(df_train.groupby('Date')['SecuritiesCode'].nunique());
Asset_ID = test_df.SecuritiesCode.unique()

We simulate missing Assets.

In [None]:
Asset_id_sample = Asset_ID[:10]
test_df_missing = test_df[test_df.SecuritiesCode.isin(Asset_id_sample)]
missing_ID = [Asset_ID[i] for i in range(2000) if i not in Asset_id_sample]
val = test_df_missing

In [None]:
def deal_MV(df, Asset_ID=Asset_ID):
    new_df = pd.DataFrame({'SecuritiesCode':Asset_ID})
    date_ref = df.Date.iloc[0]
    df = df.merge(new_df,on='SecuritiesCode',how='right')
    df.Date.fillna(date_ref,inplace=True)
    df['RowId'] = df.Date +'_'+df.SecuritiesCode.astype('str')
    return df

In [None]:
deal_MV(val)

Don't forget to remove them for prediction.

<a id='Base_FE'></a>
# Base Feature Enginerring

In [None]:
def Base_FE(df):
    
    df['Avg_Price'] = (df['Close']+df['Open'])/2
    df['Avg_Price_HL'] = (df['High']+df['Low'])/2
    df['Side'] = 2*(df['Avg_Price']-df['Avg_Price_HL'])/(df['High']-df['Low'])
    
    df['ret_HL'] = df['High']/df['Low']-1
    df['ret'] = df['Close']/df['Open']-1
    df['ret_Div'] = df['ExpectedDividend']/df['Avg_Price']
    
    df['log_Dollars'] = np.log(df['Avg_Price']*df['Volume'])
    
    df['GK_sqrt_vol'] = np.sqrt((1 / 2 * np.log(df['High']/df['Low']) ** 2 - (2 * np.log(2) - 1) * np.log(df['Close'] / df['Open']) ** 2))
    df['RS_sqrt_vol'] = np.sqrt(np.log(df['High']/df['Close'])*np.log(df['High']/df['Open']) + np.log(df['Low']/df['Close'])*np.log(df['Low']/df['Open']))

    df[Base_Features] = df[Base_Features].astype('float32')
    
    return df

Base_Features  = ['Side','ret_HL','ret','ret_Div','log_Dollars','GK_sqrt_vol','RS_sqrt_vol']

In [None]:
test_df = Base_FE(test_df)
test_df[Base_Features].hist();

<a id='Market_Features'></a>
# Market Features

In [None]:
def Market_FE(df,features = Base_Features):
    df[[f+'_M_mean'for f in features]] = df.groupby('Date')[features].transform('mean')
    df[[f+'_M_std'for f in features]] = df.groupby('Date')[features].transform('std')
    df[[f+'_M_skew'for f in features]] = df.groupby('Date')[features].transform('skew')
    return df

Market_Features = [f+'_M_mean'for f in Base_Features]+[f+'_M_std'for f in Base_Features]+[f+'_M_skew'for f in Base_Features]

In [None]:
test_df = Market_FE(test_df)

<a id='Time_Features'></a>
# Time Features

In [None]:
def Time_FE(df):
    day = pd.to_datetime(df.Date[0])
    df['sin_month'] = (np.sin(2 * np.pi * day.month/12))
    df['cos_month'] = (np.cos(2 * np.pi * day.month/12))
    df['sin_week'] = (np.sin(2 * np.pi * day.week/52))
    df['cos_week'] = (np.cos(2 * np.pi * day.week/52))
    df['sin_day'] = (np.sin(2 * np.pi * day.day/31))
    df['cos_day'] = (np.cos(2 * np.pi * day.day/31))
    return df

Time_Features = ['sin_month','cos_month','sin_week','cos_week','sin_day','cos_day']

In [None]:
test_df = Time_FE(test_df)

Japanese weird trading days ?

<a id='RMA'></a>
# Running Moving Average

Standard pandas moving average implementation would look like this:

In [None]:
#rw = 10000
#train_data_rolled = train_data.rolling(window=rw).mean()

But that wouldn't be practical to keep and update a data Frame. One idea is to get values in memory, then perform the mean. This would be rather inefficient too. 

A better approach is to keep track of the cumulated sum. Only adding the last instance / removing the further one in time at each time step.

In [None]:
import collections
from collections import deque

class RunningMean:
    def __init__(self, WIN_SIZE=20, n_size = 1):
        self.n = 0
        self.mean = np.zeros(n_size)
        self.cum_sum = 0
        self.past_value = 0
        self.WIN_SIZE = WIN_SIZE
        self.windows = collections.deque(maxlen=WIN_SIZE+1)
        
    def clear(self):
        self.n = 0
        self.windows.clear()

    def push(self, x):
        x = fillna_npwhere(x, self.past_value)
        self.past_value = x
        
        self.windows.append(x)
        self.cum_sum += x
        
        if self.n < self.WIN_SIZE:
            self.n += 1
            self.mean = self.cum_sum / float(self.n)
            
        else:
            self.cum_sum -= self.windows.popleft()
            self.mean = self.cum_sum / float(self.WIN_SIZE)

    def get_mean(self):
        return self.mean if self.n else np.zeros(n_size)

    def __str__(self):
        return "Current window values: {}".format(list(self.windows))

# Temporary removing njit as it cause many bugs down the line
# Problems mainly due to data types, I have to find where I need to constraint types so as not to make njit angry
#@njit
def fillna_npwhere(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

<a id='MA_FE'></a>
# Moving Average Features

In [None]:
%%time 
 
MA_lags = [2,5,20,60,300]
Suffixes = ['2d','W','M','Q','Y']
    
dict_RM = {}
for lag in MA_lags:
    dict_RM[lag] = RunningMean(lag)

Features_to_lag = Base_Features+Market_Features

def Lag_FE(df, Features_to_lag = Features_to_lag, MA_lags = MA_lags, Suffixes = Suffixes):
    for (lag,suffixe) in zip(MA_lags,Suffixes):
        dict_RM[lag].push(df[Features_to_lag].values)
        df[[f+'_'+suffixe for f in Features_to_lag]] = dict_RM[lag].get_mean()
    return df

for i in tqdm(range(100)):
    df = Lag_FE(test_df,Features_to_lag=Features_to_lag)
    
Moving_average_Features = [f+'_'+s for f in Features_to_lag for s in  Suffixes]

<a id='Betas'></a>
# Betas

Beta translate correlation to another series. We start with market beta.
For a lack of a better implementation I start with just two memories. 

In [None]:
%%time 

beta_lags = [20,60,300]
Suffixes_Beta = ['M','Q','Y']

dict_MM = {}
dict_Mr = {}
for lag in beta_lags:
    dict_MM[lag] = RunningMean(lag)
    dict_Mr[lag] = RunningMean(lag)

def Betas_FE(df, beta_lags = beta_lags, Suffixes_Beta = Suffixes_Beta):
    for (lag,suffixe) in zip(beta_lags, Suffixes_Beta):
        dict_MM[lag].push(df['ret_M_mean']**2)
        dict_Mr[lag].push(df['ret_M_mean']*df['ret'])
        df['Beta_'+suffixe] = dict_Mr[lag].get_mean()/dict_MM[lag].get_mean()
    return df
    
for i in tqdm(range(100)):
    df = Betas_FE(df)


In [None]:
Beta_Features = ['Beta_'+s for s in Suffixes_Beta]

<a id='All'></a>
# Putting it all together - cleaning and testing

In [None]:
%%time 

df_grouped = df_train.groupby('Date')

df_result = pd.DataFrame()

MA_lags = [2,5,20,60,300]
Suffixes = ['2d','W','M','Q','Y']
beta_lags = [20,60,300]
Suffixes_Beta = ['M','Q','Y']

dict_RM = {}
for lag in MA_lags:
    dict_RM[lag] = RunningMean(lag)
    
dict_MM = {}
dict_Mr = {}
for lag in beta_lags:
    dict_MM[lag] = RunningMean(lag)
    dict_Mr[lag] = RunningMean(lag)

Features_to_lag = Base_Features+Market_Features

list_df = [] 

for name, df in tqdm(df_grouped):
    df = deal_MV(df)
    df = Base_FE(df)
    df = Market_FE(df)
    df = Time_FE(df)
    df = Lag_FE(df, Features_to_lag = Features_to_lag, MA_lags = MA_lags, Suffixes = Suffixes)
    df = Betas_FE(df, beta_lags = beta_lags, Suffixes_Beta = Suffixes_Beta)
    
    list_df.append(df)
    
df_result = pd.concat(list_df)

del list_df

In [None]:
list_float64 = [c for c in df_result.select_dtypes(np.float64).columns if c not in ['Target']]
df_result[list_float64] = df_result[list_float64].astype(np.float32)

df_result.to_parquet('train_FE.parquet')

In [None]:
pickle.dump(dict_RM, open('dict_RM_train.pkl', 'wb'))
pickle.dump(dict_MM, open('dict_MM_train.pkl', 'wb'))
pickle.dump(dict_Mr, open('dict_MR_trani.pkl', 'wb'))

In [None]:
df_grouped_sup = df_train_sup.groupby('Date')
df_result_sup = pd.DataFrame()
list_df_sup = [] 

for name, df in tqdm(df_grouped_sup):
    df = deal_MV(df)
    df = Base_FE(df)
    df = Market_FE(df)
    df = Time_FE(df)
    df = Lag_FE(df, Features_to_lag = Features_to_lag, MA_lags = MA_lags, Suffixes = Suffixes)
    df = Betas_FE(df, beta_lags = beta_lags, Suffixes_Beta = Suffixes_Beta)    
    list_df_sup.append(df)
    
df_result_sup = pd.concat(list_df_sup)

list_float64 = [c for c in df_result_sup.select_dtypes(np.float64).columns if c not in ['Target']]
df_result_sup[list_float64] = df_result_sup[list_float64].astype(np.float32)

df_result_sup.to_parquet('train_FE_sup.parquet')

pickle.dump(dict_RM, open('dict_RM_train_sup.pkl', 'wb'))
pickle.dump(dict_MM, open('dict_MM_train_sup.pkl', 'wb'))
pickle.dump(dict_Mr, open('dict_MR_trani_sup.pkl', 'wb'))

<a id='FE_exploration'></a>
# Complete Feature Exploration

In [None]:
for c in df_result.columns:
    if df_result[c].dtype.kind in 'biufc':
        print(c)
        print(df_result[c].describe())
        df_result[c].plot(kind = 'hist', stacked=True, bins=100).set_xlim((np.min(df_result[c].quantile(0.025)),np.max(df_result[c].quantile(0.975))));
        plt.show();