# Proxy Metrics for Utility Score to Use in Regression 
# &
# Custom Objective Function Implementations for LGBM


## Why regression might be more appropriate?
In this notebook, I will create proxy metrics to turn classification task into a regression task. I wanted to try and see the effect of structuring this problem as a regression problem instead of classification because "resp" includes more information compared to a binary label as "action". 


## How those Custom Objective Functions are related with Evaluation Metric i.e Utility Score?

I started with adding variations of penalty terms to MSE and MAE if predicted "resp" and true "resp" doesn't have the same sign. This modification is for assigning specific importance for the match between predicted "action" and the true "action" on top of the penalty due to residuals between predicted and actual "resp". This property would allow those new metrics to be used as a proxy to Utility score in a regression setting.


## Modified MSE Derivation

$$Modified MSE = f(\hat{y}) = \sum_{j}(y_j - \hat{y}_j)^2 + \left\{
\begin{array}{ll}
      0  &, y . \hat{y}_j > 0 ; \\
      - \lambda . y_j . \hat{y}_j   &, y_j . \hat{y}_j \leq 0 \\
\end{array} 
\right.$$

$y_j$ :  True "resp" value for row j.

$\hat{y}_j$ :  Predicted "resp" value for row j.

$ \lambda $ : Parameter to control amount of penalty for non matching action predictions. Needs to be tuned.

**Piecewise part of the metric represents the extra penalty for non matching true and predicted action. If predicted and true "resp" doesn't have the same label, their multiplication will be smaller than 0 and piecewise function will add extra penalty of $y . \hat{y}$**


### Gradient and Hessian 
I will be implementing objective functions of the different variants of Modified MSE type hybrid metrics in the code below but before starting that part, I wanted to calculate the gradient and hessian for this example -- as gradient and hessian is required to be calculated for objective funcitons in LGBM. 

$$ Gradient  Modified MSE =  \frac{\partial f(\hat{y})}{\partial \hat{y}} = \sum_{j} -2(y_j - \hat{y}_j) + \left\{
\begin{array}{ll}
      0  &, y . \hat{y}_j > 0 \\
      - \lambda . y_j   &, y_j . \hat{y}_j \leq 0 \\
\end{array} 
\right.$$

$$ Hessian  Modified MSE = \frac{\partial^2 f(\hat{y})}{\partial \hat{y}^2} = 2 $$



There are 2 requirement for an objective function which are ;

- It should be differentiable to first and second level wrt. $\hat{y}_j$ as those derivatives will be gradient and hessian to be used in the underlying optimisation algorithm to minimize objective function.

- Being differentiable implies another condition which is being continous.

We can quickly have a look at the shape of those objective functions using a 3-d plot. I will set $\lambda = 2$ to be able to visualise in 3-d

In [None]:
# Plot Modified MSE 
import numpy as np 
import plotly.graph_objects as go
lmbda = 2
x = np.outer(np.linspace(-0.5, 0.5, 30), np.ones(30))
y = x.copy().T 
z=np.where(x*y>=0, (y-x)**2, ((y-x)**2) - (x*y)*lmbda )

trace = go.Surface(x = x, y = y, z =z )
data = [trace]
layout = go.Layout(title = '3D Surface plot')
fig = go.Figure(data = data)
fig.update_layout(title='MSE Modified 1 Objective function values', autosize=False,
                  width=500, height=500,
                  margin=dict(l=80, r=70, b=85, t=110),
scene = dict(xaxis_title='Predicted resp',yaxis_title='True resp',zaxis_title='Value of Objective Function'))
fig.show()


Given that resp is ranged between -0.5 and 0.5 in the training set, I restricted those axis on $|resp_{pred}|<0.5$ and $|resp_{true}|<0.5$.


In [None]:
# Plot Modified MAE
lmbda = 2
x = np.outer(np.linspace(-0.5, 0.5, 30), np.ones(30))
y = x.copy().T # transpose
z=np.where(x*y>=0, abs(y-x), abs(y-x) - (x*y)*lmbda )


trace = go.Surface(x = x, y = y, z =z )
data = [trace]
layout = go.Layout(title = '3D Surface plot')
fig = go.Figure(data = data)
fig.update_layout(title='MAE Modified 1 Objective function values', autosize=False,
                  width=500, height=500,
                  margin=dict(l=80, r=70, b=85, t=110),
scene = dict(xaxis_title='Predicted resp',yaxis_title='True resp',zaxis_title='Value of Objective Function'))
fig.show()


## Weighted Training with Daily Signal-to-Noise Ratio -- Added at Version 3

We know that low signal to noise ratio is a problem in this dataset. Hence, I used an alternative definition of SNR to calculate it on daily resp as follows;
[Here is the link to wiki for the formula](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)

$$ SNR = \frac{\mu}{\sigma}
             \\ or \\
SNR = \frac{\mu^2}{\sigma^2} $$

I will be assigning daily SNR values of "resp" as training weights so the model will assign higher importance for the days with higher SNR.

I will derive some variants of those formulas as first formula can end up with negative weights and second formula has a log-normal distribution which made me experiment with log transformed version of it as well.

I derived 4 versions of SNR and results on time series splitted validation set is as follows;


second formula to assign high importance for negative "resp" with high absolute value.
First formula might be considered perhaps after taking absolute value of "resp" first.

In [None]:
%%time
import datatable as dt
import numpy as np
import pandas as pd 
import os
import random
import seaborn as sns
import matplotlib.pyplot as plt
import janestreet
import warnings
warnings.filterwarnings('ignore')
from tqdm.notebook import tqdm
import plotly.graph_objects as go

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import KFold
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args
from sklearn.model_selection import train_test_split


import lightgbm as lgb


import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
%%time
# Load data
train_data = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()
feature = pd.read_csv("/kaggle/input/jane-street-market-prediction/features.csv")
test_example = pd.read_csv("/kaggle/input/jane-street-market-prediction/example_test.csv")
sample_sub = pd.read_csv("/kaggle/input/jane-street-market-prediction/example_sample_submission.csv")


ALL_FEATURES = ["feature_" + str(i) for i in range(0,130) ]
KEPT_FEATURES = ALL_FEATURES
CAT_FEATURES = ["feature_0"]
LABEL_COLUMNS = ["resp", "resp_1", "resp_2", "resp_3", "resp_4"]
DATE_COLUMNS = ["date", "ts_id"]
# "weight" is only remeaning column


# Derive 
#train_data = train_data.loc[train_data.weight !=0,]
train_data['action'] = 0
train_data.loc[train_data['resp']>0.0,'action'] = 1
features = [f"feature_{x}" for x in range(130)]

# Filter first 85 days 
train_data = train_data.loc[train_data.date>85,]

# Calculate daily Signal to Noise ratio to use as weight
daily_resp_mean = train_data.groupby("date").resp.mean()[train_data.date].values
daily_resp_std = train_data.groupby("date").resp.std()[train_data.date].values

# 4 variants of daily SNR -- logged versions are quite normally distributed.
daily_SNR_abs = abs(daily_resp_mean) / daily_resp_std ## This is not as useful as the logged one
daily_SNR_abs_logged = -1*np.log(daily_SNR_abs)

daily_SNR_squared = np.square(daily_resp_mean) / np.square(daily_resp_std) ## This is not as useful as the logged one
daily_SNR_squared_logged = -1*np.log(daily_SNR_squared)

del daily_resp_mean, daily_resp_std


target = 'resp'

y = train_data[target].values
date = train_data["date"].values
weight = train_data["weight"].values
resp = train_data["resp"].values
train_data = train_data[features].values



In [None]:
# Objective function implementation
def mse_modified_1(y_pred, y_true):
    # Set hyperparameter lambda
    lmbda = 0.15
    
    # This weight is SNR
    weight = y_true.get_weight()
    
    y_true = y_true.get_label()
    residual = (y_true - y_pred).astype("float64")
    
    signs_matching = (y_true * y_pred) >= 0
    
    grad = np.where(signs_matching,  weight *(-2 * residual), weight * (-2 * residual - y_true*lmbda))
    hess = np.where(signs_matching, weight * 2 , weight * 2 )
    return grad, hess


def Eval_mse_modified_1(y_pred, y_true):
    lmbda = 0.15
    
    weight = y_true.get_weight()
    y_true = y_true.get_label()
    residual = (y_true - y_pred).astype("float")
    
    signs_matching = (y_true * y_pred) >= 0
    
    mse_action_value = np.where(signs_matching,  residual**2, residual**2 - y_true*y_pred*lmbda)
    
    mse_action_value = weight* mse_action_value 
    
    return "MSE_Modified_1", np.mean(mse_action_value), False







def mse_modified_2(y_pred, y_true):
    lmbda = 0.2
    
    # Weights are Daily signal to noise ratio
    weight = y_true.get_weight()
    
    y_true = y_true.get_label()
    residual = (y_true - y_pred).astype("float64")
    
    signs_matching = (y_true * y_pred) >= 0
    
    
    grad = np.where(signs_matching, -2 * residual,  -2 * residual + 2*y_pred*(y_true**2)*lmbda)
    hess = np.where(signs_matching, 2 , 2 + 2*(y_true**2)*lmbda )
    
    grad = weight * grad
    hess = weight * hess
    return grad, hess


def Eval_mse_modified_2(y_pred, y_true):
    lmbda = 0.2
    
    # Weights are signal to noise ratio
    weight = y_true.get_weight()
    y_true = y_true.get_label()
    residual = (y_true - y_pred).astype("float")
    
    signs_matching = (y_true * y_pred) >= 0
    
    value = np.where(signs_matching,  residual**2, residual**2 + (y_true**2)*(y_pred**2)*lmbda)
    
    value = weight* value 
    
    return "MSE_Modified_2", np.mean(value), False

                    
       

                    
# THIS WILL NOT WORK AS SEEN HESSIAN IS ZERO AND THE FUNCTION IS NOT DIFFERENTIABLE AT CRITICAL
# Usually MAE is approximated by ARC COSH function at critical point to make it differentiable.

# def mae_modified_1(y_pred, y_true):
#     lmbda = 2
#     #weight = y_true.get_weight()
#     y_true = y_true.get_label().astype("float64")
#     signs_matching = (y_true * y_pred) >= 0
#     is_pred_bigger = y_pred > y_true
    
    
#     grad = np.where(is_pred_bigger, 1, -1)
#     grad[~signs_matching] = grad[~signs_matching] - y_true[~signs_matching]*lmbda
#     #grad = weight * grad
    
#     hess = np.where(signs_matching, 0, 0)
#     return grad, hess

# def Eval_mae_modified_1(y_pred, y_true):
#     lmbda = 2
#     #weight = y_true.get_weight()
#     y_true = y_true.get_label()
#     residual = (y_true - y_pred).astype("float")
    
#     signs_matching = (y_true * y_pred) >= 0
#     is_pred_bigger = y_pred > y_true
    
#     mae_action_value = np.where(is_pred_bigger, -residual, residual)
#     mae_action_value[~signs_matching] += y_true[~signs_matching]*y_pred[~signs_matching]*lmbda
    
#     #mae_action_value = weight * mae_action_value
    
#     return("MAE_Modified_1", np.mean(mae_action_value), False)




# Taken from Yurin's notebook 
# https://www.kaggle.com/gogo827jz/jane-street-super-fast-utility-score-function
def utility_score_bincount(date, weight, resp, action):
    count_i = len(np.unique(date))
    Pi = np.bincount(date, weight * resp * action)
    t = np.sum(Pi) / np.sqrt(np.sum(Pi ** 2)) * np.sqrt(250 / count_i)
    u = np.clip(t, 0, 6) * np.sum(Pi)
    return u




In [None]:
split_from = train_data.shape[0] - train_data.shape[0]//6
train = train_data[:split_from,]
y_train = y[:split_from]

valid = train_data[split_from:,]
y_valid = y[split_from:]

train_daily_SNR = daily_SNR_abs_logged[:split_from]
valid_daily_SNR = daily_SNR_abs_logged[split_from:]



params = {
        'boosting_type': 'gbdt',
        'objective': 'custom',
        'n_jobs': -1,
        'seed': 0,
        "num_leaves": 32,
        'learning_rate': 0.01,
        'bagging_fraction': 0.8,
        'bagging_freq': 10,
        'colsample_bytree': 0.9,
        "num_boost_round": 2500,
        "early_stopping_rounds": 50,
        "min_data_in_leaf": 20}



lgb_train = lgb.Dataset(train,y_train, weight=train_daily_SNR)
lgb_valid = lgb.Dataset(valid,y_valid, weight=valid_daily_SNR)

model = lgb.train(
    params,
    lgb_train,
    valid_sets = [lgb_valid],
    verbose_eval = 50,
    fobj=mse_modified_1,
    feval=Eval_mse_modified_1
)

resp_preds = model.predict(valid)
action_preds = (resp_preds > 0).astype(int)
val_utility_score = utility_score_bincount(date[split_from:], weight[split_from:], resp[split_from:], action_preds)

print(f"Validation Utility Score: {val_utility_score:.2f}")

In [None]:
for (test_df, sample_prediction_df) in tqdm(iter_test):
    predictions = model.predict(test_df[features].values[:,])
    predictions = predictions > 0
    sample_prediction_df.action = predictions.astype(int)
    env.predict(sample_prediction_df)