In [None]:
import pandas as pd
import numpy as np
import warnings
import os
warnings.simplefilter(action='ignore')

In [None]:
df = pd.read_csv("/kaggle/input/jane-street-market-prediction/train.csv")

Here I describe some intuitions behing the utility score function for the Jane Street Market Prediction
problem in Kaggle.

You can find more information on the problem here: https://www.kaggle.com/c/jane-street-market-prediction/overview

## Utility Score Definition

This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.


$$
p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij}),
$$

$$
t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}},
$$

where \(|i|\) is the number of unique dates in the test set. The utility is then defined as:

$$ u = min(max(t,0), 6) \sum p_i. $$

In [None]:
df.head(2)

## $p_i$

Each row or trading opportunity can be chosen (action == 1) or not (action == 0). 

The variable $p_i$ is a indicator for each day $i$, showing how much return we got for that day.

Let's say for example we want to verify the potential return for day 0.


In [None]:
df_0 = df[df['date'] == 0]

Let's say we end up choosing all transactions for day 0. We would have:

$$
p_i = \sum_j(weight_{ij} * resp_{ij} * 1)
$$

In [None]:
# If we choose all transactions
df_0['mult'] = df_0['weight']*df_0['resp']*1
p_0 = df_0['mult'].sum()
p_0

Obviously, if we choose no transactions, $p_i$ = 0

In [None]:
# If we choose no transactions
df_0['mult'] = df_0['weight']*df_0['resp']*0
p_0 = df_0['mult'].sum()
p_0

Now, let's say that we only choose the ones that would give us a positive return.
Let's see what is the maximum return we can get from day 0.

In [None]:
# Highest possible p for day 0
df_0['mult'] = df_0['weight']*df_0['resp']*(df_0['resp'] > 0)
p_0 = df_0['mult'].sum()
p_0

Since we want to maximize u, we also want to maximize $p_i$. To do that, we have to select the least amount of
negative $resp$ values as possible (since this is the only negative value in my equation and only value that would make the total sum of p going down)
and maximize the positive number of positive $resp$ transactions we select.

## $t$

Now, let's try to understan what $t$ is all about.
Let's create an example.

Let's say we have two days to compose $t$.

First scenario, we have:

$$ Day0 -> p_0 = 74$$

$$ Day1 -> p_1 = 2$$


where, $\sum p_i = 76 $. If we calculate $t$ for this scenario we would have:

In [None]:
t = (np.sum(np.array([74, 2])/np.sqrt(np.sum(np.array([74,2])**2))))*np.sqrt(250/2)
t

Now, let's say we had different values for each day.


$$ Day0 -> p_0 = 38$$

$$ Day1 -> p_1 = 38$$

Note that in this scenario $\sum p_i $ is also 76.

In [None]:
t = (np.sum(np.array([38, 38])/np.sqrt(np.sum(np.array([38,38])**2))))*np.sqrt(250/2)
t

Ok, so we can see that $t$ is larger when the return for each day is better distributed and has lower variation.
It is better to have returns uniformly divided among days than have all of your returns concentrated in just one day.
It reminds me a little of a $L_1$ over $L_2$ situation, where the $L_2$ norm penalizes outliers more than $L_1$.

There is one more thing to consider in the $t$ equation.
We have a multiplying factor of $\sqrt{\frac{250}{|i|}}$.

So, basically, the higher $i$ the lower my $t$ value will be.

Let's say that, similar to the scanerio above, we actually had 3 days instead of 2:

$$ Day0 -> p_0 = 38$$

$$ Day1 -> p_1 = 38$$

$$ Day2 -> p_1 = 0$$


In [None]:
t = (np.sum(np.array([38, 38, 0])/np.sqrt(np.sum(np.array([38,38, 0])**2))))*np.sqrt(250/3)
t

We can see we get a lower $t$ value than with 2 days.

Basically, we want to select uniformly distributed distributed returns over days, maiximizing our return 
but giving a penalty on choosing too many dates.

The variable $t$, however, will only matter if it is lower than 6, given the final equation:

$$ u = min(max(t,0), 6) \sum p_i. $$

otherwise, $t$ wil be replaced by the number 6 (I am still trying to understand why 6, if anyone knows please share it with me :) ).
