# Introduction

This notebook does not contribute directly to the competition.  
However, it is included because it may provide a hint to raise the score.  

The Public LB for this competition is considered to consist of files provided as supplemental_files. (Date: 2021-12-06 - 2022-02-28)  
Therefore, the value of Target can be used as-is as a forecast value to make a complete forecast.  

However, in practice, a score based on perfect predictions is not the best score.  
This note explains why.

#### References
https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition

# Score based on perfect predictions

First, let's calculate the score based on the perfect prediction.

In [None]:
import os
from pathlib import Path
from decimal import ROUND_HALF_UP, Decimal

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from scipy.stats import norm

import matplotlib.pyplot as plt
import seaborn as sns

import optuna
optuna.logging.set_verbosity(optuna.logging.CRITICAL)

import warnings
warnings.simplefilter('ignore')

In [None]:
def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): spread return
    """
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short

def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, toprank_weight_ratio: float = 2) -> float:
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): sharpe ratio
    """
    buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio, buf

def add_rank(df, col_name="pred"):
    df["Rank"] = df.groupby("Date")[col_name].rank(ascending=False, method="first") - 1 
    df["Rank"] = df["Rank"].astype("int")
    return df

In [None]:
# Read supplemental_files
base_df = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv')

In [None]:
# Predict and calculate scores
feature_df = base_df.copy()
feature_df["pred"] =  feature_df["Target"] # perfect predictions
feature_df = add_rank(feature_df)
score, buf = calc_spread_return_sharpe(feature_df)
print(f'score -> {score}')

5.434 is the same score as when a known Target is set as a predictor in Submisson.  
Reference -> https://www.kaggle.com/code/ikeppyo/jpx-target-and-shifted-target-submit-test  

But this is not PublicLB's highest score.  
You can see why by looking at how the scores are calculated.

# JPX Competition Metric

In this competition, the following conditions set will be used to compete for scores.

1. The model will use the closing price ($C_{(k, t)}$) until that business day ($t$) and other data every business day as input data for a stock ($k$), and predict rate of change ($r_{(k, t)}$) of closing price of the top 200 stocks and bottom 200 stocks on the following business day ($C_{(k, t+1)}$) to next following business day ($C_{(k, t+2)}$)

    $$
    r_{(k, t)} = \frac{C_{(k, t+2)} - C_{(k, t+1)}}{C_{(k, t+1)}}
    $$
    
2. Within top 200 stock predicted ($up_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for rank 1-200 and denote their sum as $S_{up}$.

    $$
    S_{up} = \frac{\sum^{200}_{i=1}(r_{({up_i}, t)} * linear function(2, 1)_i))}{Average(linear function(2, 1))}
    $$
    
3. Within bottom 200 stocks predicted  ($down_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for bottom rank 1-200 and denote their sum as $S_{down}$.

    $$
    S_{down} = \frac{\sum^{200}_{i=1}(r_{({down_i}, t)} * linear function(2, 1)_i)}{Average(linear function(2, 1))}
    $$
    
4. The result of subtracting $S_{down}$ from $S_{up}$ is $R_{day}$ and is called "**daily spread return**".

    $$
    R_{day} = S_{up} - S_{down}
    $$
    
5. The daily spread return is calculated every business day during the public/private period and obtained as a time series for that period. The mean/standard deviation of the time series of daily spread returns is used as the score. Score calculation formula (x is the business day of public/private period)

    $$
    Score = \frac{Average(R_{day_1-day_x})}{STD(R_{day_1-day_x})}
    $$
    
6. The Kagger with the largest score for the private period wins.

# Why a score based on perfect predictions is not the best score

As stated in the Competition Metric, the score is divided by the std of daily spread returns in the score calculation.  
A perfect prediction maximizes the mean of daily spread returns, but does not minimize the std.  
In other words, by daring to lower the prominent daily spread return, the std will decrease, resulting in an increase in score.  

In the following, we see that by lowering the predictive ranking of a particularly high Target (spread per stock), instead of decreasing the mean of daily spread returns, we also decrease the std, resulting in an increase in score.

In [None]:
def calc_score(_predictor, base_df):
    feature_df = base_df.copy()
    feature_df = feature_df.groupby("Date").apply(lambda df: df.assign(pred=_predictor(df))).reset_index(level=0, drop=True)
    feature_df = add_rank(feature_df)
    score, buf = calc_spread_return_sharpe(feature_df)
    print(f'<<< {_predictor.__name__} >>>')
    print(f'score -> {score}')
    print(f'mean  -> {buf.mean()}')
    print(f'std   -> {buf.std()}')
    return feature_df['pred'], score, buf

def visualizer(_predictor, base_df):
    pred, score, buf = calc_score(_predictor, base_df)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))
    sns.distplot(pred, bins=100, ax=ax1)
    ax1.axes.set_xlim(0, 0.20)
    ax1.axes.set_ylim(0, 20)

    sns.distplot(buf, bins=50, fit=norm ,fit_kws={'label':'norm','color':'red'}, ax=ax2)
    ax2.set_xlabel("Daily spread return")
    ax2.axes.set_xlim(0, 30)
    ax2.axes.set_ylim(0, 0.3)
    plt.show()

def _predictor_perfect(feature_df):
    return feature_df['Target']

def _predictor_custom_g(x):
    def _predictor(feature_df):
        return feature_df['Target'].where(feature_df['Target'].abs() < x, 0)
    _predictor.__name__ = f'_predictor2_{x}'
    return _predictor    
    
visualizer(_predictor_perfect, base_df)
visualizer(_predictor_custom_g(0.20), base_df)
visualizer(_predictor_custom_g(0.15), base_df)
visualizer(_predictor_custom_g(0.10), base_df)

# Adjustment of predictions

If you can predict with some accuracy, you can increase your score (almost infinitely) by adjusting the daily spread return to be constant.  
It is important to note that there is no point in raising Public LB's score in this competition.

In [None]:
%%time

feature_cols = ['SecuritiesCode','Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']
target_cols = ['Target']

# create model
feature_df = base_df.copy()
model = lgb.LGBMRegressor(num_leaves=2**12, min_child_samples=1, random_state=2022)
model.fit(feature_df[feature_cols], feature_df[target_cols])
model.score(feature_df[feature_cols], feature_df[target_cols])

In [None]:
# This function adjusts the predictions so that the daily spread return approaches a certain value.
def adjuster(df):
    def calc_pred(df, x, y, z):
        return df['Target'].where(df['Target'].abs() < x, df['Target'] * y + np.sign(df['Target']) * z)

    def objective(trial, df):
        x = trial.suggest_uniform('x', 0, 0.2)
        y = trial.suggest_uniform('y', 0, 0.1)
        z = trial.suggest_uniform('z', 0, 1e-3)
        df["Rank"] = calc_pred(df, x, y, z).rank(ascending=False, method="first") - 1 
        df["Rank"] = df["Rank"].astype("int")
        return calc_spread_return_per_day(df, 200, 2)
    
    def predictor_per_day(df):
        study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(seed=2022))
        study.optimize(lambda trial: abs(objective(trial, df) - 10), 10)
        return calc_pred(df, *study.best_params.values())

    return df.groupby("Date").apply(predictor_per_day).reset_index(level=0, drop=True)

def _predictor_model(feature_df):
    pred = model.predict(feature_df[feature_cols])
    return pred

def _predictor_model_with_adjuster(feature_df):
    feature_df["Target"] = model.predict(feature_df[feature_cols])
    pred = adjuster(feature_df).iloc[0]
    return pred

def _predictor_target(feature_df):
    current_date = feature_df["Date"].iloc[0]
    target_df = base_df[base_df['Date'] == current_date].copy()
    target_map = target_df.set_index('SecuritiesCode')['Target'].to_dict()
    pred = feature_df['SecuritiesCode'].map(target_map)
    return pred

def _predictor_target_with_adjuster(feature_df):
    current_date = feature_df["Date"].iloc[0]
    target_df = base_df[base_df['Date'] == current_date].copy()
    target_map = target_df.set_index('SecuritiesCode')['Target'].to_dict()
    feature_df['Target'] = feature_df['SecuritiesCode'].map(target_map)
    pred = adjuster(feature_df).iloc[0]
    return pred

visualizer(_predictor_model, base_df)
visualizer(_predictor_model_with_adjuster, base_df)
visualizer(_predictor_target, base_df)
visualizer(_predictor_target_with_adjuster, base_df)

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

predictor = _predictor_target_with_adjuster

for prices, options, financials, trades, secondary_prices, sample_prediction in iter_test:
    feature_df = prices.copy()
    feature_df["pred"] = predictor(feature_df)
    feature_df = add_rank(feature_df)
    feature_map = feature_df.set_index('SecuritiesCode')['Rank'].to_dict()
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(feature_map)
    env.predict(sample_prediction)