**This notebook is English translated version of [LanceZero's.](https://www.kaggle.com/code/lancezero/simple-leakage-submission-test/notebook?scriptVersionId=95129549)<br>
For educational purposes only.**

In [None]:
import pandas as pd
import numpy as np
import itertools
from warnings import filterwarnings
filterwarnings("ignore")

**The submission file of the competition looks like this**

In [None]:
submission = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/example_test_files/sample_submission.csv')

In [None]:
submission

**A total of 56 trading days, each trading day has 2000 stocks to choose from. <br>
We need to predict the rank of each stock for each day separately, and then overwrite the initial rank above. The final submission is this file.**

**Then the background will calculate the score for this table. <br>
The basis for the calculation is to take the top 200 and the bottom 200 stocks each day, one long and one short, and then calculate the interest rate difference for each day, and calculate the interest rate spread for a total of 56 days. <br>
Then score = mean/standard deviation of spread. So the score is not that the bigger the spread, the better. <br>
On the contrary, I think that if you can find a spread that is basically unchanged every day, even if it is very small, but the fluctuation is small, the score will be high. The idea is as follows**

In [None]:
spread_return_1 = np.full(56,12) + np.random.normal(loc=0,scale=1,size=56)
spread_return_2 = np.full(56,15) + np.random.normal(loc=0,scale=2,size=56)

In [None]:
spread_return_1.mean(),spread_return_2.mean()

In [None]:
spread_return_1.std(),spread_return_2.std()

In [None]:
sharp_ratio_1 = spread_return_1.mean()/spread_return_1.std()
sharp_ratio_2 = spread_return_2.mean()/spread_return_2.std()

In [None]:
sharp_ratio_1,sharp_ratio_2  

**It can be seen that even though the average return of 2 is 3% larger than that of 1, the volatility is 1 times larger, so the Sharpe is smaller than that of 1. <br>
Through this, we found that the goal is not to rank stocks according to the predicted target, that is, the rate of return, but to rank stocks in consideration of fluctuations to ensure that the daily rate of return is large and stable.**

In [None]:
def calc_spread_return_per_day(df, portfolio_size=200, toprank_weight_ratio=2):
    #Given a sorted table on a certain day, the spread of the day will be calculated return return
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short

In [None]:
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size=200, toprank_weight_ratio=2):
    buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio, buf 
# buf is a yield group consisting of a spread return

**The deadline for training data is Friday 21.12.3. <br>
The start time of the supplementary data is Monday 21.12.6 to 22.2.28. <br>
In theory, we want to predict the data target starting from 21.12.6, and then submit it to the background, and the background will give a score based on the real data after 21.12.6, but now the data of 21.12.6 has told us that it is in supplemental_files. <br>
Now we have three options. <br>
One is to not train the model directly, submit real data, and get a super high score. <br>
Alternatively, you can use this data set as test_data, and train the model to score by yourself, which should have the same results as version A. <br>
Or it can be added to train the model together, so that there is more data, and there should be an advantage in submitting at the end.**

# 1. Use supplemental_files to feel the submission process first

In [None]:
# parse_dates=["Date"] Convert the Date column to time format
df = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv', parse_dates=["Date"])

In [None]:
# Supplementary data for a total of 56 time points
df['Date'].nunique()

In [None]:
# df.groupby("Date") Divide the df table into 56 sub-tables, each sub-table represents the data of a time
# Then only take out the column of Target under each date, which is equivalent to 56 Series.
df.groupby("Date")["Target"]

In [None]:
len(df.groupby("Date")["Target"])

In [None]:
list(df.groupby("Date")["Target"])[1]

# 1. Sort stocks according to target, regardless of volatility, think that the higher the yield, the better

In [None]:
# Sort the target of 2000 stocks for each date
df.groupby("Date")["Target"].rank(ascending=False, method="first") - 1

In [None]:
a = pd.Series([23,34,13,13,44,51])
a

In [None]:
# Without method, the same will be shot the same and not an integer
a.rank(ascending=False)

In [None]:
# Sort first
a.rank(ascending=False,method='first')

In [None]:
a.rank(ascending=False,method='first') -1 

In [None]:
df['Rank'] = df.groupby("Date")["Target"].rank(ascending=False, method="first") - 1

In [None]:
df['Rank'] = df['Rank'].astype('int')

**Get the rank column, which indicates the rank of the target of each stock in the 2000 stocks on that day.**

**But the submitted documents require that the table be sorted by date, and then each date is sorted by rank. From small to large, from 0-1999.**

In [None]:
# Sort by date first, then each date by rank.
df.sort_values(["Date", "Rank"],ascending=True)

In [None]:
df_submission = df.sort_values(["Date", "Rank"],ascending=True)

**Then look at the submission file, you need to match the rank column of the sunmission file with our prediction, that is, through the SecuritiesCode column, and then find the corresponding date of each day, what is the rank corresponding to this SecuritiesCode in df_submisson.**

In [None]:
submission

**In fact, this is already done, the rest is to submit.**

**It is in an iterative way, changing the Rank column of the submission file one date by one.**

## 1.1. Calculate the score yourself first, in theory, we don't need to calculate it, and the background will calculate after submitting the file

In [None]:
return_list = []
for date in df_submission.Date.unique():
    today_return = calc_spread_return_per_day(df_submission.loc[ df_submission.Date == date])
    return_list.append(today_return)

In [None]:
return_list = np.array(return_list)

In [None]:
min(return_list) # Minimum are 11.44

In [None]:
sharp_ratio = return_list.mean() / return_list.std()

In [None]:
sharp_ratio 

# 2. Submission

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for prices, _, _, _, _, sample_prediction in iter_test:
    day_df = df_submission[df_submission['Date']==prices["Date"].iloc[0]]
    map_dict = day_df.set_index("SecuritiesCode")["Rank"]
    sample_prediction["Rank"] = sample_prediction.SecuritiesCode.map(map_dict)
    env.predict(sample_prediction)