# JPX Tokyo Stock Exchange Prediction

# 1.1 About the Competition
We need to create a model which uses the
[Sharpe Ration Evaluation Metrics](https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition)
<p> We need to return the rank of each stock active on the given day.</p>
<p>The returns for a single day treat the 200 highest (e.g. 0 to 199) ranked stocks as purchased and the lowest (e.g. 1999 to 1800) ranked 200 stocks as shorted. The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that.</p>
<p> Also we must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow this template in Kaggle Notebooks:</p>
<pre><code>import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    sample_prediction_df['Rank'] = np.arange(len(sample_prediction))  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions
</code></pre>
<p>We will get an error if we:</p>  
<ul>
<li>Use ranks that are below zero or greater than or equal to the number of stocks for a given date.</li>
<li>Submit any duplicated ranks.</li>
<li>Change the order of the rows.</li>
</ul>

# 1.2 Business Requirements
We need to predict the stock rank with high accuracy as much as possible.</p>
<p>Low latency </p>
<p> High speed </p>

# 1.3 Let's Understand the given Data

This dataset contains historic data for a variety of Japanese stocks and options. Your challenge is to predict the future returns of the stocks.

As historic stock prices are not confidential this will be a forecasting competition using the time series API. The data for the public leaderboard period is included as part of the competition dataset. Expect to see many people submitting perfect submissions for fun. Accordingly, the active phase public leaderboard for this competition is intended as a convenience for anyone who wants to test their code. The forecasting phase leaderboard will be determined using real market data gathered after the submission period closes.

<h2>Files</h2>


<h3>stock_prices.csv:</h3> The core file of interest. Includes the daily closing price for each stock and the target column.

<h3>options.csv:</h3> Data on the status of a variety of options based on the broader market. Many options include implicit predictions of the future price of the stock market and so may be of interest even though the options are not scored directly.

<h3>secondary_stock_prices.csv:</h3> The core dataset contains on the 2,000 most commonly traded equities but many less liquid securities are also traded on the Tokyo market. This file contains data for those securities, which aren't scored but may be of interest for assessing the market as a whole.

<h3>trades.csv:</h3> Aggregated summary of trading volumes from the previous business week.

<h3>financials.csv:</h3> Results from quarterly earnings reports.

<h3>stock_list.csv:</h3> Mapping between the SecuritiesCode and company names, plus general information about which industry the company is in.

<h2> Folders </h2> 

<h3> data_specifications: </h3> Definitions for individual columns.

<h3> jpx_tokyo_market_prediction: </h3> Files that enable the API. Expect the API to deliver all rows in under five minutes and to reserve less than 0.5 GB of memory.

Copies of data files exist in multiple folders that cover different time windows and serve different purposes.

<h3> train_files: </h3> Data folder covering the main training period.

<h3> supplemental_files: </h3> Data folder containing a dynamic window of supplemental training data. This will be updated with new data during the main phase of the competition in early May, early June, and roughly a week before the submissions are locked.

<h3> example_test_files: </h3> Data folder covering the public test period. Intended to facilitate offline testing. Includes the same columns delivered by the API (ie no Target column). You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that. This folder also includes an example of the sample submission file that will be delivered by the API.

# 1.4 Let's understand the Data by performing EDA(Exploratory Data Analysis)

The data provided consists of 24 files in that 22 consists of .csv file 1 file consists of .py file and the other is .so file

stock_prices.csv file consists of total 16 columns which explains the day open, close, volume and other features.

RowId: unique ID of price records, the combination of Date and SecuritiesCode.

Date: trade date

SecuritiesCode: Security code given by security depositors(JPX, TSE, OSE and TOCOM)
  
Open: First traded price of that particular day.

High: Day's High.

Low: Day's Low.

Close: Closing price of that day.

Volume: Total volume traded on for that particular day(total volume= buying+selling trades)

AdjustmentFactor: to calculate theoretical price/volume when split/reverse-split happens (NOT including dividend/allot...)

ExpectedDividend: Expected dividend value for ex-right date. This value is recorded for t+2 business days before ex-dividend...

SupervisionFlag: Flag of securities under supervision and securities to be delisted -> info.

Target: Change ratio of adjusted closing price between t+2 and t+1 where t+0 is TradeDate.

In [None]:
#Installing Libraries
!pip install --upgrade ta

# Importing Packages
import os 
from pathlib import Path
import pandas as pd
import numpy as np
import pandas_profiling as pp
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('seaborn')
import ta
from ta import add_all_ta_features
from ta.utils import dropna
from decimal import ROUND_HALF_UP, Decimal
from lightgbm import LGBMRegressor
from tqdm.notebook import tqdm
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
import warnings
from sklearn import preprocessing
import jpx_tokyo_market_prediction

warnings.filterwarnings('ignore')
%matplotlib inline
init_notebook_mode(connected = True)




In [None]:
data_dir = Path("../input/jpx-tokyo-stock-exchange-prediction") 
train_dir = Path("../input/jpx-tokyo-stock-exchange-prediction/train_files")
stock_prices = pd.read_csv(train_dir / "stock_prices.csv")
financials = pd.read_csv(train_dir / "financials.csv")
options = pd.read_csv(train_dir / "options.csv")
secondary_stock_prices = pd.read_csv(train_dir / "secondary_stock_prices.csv")
stock_list = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/stock_list.csv')    


In [None]:
stock_prices

In [None]:
financials

In [None]:
options

In [None]:
secondary_stock_prices

In [None]:
stock_list

In [None]:
pp.ProfileReport(stock_prices)

### Stock Price Overview

<b>Variable types:</b>

*  Categorical : 2
*  Numerical : 9
*  Boolean : 1

<p><b>Row_Id:</b> Completely Distinct(100%) values: 2332531 (No missing Data)</p>
<p><b>Date:</b> Distinct(0.1%) values: 1202 (No Missing data)</p>
<p><b>Security Code:</b> Distinct(0.1%) values: 2000 (No missing data) securitycode ranges from min 1301 to max 9997</p>
<p><b>Open:</b> Distinct(1%) values: 23067 (missing data: 7608(0.3%)) Highly correlated </p> 
<p><b>High:</b> Distinct(1%) values: 23960 (missing data: 7608(0.3%)) Highly correlated</p>
<p><b>Low:</b> Distinct(1%) values: 23904 (missing data: 7608(0.3%))  Highly correlated</p>
<p><b>Close:</b> Distinct(1%) values: 24046 (missing data: 7608(0.3%))Highly correlated</p>
<p><b>Volume:</b> Distinct(3.8%) values: 89006 (No missing data) zero's: 7608(0.3%)</p>
<p><b>AdjustmentFactor:</b> Distinct(&lsaquo;0.1%) values: 19 (no missing data) </p>
<p><b>ExpectedDivident:</b> Distinct(2.4%) values: 446 (missing data: 2313666(99.2%) zero's: 3551(0.2%)</p>
<p><b>SupervisionFlag:</b> Distinct(&lsaquo;0.1%) values: 2 Boolean(True: 1495, false: 2331036)</p>
<p><b>Target:</b> Distinct(15.2%) values: 354507 (missing data: 238(&lsaquo;0.1%)) zero's: 87990(3.8%)</p>


<h4> From the ProfileReport we understood about the data, now let's deep dive into individual feature and see understand the feature more clearly.</h4>

We are clear with the feature Row_Id as their are no missing data and no values



<h2> Date Feature </h2>
Lets perform somemore EDA on the Date Feature and understand.

In [None]:
stock_prices['Date'] = pd.to_datetime(stock_prices['Date'])
fig = px.bar(stock_prices['Date'].value_counts())
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
fig.show()

In [None]:
fig = px.line(stock_prices['Date'].value_counts())
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
# add annotation
fig.add_annotation(dict(font=dict(color='Red',size=15),
                                        x=0.01,
                                        y=0.97,
                                        showarrow=True,
                                        text="Note: <br> By using the line plot we will not able to see the holidays</br>as it will draw the line by ignoring the missed dates but to understand how the stocks are added we used line plot",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))
fig.show()

From the bar plot we can see that they is no data for some days, as we already know saturday/sundays stock markets are close and also for festivals the markets are closed. The white spaces in the bar indicates the markets are close on those particular dates and also we can observe that not all stocks are listed from 1865, stocks are been getting listed and currently their are 2000 stocks listed in the JPX Market.

<h2> Security Feature </h2>

In [None]:
stock_prices['SecuritiesCode'].value_counts()

In [None]:
tmp = pd.DataFrame(stock_prices['SecuritiesCode'].value_counts())
print('Percentage of stocks with fewer records than 500: ', tmp[tmp.SecuritiesCode < 500].shape[0] / tmp.shape[0])

In [None]:
pp.ProfileReport(financials)

In [None]:
pp.ProfileReport(secondary_stock_prices)

In [None]:
pp.ProfileReport(stock_list)

In [None]:
data_distr=stock_list.groupby('SecuritiesCode').size().reset_index(name='total')
stock_list_data=pd.merge(stock_list,data_distr, how='left',on=['SecuritiesCode'])
stock_list_data=stock_list_data.groupby(['33SectorName']).total.sum().reset_index(name='total')
fig = px.bar(x = stock_list_data["33SectorName"],
             y = stock_list_data['total'] , 
             color = stock_list_data["33SectorName"] ,
             color_continuous_scale="Emrld") 
fig.update_xaxes(title="Assets")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Data Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'} ,
        template="plotly_white")
fig.show()

# CandleStick Charts to understand Market Performance


We have a excellent library called "ta" for technical analysis [github](https://github.com/bukosabino/ta). "It is a Technical Analysis library useful to do feature engineering from financial time series datasets (Open, Close, High, Low, Volume). It is built on Pandas and Numpy."

The library has implemented 42 indicators:

Volume
* Money Flow Index (MFI)
* Accumulation/Distribution Index (ADI)
* On-Balance Volume (OBV)
* Chaikin Money Flow (CMF)
* Force Index (FI)
* Ease of Movement (EoM, EMV)
* Volume-price Trend (VPT)
* Negative Volume Index (NVI)
* Volume Weighted Average Price (VWAP)

Volatility
* Average True Range (ATR)
* Bollinger Bands (BB)
* Keltner Channel (KC)
* Donchian Channel (DC)
* Ulcer Index (UI)

Trend
* Simple Moving Average (SMA)
* Exponential Moving Average (EMA)
* Weighted Moving Average (WMA)
* Moving Average Convergence Divergence (MACD)
* Average Directional Movement Index (ADX)
* Vortex Indicator (VI)
* Trix (TRIX)
* Mass Index (MI)
* Commodity Channel Index (CCI)
* Detrended Price Oscillator (DPO)
* KST Oscillator (KST)
* Ichimoku Kinkō Hyō (Ichimoku)
* Parabolic Stop And Reverse (Parabolic SAR)
* Schaff Trend Cycle (STC)

Momentum
* Relative Strength Index (RSI)
* Stochastic RSI (SRSI)
* True strength index (TSI)
* Ultimate Oscillator (UO)
* Stochastic Oscillator (SR)
* Williams %R (WR)
* Awesome Oscillator (AO)
* Kaufman's Adaptive Moving Average (KAMA)
* Rate of Change (ROC)
* Percentage Price Oscillator (PPO)
* Percentage Volume Oscillator (PVO)

Others
* Daily Return (DR)
* Daily Log Return (DLR)
* Cumulative Return (CR)

In [None]:
def adjust_price(price):
    """
    Args:
        price (pd.DataFrame)  : pd.DataFrame include stock_price
    Returns:
        price DataFrame (pd.DataFrame): stock_price with generated AdjustedClose
    """
    # transform Date column into datetime
    price.loc[: ,"Date"] = pd.to_datetime(price.loc[: ,"Date"], format="%Y-%m-%d")

    def generate_adjusted_close(df):
        """
        Args:
            df (pd.DataFrame)  : stock_price for a single SecuritiesCode
        Returns:
            df (pd.DataFrame): stock_price with AdjustedClose for a single SecuritiesCode
        """
        # sort data to generate CumulativeAdjustmentFactor
        df = df.sort_values("Date", ascending=False)
        # generate CumulativeAdjustmentFactor
        df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
        # generate AdjustedClose
        df.loc[:, "AdjustedClose"] = (
            df["CumulativeAdjustmentFactor"] * df["Close"]
        ).map(lambda x: float(
            Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        ))
        # reverse order
        df = df.sort_values("Date")
        # to fill AdjustedClose, replace 0 into np.nan
        df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
        # forward fill AdjustedClose
        df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
        return df

    # generate AdjustedClose
    price = price.sort_values(["SecuritiesCode", "Date"])
    price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True)

    price.set_index("Date", inplace=True)
    return price

In [None]:
# generate AdjustedClose
stock_prices = adjust_price(stock_prices)

# Example: Nintendo Co., Ltd (SecuritiesCode: 7974)

In [None]:
nintendo_data = stock_prices.loc[stock_prices["SecuritiesCode"] == 7974].copy()

In [None]:
nintendo_data = ta.add_all_ta_features(
    nintendo_data, "Open", "High", "Low", "Close", "Volume", fillna=False
)

In [None]:
nintendo_data.shape


# Bollinger Bands

In [None]:
plt.plot(nintendo_data[100:500].Close)
plt.plot(nintendo_data[100:500].volatility_bbh, label='High BB')
plt.plot(nintendo_data[100:500].volatility_bbl, label='Low BB')
plt.plot(nintendo_data[100:500].volatility_bbm, label='EMA BB')
plt.title('Bollinger Bands')
plt.legend()
plt.show()

# MACD

In [None]:
plt.plot(nintendo_data[100:500].trend_macd, label='MACD')
plt.plot(nintendo_data[100:500].trend_macd_signal, label='MACD Signal')
plt.plot(nintendo_data[100:500].trend_macd_diff, label='MACD Difference')
plt.title('MACD, MACD Signal and MACD Difference')
plt.legend()
plt.show()

# KST

In [None]:
plt.plot(nintendo_data[100:500].trend_kst, label='KST')
plt.plot(nintendo_data[100:500].trend_kst_sig, label='KST Signal')
plt.plot(nintendo_data[100:500].trend_kst_diff, label='KST - KST Signal')
plt.title('Know Sure Thing (KST)')
plt.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(11, 9, figsize=(20,20))
fig.tight_layout()
ax = ax.flatten()
for i, col in enumerate(nintendo_data.columns):
    ax[i].plot(nintendo_data[col], color="m")
    ax[i].title.set_text(col)
    ax[i].axis('off')
plt.show()

In [None]:
print(nintendo_data.columns)


# Example: Toyota Motor Corporation (SecuritiesCode: 7203)

In [None]:
toyota_data = stock_prices.loc[stock_prices["SecuritiesCode"] == 7203].copy()

In [None]:
toyota_data = ta.add_all_ta_features(
    toyota_data, "Open", "High", "Low", "Close", "Volume", fillna=False
)

In [None]:
toyota_data.shape


# Bollinger Bands

In [None]:
plt.plot(toyota_data[100:500].Close)
plt.plot(toyota_data[100:500].volatility_bbh, label='High BB')
plt.plot(toyota_data[100:500].volatility_bbl, label='Low BB')
plt.plot(toyota_data[100:500].volatility_bbm, label='EMA BB')
plt.title('Bollinger Bands')
plt.legend()
plt.show()

# MACD

In [None]:
plt.plot(toyota_data[100:500].trend_macd, label='MACD')
plt.plot(toyota_data[100:500].trend_macd_signal, label='MACD Signal')
plt.plot(toyota_data[100:500].trend_macd_diff, label='MACD Difference')
plt.title('MACD, MACD Signal and MACD Difference')
plt.legend()
plt.show()

# KST

In [None]:
plt.plot(toyota_data[100:500].trend_kst, label='KST')
plt.plot(toyota_data[100:500].trend_kst_sig, label='KST Signal')
plt.plot(toyota_data[100:500].trend_kst_diff, label='KST - KST Signal')
plt.title('Know Sure Thing (KST)')
plt.legend()
plt.show()

In [None]:
print(len(toyota_data.columns))


In [None]:
print(toyota_data.columns)


In [None]:
fig, ax = plt.subplots(11, 9, figsize=(20,20))
fig.tight_layout()
ax = ax.flatten()
for i, col in enumerate(toyota_data.columns):
    ax[i].plot(toyota_data[col])
    ax[i].title.set_text(col)
    ax[i].axis('off')
plt.show()

# Pre-processing for model building
This notebook presents a simple model using LightGBM.

First, the features are generated using the price change and historical volatility described above.

In [None]:
def get_features_for_predict(price, code):
    """
    Args:
        price (pd.DataFrame)  : pd.DataFrame include stock_price
        code (int)  : A local code for a listed company
    Returns:
        feature DataFrame (pd.DataFrame)
    """
    close_col = "AdjustedClose"
    feats = price.loc[price["SecuritiesCode"] == code].copy()
    
    # Adds all 42 features
    feats = ta.add_all_ta_features(
        feats, "Open", "High", "Low", close_col, "Volume", fillna=False
    )
    
    # To only add specific features
    # Example: https://github.com/bukosabino/ta/blob/master/examples_to_use/bollinger_band_features_example.py
    # df['bb_bbm'] = indicator_bb.bollinger_mavg()
    # df['bb_bbh'] = indicator_bb.bollinger_hband()
    # df['bb_bbl'] = indicator_bb.bollinger_lband()
    
    # filling data for nan and inf
    feats = feats.fillna(0)
    feats = feats.replace([np.inf, -np.inf], 0)
    # drop AdjustedClose column
    feats = feats.drop([close_col], axis=1)

    return feats

In [None]:
# fetch prediction target SecuritiesCodes
codes = sorted(stock_prices["SecuritiesCode"].unique())
len(codes)

In [None]:
# generate feature
buff = []
for code in tqdm(codes):
    feat = get_features_for_predict(stock_prices, code)
    buff.append(feat)
feature = pd.concat(buff)

In [None]:
feature.tail(2)


# Label creation
Next, we obtain the labels to be used for training the model (this is where we load and split the label data).

In [None]:
def get_label(price, code):
    """ Labelizer
    Args:
        price (pd.DataFrame): dataframe of stock_price.csv
        code (int): Local Code in the universe
    Returns:
        df (pd.DataFrame): label data
    """
    df = price.loc[price["SecuritiesCode"] == code].copy()
    df.loc[:, "label"] = df["Target"]

    return df.loc[:, ["SecuritiesCode", "label"]]

In [None]:
# split data into TRAIN and TEST
TRAIN_END = "2019-12-31"
# We put a week gap between TRAIN_END and TEST_START
# to avoid leakage of test data information from label
TEST_START = "2020-01-06"

def get_features_and_label(price, codes, features):
    """
    Args:
        price (pd.DataFrame): loaded price data
        codes  (array) : target codes
        feature (pd.DataFrame): features
    Returns:
        train_X (pd.DataFrame): training data
        train_y (pd.DataFrame): label for train_X
        test_X (pd.DataFrame): test data
        test_y (pd.DataFrame): label for test_X
    """
    # to store splited data
    trains_X, tests_X = [], []
    trains_y, tests_y = [], []

    # generate feature one by one
    for code in tqdm(codes):

        feats = features[features["SecuritiesCode"] == code].dropna()
        labels = get_label(price, code).dropna()

        if feats.shape[0] > 0 and labels.shape[0] > 0:
            # align label and feature indexes
            labels = labels.loc[labels.index.isin(feats.index)]
            feats = feats.loc[feats.index.isin(labels.index)]

            assert (labels.loc[:, "SecuritiesCode"] == feats.loc[:, "SecuritiesCode"]).all()
            labels = labels.loc[:, "label"]

            # split data into TRAIN and TEST
            _train_X = feats[: TRAIN_END]
            _test_X = feats[TEST_START:]

            _train_y = labels[: TRAIN_END]
            _test_y = labels[TEST_START:]
            
            assert len(_train_X) == len(_train_y)
            assert len(_test_X) == len(_test_y)

            # store features
            trains_X.append(_train_X)
            tests_X.append(_test_X)
            # store labels
            trains_y.append(_train_y)
            tests_y.append(_test_y)
            
    # combine features for each codes
    train_X = pd.concat(trains_X)
    test_X = pd.concat(tests_X)
    # combine label for each codes
    train_y = pd.concat(trains_y)
    test_y = pd.concat(tests_y)

    return train_X, train_y, test_X, test_y

In [None]:
# generate feature/label
train_X, train_y, test_X, test_y = get_features_and_label(
    stock_prices, codes, feature
)

# Building a simple model
Using the a selected subset of features and labels, build a model using the following procedure

In [None]:
lgbm_params = {
    'seed': 42,
    'n_jobs': -1,
}

feat_cols = [
    "momentum_rsi",
    "trend_macd",
    "trend_kst",
    "trend_ema_fast",
    "volatility_bbm",
    "volatility_atr",
]

In [None]:
# initialize model
pred_model = LGBMRegressor(**lgbm_params)
# train
pred_model.fit(train_X[feat_cols].values, train_y)
# prepare result data
result = test_X[["SecuritiesCode"]].copy()
# predict
result.loc[:, "predict"] = pred_model.predict(test_X[feat_cols])
# actual result
result.loc[:, "Target"] = test_y.values

def set_rank(df):
    """
    Args:
        df (pd.DataFrame): including predict column
    Returns:
        df (pd.DataFrame): df with Rank
    """
    # sort records to set Rank
    df = df.sort_values("predict", ascending=False)
    # set Rank starting from 0
    df.loc[:, "Rank"] = np.arange(len(df["predict"]))
    return df

result = result.sort_values(["Date", "predict"], ascending=[True, False])
result = result.groupby("Date").apply(set_rank)

In [None]:
result.tail()


# Evaluation
Input the output of the forecasts of the constructed model into the evaluation function and plot the daily returns.

The evaluation function for this competition is as follows.

Please read here to know the evaluation function more.

In [None]:
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, toprank_weight_ratio: float = 2) -> float:
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): sharpe ratio
    """
    def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
        """
        Args:
            df (pd.DataFrame): predicted results
            portfolio_size (int): # of equities to buy/sell
            toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
        Returns:
            (float): spread return
        """
        assert df['Rank'].min() == 0
        assert df['Rank'].max() == len(df['Rank']) - 1
        weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
        purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
        short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
        return purchase - short

    buf = df.groupby('Date').apply(_calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio

In [None]:
# calc spread return sharpe
calc_spread_return_sharpe(result, portfolio_size=200)

Then, we will show daily spread return of the model.



In [None]:
def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): spread return
    """
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short

df_result = result.groupby('Date').apply(_calc_spread_return_per_day, 200, 2)

In [None]:
df_result.plot(figsize=(20, 8))


We also show a cumulative spread return of the mode

In [None]:
df_result.cumsum().plot(figsize=(20, 8))


The model in this notebook is now complete! Try different features and training methods through trial and error!

# Saving model
You need to save your model parameter to use created model for your submission.

In [None]:
pred_model.booster_.save_model("simple-model.txt")


In [None]:
# # Make predictions and submission
# env = jpx_tokyo_market_prediction.make_env()   # initialize the environment
# iter_test = env.iter_test()    # an iterator which loops over the test files
# for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
#     # Combine history data with incoming new data
#     display(prices)
#     df_prices = pd.concat([stock_prices, prices], ignore_index=True)
#     training_cutoff = prices['Date'].values[0]
#     print("Training cutoff: ", training_cutoff)
    
#     # Get processed test data
#     X_test, _ = preprocessing(
#         df=stock_prices, 
#         date_col='Date', 
#         target_col='Target', 
#         group_col='SecuritiesCode', 
#         training_cutoff=training_cutoff, 
#         num_periods_input=num_periods_input, 
#         num_periods_output=num_periods_output,
#         fill_missing_train=False,
#         fill_missing_test=True,
#         training=False)
#     X_test = create_test_instances(X_test)
    
#     # Make predictions
#     sample_prediction['target_pred'] = model.predict(X_test)
#     sample_prediction = sample_prediction.sort_values(by="target_pred", ascending=False)
# #     sample_prediction['Rank'] = np.arange(2000)
#     sample_prediction = sample_prediction.sort_values(by="SecuritiesCode", ascending=True)
#     sample_prediction.drop(['target_pred'], axis=1, inplace=True)
#     display(sample_prediction)
#     env.predict(sample_prediction)  # register your predictions