In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Notes


There are several folders containing different materials here is the overview.
* **data_specifications/** 
    * Definitions for individual columns.

* **jpx_tokyo_market_prediction/** 
    * Files that enable the API. Expect the API to deliver all rows in under five minutes and to reserve less than 0.5 GB of memory.

    * Copies of data files exist in multiple folders that cover different time windows and serve different purposes.

* **train_files/**
    * Data folder covering the main training period.

* **supplemental_files/** 
    * Data folder containing a dynamic window of supplemental training data. This will be updated with new data during the main phase of the competition in early May, early June, and roughly a week before the submissions are locked.

* **example_test_files/** 
    * Data folder covering the public test period. Intended to facilitate offline testing. Includes the same columns delivered by the API (ie no Target column). You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that. This folder also includes an example of the sample submission file that will be delivered by the API.

* **Stocklist/*
    * for mapping stock with symbol and inclding the industry the stock is in

# EDA

In [None]:
stock_list = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/stock_list.csv')

In [None]:
stock_fin_spec = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/data_specifications/stock_price_spec.csv')
trade_spec = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/data_specifications/trades_spec.csv')
stock_price_spec = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/data_specifications/stock_price_spec.csv')
option_spec = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/data_specifications/options_spec.csv')
stock_list_spec = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/data_specifications/stock_list_spec.csv')

In [None]:
train_stock_prices = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv',parse_dates=['Date'], infer_datetime_format=True)
train_secondary_stock_prices = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/secondary_stock_prices.csv',parse_dates=['Date'], infer_datetime_format=True)
#train_options = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/options.csv',parse_dates=['Date'], infer_datetime_format=True)
train_trades = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/trades.csv',parse_dates=['Date'], infer_datetime_format=True)
train_financials = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/financials.csv',parse_dates=['Date'], infer_datetime_format=True)

In [None]:
# sup_options = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/options.csv')
# sup_financials = pd.read_csv('/kaggle/input/jpx-tokyoindex_col=k-exchange-prediction/supplemental_files/financials.csv')
# sup_secondary_stock_prices = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/secondary_stock_prices.csv')
# sup_trades = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/trades.csv')
# sup_stock_prices = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv')

Let's begin with the sample test file just to see what is the expectation of the result

In [None]:
sample = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/example_test_files/sample_submission.csv')
sample

In [None]:
sample.nunique()

So, we are expected to deliver the "daily ranks" of the 2000 stocks from the workday of 2021-12-06	to 2022-02-28.

### What really is the "rank" here?

from the competition [page](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction/overview/evaluation), here is the explanation of the evaluation:

### From the original page.

1. The model will use the closing price ($C_{(k, t)}$) until that business day ($t$) and other data every business day as input data for a stock ($k$), and predict rate of change ($r_{(k, t)}$) of closing price of the top 200 stocks and bottom 200 stocks on the following business day ($C_{(k, t+1)}$) to next following business day ($C_{(k, t+2)}$)

    $$
    r_{(k, t)} = \frac{C_{(k, t+2)} - C_{(k, t+1)}}{C_{(k, t+1)}}
    $$
    
2. Within top 200 stock predicted ($up_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for rank 1-200 and denote their sum as $S_{up}$.

    $$
    S_{up} = \frac{\sum^{200}_{i=1}(r_{({up_i}, t)} * linear function(2, 1)_i))}{Average(linear function(2, 1))}
    $$
    
3. Within bottom 200 stocks predicted  ($down_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for bottom rank 1-200 and denote their sum as $S_{down}$.

    $$
    S_{down} = \frac{\sum^{200}_{i=1}(r_{({down_i}, t)} * linear function(2, 1)_i)}{Average(linear function(2, 1))}
    $$
    
4. The result of subtracting $S_{down}$ from $S_{up}$ is $R_{day}$ and is called "**daily spread return**".

    $$
    R_{day} = S_{up} - S_{down}
    $$
    
5. The daily spread return is calculated every business day during the public/private period and obtained as a time series for that period. The mean/standard deviation of the time series of daily spread returns is used as the score. Score calculation formula (x is the business day of public/private period)

    $$
    Score = \frac{Average(R_{day_1-day_x})}{STD(R_{day_1-day_x})}
    $$
    


### If the above are confusing, here is my attempt to deconstruct it:
--------------------------------------------------------------------------

   1. Predict the daily return of all 2000 major stocks and rank them accordingly 
       * higest (top gainer)(rank 0,1,2...) to lowest (top loser)(rank ...,1998,1999,2000)
   2. Pick the top 200 and bottom 200 of the daily list to calculate the weight
   3. np.linspcace 2-1 weight their rank with the return to determine the long/short volume.
   4. Assume all the positions are enter on the opening of the next day and close the day after that
   5. from all the 200 long positions and 200 short positions you would get the 'Daily Spead Return' DSR
   6. Use DSR from all 56 days to calculate the Sharpe Ratio and use as a competition score. Sharpe ratio is basically a ratio between the total return divided by the SD of itself duing the period.
--------------------------------------------------------------------------

### Note for the scoring.
Since the score is the Sharpe ratio, which consist of return divided by SD, remember that the other way of raising score is lower the SD of the portfolio. Don't focus only on return.

Now, let's explore the data given, begining with the main focus, "train_stock_prices" which include the most liquid 2000 stocks in the market. And, also a stock universe for this competition.

### Train_stock_prices
   


In [None]:
train_stock_prices.tail(3)

Here we have many columns containing many informations of certain stocks. I'd like to make a note about few columns here.
* AdjustmentFactor : In case of stock split or reverse-split, the price of a single stock must be adjusted by this column.
* ExpectedDividend : Expected amount of dividends paid. This info will be available 2 days before the ex-div announcement.
* SupervisionFlag : Notice of a stock under supervision or delisting.
* Target : Change ratio of adjusted closing price between t+2 and t+1 where t+0 is TradeDate'. Basically the return from buying tomorrow and sell a day after.

Before I continue, let's replicate the target just to make sure that where it's came from. Let's pick one stock, says the one withe securitycode '9997'

In [None]:
df_9997 = train_stock_prices.loc[train_stock_prices['SecuritiesCode'] == 9997].copy() 
df_9997['close_sft_-1'] = df_9997['Close'].shift(-1) #close from the tomorrow
df_9997['close_sft_-2'] = df_9997['Close'].shift(-2) # close from the day after tomorrow
df_9997['dup_target'] = (df_9997['close_sft_-2'] - df_9997['close_sft_-1'])/df_9997['close_sft_-1']
df_9997.head()

In [None]:
df_9997[['Target','dup_target']].head()

Proven that the target is, in my own word, 

### the percentage change from the closing price of tomorrow and the day after that.

In [None]:
df_9997['Target'].plot()

In [None]:
df_9997['Target'].plot(kind='hist')

The target distribution seem to be normally distributed.

Let's explore other training files.

### Trades

In [None]:
train_trades.dropna().tail()

In [None]:
train_trades.info()

The trades file contains a lot of informattion. Considering the columns headers, this dataset tells us about the market trades summary each week. The data show trading activities based on market participants such as prop-trade, individuals, foreigners, institutions etc...

This make me curious about who is the biggest player in japanese market. Perhaps, these money can drive or lead the market. Let's see.

In [None]:
standard_trade_total = train_trades.loc[train_trades['Section'] == 'Standard Market (Second Section)', ['EndDate','Section','TotalTotal','ProprietaryTotal','BrokerageTotal',
              'IndividualsTotal','ForeignersTotal','SecuritiesCosTotal',
              'InvestmentTrustsTotal','BusinessCosTotal','OtherInstitutionsTotal',
              'InsuranceCosTotal','CityBKsRegionalBKsEtcTotal','TrustBanksTotal',
              'OtherFinancialInstitutionsTotal'
             ]].dropna()
standard_trade_total = standard_trade_total.drop('Section',axis=1)
standard_trade_total.head()

In [None]:
total_sum = standard_trade_total.loc[:,'ProprietaryTotal':'OtherFinancialInstitutionsTotal'].sum().sum()
(standard_trade_total.loc[:,'ProprietaryTotal':'OtherFinancialInstitutionsTotal'].sum()/total_sum).plot(kind='bar',title='total trading volume by group of investor')

The group that has the highest trading volume in standard market is Brokerage, Individual, Foreigners respectively. I think all other groups can be neglected since the volume are so small.

Anyhow let's see the major player balance

I think the financial data migh not really help much in predicting short term holding return as the ranking is derived only from a day of holding. However, the release date of the financial report might really effect the return since, sometimes, the surprise factor from the earnings or somthing else from the report could really drive the price instantly. Let's see the financial report.

In [None]:
def announcement_effect(df_price,df_announcement,stock_id):
    df_stock = df_price.loc[df_price['SecuritiesCode'] == stock_id].copy()
    df_plot = df_stock[['Date','Close']].set_index('Date').Close.to_frame()
    df_plot['announce'] = 0
    df_plot.loc[df_announcement[df_announcement['SecuritiesCode']==stock_id].Date.values, 'announce'] = 1
    df_plot[df_plot['announce']==1].head()

    df_plot['pct_change_ma7'] = ((df_plot['Close']-df_plot['Close'].shift(1))/df_plot['Close'].shift(1)).rolling(7,center=True).mean()

    fig, ax1 = plt.subplots(figsize=(18,7))
    plt.title('StockID {} Price and Fin Discousure'.format(stock_id))
    ax1.set_ylabel('Close')

    plt.plot_date(df_plot[df_plot['announce']==1].index,df_plot[df_plot['announce']==1].Close,c='red')
    #plt.xticks(rotation=45)
    plt.plot(df_plot.Close)
    ax2 = ax1.twinx()
    ax2.set_ylabel('pct_change_ma7',color='g')
    ax2.plot(df_plot['pct_change_ma7'],color='g', alpha=0.3)
    ax2.tick_params(axis='y', labelcolor='g')
    
    plt.show()

Let's see the hypothesis that the date of financial disclosure cause higher volatility.

In [None]:
announcement_effect(train_stock_prices,train_financials,8306) #MITSUBISHI UFJ FINANCIAL GROUP	

In [None]:
announcement_effect(train_stock_prices,train_financials,7267) #Honda Motors

In [None]:
announcement_effect(train_stock_prices,train_financials,6752) #PANASONIC HOLDINGS CORPORATION

In [None]:
announcement_effect(train_stock_prices,train_financials,8035) #TOKYO ELECTRON LTD.

Considering the peaks in pct_change and the closnig price, it's noticable that the volatility is higher around the fiancial announcement date.

### Don't trade the market if you don't know what you are doing?

One myth I want to check here is that the saying of 'retail investor' is alway a loser in this game.

The overall movenment of the stock plotted above seem to have some correlation. I will use the biggest weighting stocks in NIKKEI225 index to calculate the coorrelation with the balance of the individuals trading. Those stocks are
* TOKYO ELECTRON LTD. 8.01% ->  Code:8035
* FAST RETAILING CO., 7.98% -> Code:9983
* SOFTBANK GROUP CORP 4.23% -> Code:9984

ref : https://indexes.nikkei.co.jp/en/nkave/factsheet?idx=nk225

In [None]:
nak_test = train_trades.loc[train_trades['Section']=='Standard Market (Second Section)']
nak_test['EndDate'] = pd.to_datetime(nak_test['EndDate'])
nak_test.set_index('EndDate', inplace=True)

nak_test = nak_test.loc[nak_test.index.weekday==4]
nak_test.head(2)

In [None]:
def retail_corr(price_df,trade_df,stock_code):
    trade_df = trade_df[trade_df['Section']=='Standard Market (Second Section)']
    
    df_price = price_df[price_df['SecuritiesCode']==stock_code].set_index('Date')
    df_price = df_price.Close.to_frame()
    df_price['dow'] = df_price.index.weekday
    df_price = df_price.loc[df_price['dow']==4]   
    df_price['return'] = (df_price['Close']-df_price['Close'].shift(1))/df_price['Close'].shift(1)

    df_price = df_price.merge(trade_df,how='left',left_index=True,right_index=True)
    
    corr = df_price[['return','BrokerageBalance','IndividualsBalance','ForeignersBalance']].dropna().corr()
    corr = corr.style.background_gradient('inferno')
    
    return corr

    

In [None]:
retail_corr(train_stock_prices,nak_test,8035) #TOKYO ELECTRON LTD. 8.01% -> Code:8035

In [None]:
retail_corr(train_stock_prices,nak_test,9983) #FAST RETAILING CO., 7.98% -> Code:9983


In [None]:
retail_corr(train_stock_prices,nak_test,9984) #SOFTBANK GROUP CORP 4.23% -> Code:9984

Weekly individual Balance, which, represent the net amount traded of individual investor are not really correlated with the weekly reaturn for major stocks. Thus, I can't say that the individual are always losers in this market. 

### Models

Let's explore some model for this task of creating a sustainable portfolio through constructing a stedy stream of 'Daily Spread Rerurn'.

#### Models to explore
* Linear Regression
* LGB / XGB
* AR
* MA
* ARMA
* ARIMA

### Features
Just to kickstart, Let's use only the corse data such as the financial release date and average trading volume with other basic price and technical indicators a features.

for simplicity, I think I will use these features for now.
* Date Features
    * Quarter
    * day of week
    * month of year
    * week of year
    * is the day +/- 2 days from the financial announcement date
* MA in trading volume
* Expecting Dividends in few days or not.
* MA percent change in return
* MA in trading value from major players
    * Brokerage 
    * Foreigners
* Target Lag
    * maybe even a rank lag 
* [MACD](https://www.investopedia.com/terms/m/macd.asp#:~:text=Moving%20average%20convergence%20divergence%20(MACD)%20is%20a%20trend%2Dfollowing,from%20the%2012%2Dperiod%20EMA.)