Success in any financial market requires one to identify solid investments. When a stock or derivative is undervalued, it makes sense to buy. If it's overvalued, perhaps it's time to sell. While these finance decisions were historically made manually by professionals, technology has ushered in new opportunities for retail investors. Data scientists, specifically, may be interested to explore quantitative trading, where decisions are executed programmatically based on predictions from trained models.

There are plenty of existing quantitative trading efforts used to analyze financial markets and formulate investment strategies. To create and execute such a strategy requires both historical and real-time data, which is difficult to obtain especially for retail investors. This competition will provide financial data for the Japanese market, allowing retail investors to analyze the market to the fullest extent.

Japan Exchange Group, Inc. (JPX) is a holding company operating one of the largest stock exchanges in the world, Tokyo Stock Exchange (TSE), and derivatives exchanges Osaka Exchange (OSE) and Tokyo Commodity Exchange (TOCOM). JPX is hosting this competition and is supported by AI technology company AlpacaJapan Co.,Ltd.

This competition will compare your models against real future returns after the training phase is complete. The competition will involve building portfolios from the stocks eligible for predictions (around 2,000 stocks). Specifically, each participant ranks the stocks from highest to lowest expected returns and is evaluated on the difference in returns between the top and bottom 200 stocks. You'll have access to financial data from the Japanese market, such as stock information and historical stock prices to train and test your model.



# First look into data

Here I'll be trying various ideas to find clues for future feature engineering 

# Imports

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style()
import jpx_tokyo_market_prediction
from sklearn.tree import DecisionTreeRegressor
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import re

In [None]:
# useful tool for visualizing NaNs
!pip install missingno
import missingno as msno

In [None]:
dataset_path = Path('/kaggle/input/jpx-tokyo-stock-exchange-prediction')
train_path = dataset_path / 'train_files'
supplemental_path = dataset_path / 'supplemental_files'

assert train_path.is_dir()
assert supplemental_path.is_dir()

In [None]:
list(train_path.iterdir())

# Stocks

In [None]:
prices_df = pd.read_csv(train_path / 'stock_prices.csv')
prices_df.columns

### Nulls

In [None]:
msno.matrix(prices_df)

In [None]:
prices_df.isna().sum(axis=0)

What's interesting, we have a few rows with missing target

In [None]:
prices_df[prices_df['Target'].isna()].SecuritiesCode.value_counts()

In [None]:
prices_df[prices_df['Target'].isna()].isna().mean(axis=0)

These missing rows doesn't offer much information. They can be dropped

### ID

Check for gaps in ID numeration

In [None]:
plt.figure(figsize=(15, 15))
sns.scatterplot(x=prices_df['Date'], y=prices_df['SecuritiesCode'], s=0.5)

In [None]:
securities = prices_df['SecuritiesCode'].unique()
securities.sort()

plt.figure(figsize=(15, 15))
sns.scatterplot(x=np.arange(securities.size), y=securities)

### Correlation

In [None]:
corr = prices_df.drop(['SecuritiesCode'], axis=1).corr(method='spearman')
px.imshow(corr, text_auto=True)

In [None]:
prices_df[~prices_df['ExpectedDividend'].isna()]['AdjustmentFactor'].value_counts()

### Recalculating target

We will use formula 

$$\frac{ \frac{{Close}_{t+2}}{{AdjustmentFactor}_{t+1}} - {Close}_{t+1}}{{Close}_{t+1}}$$

In [None]:
t2 = prices_df.groupby(['SecuritiesCode'])['Close'].shift(-2)
t1 = prices_df.groupby(['SecuritiesCode'])['Close'].shift(-1)
next_factor = prices_df.groupby(['SecuritiesCode'])['AdjustmentFactor'].shift(-1)

target_calc = (t2 / next_factor - t1) / t1

Check for errors

In [None]:
target_diff = prices_df['Target'] - target_calc
plt.hist(target_diff, log=True)

Seems OK, this difference probably related to the rounded closing price and floating point errors

In [None]:
px.histogram(prices_df['Target'], log_y=True)

In [None]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True)
fig.update_layout(title_text="Target dynamics")

date_df = pd.DataFrame({
    'Target-mean': prices_df.groupby('Date')['Target'].mean(),
    'Target-std': prices_df.groupby('Date')['Target'].std(),
})

for i, col in enumerate(date_df.columns):
    fig.add_trace(go.Scatter(x=date_df[col].index, y=date_df[col].values, name=col), row=i+1, col=1)
fig.show()

- We can notice large std values in the spring 2020
- There are seasonal std spikes. May, August, November and February usually have large std. This can be explained with seasonal reports (see below in financials)

In [None]:
asset_stats = prices_df.groupby('SecuritiesCode')['Target'].describe()

In [None]:
asset_stats['count'].max()

In [None]:
(asset_stats['count'] == 1202).mean()

93% of assets are present in dataset for the whole duration

In [None]:
x_stat = 'mean'
y_stat = 'std'
ax = sns.jointplot(x=asset_stats[x_stat], y=asset_stats[y_stat], kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'red'}})
ax.ax_joint.set_xlabel(x_stat)
ax.ax_joint.set_ylabel(y_stat)

In [None]:
x_stat = 'min'
y_stat = 'max'
ax = sns.jointplot(x=asset_stats[x_stat], y=asset_stats[y_stat], kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'red'}})
ax.ax_joint.set_xlabel(x_stat)
ax.ax_joint.set_ylabel(y_stat)

Successful assets have higher std?

In [None]:
prices_df

In [None]:
sns.boxplot(data=prices_df['Target'])

In [None]:
x_stat = '50%'
y_stat = 'mean'
ax = sns.jointplot(x=asset_stats[x_stat], y=asset_stats[y_stat], kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'red'}})
ax.ax_joint.set_xlabel(x_stat)
ax.ax_joint.set_ylabel(y_stat)

In [None]:
(asset_stats['50%'] == 0).mean()

61% of assets have exactly zero median but non-zero mean. Strange

In [None]:
weird_assets = asset_stats[asset_stats['50%'] == 0].index

In [None]:
prices_df[prices_df['SecuritiesCode'].isin(weird_assets)]

In [None]:
df = prices_df[prices_df['SecuritiesCode'] == 1301]

In [None]:
df.Target.median()

### Volume and Close dynamic

Idea: look at the aggregated market state

In [None]:
fig = make_subplots(rows=4, cols=1, shared_xaxes=True)
fig.update_layout(title_text="Volume vs. Close")

date_df = pd.DataFrame({
    'Volume-median': prices_df.groupby('Date')['Volume'].median(),
    'Volume-mean': prices_df.groupby('Date')['Volume'].mean(),
    'Close-median': prices_df.groupby('Date')['Close'].median(),
    'Close-mean': prices_df.groupby('Date')['Close'].mean()
})

for i, col in enumerate(date_df.columns):
    fig.add_trace(go.Scatter(x=date_df[col].index, y=date_df[col].values, name=col), row=i+1, col=1)
fig.show()

- Mean and Median have a difference in magnitudes of order. This means there are some a few large samples that bias mean value. To asseess "average" median values would be much more preferable
- There are some spikes of volume trading
- There is a drop of closing price the spring of 2020
- Unusual surge of Close median in 2017

Let's investigate into the 2017 surge:

In [None]:
prices_df[prices_df['Date'] == '2017-09-26']['AdjustmentFactor'].value_counts()

It's seems it caused by adjustment of over a hundred securities.

We should in dataset processing step take into the account all adjustments to get unbiased data

In [None]:
adjustments_df = prices_df[prices_df['AdjustmentFactor'] != 1.0]
adjustments_per_date = adjustments_df.groupby('Date')['AdjustmentFactor'].describe()
px.line(adjustments_per_date['mean'])

### Volume and Close distributions

In [None]:
px.histogram(prices_df.groupby('SecuritiesCode')['Volume'].mean(), log_y=True)

In [None]:
px.histogram(prices_df.groupby('SecuritiesCode')['Close'].mean(), log_y=True)

In [None]:
secondary_prices_df = pd.read_csv(train_path / 'secondary_stock_prices.csv')
secondary_prices_df.columns

In [None]:
prices_df[prices_df.SupervisionFlag].SecuritiesCode.value_counts()

# Financials

In [None]:
financials_df = pd.read_csv(train_path / 'financials.csv')
financials_df

In [None]:
financials_df.info()

Most of the columns have `object` type though they are numeric. This means some rows in this cols prevents pandas from `float` conversion.

Let's check what these values

In [None]:
col = 'ResultDividendPerShare3rdQuarter'
financials_df[col][financials_df[col]\
                   .fillna('0.0')\
                   .apply(lambda x: re.match('^\-?[0-9]+(\.[0-9]+)?$', x) is None)]\
                   .value_counts()

In [None]:
col = 'NetSales'
financials_df[col][financials_df[col]\
                   .fillna('0.0')\
                   .apply(lambda x: re.match('^\-?[0-9]+(\.[0-9]+)?$', x) is None)]\
                   .value_counts()

Seems these cols have "-" value. If we want to convert to float we have to do it this way:

In [None]:
pd.to_numeric(financials_df['ResultDividendPerShare2ndQuarter'], errors='coerce')

In [None]:
numeric_cols = ['SecuritiesCode', 'NetSales', 'OperatingProfit', 'OrdinaryProfit', 'Profit', 'EarningsPerShare',
               'TotalAssets', 'Equity', 'EquityToAssetRatio', 'BookValuePerShare', 'ResultDividendPerShare1stQuarter',
               'ResultDividendPerShare2ndQuarter', 'ResultDividendPerShare3rdQuarter', 'ResultDividendPerShareFiscalYearEnd', 'ResultDividendPerShareAnnual',
               'ForecastDividendPerShare1stQuarter', 'ForecastDividendPerShare2ndQuarter', 'ForecastDividendPerShare3rdQuarter',
               'ForecastDividendPerShareFiscalYearEnd', 'ForecastDividendPerShareAnnual', 'ForecastNetSales', 'ForecastOperatingProfit'
               'ForecastOrdinaryProfit', 'ForecastProfit', 'ForecastEarningsPerShare',
               'NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock', 'NumberOfTreasuryStockAtTheEndOfFiscalYear',
               'AverageNumberOfShares']

In [None]:
financials_df.TypeOfCurrentPeriod.value_counts()

There are rare values "4Q" and "5Q" we have deal with 

In [None]:
financials_df.TypeOfDocument.value_counts()

In [None]:
msno.matrix(financials_df)

In [None]:
fig.update_layout(title_text="Number of disclosures per date")
px.line(financials_df.groupby('Date')['DisclosureNumber'].count())

We can clearly see seasonality here.

Also these spikes happen at the same time as target's std spikes. This indicates that data in this .csv have large influence on target.

# Options

In [None]:
options_df = pd.read_csv(train_path / 'options.csv')
options_df

In [None]:
px.line(options_df.groupby('Date')[['BaseVolatility', 'ImpliedVolatility']].median())

In [None]:
px.line(options_df.groupby('Date')[['InterestRate']].median())

In [None]:
px.line(options_df.groupby('Date')[['TradingVolume']].mean())