# Stocks that fell under supervision

These stocks were at one point or another marked as supervised[0] either in expectation for subsequent delisting,
reasons for delisting may vary, but include splitting of the stock or a takeover bid. 

As this information is public and indicates that trading of the instrument will soon end, we can consider these stocks to be outliers deserving special attention, especially during the time they were marked as supervised, but also in preceding days.


In [None]:
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

In [None]:
stocks = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
stock_info = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/stock_list.csv')
stocks = stocks.merge(stock_info[['SecuritiesCode', 'Name']], on='SecuritiesCode')

In [None]:
supervised_ids = stocks[stocks['SupervisionFlag']==True]['SecuritiesCode'].unique()
supervised_st = stocks[stocks['SecuritiesCode'].isin(supervised_ids)]

In [None]:
stock_info['SupervisionFlag'] = stock_info['SecuritiesCode'].isin(supervised_ids)
stock_info[stock_info['SupervisionFlag'] == True]

Only a handful of stocks in the dataset were at one point marked as supervised. The companies were active in wide range of areas, from real estate to raw materials.

In [None]:
supervised_st.groupby('Name')[['SupervisionFlag']].agg([np.sum, np.mean]).sort_values(by=[('SupervisionFlag','sum')], ascending=False)

## Supervised stocks time series

Most of these stocks were supervised for only a handful of days, with the exception of 'C.I.MEDICAL CO.,LTD.' which was in this state for almost three years and for most of it's existence. 

Web search doesn't reveal much about it. There are indications it might still be operational, company website[1] is available and the WHOIS record confirms the domain ownership. 

By plotting the history of supervised stocks, specifically the closing price and volume for the given day, and marking the days during which the stocks in question were supervised, we can perform a preliminary visual analysis of their trading behavior.
We can also easily determine if the stock was supervised only before delisting, or if it recovered it's original status.


In [None]:
for st in supervised_ids:
    fig, axes = plt.subplots(2, figsize=(20, 10))
    stock = supervised_st[supervised_st['SecuritiesCode'] == st]
    stock_name = stock['Name'].values[0]
    under_sp = stock[stock['SupervisionFlag'] == True]
    axes[0].set_title(stock_name + " - Prices")
    axes[0].plot(stock['Date'].values, stock['Close'])
    axes[0].plot(under_sp['Date'].values, under_sp['Close'], color='red', marker='x')
    axes[1].plot(stock['Date'].values, stock['Volume'])
    axes[1].plot(under_sp['Date'].values, under_sp['Volume'], color='red', marker='x')
    axes[1].set_title(stock_name + " - Volume")

    for ax in axes:
        ax.xaxis.set_major_locator(mdates.YearLocator())
        ax.xaxis.set_minor_locator(mdates.MonthLocator())
        ax.grid(True)

plt.show()

In some of the stocks we can see that the supervision has ended after a period of time and their original status was restored.
Presumably, whatever reason for their supervision was no longer relevant. Duration of the supervision seems to range from days to weeks.
Other stocks remain supervised until the end of the time series.

Stock of the 'C.I.MEDICAL CO.,LTD.' remains an outlier of the group.

Next we derive some aggregate statistics and see how the supervised stocks look compared to their counterparts. 

It is important to understand that these statistics are derived from all data gathered over the observation period. Therefore they are not directly useful for prediction of future evolution of stock prices and other metrics. Their utility, is in categorization of the stocks as objects.

Grouping the data by stock name we compute aggregate statistics for all columns, such as mean, minimum and standard deviation.

In [None]:
stock_agg = stocks.drop(columns=['Date','RowId','SecuritiesCode']).groupby('Name').agg([np.min, np.max, np.std, np.mean, np.median])
stock_agg.columns = ["{}_{}".format(*col) for col in stock_agg.columns]
stock_agg = stock_agg.reset_index()
stock_agg

## PCA

The newly transformed data will be subjected to PCA in order to reduce dimensionality. Setting `n_components` parameter to 'mle' will determine the appropriate number of components for our data.[2] This choice will also set `svd_solver` parameter to 'full'.[2]



In [None]:
pca = PCA(n_components='mle')
pca.fit(stock_agg.drop(columns=['Name']))

fig, ax = plt.subplots(1)
ax.plot(pca.explained_variance_ratio_)

In [None]:
sum(pca.explained_variance_ratio_[:2])

Seems like first two components account for well over 99.9% of the variance. This means we can easily plot our stocks as 2D scatter plot.

This also greatly simplifies the analysis, and allows us to explore relationships between variables in more detail.

In [None]:
stock_agg[['Component_A', 'Component_B']] = pca.transform(stock_agg.drop(columns=['Name']))[:,0:2]

In [None]:
original_features = list(set(stock_agg.columns) - set(['Component_A', 'Component_B', 'Name']))
fig, ax = plt.subplots()
heatmap = ax.imshow(pca.components_.transpose())
ax.set_yticks(np.arange(len(original_features)), labels=original_features)
ax.set_xticks(np.arange(pca.components_.shape[0]), labels=pca.explained_variance_ratio_)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.figure.colorbar(heatmap, ax=ax)
fig.set_size_inches(20, 20)

It appears that both of the most important components focus on handful of the features in our data set, such as those derived from High and Low and Open columns, to the detriment of others. As we are taking a global view of the trading history, this isn't necessarilly indicative of their relevance to prediction of further development.

However, if properly applied, this knowledge might help us to differentiate between various types of stocks.

Interesting enough, one of the features derived from SupervisionFlag column makes an appearance in both of our components. This does call for closer inspection.

In [None]:
fig, ax = plt.subplots()
selected_comp = pca.components_[:2]
heatmap = ax.imshow(selected_comp.transpose())
ax.set_yticks(np.arange(len(original_features)), labels=original_features)
ax.set_xticks(np.arange(selected_comp.shape[0]), labels=pca.explained_variance_ratio_[:2])
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")


for f in np.arange(len(original_features)):
    for c in np.arange(selected_comp.shape[0]):
        ax.text(c, f, np.round(selected_comp[c, f], 2),
                       ha="center", va="center", color="white")

fig.set_size_inches(20, 20)

Closer look confirms the initial assessment. All features except for those previously indicated have coeficients approaching zero.


In [None]:
fig, axes = plt.subplots(2)
axes[0].scatter(
    stock_agg[stock_agg['SupervisionFlag_mean']==0]['Component_A'],
    stock_agg[stock_agg['SupervisionFlag_mean']==0]['Component_B'])
axes[0].scatter(
    stock_agg[stock_agg['SupervisionFlag_mean']>0]['Component_A'],
    stock_agg[stock_agg['SupervisionFlag_mean']>0]['Component_B'],
    color='red')
for st in supervised_st['Name'].unique():
    axes[1].scatter(
        stock_agg[stock_agg['Name']==st]['Component_A'],
        stock_agg[stock_agg['Name']==st]['Component_B'],
        label=st)
axes[1].legend()
fig.set_size_inches(20, 25)

Synthetic features do indicate existence of another outlier 'TOSHIBA CORPORATION', occupying extreme positions of the scatter plot both among supervised and all stocks.

[0]https://www.jpx.co.jp/english/listing/market-alerts/supervision/00-archives/index.html

[1]http://www.ci-medical.co.jp/

[2]https://tminka.github.io/papers/pca/minka-pca.pdf

[3]https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn-decomposition-pca