# Introduction
The competition is to predict the highest future returns for stocks that are actually traded on the Japan Exchange Group, Inc.
In this notebook, we will work with `jpx_tokyo_market_prediction`, which is unfamiliar to Kaggle beginners, and how to extract the relevant data in the training data.

# Table of Contents
*  [Explanation of data](#Explanation-of-data)
*  [jpx_tokyo_market_prediction](#jpx_tokyo_market_prediction)
*  [Create models and submit data](#Create-models-and-submit-data)

# Explanation of data
## Loading Modules
First, load the required modules.  
In this case, we will use pandas to load the data.

In [None]:
import numpy as np
import pandas as pd

## Check the data
Read *`stock_price.csv`* using `read_csv` in pandas.

In [None]:
stock_price_df = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")

Check the form of this data (nrows ,columns) and the contents.

In [None]:
print('(rows, columns) =', stock_price_df.shape)
stock_price_df.tail()

The data contained in `stock_price.csv` was as follows.  
*  `SecuritiesCode` ... Securities Code (number assigned to each stock)
*  `Open` ... Opening price (price per share at the beginning of the day (9:00 am))
*  `High` ... High ... the highest price of the day
*  `Low` ... Low price
*  `Colse` ... Closing price
*  `Volume` ... Volume (number of shares traded in a day)
*  `AdjustmentFactor` ... Used to calculate the theoretical stock price and volume at the time of a reverse stock split or reverse stock split
*  `ExpectedDividend` ... Expected dividend on ex-rights date
*  `SupercisionFlag` ... Flag for supervised issues and delisted issues
*  `Target` ... Percentage change in adjusted closing price (from one day to the next)  
  
Although many other data are available for this competition, we will implement this using only the information in `stock_price.csv`.

# jpx_tokyo_market_prediction
Next, we will check the usage of the API named jpx_tokyo_market_prediction.  
First, import it as you would any other module.  
Since jpx_tokyo_market_prediction can only be executed once, we will write the image in Markdown.

***
```python
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
```


The environment was created by executing *`make_env()`* and the object was created by executing *`iter_test()`*.
As shown below, looking at the type, iter_test is a generator, so we can confirm that it is an object that can be called one by one with a for statement.  
***
```python
print(type(iter_test))
```
[出力]  
```
<class 'generator'>
```
By turning a for statement, check the operation as follows.
***
```python
count = 0
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    print(prices.head())
    env.predict(sample_prediction)
    count += 1
    break
```
[出力]
```
This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
         Date          RowId  SecuritiesCode    Open    High     Low   Close  \
0  2021-12-06  20211206_1301            1301  2982.0  2982.0  2965.0  2971.0   
1  2021-12-06  20211206_1332            1332   592.0   599.0   588.0   589.0   
2  2021-12-06  20211206_1333            1333  2368.0  2388.0  2360.0  2377.0   
3  2021-12-06  20211206_1375            1375  1230.0  1239.0  1224.0  1224.0   
4  2021-12-06  20211206_1376            1376  1339.0  1372.0  1339.0  1351.0   

    Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag  
0     8900               1.0               NaN            False  
1  1360800               1.0               NaN            False  
2   125900               1.0               NaN            False  
3    81100               1.0               NaN            False  
4     6200               1.0               NaN            False
```

The names of each variable are as follows.  
*  `price` ... Data for each stock on the target day, the same as the information in stock_price.csv without Target.　　
*  `options` ... Same information as options.csv for the target date.
*  `finacials` ... Same information as finacials.csv for the target date.
*  `trades` ... Same information as trades.csv of the target date
*  `secondary_prices` ... Same information as secondary_stock_price.csv without Target for the target date.
*  `sample_prediction` ... Data from sample_prediction.csv for the target date.


Thus, if we call the 2000 stocks of the target date one day at a time using *`jpx_tokyo_market_prediction`*, forecast them with the model we created, and then create the submitted data with env.predict, we can produce a score.

# Create models and submit data
Here, we will create a simple training model using stock_price.csv and implement it up to submission.
## Create Model(LSTM)
We use a model called LSTM (Long Short Term Memory).  
LSTM is one of the RNNs used for series data and is a model that can learn long-term dependencies.  
We will implement LSTM using Pytorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTM(nn.Module):
    def __init__(self, input_size=8, sequence_num=31, lstm_dim=128,
                 num_layers=2, output_size=1):
        super().__init__()
        
        self.lstm = nn.LSTM(input_size, lstm_dim, num_layers, batch_first=True, bidirectional=True)
        self.linear1 = nn.Linear(lstm_dim*sequence_num*2, 1)
        self.bn1 = nn.BatchNorm1d(lstm_dim*sequence_num*2)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        x = lstm_out.reshape(lstm_out.shape[0], -1)
        x = self.linear1(self.bn1(x))
        return x

# Create Dataset
Create a data set that can be retrieved for each Code.  
  
First, convert Nan in *`stock_price_df`* to 0, bool to int, and *`'Date'`* to datetime.

In [None]:
stock_price_df['ExpectedDividend'] = stock_price_df['ExpectedDividend'].fillna(0)
stock_price_df['SupervisionFlag'] = stock_price_df['SupervisionFlag'].map({True: 1, False: 0})
stock_price_df['Date'] = pd.to_datetime(stock_price_df['Date'])
stock_price_df.info()

Some of them contained missing values, so they were removed.

In [None]:
stock_price_df = stock_price_df.dropna(how='any')
# Confirmation of missing information
stock_price_df_na = (stock_price_df.isnull().sum() / len(stock_price_df)) * 100
stock_price_df_na = stock_price_df_na.drop(stock_price_df_na[stock_price_df_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :stock_price_df_na})
missing_data.head(22)

Standardize the features (other than RowId, Date, and SecuritiesCode) to be used in this project using sklearn's StandardScaler.

In [None]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']
stock_price_df[columns] = stdsc.fit_transform(stock_price_df[columns])
stock_price_df.head()

Store data for each issue in dictionary form and store it in such a way that it can be recalled for each issue.

In [None]:
dataset_dict = {}
for sc in stock_price_df['SecuritiesCode'].unique():
    dataset_dict[str(sc)] = stock_price_df[stock_price_df['SecuritiesCode'] == sc].values[:, 3:].astype(np.float32)
print(dataset_dict['1301'].shape)


Use Pytorch dataloader to recall data for each mini-batch.

In [None]:
from torch.utils.data.sampler import SubsetRandomSampler
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, X, sequence_num=31, y=None, mode='train'):
        self.data = X
        self.teacher = y
        self.sequence_num = sequence_num
        self.mode = mode
    def __len__(self):
        return len(self.teacher)

    def __getitem__(self, idx):
        out_data = self.data[idx]
        if self.mode == 'train':
            out_label =  self.teacher[idx[-1]]
            return out_data, out_label
        else:
            return out_data
def create_dataloader(dataset, dataset_num, sequence_num=31, input_size=8, batch_size=32, shuffle=False):
    sampler = np.array([list(range(i, i+sequence_num)) for i in range(dataset_num-sequence_num+1)])
    if shuffle == True:
        np.random.shuffle(sampler)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size, sampler=sampler)
    return dataloader
#### Check operation ####
X_check, y_check = dataset_dict['1301'][:, :-1], dataset_dict['1301'][:, -1]
dataset_check = MyDataset(X_check, y=y_check, sequence_num=31, mode='train')
dataloader_check = create_dataloader(dataset_check, X_check.shape[0], sequence_num=31, input_size=8, batch_size=32, shuffle=False)
for b, tup in enumerate(dataloader_check):
    print('---------')
    print(tup[0].shape, tup[1].shape)
    break

## Trainig
For each stock, LSTM training is conducted by repeatedly creating a data set and training the model.

In [None]:
from tqdm import tqdm
epochs = 10
batch_size = 512
# Check wheter GPU is available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Model Instantiation
model = LSTM(input_size=8, sequence_num=31, lstm_dim=128, num_layers=2, output_size=1)
model.to(device)
model.train()
# setting optimizer
lr = 0.0001
weight_decay = 1.0e-05
optimizer = torch.optim.Adagrad(model.parameters(), lr=lr, weight_decay=weight_decay)
# setting criterion
criterion = nn.MSELoss()
# set iteration counter
iteration = 0
# 
log_train = [[0], [np.inf]]
for epoch in range(epochs):
    epoch_loss = 0.0
    for sc in tqdm(stock_price_df['SecuritiesCode'].unique()):
        X, y = dataset_dict[str(sc)][:, :-1], dataset_dict[str(sc)][:, -1]
        dataset = MyDataset(X, y=y, sequence_num=31, mode='train')
        dataloader = create_dataloader(dataset, X.shape[0], sequence_num=31, input_size=8, batch_size=batch_size, shuffle=True)
        for data, targets in dataloader:
            data, targets = data.to(device), targets.to(device)
            
            optimizer.zero_grad()
            
            data = data.to(torch.float32)
            output = model.forward(data)
            targets = targets.to(torch.float32)
            
            loss = criterion(output.view(1,-1)[0], targets)
            
            loss.backward()
            
            optimizer.step()
            
            epoch_loss += loss.item()
            
            iteration += 1
    epoch_loss /= iteration
    print('epoch_loss={}'.format(epoch_loss))
    log_train[0].append(iteration)
    log_train[1].append(epoch_loss)

To see the learning status, check the phenomenon of the loss function.  
　→Can be learned.

In [None]:
import matplotlib.pyplot as plt
plt.plot(log_train[0][1:], log_train[1][1:])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

## Prediction
The trained model will be used to make predictions on the submitted data.  
`DataFrame` → `Ndarray` → `tensor` and transform the data to make predictions.

In [None]:
from datetime import datetime
def predict(model, X_df, sequence=31):
    pred_df = X_df[['Date', 'SecuritiesCode']]
    # Grouping by `groupby` and retrieving one by one
    code_group = X_df.groupby('SecuritiesCode')
    X_all = np.array([])
    for sc, group in code_group:
        # Standardize target data
        group_std = stdsc.transform(group[columns])
        # Calling up past data of the target data
        X = dataset_dict[str(sc)][-1*(sequence-1):, :-1]
        # concat
        X = np.vstack((X, group_std))
        X_all = np.append(X_all, X)
    X_all = X_all.reshape(-1, sequence, X.shape[1])
    y_pred = np.array([])
    for it in range(X_all.shape[0]//512+1):
        data = X_all[it*512:(it+1)*512]
        data = torch.from_numpy(data.astype(np.float32)).clone()
        data = data.to(torch.float32)
        data = data.to(device)
        output = model.forward(data)
        output = output.view(1, -1)
        output = output.to('cpu').detach().numpy().copy()
        y_pred = np.append(y_pred, output[0])
    pred_df['target'] = y_pred
    pred_df['Rank'] = pred_df["target"].rank(ascending=False,method="first") -1
    pred_df['Rank'] = pred_df['Rank'].astype(int)
    pred_df = pred_df.drop('target', axis=1)
    return pred_df
test_X_df = stock_price_df[stock_price_df['Date'] == datetime(2021, 12, 3)].drop('Target', axis=1)
y_pred = predict(model, test_X_df)
print(y_pred.shape)
print(y_pred)

## Submission
Perform data preparation for submission from `jpx_tokyo_market_prediction`.

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

In [None]:
count = 0
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices = prices.fillna(0)
    prices['SupervisionFlag'] = prices['SupervisionFlag'].map({True: 1, False: 0})
    prices['Date'] = pd.to_datetime(prices['Date'])
    pred_df = predict(model, prices)
    print(pred_df)
    env.predict(pred_df)
    count += 1

In [None]:
pred_df