# **Optiver Realized Volatility Prediction**&#x1f600;
Apply your data science skills to make financial markets better
 
 ## ※　Japanese version is here.
 
 https://www.kaggle.com/chumajin/optiver-realized-eda-for-starter-version


## I think this competition is a competition that predicts the volatility (degree of price fluctuation) of stocks by time (time id).


## If you find it useful, I would be grateful if you could **upvote**.
　※ Thank you to those who have uploaded to my notebook before !
 
 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. What to predict ??  (Let's look from sample_submission.csv)

In [None]:
sample = pd.read_csv("../input/optiver-realized-volatility-prediction/sample_submission.csv")
sample

### There are only row_id and target.
### I think target is a volatility(degree of price fluctuation) that we must predict.
### See below for what row_id is.

# 2. What to predict from?? (Let's see from test.csv)

In [None]:
test = pd.read_csv("../input/optiver-realized-volatility-prediction/test.csv")
test

#### You can see that the row_id of the submission file is the stock_id (stock id) and time_id (time id) connected by string "-".

#### In addition, test.csv comes with **book_test.parquet** and **trade_test.parquet**. 
#### Let's see in the case of stock_id = 0. 
※　It opens with the file path, but it seems to open with the path to the folder above it.

In [None]:
book_testparquet = pd.read_parquet("../input/optiver-realized-volatility-prediction/book_test.parquet/stock_id=0")
book_testparquet

In [None]:
trade_testparquet = pd.read_parquet("../input/optiver-realized-volatility-prediction/trade_test.parquet/stock_id=0")
trade_testparquet

#### From these information, I think it is a competition that we will pred the volatility (degree of price fluctuation) of each stocks by time.
#### Based on the above, we will look at train data including the explanation of each column item, EDA and, finally submit it as a trial.

# 3. train.csv

In [None]:
train = pd.read_csv("../input/optiver-realized-volatility-prediction/train.csv")
train

#### Very simple configuration. The stock_id, time_id, and target values of the stock are shown.
#### I think target is the total volatility for 10 minutes that is explained in Data explanation on this competition (this is the target for learning).

# 4. train.parquet

#### Similar to the explanation in the test data, the files book_train.parquet and trade_train.parquet are attached to train.csv.
#### For example, take a look at each parquet file of the stock_id = 0 in the train.csv.

## **4.1 book_train.parquet**


#### Provides order book data on the most competitive buy and sell orders entered into the market. The top two levels of the book are shared. 
#### The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level.


******Supplementary explanation (Personal Interpretation)******

I think the order book data is like the reservation status. 


When the price drops to this price, the person who made the reservation will buy it. 


On the other hand, the person who made the reservation can sell when it comes up.

In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
book_example

Bid is the price the buyer wants to buy the stock, and Ask is the price the seller wants to sell the stock.

* stock_id: Stock (which stock) Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
* time_id: id of which time information (linked to time_id in submission file)
* seconds_in_bucket: How many seconds after starting from 0 in time_id? Maybe you're predicting a total volatility of 10 minutes, so seconds_in_bucket should be up to 600 sec
* bid_price1,2: 1st and 2nd desired bid price of the stock (Normalized prices of the most / second most competitive buy level. )

* ask_price1,2: Desired selling price of the stock(Normalized prices of the most/second most competitive sell level.)

* bid_size1,2: The number of shares on the most/second most competitive buy level.
* ask_size1,2: The number of shares on the most/second most competitive sell level.



## 4.2 **trade_train.parquet**

#### Contains data on trades that actually executed. Usually, in the market, 

#### there are more passive buy/sell intention updates (book updates) than actual trades, 

#### therefore one may expect this file to be more sparse than the order book.


******Supplementary explanation (Personal Interpretation)******

The amount actually traded in real time. It is estimated that the buyer bought this quantity at this price and the seller sold at this price.

In [None]:
trade_example = pd.read_parquet("../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0")
trade_example

* stock_id - Same as above.
* time_id - Same as above.
* seconds_in_bucket - Same as above. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
* price - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
* size - The sum number of shares traded.
* order_count - The number of unique trade orders taking place.

# 5. EDA

## 5.1 Analysis by stock
### 5.1.1 Number of shares

In [None]:
train

In [None]:
for col in train.columns:
    print(col,":",len(train[col].unique()))

## There are 112 types of stock_id, 3830 types of time_id, and 414287 types of target.

## 5.1.2 Statistic by stock

In [None]:
stock = train.groupby("stock_id")["target"].agg(["mean","median","std","count","sum"]).reset_index()
stock

#### Let's look at the histogram only for the mean value and sum.

In [None]:
print("mean value=" ,stock["mean"].mean())
plt.hist(stock["mean"])

## The average value is 0.003, which is close to 0.

In [None]:
print("sum value=" ,stock["sum"].mean())
plt.hist(stock["sum"])

## The total volatility during this period is 14.8, max, and there are more than 30.

--------------Below, let's look at the 10-minute behavior of time_id = 5 with stock id = 0. -------------

## 5.2 Relationship between Bid, Ask on the order book, and actual transaction behavior price within time_id

In [None]:
book_example

In [None]:
book_test = book_example[book_example["time_id"]==5]
book_test

## 5.2.1 Price fluctuation (individual + whole)

#### First, order book information

In [None]:
samples = ["bid_price1","bid_price2","ask_price1","ask_price2"]

for num,a in enumerate(samples):
    plt.figure(figsize=(20,5))
   
    plt.subplot(4,1,num+1)
    plt.plot(book_test["seconds_in_bucket"],book_test[a])
    plt.title(a)
plt.show()
plt.figure(figsize=(20,5))

for num,a in enumerate(samples):
    
   
    plt.plot(book_test["seconds_in_bucket"],book_test[a],label=a)
plt.legend(fontsize=12)


#### Add the actual transaction information to this.

In [None]:
trade_example

In [None]:
trade_test = trade_example[trade_example["time_id"]==5]
trade_test.head(5)

#### Add the actual transaction to the whole graph

In [None]:
plt.figure(figsize=(20,5))

for num,a in enumerate(samples):
    
   
    plt.plot(book_test["seconds_in_bucket"],book_test[a],label=a)
    
plt.plot(trade_test["seconds_in_bucket"],trade_test["price"],label="trade_parquet",lw=10)
plt.legend(fontsize=12)

## Purple is the actual deal. It's wandering between bid and ask on the order book.

#### ※ Maybe, it is estimated that if it gets close to the bit or ask, there is an offense and defense,

#### and if it exceeds it, break it, and so on.

#### I am not major of this part, so I will leave it to the experts.

#### It can be seen that the values of bid and ask fluctuate even within 10 minutes of time id = 5. 
#### The variability in this may be related to volatility.
#### For example, when news comes in that the stock price fluctuates suddenly, the volatility rises.
#### Therefore, there is possiblity that the variation will be large.
#### Additionally, Max-Min may be important. (It's the opposite idea because we predict the volatility)

## 5.2.2 One of the image for understanding the volatility

Let's visualize when the volatility is the lowest of stock_id=0.

In [None]:
stock0 = train[train["stock_id"]==0]
min_index = stock0["target"].idxmin()
min_time_id = stock0.iloc[min_index]["time_id"]
print("min index is",min_time_id,"min target is",stock0.iloc[min_index]["target"])

In [None]:
book_test_min = book_example[book_example["time_id"]==min_time_id]
trade_test_min = trade_example[trade_example["time_id"]==min_time_id]


plt.figure(figsize=(20,5))

for num,a in enumerate(samples):
    
   
    plt.plot(book_test_min["seconds_in_bucket"],book_test_min[a],label=a)
    
plt.plot(trade_test_min["seconds_in_bucket"],trade_test_min["price"],label="trade_parquet",lw=10)
plt.legend(fontsize=12)

On the other hand, let's visualize when the volatility is the highest of stock_id=0.

In [None]:
stock0 = train[train["stock_id"]==0]
max_index = stock0["target"].idxmax()
max_time_id = stock0.iloc[max_index]["time_id"]
print("max index is",max_time_id,"max target is",stock0.iloc[max_index]["target"])

In [None]:
book_test_max = book_example[book_example["time_id"]==max_time_id]
trade_test_max = trade_example[trade_example["time_id"]==max_time_id]


plt.figure(figsize=(20,5))

for num,a in enumerate(samples):
    
   
    plt.plot(book_test_max["seconds_in_bucket"],book_test_max[a],label=a)
    
plt.plot(trade_test_max["seconds_in_bucket"],trade_test_max["price"],label="trade_parquet",lw=10)
plt.legend(fontsize=12)

Since the scale of the vertical axis is completely different, I compared the actual trade transactions (purple lines).

In [None]:
plt.figure(figsize=(20,5))
plt.plot(trade_test_min["seconds_in_bucket"],trade_test_min["price"],lw=10,label="min_vol_time")
plt.plot(trade_test_max["seconds_in_bucket"],trade_test_max["price"],lw=10,label = "max_vol_time")
plt.legend(fontsize=15)

When the volatility is high, in this case, it can be seen that the price fluctuates considerably in 10 minutes.


(Variation and Max-Min seem to be important)


This is one of the examples for the image of the volatility.

# 6.submit

## In the trial, enter all the median values ​​for each stock and submit

## 6.1 Creating a dictionary of median values for each stock

In [None]:
stock

In [None]:
stock2 = stock[["stock_id","median"]]
stock2 = stock2.set_index("stock_id")
stock2


In [None]:
stock_dict = stock2.to_dict()

# example : stock id = 0 median median value
stock_dict["median"][0]

## 6.2 Replace submit file and generate submission file

In [None]:
sample # sample_submission.csv

Extract stock_id from row_id

In [None]:
sample["stock_id"] = [s.split("-")[0] for s in sample["row_id"]]
sample

Substitute the contents of the dictionary type

In [None]:
sample["target"] = [stock_dict["median"][int(s)] for s in sample["stock_id"]]
sample

stock_id deleted

In [None]:
sample = sample.drop("stock_id",axis=1)
sample

In [None]:
sample.to_csv("submission.csv",index=False)

# Thank you for watching until the end. 
  ※ Thank you to those who have uploaded to my notebook before.
# If you find it useful, I would be grateful if you could **upvote**!