Explanations of the Market Maker, Statistics retained from the explanation tutorial for reference.

## A market maker provides liquidity to the market by selling and buying in a market. 

Market Maker is a firm, individual or a Program (automated MM) who actively quotes two-sided markets in a security, providing bids and offers (known as asks) along with the market size of each. As a market maker will show both bid and offer orders, an order book with the presence of market maker will be more liquid, therefore a more efficient market will be provided to end investors to trade freely without concern on executions.


# Order book statistics
There are a lot of statistics Optiver data scientist can derive from raw order book data to reflect market liquidity and stock valuation. These stats are proven to be fundamental inputs of any market prediction algorithms. Below we would like to list some common stats to inspire Kagglers mining more valuable signals from the order book data.

Let's come back to the original order book of stock A

**bid/ask spread**

As different stocks trade on different level on the market we take the ratio of best offer price and best bid price to calculate the bid-ask spread. 

The formula of bid/ask spread can be written in below form:
$$BidAskSpread = BestOffer/BestBid -1$$

**Weighted averaged price**

The order book is also one of the primary source for stock valuation. A fair book-based valuation must take two factors into account: the level and the size of orders. In this competition we used weighted averaged price, or WAP, to calculate the instantaneous stock valuation and calculate realized volatility as our target. 

The formula of WAP can be written as below, which takes the top level price and volume information into account:

$$ WAP = \frac{BidPrice_{1}*AskSize_{1} + AskPrice_{1}*BidSize_{1}}{BidSize_{1} + AskSize_{1}} $$


# Log returns

**How can we compare the price of a stock between yesterday and today?**

We can solve the above problem of comparing two price movements, by dividing the move by the starting price of the stock, effectively computing the percentage change in price, also known as the **stock return**. 

Log returns present several advantages, for example:
- they are additive across time $r_{t_1, t_2} + r_{t_2, t_3} = r_{t_1, t_3}$
- regular returns cannot go below -100%, while log returns are not bounded

# Realized volatility
Compute the log returns over all consecutive book updates and we define the **realized volatility, $\sigma$,** as the squared root of the sum of squared log returns.
$$
\sigma = \sqrt{\sum_{t}r_{t-1, t}^2}
$$
Where we use **WAP** as price of the stock to compute log returns.


# Competition data
In this competition, Kagglers are challenged to generate a series of short-term signals from the book and trade data of a fixed 10-minute window to predict the realized volatility of the next 10-minute 
window. Being an Avid Option trading enthusiast, and reading copious books on this topic has got me excited about this competition. 

# So Why calculate the Realized Volatility?

The calculation of the Realized volality will help to decide the market maker, which option strategies to maintain. In the real market, the option trader must maintain a hedge in another instrument to reduce their risks. The Risk are calculated by the people in the firm. We are going to support them.

# Target Data is Realized Volatility:

The target, which is given in train/test.csv, can be linked with the raw order book/trade data by the same **time_id** and **stock_id**. There is no overlap between the feature and target window.

Some musings about the data and the steps to be taken... 
## What is the meaning of short term signals? 

## In case of train.csv there is Realized Vol given. The below notes uses the WAP of the order book data to generate the same and compared with the score. 

1) How might I improve on the predictions that were done based on the WAP technique, in turn have better R2 and RSMPE score

    Bring in the Bid/Ask spread, and tie it in to somehow to improved the prediction
    
    There are two ask and bid prices, and the 2nd prices can be used to create additional WAP
    
    Using both WAP, aggregated WAP and from that new Target can be calculated
    
2) How might I use the "Trade book" data to improve the prediction

    Take the trade book data WAP, and then mix it with the B/A spread and then generate the predictions
    
    Trade book is where the price of the stock got decided finally. So there WAP and actual traded price can be calculated

3) How might I use the test.csv which contains only row_id, and no volality values

4) How might I create the signals from the realized volatilty calculations?

    When the volatility falls below a certain probability then give a buy or sell order
    
    Signal to a market maker, is which side of the market he should be in. On the buy side or sell side

Note that the competition data will come with partitioned parquet file. You can find a tutorial of parquet file handling in this [notebook](https://www.kaggle.com/sohier/working-with-parquet)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
train.head()

Taking the first row of data, it implies that the realized vol of the **target bucket** for time_id 5, stock_id 0 is 0.004136. How does the book and trade data in **feature bucket** look like for us to build signals?

In [None]:
book_example = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0')
trade_example =  pd.read_parquet('../input/optiver-realized-volatility-prediction/trade_train.parquet/stock_id=0')
stock_id = '0'
book_example = book_example[book_example['time_id']==11]
book_example.loc[:,'stock_id'] = stock_id
trade_example = trade_example[trade_example['time_id']==11]
trade_example.loc[:,'stock_id'] = stock_id

**book data snapshot**

In [None]:
book_example.head()

In [None]:
#Calculating Bid/Ask spread, will take multiple row of a particular time_id and then aggregate the bid/ask spread. 
#define bid_ask_spread(time_id,)
id0_BASpread = min(book_example.ask_price1)/max(book_example.bid_price1) - 1
id0_BASpread1 = min(book_example.ask_price2)/max(book_example.bid_price2) - 1
print("spread1:", id0_BASpread)
print("spread2:", id0_BASpread1)
# The spread2 is higher compared to spread 1, and it is correct since spread 2 is 2nd best

**trade date snapshot**

In [None]:
trade_example.head()

**Realized volatility calculation in python**

## Our Objective 

to predict short-term realized volatility. Although the order book and trade data for the target cannot be shared, we can still present the realized volatility calculation using the feature data we provided. 

## Tactics
As realized volatility is a statistical measure of price changes on a given stock, 

1) We use the WAP calculated from two Bid/Ask spread data. The use the WAP to calculate the RV. 

2) Bring the trade price data and use the RV of that data also. 

3) Finally take the mean of predictions to submit

In [None]:
book_example['wap'] = (book_example['bid_price1'] * book_example['ask_size1'] +
                                book_example['ask_price1'] * book_example['bid_size1']) / (
                                       book_example['bid_size1']+ book_example['ask_size1'])

In [None]:
book_example['wap2'] = (book_example['bid_price2'] * book_example['ask_size2'] +
                                book_example['ask_price2'] * book_example['bid_size2']) / (
                                       book_example['bid_size2']+ book_example['ask_size2'])

**The WAP of the stock is plotted below**

In [None]:
fig, axs = plt.subplots(figsize=(16,10), sharex=False)
line1 = axs.plot(trade_example["seconds_in_bucket"],trade_example["price"])
line1 = axs.plot(book_example["seconds_in_bucket"],book_example["wap"])

In [None]:
fig, axs = plt.subplots(figsize=(16,10), sharex=False)
line1 = axs.plot(trade_example["seconds_in_bucket"],trade_example["price"])
line1 = axs.plot(book_example["seconds_in_bucket"],book_example["wap2"])

In [None]:
import matplotlib.pyplot as plt
plt.scatter(book_example.wap,book_example.wap2)
print(book_example[['wap','wap2']].corr())

## Couple of Observations

Wap and wap2 seem to be correlated by 52% with each other.

Trade price execution, wap and Wap2 price transitions can be linked to generate signals

This opens up the idea for using wap2 and trading prices to collect two more realized volatility and check their relationships 

To compute the log return, we can simply take **the logarithm of the ratio** between two consecutive **WAP**. The first row will have an empty return as the previous book update is unknown, therefore the empty return data point will be dropped.

In [None]:
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

In [None]:
book_example.loc[:,'log_return'] = log_return(book_example['wap'])
book_example.loc[:,'log_return2'] = log_return(book_example['wap2'])
trade_example.loc[:,'log_return'] = log_return(trade_example['price'])

In [None]:
book_example = book_example[~book_example['log_return'].isnull()]
book_example = book_example[~book_example['log_return2'].isnull()]
trade_example = trade_example[~trade_example['log_return'].isnull()]

In [None]:
print(book_example.shape)
print(trade_example.shape)
#The rows, or inputs in trade file is lower, since these are actual traded prices, not estimated WAPs

**Let's plot the tick-to-tick return of this instrument over this time bucket**

In [None]:
fig, axs = plt.subplots(figsize=(16,10), sharex=False)
line1 = axs.plot(book_example["seconds_in_bucket"],book_example["log_return"])
line1 = axs.plot(book_example["seconds_in_bucket"],book_example["log_return2"])
#Log return2 looks more volatile than the log return1

In [None]:
fig, axs = plt.subplots(figsize=(16,10), sharex=False)
line1 = axs.plot(trade_example["seconds_in_bucket"],trade_example["log_return"])
#Log return2 looks more volatile than the log return1

The realized vol of stock 0 in this feature bucket, will be:

In [None]:
def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

#To calculate the realized volatility of one time_id, time_id 5
realized_vol = realized_volatility(book_example['log_return'])
realized_vol_2 = realized_volatility(book_example['log_return2'])
realized_vol_trade = realized_volatility(trade_example['log_return'])

print(f'Realized volatility_1 for stock_id 0 on time_id 5 is {realized_vol}')
print(f'Realized volatility_2 for stock_id 0 on time_id 5 is {realized_vol_2}')
print(f'Trade Realized volatility for stock_id 0 on time_id 5 is {realized_vol_trade}')

Volatility during the trade activity, there is marked reduction in the realized volatility. 

What might cause such a reduction? 

The graphs of Log_returns show the volatility change, between the 1st best and 2nd best bid/ask spreads 

# Improving on the Naive prediction: using past period realized volatility calculated from 2 bid/ask prices as target and aggregating them

A commonly known fact about volatility is that it tends to be autocorrelated. We can use this property to implement a naive model that just "predicts" realized volatility by using whatever the realized volatility was in theinitial 10 minutes.

Let's calculate the past realized volatility across the training set to see how predictive a single naive signal can be.

In [None]:
import os
from sklearn.metrics import r2_score
import glob
list_order_book_file_train = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/book_train.parquet/*')
list_trade_book_file_train = glob.glob('/kaggle/input/optiver-realized-volatility-prediction/trade_train.parquet/*')

In [None]:
print(len(list_order_book_file_train))
print(len(list_trade_book_file_train))
# Both training parquet files have 
#1) 112 stock IDs
#2) 2 sets of Bid / Ask prices and quantity
#3) Trading parquets have only traded prices

As the data is partitioned by stock_id in this competition to allow Kagglers better manage the memory, we try to calculcate realized volatility stock by stock and combine them into one submission file. Note that the stock id as the partition column is not present if we load the single file so we will remedy that manually. We will reuse the log return and realized volatility functions defined in the previous session.

In [None]:
def realized_volatility_per_time_id(file_path, prediction_column_name):
    df_book_data = pd.read_parquet(file_path)
    #Reads the book order data, for stock id
    df_book_data['wap'] =(df_book_data['bid_price1'] * df_book_data['ask_size1']+df_book_data['ask_price1'] * df_book_data['bid_size1'])  / (
                                      df_book_data['bid_size1']+ df_book_data[
                                  'ask_size1'])
    #calculates the wap for that book data, For each time_id, since volatility for each time id is reqd
    df_book_data['log_return'] = df_book_data.groupby(['time_id'])['wap'].apply(log_return)
    df_book_data = df_book_data[~df_book_data['log_return'].isnull()]
    
    #Below command takes log returns of each time_id and aggregates it realized volatiity function.
    df_realized_vol_per_stock =  pd.DataFrame(df_book_data.groupby(['time_id'])['log_return'].agg(realized_volatility)).reset_index()
    
    df_realized_vol_per_stock = df_realized_vol_per_stock.rename(columns = {'log_return':prediction_column_name})
    #here we get the id of the stock as a number
    stock_id = file_path.split('=')[1]
    
    df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    return df_realized_vol_per_stock[['row_id',prediction_column_name]]

In [None]:
def realized_volatility_2_per_time_id(file_path, prediction_column_name):
    df_book_data = pd.read_parquet(file_path)
    #Reads the book order data, for stock id
    df_book_data['wap2'] =(df_book_data['bid_price2'] * df_book_data['ask_size2']+df_book_data['ask_price2'] * df_book_data['bid_size2'])  / (
                                      df_book_data['bid_size2']+ df_book_data[
                                  'ask_size2'])
    #calculates the wap for that book data, For each time_id, since volatility for each time id is reqd
    df_book_data['log_return2'] = df_book_data.groupby(['time_id'])['wap2'].apply(log_return)
    df_book_data = df_book_data[~df_book_data['log_return2'].isnull()]
    
    #Below command takes log returns of each time_id and aggregates it realized volatiity function.
    df_realized_vol_per_stock =  pd.DataFrame(df_book_data.groupby(['time_id'])['log_return2'].agg(realized_volatility)).reset_index()
    
    df_realized_vol_per_stock = df_realized_vol_per_stock.rename(columns = {'log_return2':prediction_column_name})
    #here we get the id of the stock as a number
    stock_id = file_path.split('=')[1]
    
    df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    return df_realized_vol_per_stock[['row_id',prediction_column_name]]

In [None]:
def realized_volatility_trade_per_time_id(file_path, prediction_column_name):
    df_trade_data = pd.read_parquet(file_path)
    #Reads the book order data, for stock id
    df_trade_data['log_return'] = df_trade_data.groupby(['time_id'])['price'].apply(log_return)
    df_trade_data = df_trade_data[~df_trade_data['log_return'].isnull()]
    
    #Below command takes log returns of each time_id and aggregates it realized volatiity function.
    df_realized_vol_per_stock =  pd.DataFrame(df_trade_data.groupby(['time_id'])['log_return'].agg(realized_volatility)).reset_index()
    
    df_realized_vol_per_stock = df_realized_vol_per_stock.rename(columns = {'log_return':prediction_column_name})
    #here we get the id of the stock as a number
    stock_id = file_path.split('=')[1]
    
    df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock['time_id'].apply(lambda x:f'{stock_id}-{x}')
    return df_realized_vol_per_stock[['row_id',prediction_column_name]]

Looping through each individual stocks, we can get the past realized volatility as prediction for each individual stocks.

In [None]:
def past_realized_volatility_per_stock(list_file,prediction_column_name):
    df_past_realized = pd.DataFrame()
    for file in list_file:
        df_past_realized = pd.concat([df_past_realized,
                                     realized_volatility_per_time_id(file,prediction_column_name)])
    return df_past_realized

def past_realized_volatility_per_stock_2(list_file,prediction_column_name):
    df_past_realized = pd.DataFrame()
    for file in list_file:
        df_past_realized = pd.concat([df_past_realized,
                                     realized_volatility_2_per_time_id(file,prediction_column_name)])
    return df_past_realized

In [None]:
def past_order_realized_volatility_per_stock(list_file,prediction_column_name):
    df_past_realized = pd.DataFrame()
    for file in list_file:
        df_past_realized = pd.concat([df_past_realized,
                                     realized_volatility_trade_per_time_id(file,prediction_column_name)])
    return df_past_realized

In [None]:
df_past_realized_train_1 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
                                                           prediction_column_name='pred')
df_past_realized_train_2 = past_realized_volatility_per_stock_2(list_file=list_order_book_file_train,
                                                           prediction_column_name='pred2')

In [None]:
df_past_realized_train_3 = past_order_realized_volatility_per_stock(list_file=list_trade_book_file_train,
                                                           prediction_column_name='trade_Pred')

In [None]:
print(df_past_realized_train_1.info())
print(df_past_realized_train_2.info())
print(df_past_realized_train_3.info())

Let's join the output dataframe with train.csv to see the performance of the naive prediction on training set.

In [None]:
train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
train = train[['row_id','target']]
df_joined_1 = train.merge(df_past_realized_train_1[['row_id','pred']], on = ['row_id'], how = 'left')
df_joined_2 = df_joined_1.merge(df_past_realized_train_2[['row_id','pred2']], on = ['row_id'], how = 'left')
df_joined_3 = df_joined_2.merge(df_past_realized_train_3[['row_id','trade_Pred']], on = ['row_id'], how = 'left')

In [None]:
print(df_joined_3)
print(df_joined_3.shape)

In [None]:
df_joined_3['mean_pred'] = (df_joined_3.pred + df_joined_3.pred2 + df_joined_3.trade_Pred)/3

In [None]:
print(df_joined_3.mean_pred.isnull().sum())
df_joined_3.fillna(0,inplace=True)

#The trade data is available only for 428,852 while the rest of the order book data are 428,932. There is gap of 80 data. 
#It is harmless in replacing these 80 mean preds to 0

We will evaluate the naive prediction result by two metrics: RMSPE and R squared. 

In [None]:
from sklearn.metrics import r2_score

def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

R2_i = round(r2_score(y_true = df_joined_3['target'], y_pred = df_joined_3['pred']),3)
RMSPE_i = round(rmspe(y_true = df_joined_3['target'], y_pred = df_joined_3['pred']),3)

R2 = round(r2_score(y_true = df_joined_3['target'], y_pred = df_joined_3['mean_pred']),3)
RMSPE = round(rmspe(y_true = df_joined_3['target'], y_pred = df_joined_3['mean_pred']),3)

print(f'Performance of the naive prediction: R2 score: {R2_i}, RMSPE: {RMSPE_i}')

print(f'Performance of the updated prediction: R2 score: {R2}, RMSPE: {RMSPE}')

The performance of the updated model is worse than the original baseline. Have to work on a different method.

# Submission

In [None]:
df_improved_naive_pred_test = df_joined_3[['row_id','mean_pred']].rename(columns = {'mean_pred':"target"})

In [None]:
df_improved_naive_pred_test.to_csv('submission.csv',index = False)