> NOTE: This is work-in-progress
I used the [notebook](https://www.kaggle.com/code/chiangken/introduction-and-explore-data-analysis) by [Ken Chiang
](https://www.kaggle.com/chiangken) to agument background information. 

---
# Background information

[__Auction__](https://www.londonstockexchange.com/discover/news-and-insights/what-auction) 
is the point of time where the __regular trading__, i.e., the continous matching between _sell_ and _buy_ orders, is paused.  
Orders are collected from the maket during __call period__.  
During the __call period__ an algoithm calculates the price at which __maximum__ amout of shares can be executed.  
These acution prices desctibe the state of the market, used for portfolio evaluation.  
During the __call period__ orders can be entered, modified or canceled.  
At the end of the __call period__ orders that can be matched are executed at event called __uncrossing__ at the end of the __call period__. The number shared that cannot be matched defines the __imbalance__. Depending on which shares cannot be matched, the __imbalance__ is positive (if buy-sided) or negative (if sell-sided) and 0 if neither.  

During closing cross, the __traditional order book__ data is merged with with __auction book__ data.  
At this time the bid and ask prices _overlap_.  


Acution helps to focus liquidity at a specific time and evaluate portfolio values and indexes.   
Autions take place on an __electronic order book__ that automatically matches _buy_ and _sell_ orders.  

In the __Order Book__: 
- `Bid Price`: price that _buyer_ wants to buy.
- `Ask Price`: price that _seller_ wants to sell.
- `Bid Size`: amount of shares that _buyer_ wants to buy.
- `Ask Size`: amount of shares that _seller_ wants to sell.

__None__: Ask Price $\geq$ Bid Price.

__Order book__ data reflect market __liquidity__ and __stock valuation__.

At [__Nasdaq Closing Cross__](https://www.nasdaqtrader.com/content/productsservices/Trading/ClosingCrossfaq.pdf) almost $10\%$ of Nasdaq’s average daily volume occurs.  

Weighted Average Price is an asset price that accounts for the level and the size of orders.  
Stock valuation is _lower_ if 
- there number of offers, _Bid Size_ exceeds the number of sells _Ask Size_, given the same price. 

### Uncrossing and [_imbalance_](https://www.kaggle.com/code/chiangken/introduction-and-explore-data-analysis) 

Consider the following _closing auction order book_ as it is _uncrossed_: 

| Buy (bid) size | $P$, price | Sell (ask) size |
| --- | --- | --- |
| 0 | 20 | 1 |
| 3 | 19 | 2 |
| 4 | 18 | 4 |

The following will take place:   
- $n=0$ shares matched at $P=20$. 
- $n=3$ shares matched (bought) at $p=19$. Note. Even though $2$ are sold at this $P$. 
- $n=4$ shares matched (bought) at $p=18$. 

`Uncross price` is the price that maximises $n$. Here it is $P=18$.  
`Matched size` is the maximum $n$ at a given price. Here it is $n=4$.  
`Imbalance` is the number of _unmatched shares_ at the _uncross price_. Here there is $4+3$ bids at this $P$ and $4$ sells. Thus, imbalance is $3$, __buy-sided__.  

`Far price` is the price that would __maximize__ the number of shares matched. This is _hypothetical uncross price_ if auction is to end at this moment.  
`Near price` is the price that would __maximize__ the number of shares matched based on __continous market orders__. In this dataset it is given only 300 seconds after the start. 


---
# EDA notes  

The dataset contains historic data for the daily ten minute closing auction on the NASDAQ stock exchange.


The additional complexity of this project is that it is not a simple time-series forecasting problem. Instead there are many time-series given for each date (so a sereis of time-serieses) for a large number of stocks.  

- $4$ typs of IDs wit no missing/nan values
- No duplicated rows
- 'far_price' and 'near_price' have $\sim50\%$ values missing. Might require to drop them
- Axis data: 'date_id' 'seconds_in_bucket' 'stock_id' that describes 
    - given stock  
    - on a given date
    - on a given second during the closing  
- __Note__: Product of 'date_id' 'seconds_in_bucket' 'stock_id', that is = $5291000$ _is not equal_ to $n$ of unqiue rows $5237980$. Open question why? Missing data?

### Target varaible
is a difference in WAP ratios, where the WAP ratio for a stock is substracted from WAP ration of a cunstom weighted Nasdaq index
$$
\text{target} = \Bigg( \frac{\text{stock}_{t+60}}{\text{stock}_t} - \frac{\text{idx}_{t+60}}{\text{idx}_t} \Bigg) \times 10^5
$$

In [None]:
from IPython.display import display_html, clear_output, Markdown;
from gc import collect;
from copy import deepcopy;
import pandas as pd;
import numpy as np;

from warnings import filterwarnings;
filterwarnings('ignore');

from tqdm.notebook import tqdm;

In [72]:
# load data (takes a while)
df = pd.read_csv("./train.csv")
df.head()


Unnamed: 0,stock_id,date_id,seconds_in_bucket,imbalance_size,imbalance_buy_sell_flag,reference_price,matched_size,far_price,near_price,bid_price,bid_size,ask_price,ask_size,wap,target,time_id,row_id
0,0,0,0,3180602.69,1,0.999812,13380276.64,,,0.999812,60651.5,1.000026,8493.03,1.0,-3.029704,0,0_0_0
1,1,0,0,166603.91,-1,0.999896,1642214.25,,,0.999896,3233.04,1.00066,20605.09,1.0,-5.519986,0,0_0_1
2,2,0,0,302879.87,-1,0.999561,1819368.03,,,0.999403,37956.0,1.000298,18995.0,1.0,-8.38995,0,0_0_2
3,3,0,0,11917682.27,-1,1.000171,18389745.62,,,0.999999,2324.9,1.000214,479032.4,1.0,-4.0102,0,0_0_3
4,4,0,0,447549.96,-1,0.999532,17860614.95,,,0.999394,16485.54,1.000016,434.1,1.0,-7.349849,0,0_0_4


In [45]:
# check duplicates and missing values
print(f"--- Duplicated_rows --- ")
print(df.duplicated().sum())
print(f"--- Nans rows --- ")
df.isnull().sum()/len(df)

# check df properties
def analyze_df(df : pd.DataFrame)->pd.DataFrame:
    res = pd.DataFrame({
        "is_unique": df.nunique() == len(df),
        "unique": df.nunique(),
        "with_nan":df.isna().any(),
        "percent_nan":round((df.isnull().sum()/len(df))*100,4),
        "dtype":df.dtypes
    })
    return res
analyze_df(df=df)

Unnamed: 0,is_unique,unique,with_nan,percent_nan,dtype
stock_id,False,200,False,0.0,int64
date_id,False,481,False,0.0,int64
seconds_in_bucket,False,55,False,0.0,int64
imbalance_size,False,2971863,True,0.0042,float64
imbalance_buy_sell_flag,False,3,False,0.0,int64
reference_price,False,28741,True,0.0042,float64
matched_size,False,2948862,True,0.0042,float64
far_price,False,95739,True,55.2568,float64
near_price,False,84625,True,54.5474,float64
bid_price,False,28313,True,0.0042,float64


In [41]:
# examine axis values (ids and time)
df_ids = df[['stock_id','date_id','time_id','row_id']]
for key in ['stock_id','date_id','time_id','row_id']:
    print(f"{key} n_unique={len(df[key].unique())}")
n = len(df['date_id'].unique())*len(df['seconds_in_bucket'].unique())*len(df['stock_id'].unique())
print(f"unique date_id*seconds_in_bucket*stock_id={n}")
df[(df['date_id']==0)][['stock_id','date_id','seconds_in_bucket','row_id']]

stock_id n_unique=200
date_id n_unique=481
time_id n_unique=26455
row_id n_unique=5237980
unique date_id*seconds_in_bucket*stock_id=5291000


In [71]:
# Look at the missing data
df[(df['date_id']==0)][['stock_id','date_id','seconds_in_bucket','row_id']]
for stock in df['stock_id'].unique():
    datas = df[(df['date_id']==0)&(df['stock_id']==stock)&(df['seconds_in_bucket']==0)]
    if (len(datas)!=1):
        print(f"stock={stock} datas={len(datas)}")

stock=78 datas=0
stock=69 datas=0
stock=156 datas=0
stock=150 datas=0
stock=153 datas=0
stock=199 datas=0
stock=79 datas=0
stock=135 datas=0
stock=102 datas=0


# Plot general time-series

In [149]:
# plotting 
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [150]:
def plot_stocks2(df_stock : pd.DataFrame):
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df_stock["time_id"], y=df_stock["wap"], line=dict(color="blue"), name="wap", opacity=.5, yaxis="y2"
    ))
    fig.add_trace(go.Scatter(
        x=df_stock["time_id"], y=df_stock["target"], line=dict(color="red"), name="target", opacity=.5
    ))
    fig.update_xaxes(title_text="time", type="linear")
    fig.update_yaxes(title_text="target", type="linear")
    fig.update_layout(title="overview target vs wap",
                    showlegend=True,
                    width=1000,
                    height=400,
                    margin=dict(l=40,r=40,t=40,b=20),
                    yaxis2 = dict(title="wap",overlaying="y",side="right"))
    return fig
plot_stocks2(df_stock=df[df["stock_id"]==0]).show()
print("On a small scale WAP show quasi-peridicity")

On a small scale WAP show quasi-peridicity


In [155]:
def plot_stocks3(df_stock : pd.DataFrame):
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x = df_stock['time_id'], 
        y = df_stock['wap'], 
        name = 'wap',
        line = dict(color = 'gray'),
        opacity=.5
        ))

    fig.add_trace(go.Scatter(
        x = df_stock['time_id'], 
        y = df_stock['ask_price'], 
        name = 'ask price',
        line = dict(color = 'blue'),
        opacity=.5
        ))

    fig.add_trace(go.Scatter(
        x = df_stock['time_id'], 
        y = df_stock['bid_price'], 
        name = 'bid price',
        line = dict(color = 'red'),
        opacity=.5
        ))
    fig.update_xaxes(title_text="time", type="linear")
    fig.update_yaxes(title_text="target", type="linear")
    fig.update_layout(title="overview target vs wap",
                    showlegend=True,
                    width=1000,
                    height=400,
                    margin=dict(l=40,r=40,t=40,b=20),
                    yaxis2 = dict(title="wap",overlaying="y",side="right"))
    return fig
plot_stocks3(df_stock=df[df["stock_id"]==0]).show()
print("The WAP lies between ask and pid price")

The WAP lies between ask and pid price


# Plot time-series for closing cross (0n a single date)

In [161]:
def plot_stocks4(df_stock : pd.DataFrame, vx = 'seconds_in_bucket'):
    fig = go.Figure()
    fig.add_trace( go.Scatter(
        x = df_stock[df_stock['date_id']==0][vx], 
        y = df_stock[df_stock['date_id']==0]['wap'], 
        name = 'wap',
        line = dict(color = 'gray'),
        opacity=.5
    ))

    fig.add_trace( go.Scatter(
        x = df_stock[df_stock['date_id']==0][vx], 
        y = df_stock[df_stock['date_id']==0]['near_price'], 
        name = 'near price',
        line = dict(color = 'blue'),
        #yaxis = "y2",
        opacity=.5
    ))

    fig.add_trace( go.Scatter(
        x = df_stock[df_stock['date_id']==0][vx], 
        y = df_stock[df_stock['date_id']==0]['far_price'], 
        name = 'far price',
        line = dict(color = 'red'),
        #yaxis = "y2",
        opacity=.5
    ))

    fig.update_xaxes(title_text=vx, type="linear")
    fig.update_yaxes(title_text="target", type="linear")
    fig.update_layout(
        title="overview target vs wap",
        showlegend=True,
        width=1000,
        height=400,
        margin=dict(l=40,r=40,t=40,b=20),
        # yaxis2 = dict(title="Near Price/Far Price",overlaying="y",side="right")
    )
    return fig
plot_stocks4(df_stock=df[df["stock_id"]==0]).show()
print("The WAP lies between ask and pid price")

The WAP lies between ask and pid price


# 

# The a