## [Optiver Realized Volatility Prediction competition](https://www.kaggle.com/c/optiver-realized-volatility-prediction/)
## The Second Update: with and without the noise.

I shall not go into details concerning motivation for creating this notebook; I shall simply refer the interested reader to the following topics:
* [Second leaderboard update is complete [out of date]](https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/282791)
* [Second leaderboard update will be rerun without added noise](https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/284121)
* [Revised second rerun is posted](https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/284665)

###  The "Top 20"

In [None]:
import numpy  as np 
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.express as px
import seaborn as sns
sns.set(style="darkgrid")
sns.set_palette("bright")

csv_0 = pd.read_csv('../input/optiver-public-leaderboard-files/optiver-realized-volatility-prediction-publicleaderboard_00.csv',index_col=0)
csv_1 = pd.read_csv('../input/optiver-public-leaderboard-files/optiver-realized-volatility-prediction-publicleaderboard_01.csv',index_col=0)
csv_2v1 = pd.read_csv('../input/optiver-public-leaderboard-files/optiver-realized-volatility-prediction-publicleaderboard_02.csv',index_col=0)
csv_2 = pd.read_csv('../input/optiver-public-leaderboard-files/optiver-realized-volatility-prediction-publicleaderboard_02_v2.csv',index_col=0)

# create the positions
csv_0.insert(loc=2, column='position 0', value=np.arange(start=1, stop=(len(csv_0)+1) ))
csv_1.insert(loc=2, column='position 1', value=np.arange(start=1, stop=(len(csv_1)+1) ))
csv_2v1.insert(loc=2, column='position 2v1', value=np.arange(start=1, stop=(len(csv_2v1)+1) ))
csv_2.insert(loc=2, column='position 2v2', value=np.arange(start=1, stop=(len(csv_2)+1) ))

csv_0 = csv_0.drop([           'SubmissionDate'], axis='columns')
csv_1 = csv_1.drop(['TeamName','SubmissionDate'], axis='columns')
csv_2v1 = csv_2v1.drop(['TeamName','SubmissionDate'], axis='columns')
csv_2 = csv_2.drop(['TeamName','SubmissionDate'], axis='columns')

csv_0.rename(columns={"Score":"score 0"},inplace=True)
csv_1.rename(columns={"Score":"score 1"},inplace=True)
csv_2v1.rename(columns={"Score":"score 2v1"},inplace=True)
csv_2.rename(columns={"Score":"score 2v2"},inplace=True)

merged_df = csv_0.join(csv_1).join(csv_2v1).join(csv_2)

merged_df["delta score"] = abs(merged_df["score 2v1"] - merged_df["score 2v2"])
merged_df["delta position"] = abs(merged_df["position 2v1"] - merged_df["position 2v2"])

# Take a look at the top 20 for the second update
cols = ["TeamName","position 2v1","score 2v1","position 2v2","score 2v2","delta score","delta position"]
merged_df[cols].set_index("TeamName").sort_values(by='score 2v2', ascending=True).head(20)

The values for the $\Delta$ that I use here are somewhat arbitrary (if you wish to change them please feel free to fork this notebook and rerun, or make a suggestion in the comments section and I will re-run this script). They were chosen by looking at one of the [isoscore strings](https://www.kaggle.com/carlmcbrideellis/shakeup-scatterplots-boxes-strings-and-things) at the bottom of the leaderboard, under the assumption that the said notebook was probably an unsophisticated '*naive*' submission, and would be little affected by the random noise modification. This corresponded to a $\Delta$(position) of 33 and a $\Delta$(score) of less than 0.002. To both of these values I have also arbitrarily added some margins.

* Number of teams with position shakeups:

In [None]:
print("Number of teams with a shakeup of over   50: ",merged_df[merged_df['delta position'] > 50].shape[0] )
print("Number of teams with a shakeup of over  100: ",merged_df[merged_df['delta position'] > 100].shape[0] )
print("Number of teams with a shakeup of over  200: ",merged_df[merged_df['delta position'] > 200].shape[0] )
print("Number of teams with a shakeup of over  500: ",merged_df[merged_df['delta position'] > 500].shape[0] )
print("Number of teams with a shakeup of over 1000: ",merged_df[merged_df['delta position'] > 1000].shape[0] )

we can plot this as an empirical cumulative distribution function (only plotted for a LB position shakeup up to 250)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.set_xlabel("Δ position", fontsize=14)
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax = ax.twinx()
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
sns.ecdfplot(data=merged_df, x="delta position")
ax.set(xlim=(0, 250))
plt.show();

Number of teams score shakeups:

In [None]:
print("Number of teams with a score shakeup of over 0.002: ", merged_df[merged_df['delta score'] > 0.002].shape[0] )
print("Number of teams with a score shakeup of over 0.003: ", merged_df[merged_df['delta score'] > 0.003].shape[0] )
print("Number of teams with a score shakeup of over 0.005: ", merged_df[merged_df['delta score'] > 0.005].shape[0] )
print("Number of teams with a score shakeup of over 0.1  : ", merged_df[merged_df['delta score'] > 0.1].shape[0] )

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.set_xlabel("Δ score", fontsize=14)
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax = ax.twinx()
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
sns.ecdfplot(data=merged_df, x="delta score")
ax.set(xlim=(0, 0.005))
plt.show();

only plotted up to a score shakeup of 0.005.

## Position shakeups with noise (red) and without noise (blue)

In [None]:
# https://matplotlib.org/stable/gallery/color/named_colors.html
color_1 = 'blue'
color_2 = 'olive'
color_3 = 'orange'
color_4 = 'crimson'
color_5 = 'limegreen'
color_6 = 'red'
color_7 = 'teal'
color_8 = 'yellowgreen'

fig  = px.scatter(merged_df,x='position 1',y='position 2v1',hover_name='TeamName').update_traces(marker=dict(color=color_6))
fig_2  = px.scatter(merged_df,x='position 1',y='position 2v2',hover_name='TeamName').update_traces(marker=dict(color=color_1))
fig.add_trace(fig_2.data[0])

fig.update_layout(
    title="The black line represents Δ=0.",
    xaxis_title="Old Leaderboard position",
    yaxis_title="New Leaderboard position",)
#fig.update_layout(xaxis = dict(range=[0.17,0.3]))
#fig.update_layout(yaxis = dict(range=[0.17,0.3]))
fig.add_shape(type="line",x0=0, y0=0, x1=4000, y1=4000,line=dict(color="black",width=1.5,))
fig.update_traces(mode='markers', marker_size=3)
fig.show();

Here is a smoothed contour plot, the red indicating the "spreading out" of the shakeup due to the addition of noise:

In [None]:
data = [[0, 0], [5000, 5000]]
xy = pd.DataFrame(data, columns = ['x', 'y'])

fig, ax = plt.subplots(figsize=(8, 8))
sns.kdeplot(x=merged_df["position 1"], y=merged_df["position 2v1"],  fill=True , thresh=0.0035, color=color_6, ax = ax)
sns.kdeplot(x=merged_df["position 1"], y=merged_df["position 2v2"],  fill=True , thresh=0.0035, color=color_1, ax = ax)
sns.lineplot(data=xy, x="x", y="y", color="black", linewidth=0.5)
ax.set_title("With noise in red, and without noise in blue", fontsize=14)
ax.set_xlabel("LB 1 position", fontsize=14)
ax.set_ylabel("LB 2 position", fontsize=14)
ax.set(xlim=(0, 5000))
ax.set(ylim=(0, 5000))
plt.show();

## Interactive score shakeups (with noise in red and without noise in blue)

In [None]:
fig  = px.scatter(merged_df,x='score 1',y='score 2v1',hover_name='TeamName').update_traces(marker=dict(color=color_6))
fig_2  = px.scatter(merged_df,x='score 1',y='score 2v2',hover_name='TeamName').update_traces(marker=dict(color=color_1))
fig.add_trace(fig_2.data[0])

fig.update_layout(
    title="The black line represents Δ=0.",
    xaxis_title="Old Leaderboard score",
    yaxis_title="New Leaderboard score",)
fig.update_layout(xaxis = dict(range=[0.17,0.9]))
fig.update_layout(yaxis = dict(range=[0.17,0.9]))
fig.add_shape(type="line",x0=0, y0=0, x1=250, y1=250,line=dict(color="black",width=1,))
fig.update_traces(mode='markers', marker_size=3)
fig.show();