# Optimal single value for prediction


### What is the best single value to use as a prediction?

__tldr; its not the mean__




The best single value to use as a prediction will be the one that minimises the error function:

$$ e = \sqrt{\frac{1}{n} \sum_i{(\frac{t_i - p}{t_i}})^2} $$

Differentiate with respect to $p$, the minmum error will be at a point where the derivative is zero.

$$ \frac{\partial e}{\partial p} = \frac{\partial }{\partial p}\sqrt{\frac{1}{n} \sum_i{(\frac{t_i - p}{t_i}})^2} = 0 $$

Although its easier to rearrage first,

$$ ne^2 = \sum_i{(\frac{t_i - p}{t_i}})^2$$

$$ ne^2 = \sum_i{ 1 - 2 t_i^{-1}p + t_i^{-2}p^2}$$

$$ \frac{\partial }{\partial p} ne^2 = 2ne \frac{\partial e}{\partial p} = 0 $$

$$ \sum_i{ - 2 t_i^{-1} + 2t_i^{-2}p} = 0 $$


The optimal value is:
---
$$ p = \frac{\sum_i{ t_i^{-1}}}{\sum_i{ t_i^{-2}}}$$

### Or in python:
```
prediction = sum(1 / target) / sum (1 / (target **2))
```

## Implementation

### Imports, constants and loading data

In [None]:
import cufflinks as cf
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold


cf.go_offline()

#DATA_DIR = 'data'
DATA_DIR = '../input/optiver-realized-volatility-prediction'
OUTPUT_DIR = '.'

df_train = pd.read_csv(f'{DATA_DIR}/train.csv')
df_test = pd.read_csv(f'{DATA_DIR}/test.csv')

df_train.head()

In [None]:
def single_prediction(df):
    """Compute the best single value prediction using the formula detailed at the start of the notebook
    
    Parameters
    ----------
    df: pandas.DataFrame
        DataFrame with column for "target"
        
    Returns
    -------
    float
    """
    inverse_target = 1 / df['target']
    single_prediction = inverse_target.sum() / np.square(inverse_target).sum()
    return single_prediction


def stock_id_prediction(df):
    """Compute the best value to use for each stock
    
    Parameters
    ----------
    df: pandas.DataFrame
        DataFrame with column for "target"
        
    Returns
    -------
    pandas.Series
        Series with index of stock_id and prediction values to use.
    """
    return df_train.groupby('stock_id')[['target']].apply(single_prediction)


def score(results_df, predict=None):
    """Compute the score (RMSPE) of a given prediction
    
    Parameters
    ----------
    results_df: pandas.DataFrame
        DataFrame with columns for target, and optionally prediction
    
    predict: float, optional
        Single value to use for prediction
        
    Returns
    -------
    float:
        RMSPE of prediction
    """
    if 'prediction' not in results_df:
        results_df['prediction'] = predict
    results_df['sq_pc_error'] = results_df.eval('(target - prediction) / target').apply(np.square)
    return np.sqrt(results_df['sq_pc_error'].sum() / results_df['sq_pc_error'].shape[0])

# Single Value Prediction

In [None]:
single_prediction(df_train)

In [None]:
prediction_df = df_train.copy()
prediction_df['prediction'] = single_prediction(df_train)
score(prediction_df)

#### Using the mean of the target is significantly worse

In [None]:
prediction_df = df_train.copy()
prediction_df['prediction'] = prediction_df['target'].mean()
score(prediction_df)

## CV of single value prediction

I do not include a shuffle on the KFold CV such that the validation set is mostly on unseen time_ids

In [None]:
kf = KFold(
    n_splits=5, 
    shuffle=False
)
results = []


for fold, (train_index, validate_index) in enumerate(kf.split(df_train)):
    X = df_train.loc[train_index]
    X_val = df_train.loc[validate_index]
    
    predict = single_prediction(X)
    results.append({
        'score': score(X, predict),
        'oof_score': score(X_val, predict)
    })

    
pd.DataFrame(results)

# Single Value Prediction Per Stock

In [None]:
stock_id_prediction(df_train)

In [None]:
prediction_df = df_train.copy()
stock_id_prediction_map = stock_id_prediction(prediction_df)
prediction_df['prediction'] = prediction_df['stock_id'].map(stock_id_prediction_map)
score(prediction_df)

## CV of single value prediction per stock

In [None]:
kf = KFold(
    n_splits=5, 
    shuffle=False
)
results = []


for fold, (train_index, validate_index) in enumerate(kf.split(df_train)):
    X = df_train.loc[train_index]
    X_val = df_train.loc[validate_index]
    
    stock_id_prediction_map = stock_id_prediction(X)
    train_df = X.copy()
    train_df['prediction'] = train_df['stock_id'].map(stock_id_prediction_map)
    
    validate_df = X.copy()
    validate_df['prediction'] = validate_df['stock_id'].map(stock_id_prediction_map)
    
    results.append({
        'score': score(train_df),
        'oof_score': score(validate_df),
        'oof_results': validate_df
    })

    
pd.DataFrame(results)[['score', 'oof_score']]

#### Error break down by stock

In [None]:
error_df = (
    pd.concat([row['oof_results'] for row in results])
    .groupby('stock_id')['sq_pc_error']
    .agg(['sum', 'count'])
    .rename(columns={'sum': 'sum_of_SPE', 'count': 'count_of_SPE'})
    .eval('sum_of_SPE / count_of_SPE')
    .apply(np.sqrt))

error_df.sort_values()

In [None]:
error_df.iplot(kind='hist')

## Create Submission File

In [None]:
stock_id_prediction_map = stock_id_prediction(df_train)
stock_id_prediction_map

In [None]:
df_test['target'] = df_test['stock_id'].map(stock_id_prediction_map)
output_df = df_test[['row_id', 'target']]
output_df.to_csv('submission.csv', index=False)
output_df