# Local API Emulator

**NEW**: the updated version of Local API Emulator is neater and packaged as a module with a demo available here:
https://www.kaggle.com/jagofc/local-api-emulator-is-now-a-module

**OLD**: Wouldn't it be great if we could test the API locally on our own slices of data? Well, now you can - here's my crack at a local API emulator with LB scoring. 

#### It has the properties you've come to expect and love from the real API:

+ Delivers data with the same formatting as that from the real API, grouped by `timestamp`.
+ Has a `predict()` method.
+ Berates you if you try to:
    + get data for time `t_2` before calling `predict()` for time `t_1`.
    + predict for time `t_1` before getting the data for time `t_1`.

#### It also has the following functionality:

+ Collects your predictions in a list of dataframe slices which is accessible as an attribute `predictions`.
+ Has a length method which counts the number of unique timestamps remaining to be served.
+ Gives familiar error messages if you don't follow protocol.
+ Has a `score` method which:
    + computes the weighted correlation between your predictions and the true targets.
    + correctly handles constant predictions (giving a score of -1).

#### This is not the *real* API.

This is just an emulator. It is easy to cheat it in ways that I hope won't be possible for real API.\
e.g. I think it's highly unlikely that - when running code for the private LB - the hosts store a list containing all of your predictions that can be modified post-hoc by participants... (However if you've played around with API in your interactive notebook you'll see that **is** possible to modify the predictions list in the local `gresearch_crypto` env :O)

#### You might find this code useful for:

+ realistically testing your models on a different time period to that used for the public LB,
+ avoiding the opaque submission procedure,
+ racking up more submissions attempts than the ordained 5 per day.

## Code & step-by-step Demo

Preliminaries.

In [None]:
import time
import numpy as np
import pandas as pd
import gresearch_crypto
from numpy import dtype

asset_details = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv')
id_2_weight = dict(zip(asset_details.Asset_ID, asset_details.Weight))

dtypes = {'timestamp': np.int64, 'Asset_ID': np.int8,
          'Count': np.int32,     'Open': np.float64,
          'High': np.float64,    'Low': np.float64,
          'Close': np.float64,   'Volume': np.float64,
          'VWAP': np.float64,    'Target': np.float64}

def datestring_to_timestamp(ts):
    return int(pd.Timestamp(ts).timestamp())

def read_csv_slice(file_path='../input/g-research-crypto-forecasting/train.csv', dtypes=dtypes, use_window=None):
    df = pd.read_csv(file_path, dtype=dtypes)
    if use_window is not None: 
        df = df[(df.timestamp >= use_window[0]) & (df.timestamp < use_window[1])]
    return df

def weighted_correlation(a, b, weights):
    w = np.ravel(weights)
    a = np.ravel(a)
    b = np.ravel(b)
    sum_w = np.sum(w)
    mean_a = np.sum(a * w) / sum_w
    mean_b = np.sum(b * w) / sum_w
    var_a = np.sum(w * np.square(a - mean_a)) / sum_w
    var_b = np.sum(w * np.square(b - mean_b)) / sum_w
    cov = np.sum((a * b * w)) / np.sum(w) - mean_a * mean_b
    corr = cov / np.sqrt(var_a * var_b)
    return corr

Select and load the subset that you're interested in filling the API with. *The current setting is for the slice used by the public LB.*

In [None]:
start = datestring_to_timestamp('2021-06-13T00:00')
end = datestring_to_timestamp('2021-09-22T01:00')
train_df = read_csv_slice(use_window=[start, end])

Here's code for the local API emulator.

In [None]:
class API:
    def __init__(self, df):
        df = df.astype(dtypes)
        df['row_id'] = df.index
        dfg = df.groupby('timestamp')
        
        self.data_iter = dfg.__iter__()
        self.init_num_times = len(dfg)
        self.next_calls = 0
        self.pred_calls = 0
        self.predictions = []
        self.targets = []
        
        print("This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set. ;)")

    def __iter__(self):
        return self
    
    def __len__(self):
        return self.init_num_times - self.next_calls
        
    def __next__(self):
        assert self.pred_calls == self.next_calls, "You must call `predict()` successfully before you can get the next batch of data."
        timestamp, df = next(self.data_iter)
        self.next_calls += 1
        data_df = df.drop(columns=['Target'])
        true_df = df.drop(columns=['timestamp','Count','Open','High','Low','Close','Volume','VWAP'])
        true_df = true_df[['row_id', 'Target', 'Asset_ID']]
        self.targets.append(true_df)
        pred_df = true_df.drop(columns=['Asset_ID'])
        pred_df['Target'] = 0.
        return data_df, pred_df
    
    def predict(self, pred_df):
        assert self.pred_calls == self.next_calls - 1, "You must get the next batch of data from the API before making a new prediction."
        assert pred_df.columns.to_list() == ['row_id', 'Target'], "Prediction dataframe should have columns `row_id` and `Target`."
        pred_df = pred_df.astype({'row_id': dtype('int64'), 'Target': dtype('float64')})
        self.predictions.append(pred_df)
        self.pred_calls += 1
        
    def score(self, id_2_weight=id_2_weight):
        pred_df = pd.concat(self.predictions).rename(columns={'Target':'Prediction'})
        true_df = pd.concat(self.targets)
        scoring_df = pd.merge(true_df, pred_df, on='row_id', how='left')
        scoring_df['Weight'] = scoring_df.Asset_ID.map(id_2_weight)
        scoring_df = scoring_df[scoring_df.Target.isna()==False]
        if scoring_df.Prediction.var(ddof=0) < 1e-10:
            score = -1
        else:
            score = weighted_correlation(scoring_df.Prediction, scoring_df.Target, scoring_df.Weight)
        return scoring_df, score

Create an API instance.

In [None]:
api = API(train_df)

Get the first batch of data.

In [None]:
(data_df, pred_df) = next(api)
data_df.head(3)

We'll get an error if we try to continue on to the next batch without making our predictions for the current batch. *Commented out so that the notebook doesn't Fail.*

In [None]:
# next(api)

Let's make a dummy prediction using `pred_df`.

In [None]:
api.predict(pred_df)

Now you can continue to iterate. Lets get another slice of data and make another dummy prediction:

In [None]:
(data_df, pred_df) = next(api)
api.predict(pred_df)

Your predictions are stored by the API. Let's just look at the first two prediction batches we made:

In [None]:
api.predictions

The API also has a length method, which tracks the number of timestamps still to be served:

In [None]:
len(api)

Note that you don't need to to restart the notebook kernel in order to make a new emulator (or to refresh the current one), in contrast with the `gresearch_crypto` env.

In [None]:
api2 = API(train_df)

#### Main loop

An example loop (making dummy predictions of Target=0) with a timing estimate for 100 days worth of data.

In [None]:
start_time = time.time()

for (data_df, pred_df) in api:
    pred_df['Target'] = 0.
    api.predict(pred_df)
    
finish_time = time.time()

total_time = finish_time - start_time
iter_speed = api.init_num_times/total_time

print(f"Iterations/s = {round(iter_speed, 2)}.")
test_iters = 60 * 24 * 100
print(f"Expected number of iterations in test set is approx. {test_iters}",
      f"which will take {round(test_iters / iter_speed, 2)}s",
      "using this API emulator while making dummy predictions.")

#### Calculate your LB score!

The API now has a `score` method. This returns:
+ a dataframe containing your predictions, the targets, and weights,
+ the LB score: weighted correlation between predictions and targets.

In [None]:
df, score = api.score()
print(f"Your LB score is {round(score, 4)}")

### A TL;DR example with random predictions

In [None]:
api = API(train_df)

for (data_df, pred_df) in api:
    pred_df['Target'] = np.random.randn(len(pred_df), 1)
    api.predict(pred_df)
    
df, score = api.score()

print(f"Your LB score is {round(score, 4)}")