# Fund Replication Experiment
In this notebook we demonstrate sequential testing using our algorithm, by applying it to a “fund replication” dataset.
The code we used for downloading the data is available [here](https://github.com/amspector100/mlr_knockoff_paper/blob/main/real_data/fund_rep.py).

In [1]:
import pandas as pd
from data.data import read_log_sector_data
from src.e_crt import EcrtTester
from src.utils import BettingFunction, get_martingale_values

In this experiment, as in the paper, we chose to run the algorithm on a technology sector
index fund, named XLK.
We downloaded the data available from August 2013 until September 2022, and got $n=2421$ time steps,
each sample corresponds to a different trading day.
For the feature importance we used the current S&P 500 information; the
[file](../data/data_imp_XLK_Information%20Technology_Open_10.csv) is available under the data folder.
The [features](../data/xdata_XLK_Information%20Technology_Open_10.csv) and the
[labels](../data/ydata_XLK_Information%20Technology_Open_10.csv) files are also  inside the data folder.

After deleting features with missing information, we ended up with $457$ features.
In this notebook we test the algorithm on $10$ features we thought represent the performance appropriately.

Since we have $457$ features, we chose $n_{init}=500$ (the default is $50$).

In order to improve robustness, in all our real-data experiments we chose to use $tanh(20 \cdot())$
as our betting function, and to use a batch ensemble with batches $[5, 10, 20]$.

In [None]:
j_vec = [3, 26, 136, 173, 183, 238, 286, 296, 344, 404]
batch_list = [5, 10, 20]
n_init = 500
g_func = BettingFunction.tanh

In order to explain the sequential testing, we first use the data from August 2013 until September **2021**.
In the first run we have $n=2169$ time steps.


In [None]:
save_path = "./results/fund_replication"
save_name="martingale_dict_sep21"
X, Y, beta, features_names = read_log_sector_data(date="2021-09-21")

For **each** feature we run a separate test and save whether the feature was rejected or not.
Since we wish to continue the test using the data from September 2021 until today, we also save the
martingales at the end of each run.

In [3]:
results_dict = {
    "idx": j_vec,
    "name": features_names[j_vec],
    "important": beta[j_vec],
    "martingale": [],
    "effective n": [],
    "rejected": []
                }

for j in j_vec:
    ecrt_tester = EcrtTester(batch_list=batch_list,
                             j=j,
                             g_func=g_func,
                             n_init=n_init,
                             save_name=save_name,
                             load_name="",
                             path=f"{save_path}/feature_{j}_{features_names[j]}")
    rejected = ecrt_tester.run(X, Y)
    martingale, neff = get_martingale_values(ecrt_tester.martingale_dict)
    results_dict["martingale"].append(martingale)
    results_dict["effective n"].append(neff)
    results_dict["rejected"].append(rejected)
    ecrt_tester.save_martingales()

We can observe that out of 5 important features, only 2 were detected as such by the test. We can also see that out of 5
unimportant features, 1 feature was falsely rejected; the Google stock.

In [4]:
pd.DataFrame(results_dict)

Unnamed: 0,idx,name,important,martingale,effective n,rejected
0,3,AAPL,1,83.232649,580,True
1,26,AMAT,1,10.346207,2169,False
2,136,EBAY,0,0.736875,2169,False
3,173,FTNT,1,10.644899,2169,False
4,183,GOOGL,0,25.355105,1120,True
5,238,KO,0,0.259894,2169,False
6,286,MSFT,1,26.608006,560,True
7,296,NFLX,0,3.430329,2169,False
8,344,PTC,1,1.152892,2169,False
9,404,TSLA,0,1.620387,2169,False


Now we want to use the new data, from September 2021 until September 2022.
We want to continue the test from the point we stopped, so we load the martingales we saved after the first run.
In order to load new martingales, one should simply provide a **"load_name"** to the input.
In order to continue from the previous point, we should also send a **"start_idx"** to the *run* method.
Note that we send the full dataset ($2421$ samples), and not only the data from the last year.
The old data will not be used to update the martingales, but it will be used to train the learning model.

In [None]:
load_name="martingale_dict_sep21"
save_name="martingale_dict_sep22"
X, Y, beta, features_names = read_log_sector_data(date="2022-09-21")
results_dict["new martingale"] = []
results_dict["new effective n"] = []

for ii, j in enumerate(j_vec):
    ecrt_tester = EcrtTester(batch_list=batch_list,
                             j=j,
                             g_func=g_func,
                             n_init=n_init,
                             save_name=save_name,
                             load_name=load_name,
                             path=f"{save_path}/feature_{j}_{features_names[j]}")
    if not results_dict["rejected"][ii]:
        rejected = ecrt_tester.run(X, Y, start_idx=results_dict["effective n"][ii])
        results_dict["rejected"][ii] = rejected
    martingale, neff = get_martingale_values(ecrt_tester.martingale_dict)
    results_dict["new martingale"].append(martingale)
    results_dict["new effective n"].append(neff)
    ecrt_tester.save_martingales()

After the second run, we can observe that now 4 out of 5 important features were rejected.

In [21]:
pd.DataFrame(results_dict)


Unnamed: 0,idx,name,important,martingale,effective n,rejected,new martingale,new effective n
0,3,AAPL,1,83.232649,580,True,83.232649,580
1,26,AMAT,1,10.346207,2169,True,27.038178,2309
2,136,EBAY,0,0.736875,2169,False,1.244609,2421
3,173,FTNT,1,10.644899,2169,True,20.531884,2229
4,183,GOOGL,0,25.355105,1120,True,25.355105,1120
5,238,KO,0,0.259894,2169,False,0.279492,2421
6,286,MSFT,1,26.608006,560,True,26.608006,560
7,296,NFLX,0,3.430329,2169,False,4.63011,2421
8,344,PTC,1,1.152892,2169,False,0.502565,2421
9,404,TSLA,0,1.620387,2169,False,1.660369,2421
