# Janestreet CV Research: **GroupKFold** or **PurgedGroupTimeSeriesSplit**
In this notebook, I want to see which CV result is more close to LB (or maybe private test set): **GroupKFold** or **PurgedGroupTimeSeriesSplit**. Through the notebook, we can see that **GroupKFold** result is more close to LB. Here is my thought:
1. Many people tend to use **PurgedGroupTimeSeriesSplit** method as it can prevent data leakage issue in time series fitting, but we found it hard to match the rank of local CV result and the rank of LB.
2. Opposite views towards the two assumptions by using the **PurgedGroupTimeSeriesSplit**: 
    1. This is a time series competition: as we are not given "stock id", we can hardly construct some time based features. Also, there is no feature in the data like "price" that can be seen continuous from day 1 to day 2. 
    2. Training with the data in the future to predict the past data may cause data leakage: this data is unlikely to be raw data and they may have been normalized from day to day. As they are normalized day by day, data leakage may not be a issue here.
3. Reasons why **GroupKFold** can be better:
    1. Training and Validation data can be larger than those used in **PurgedGroupTimeSeriesSplit** as it uses whole train data
    2. When we used the future data to train and past data to valid, we actually tested the robustness in a different perspective: based on future market environment, whether or not we can predict current market.
    
Some notes:
1. During local CV, using **train size : valid size = 2:1** can match **whole train size: private test size**. So that's why I chose those parameter in the Local CV part.
2. As Utility is positively related with the data length, I use `u_daily = u / num_of_date` here as another metric to relief this issue.


## If you found this helpful, please UP VOTE! Thank you!

# Read Package

In [None]:
import optuna
import numpy as np
import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GroupKFold

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Load Data

In [None]:
train = pd.read_parquet('../input/dtrain-parquet/dtrain.parquet')
train = train.query("weight != 0")
train.reset_index(inplace=True, drop=True)
train["action"] = (train["resp"] > 0).astype(int)
train["sample_weight"] = abs(train['resp'])*train['weight'].transform('sqrt')
fs = ["weight"] + [f'feature_{x}' for x in range(130)] 
fs_median = train[fs].median()
train.fillna(fs_median,inplace=True)
fs_median_v = fs_median.values
X = train[fs].values
y = train[["action"]].values.ravel()
dates = train["date"].values.ravel()

In [None]:
LOCAL_CV = False
SUBMISSION = True
show_plot = False
SUBMISSION_d = 6

# PurgedGroupTimeSeriesSplit

In [None]:
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# https://www.kaggle.com/marketneutral/purged-rolling-time-series-cv-split
# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class PurgedGroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Allows for a gap in groups to avoid potentially leaking info from
    train into test if the model has windowed or lag features.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_group_size : int, default=Inf
        Maximum group size for a single training set.
    group_gap : int, default=None
        Gap between train and test
    max_test_group_size : int, default=Inf
        We discard this number of groups from the end of each train split
    """

    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_group_size=np.inf,
                 max_test_group_size=np.inf,
                 group_gap=None,
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_group_size = max_train_group_size
        self.group_gap = group_gap
        self.max_test_group_size = max_test_group_size

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        group_gap = self.group_gap
        max_test_group_size = self.max_test_group_size
        max_train_group_size = self.max_train_group_size
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))

        group_test_size = min(n_groups // n_folds, max_test_group_size)
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []

            group_st = max(0, group_test_start - group_gap - max_train_group_size)
            for train_group_idx in unique_groups[group_st:(group_test_start - group_gap)]:
                train_array_tmp = group_dict[train_group_idx]
                
                train_array = np.sort(np.unique(
                                      np.concatenate((train_array,
                                                      train_array_tmp)),
                                      axis=None), axis=None)

            train_end = train_array.size
 
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                                                group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                              np.concatenate((test_array,
                                                              test_array_tmp)),
                                     axis=None), axis=None)

            test_array  = test_array[group_gap:]
                    
            yield [int(i) for i in train_array], [int(i) for i in test_array]

# Utility

In [None]:
from sklearn.metrics import roc_auc_score

def utility(estimator, X, y, idx):
    """Custom scoring object as per documentation:
    https://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object
    Utility score formulae are defined in competition's intro:
    https://www.kaggle.com/c/jane-street-market-prediction/overview/evaluation
    Using optimisation tricks from @gogo827jz:
    https://www.kaggle.com/c/jane-street-market-prediction/discussion/201257
    """

    # still looking for a way to write this for xgb.cv but it passes DMatrix which doesn't allow indexing...
    # https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html
    date = train.loc[idx, 'date'].values
    num_date = np.unique(date).size
    weight = train.loc[idx, 'weight'].values
    resp = train.loc[idx, 'resp'].values
    
    action = estimator.predict(X)
    proba = estimator.predict_proba(X)[:, 1]
    roc_auc = roc_auc_score(y, proba)
    p_i = np.bincount(date, weight * resp * action)
    
    t = p_i.sum() / np.sqrt((p_i ** 2).sum()) * np.sqrt(250 / num_date)
    u = np.clip(t, 0, 6) * p_i.sum()
    u_daily = u / num_date
    print("{} days, roc_auc:{:<10.3f}, t:{:<10.3f}, u:{:<10.3f}, u_daily:{:<10.3f}".format(num_date, roc_auc, t, u, u_daily))
    return roc_auc, t, u, u_daily

# Local CV

In [None]:
depths = [3, 4, 5, 6, 7, 8 ,9]
cv_dict = {
    "PurgedCV": PurgedGroupTimeSeriesSplit(n_splits=5, 
                                           group_gap=5, 
                                           max_train_group_size=200, 
                                           max_test_group_size=100), # train: valid = 2:1 
    "GroupCV_1": GroupKFold(n_splits=3), # train: valid = 2:1
    "GroupCV_2": GroupKFold(n_splits=5), # train: valid = 4:1
}
lgb_params = dict(num_leaves=31, 
                       max_depth=7, 
                       learning_rate=0.01, 
                       n_estimators=200, 
                       objective="binary", 
                       min_child_weight=0.001, min_child_samples=20, 
                       subsample=1.0, subsample_freq=0, 
                       colsample_bytree=0.8, 
                       reg_alpha=1, reg_lambda=1, 
                       random_state=2, 
                       n_jobs=-1, 
                       silent=True, 
                       importance_type='split'
                      )


In [None]:
# Code from https://www.kaggle.com/gogo827jz/jane-street-ffill-xgboost-purgedtimeseriescv
def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    """Create a sample plot for indices of a cross-validation object."""
    
    cmap_cv = plt.cm.coolwarm

    jet = plt.cm.get_cmap('jet', 256)
    seq = np.linspace(0, 1, 256)
    _ = np.random.shuffle(seq)   # inplace
    cmap_data = ListedColormap(jet(seq))

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2)

    # Plot the data classes and groups at the end
    ax.scatter(range(len(X)), [ii + 1.5] * len(X),
               c=y, marker='_', lw=lw, cmap=plt.cm.Set3)

    ax.scatter(range(len(X)), [ii + 2.5] * len(X),
               c=group, marker='_', lw=lw, cmap=cmap_data)

    # Formatting
    yticklabels = list(range(n_splits)) + ['target', 'day']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, len(y)])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    return ax

In [None]:
%%time
if show_plot:
    fig, ax = plt.subplots()
    cv = cv_dict["PurgedCV"]
    plot_cv_indices(cv, X, y, dates, ax, 5, lw = 20)
    plt.show()

In [None]:
score_cols = ["roc_auc","t","u","u_daily"]
score_df = pd.DataFrame([])
if LOCAL_CV:
    result_d = {}
    for cv_name, cv in cv_dict.items():
        print("*" * 10 + "  " + cv_name + "  " + "*" * 10)    
        for d in depths:
            print("*" * 5 + " Depth: ",d," "+ "*" * 5)
            new_lgb_params = lgb_params.copy()
            new_lgb_params["max_depth"] = d
            model = LGBMClassifier(**new_lgb_params)
            t_score = []
            v_score = []        
            for f,(t_idx, v_idx) in enumerate(cv.split(X, y, dates)):
                print("Fold "+str(f)+" :")
                X_t, X_v = X[t_idx], X[v_idx]
                y_t, y_v = y[t_idx], y[v_idx]
                w_t = train.loc[t_idx, "sample_weight"]
                model.fit(X_t, y_t, eval_set =[(X_v, y_v)], eval_metric = ["logloss"], early_stopping_rounds = 20, sample_weight= w_t,verbose=True)
                iteration = model.best_iteration_
                print("Training Iteration: ", iteration)
                print("Train",end=":\t")
                t_roc_auc, t_t, t_u, t_u_daily = utility(model, X_t, y_t, t_idx)
                print("Valid",end=":\t")
                v_roc_auc, v_t, v_u, v_u_daily = utility(model, X_v, y_v, v_idx)
                t_score.append([t_roc_auc,t_t,t_u,t_u_daily,iteration])
                v_score.append([v_roc_auc,v_t,v_u,v_u_daily,iteration])  
            t_score_df = pd.DataFrame(t_score, columns = score_cols+["iteration"])
            v_score_df = pd.DataFrame(v_score, columns = score_cols+["iteration"])
            t_score_df["train"] = True 
            t_score_df["cv"] = cv_name
            t_score_df["depth"] = d
            v_score_df["train"] = False 
            v_score_df["cv"] = cv_name
            v_score_df["depth"] = d
            t_score_mean = t_score_df[score_cols].mean()
            t_score_std = t_score_df[score_cols].std()
            v_score_mean = v_score_df[score_cols].mean()
            v_score_std = v_score_df[score_cols].std()
            score_df = pd.concat([score_df, t_score_df, v_score_df],axis=0)

# Present CV and LB Result

In [None]:
if LOCAL_CV:
    score_df.to_csv("score_df.csv")
else:
    CV_FOLDER = "cv-score"
    score_df = pd.read_csv('../input/'+ CV_FOLDER +'/score_df.csv', index_col=0)

In [None]:
score_df.head()

In [None]:
score_df_mean = score_df.groupby(["cv","depth","train"])[score_cols+["iteration"]].mean()
score_df_std = score_df.groupby(["cv","depth","train"])[score_cols+["iteration"]].std()

## Local CV Score

In [None]:
score_df_mean.style.format("{:.2f}")

In [None]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]
Idx = pd.IndexSlice
score_df_mean.loc[Idx[:,:,False],:].style.apply(highlight_max).format("{:.4f}")

## LB Score:
| Depth      | GroupCV 3 folds u_daily | GroupCV 5 folds u_daily | PurgedCV u_daily | LB     |
| :---        |    :----:   |  :----:   |   :----:   |        ---: |
| 3   | 6.63 (9) | 9.55 (5) | 6.36 (9) |  1548 (9) |
| 4   | 7.10 (8)  | 9.17 (7) |  7.26 (5)  | 2169 (8) |
| 5   | 10.04 (1)  | 9.60 (4) | 8.37 (1) |  2374 (7) |
| 6   | 8.52 (3)  |  9.48 (6) | 6.92 (6) |  2512 (2) |
| 7      | 9.37 (2)  |  9.96 (1)   | 7.33 (4) | 2660 (1)  |
| 8   | 8.52 (4)    | 9.91 (2) | 7.40 (3) | 2485   (4)  |
| 9      | 8.20 (5) | 9.73 (3)  | 6.78 (7) |  2488 (3) |
| 10   | 7.90  (6)  |    | 7.59 (2) | 2484   (5)   |
| 11      |   7.84 (7) |    | 6.61 (8) | 2463  (6) |

# Summary
1.  **GroupKFold 3 folds** is much more consistent with LB result compared with **the 5 fold version** and **PurgedGroupTimeSeriesSplit**
    - 5 fold use train:valid data ratio 4:1, which is not the real train:test ratio. The difference between each CV is relatively small as well
    - 
2.  As depth go from 7 to 11, we should see a decreasing validation score here because of overfitting issue. **GroupKFold 3 fold version** and LB score confirmed this but **PurgedGroupTimeSeriesSplit** didn't.


###  Please let me know if you have any idea or suggestion towards the CV and LB consistency issue. Thank you!

# Submission

In [None]:
import janestreet
if SUBMISSION:
    env = janestreet.make_env() # initialize the environment
    iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
if SUBMISSION:
    new_lgb_params = lgb_params.copy()
    new_lgb_params["max_depth"] = SUBMISSION_d
    model = LGBMClassifier(**new_lgb_params)
    model.fit(X, y,sample_weight= train["sample_weight"])

In [None]:
from tqdm import tqdm
if SUBMISSION:
    for (test_df, sample_prediction_df) in tqdm(iter_test):
        if test_df.iloc[0,0] == 0:
            action = 0
        else:
            X_test = test_df.loc[:,fs]
            X_test = X_test.iloc[0,:].values
            if np.isnan(X_test.sum()):
                X_test = np.nan_to_num(X_test)+np.isnan(X_test)*fs_median_v
            pred = model.predict(X_test.reshape(1,-1))
            action = (pred).astype(int)
        sample_prediction_df.action = action #make your 0/1 prediction here
        env.predict(sample_prediction_df)

In [None]:
# sample_prediction_df.action = 0 #make your 0/1 prediction here
# env.predict(sample_prediction_df)