**A huge shoutout goes to [Pawel Jankiewicz](https://www.kaggle.com/paweljankiewicz) who in this [thread](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307288) very generously shared a lot of practical information on approaching a RecSys problem. For people newer to this space like myself this thread has been invaluable.**

In fact, much of the code that follows, is based on my understanding of the concepts discussed in the aforementioned thread.

NOTE: The code in this notebook will require some files that have been generated in NB 01 & NB 02. These files are used for local validation.

NOTE2: I ran this on a VM with 192GB of RAM. You can probably run it with less RAM using a large swap file. Another thing that could be done is to port this to dask (that would likely involve quite a few code changes as dask doesn't support the full pandas API yet).

------------------------------

The other two notebooks have been fine in the sense that I learned a lot about the problem.

I got a sense for how candidate genertion feels, what trends might exist in the data, etc.

Let's now try implementing a full (albeit small) data processing pipeline where at the end we will train an lgbm model and make a submission.

Let's get started.

# Feature (and candidate) Engineering

In [1]:
import pandas as pd
import swifter
import numpy as np

In [2]:
VALID_RUN = True # whether to use the last week as local validation (set created in 01) or use it for training
DRY_RUN = False # run on a subset of data, mostly for development

SUB_NAME = 'all_vars'

In [3]:
%%time
# https://www.kaggle.com/paweljankiewicz/hm-create-dataset-samples

if not DRY_RUN:
    transactions = pd.read_csv('data/transactions_train.csv', dtype={"article_id": "str"})
    customers = pd.read_csv('data/customers.csv')
    articles = pd.read_csv('data/articles.csv', dtype={"article_id": "str"})

CPU times: user 27.3 s, sys: 2.35 s, total: 29.7 s
Wall time: 29.7 s


We can skip customers and articles for now, the most heart of the problem are the transactions.

We want to use features we generate for up to an including week $t$ to generate predictions for week $t_{t+1}$.

Of course, that is just one way to structure the problem. We could treat it as a time series problem where we look at the sequence of weeks $t_1...t_n$ and predict the purchases for week $t_{n+1}$. There are many ways to frame the problem.

But we want to do something that would be
* simple
* has a chance to lending itself well for using gradient boosted trees

In line with [the suggestion from Pawel Jankiewicz](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307288), we will predict baskets. And to make things even simpler, we will aggregate purchases by weeks.

In [4]:
# %%time

# for sample_repr, sample in [("01", 0.001), ("1", 0.01), ("5", 0.05)]:
#     print(sample)
#     customers_sample = customers.sample(int(customers.shape[0]*sample), replace=False)
#     customers_sample_ids = set(customers_sample["customer_id"])
#     transactions_sample = transactions[transactions["customer_id"].isin(customers_sample_ids)]
#     articles_sample_ids = set(transactions_sample["article_id"])
#     articles_sample = articles[articles["article_id"].isin(articles_sample_ids)]
#     customers_sample.to_csv(f"data/customers_sample_{sample_repr}.csv.gz", index=False)
#     transactions_sample.to_csv(f"data/transactions_train_sample_{sample_repr}.csv.gz", index=False)
#     articles_sample.to_csv(f"data/articles_train_sample_{sample_repr}.csv.gz", index=False)

In [5]:
if DRY_RUN:
    transactions = pd.read_csv('data/transactions_train_sample_01.csv.gz', dtype={"article_id": "str"})
    customers = pd.read_csv('data/customers_sample_01.csv.gz')
    articles = pd.read_csv('data/articles_train_sample_01.csv.gz', dtype={"article_id": "str"})

In [6]:
%%time

transactions['week'] = pd.to_datetime(transactions.t_dat, format='%Y-%m-%d') \
    .swifter.apply(lambda t: t.year * 100 + t.week) \
    .rank(method='dense') \
    .astype('int')

transactions.drop(columns='t_dat', inplace=True)

Dask Apply:   0%|          | 0/128 [00:00<?, ?it/s]

CPU times: user 6.06 s, sys: 4.71 s, total: 10.8 s
Wall time: 17.5 s


In [7]:
%%time

test_set_week = transactions.week.max()
valid_set_week = test_set_week - 1
train_set_weeks = set(transactions.week) - set([test_set_week]) - set([valid_set_week])

# because of some of the generated features some data for the first 3 weeks will be missing
first_three_weeks = transactions.week.sort_values().unique()[:3]

CPU times: user 3.21 s, sys: 297 ms, total: 3.51 s
Wall time: 3.5 s


In [8]:
transactions = transactions[transactions.week > 92] # this should correspond to 15 weeks of data

In [9]:
# I'm using the last week for local validation

if VALID_RUN:
    transactions = transactions[transactions.week != transactions.week.max()]
    transactions.reset_index(inplace=True, drop=True)

In [10]:
transactions['purchased'] = 1 # our positive examples
transactions.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

In [11]:
%%time

bought_i_weeks_ago = transactions.copy()
for i in range(1,4):
    bought_i_weeks_ago.week += 1
    bought_i_weeks_ago[f'bought_{i}_wks_ago'] = 1
    
    # updating true sales already in transactions with information on whether an article was bought by a customer
    # in the previous week
    transactions = pd.merge(
        transactions,
        bought_i_weeks_ago[['customer_id', 'article_id', 'week', f'bought_{i}_wks_ago']],
        on=['customer_id', 'article_id', 'week'], how='left'
    )

    bought_i_weeks_ago.purchased = 0 # negative examples, possibly (or candidates for the test week)
    transactions = pd.concat([transactions, bought_i_weeks_ago]) # adding our negative examples
    
    transactions[f'bought_{i}_wks_ago'].fillna(0, inplace=True)

del bought_i_weeks_ago

CPU times: user 17.3 s, sys: 2.01 s, total: 19.3 s
Wall time: 19.3 s


`transactions` now contains some fake transactions. But if you observe, I am always adding these possibly fake transactions (for instance, negative examples where in fact a purchase was made of that article by a given customer in a given week) at the bottom of the dataframe.

The is insight is key. I will be able to remove them by using `drop_duplicates` once I am done with generating negative examples (or data to predict on for the week in the test set).

In [12]:
transactions

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,bought_1_wks_ago,bought_2_wks_ago,bought_3_wks_ago
0,0002f2f1deddc6f2504d96d65fa10ecbb22d7a13092b45...,0868392003,0.025407,2,93,1,0.0,0.0,0.0
1,0002f2f1deddc6f2504d96d65fa10ecbb22d7a13092b45...,0879891001,0.033881,2,93,1,0.0,0.0,0.0
2,0002f2f1deddc6f2504d96d65fa10ecbb22d7a13092b45...,0892794001,0.033881,2,93,1,0.0,0.0,0.0
3,0002f2f1deddc6f2504d96d65fa10ecbb22d7a13092b45...,0855005001,0.033881,2,93,1,0.0,0.0,0.0
4,0002f2f1deddc6f2504d96d65fa10ecbb22d7a13092b45...,0551379001,0.038119,2,93,1,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
4479115,fff7e7674509592818bf453391af43a85eaaac9a52d858...,0624486049,0.013542,1,109,0,1.0,1.0,1.0
4479116,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490010,0.008458,2,109,0,1.0,1.0,1.0
4479117,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490058,0.008458,2,109,0,1.0,1.0,1.0
4479118,fff871bf24b40fd1290215414d760afaa69bb164d2b970...,0717490057,0.008458,2,109,0,1.0,1.0,1.0


In [13]:
%%time

transactions = transactions[transactions.week <= test_set_week]

CPU times: user 724 ms, sys: 336 ms, total: 1.06 s
Wall time: 1.06 s


In [14]:
%%time

bestsellers_previous_week = transactions \
    .groupby(['week', 'sales_channel_id'])['article_id'].value_counts() \
    .groupby(['week', 'sales_channel_id']).head(20) \
    .groupby(['week', 'sales_channel_id']).rank(method='dense', ascending=False) \
    .to_frame('bestseller_previous_week_rank').reset_index()

bestsellers_previous_week.week += 1

bestsellers_previous_week = bestsellers_previous_week[bestsellers_previous_week.week != bestsellers_previous_week.max().week.item()]

CPU times: user 6.35 s, sys: 287 ms, total: 6.63 s
Wall time: 6.63 s


In [15]:
transactions = pd.merge(transactions, bestsellers_previous_week, on=['week', 'article_id', 'sales_channel_id'], how='left')
transactions.bestseller_previous_week_rank.fillna(999, inplace=True)

In [16]:
%%time

negative_examples_bestsellers_previous_week = pd.merge( # negative examples AND candidates for test week!
    transactions[['customer_id', 'week']].drop_duplicates(),
    bestsellers_previous_week, how='left', on='week')

CPU times: user 10.1 s, sys: 3.47 s, total: 13.5 s
Wall time: 13.5 s


In [17]:
negative_examples_bestsellers_previous_week = \
    negative_examples_bestsellers_previous_week[
    negative_examples_bestsellers_previous_week.week != negative_examples_bestsellers_previous_week.week.min()
]
negative_examples_bestsellers_previous_week.reset_index(inplace=True, drop=True)

In [18]:
%%time

# need to recover the price

negative_examples_bestsellers_previous_week.week -= 1
negative_examples_bestsellers_previous_week = pd.merge(
    negative_examples_bestsellers_previous_week,
    # mean of prices across channels per week (prices vary within a week)
    transactions[['week', 'article_id', 'price', 'sales_channel_id']].groupby(['week', 'article_id', 'sales_channel_id']).mean().reset_index(),
    on=['week', 'article_id', 'sales_channel_id'],
    how='left'
) 
negative_examples_bestsellers_previous_week.week += 1

CPU times: user 31.8 s, sys: 9.35 s, total: 41.1 s
Wall time: 41.1 s


In [19]:
transactions = pd.concat([transactions, negative_examples_bestsellers_previous_week])
transactions.reset_index(inplace=True, drop=True)

Let's generate candidate examples for all the customers in the test set as well.

In [20]:
test_data = pd.read_csv('data/sample_submission.csv')[['customer_id']]

if DRY_RUN:
    test_data = test_data.iloc[:1000]

test_data['sales_channel_id'] = 1

test_data_2 = test_data.copy()
test_data_2['sales_channel_id'] = 2

test_data = pd.concat([test_data, test_data_2])

In [21]:
%%time
# adding bestsellers from last week in the dataset as candidates for all customers in the test set

negative_examples_bestsellers_previous_week = \
    negative_examples_bestsellers_previous_week[
        negative_examples_bestsellers_previous_week.week == negative_examples_bestsellers_previous_week.week.max()
]

bestseller_article_info = negative_examples_bestsellers_previous_week.drop_duplicates(['article_id', 'sales_channel_id'])[['article_id', 'sales_channel_id', 'bestseller_previous_week_rank', 'price']]

CPU times: user 3.92 s, sys: 1.96 s, total: 5.89 s
Wall time: 5.89 s


In [22]:
%%time

test_candidates = pd.merge( # negative examples AND candidates for week 108!
    test_data,
    bestseller_article_info,
    how='outer',
    on='sales_channel_id'
)

test_candidates['week'] = transactions.week.max()

transactions = pd.concat([transactions, test_candidates])

CPU times: user 6.21 s, sys: 4.78 s, total: 11 s
Wall time: 11 s


In [23]:
%%time

transactions.purchased.fillna(0, inplace=True)

CPU times: user 310 ms, sys: 468 µs, total: 310 ms
Wall time: 309 ms


In [24]:
for i in range(1,4):
    transactions[f'bought_{i}_wks_ago'] = 0

In [25]:
%%time

# removing "fake", incorrect transactions
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

CPU times: user 1min 22s, sys: 12.7 s, total: 1min 35s
Wall time: 1min 35s


In [26]:
transactions.shape

(197837278, 10)

In [27]:
%%time
# Yet another way to create negative examples, to further differentiate between user preferences.

negative_transactions = transactions[transactions.purchased == 1].copy()
negative_transactions.purchased = 0

for i in range(1,4):
    negative_transactions[f'bought_{i}_wks_ago'] = 0

CPU times: user 435 ms, sys: 148 ms, total: 582 ms
Wall time: 586 ms


In [28]:
%%time

negative_transactions.customer_id = negative_transactions.groupby('week')['customer_id'].transform(np.random.permutation)

CPU times: user 520 ms, sys: 7.93 ms, total: 528 ms
Wall time: 527 ms


In [29]:
%%time

transactions = pd.concat([transactions, negative_transactions])
transactions.drop_duplicates(['customer_id', 'week', 'article_id', 'sales_channel_id'], inplace=True)

CPU times: user 1min 25s, sys: 17 s, total: 1min 42s
Wall time: 1min 42s


In [30]:
transactions.shape

(201775487, 10)

Let's merge `customers` and `articles` with `transactions`.

In [31]:
articles.drop(columns=['detail_desc', 'prod_name'], inplace=True)

In [32]:
%%time

transactions = pd.merge(transactions, articles, on='article_id', how='left')
transactions = pd.merge(transactions, customers, on='customer_id', how='left')

CPU times: user 3min 55s, sys: 1min 50s, total: 5min 46s
Wall time: 5min 45s


In [33]:
transactions.shape

(201775487, 38)

In [34]:
%%time

# we need to sort transactions to create baskets,
# where a basket is a sequence of records
# representing purchases (and purchase candidates / negative examples)
# for a customer for a given week
transactions.sort_values(['customer_id', 'week'], inplace=True)

CPU times: user 1min 35s, sys: 26.8 s, total: 2min 2s
Wall time: 2min 2s


In [35]:
%%time

# operations above are quite expensive -- let's save them so that we can restart from this place
# in the notebook if need be

transactions.to_pickle(f"data/transactions_sorted.pkl")
# transactions = pd.read_pickle(f"data/transactions_sorted.pkl")
# # need to recalculate this if we restarted the kernel and are loading up the data
# test_set_week = transactions.week.max()
# valid_set_week = test_set_week - 1
# train_set_weeks = set(transactions.week) - set([test_set_week]) - set([valid_set_week])
# first_three_weeks = transactions.week.sort_values().unique()[:3]

CPU times: user 2min 8s, sys: 1min 3s, total: 3min 11s
Wall time: 6min 25s


# Dataset preparation

In [36]:
%%time

transactions = transactions[~transactions.week.isin(set(first_three_weeks))]

CPU times: user 1min 7s, sys: 1min 6s, total: 2min 13s
Wall time: 2min 13s


In [37]:
test_candidates = transactions[transactions.week == test_set_week]
train_set = transactions[transactions.week != test_set_week]

In [38]:
train_set.shape, test_candidates.shape

((146239656, 38), (55535831, 38))

In [39]:
%%time

train_X = train_set[train_set.week.isin(train_set_weeks)]
valid_X = train_set[train_set.week == valid_set_week]

train_y = train_X['purchased']
valid_y = valid_X['purchased']

train_X = train_X.drop(columns='purchased')
valid_X = valid_X.drop(columns='purchased')

CPU times: user 38.8 s, sys: 43.1 s, total: 1min 21s
Wall time: 1min 21s


In [40]:
del transactions

In [41]:
test_X = test_candidates.drop(columns='purchased')
del test_candidates

In [42]:
%%time

train_baskets = train_X.groupby(['customer_id', 'week'])['article_id'].count().values
valid_baskets = valid_X.groupby(['customer_id', 'week'])['article_id'].count().values
test_baskets = test_X.groupby(['customer_id', 'week'])['article_id'].count().values

CPU times: user 37.4 s, sys: 3.32 s, total: 40.7 s
Wall time: 40.6 s


# Vectorize datasets

In [43]:
import lightgbm
lightgbm.__version__

'3.3.2'

In [44]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import LabelEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

from sklearn.compose import ColumnTransformer
from lightgbm.sklearn import LGBMRanker

In [45]:
# https://github.com/logicai-io/recsys2019/blob/master/src/recsys/transformers.py
# https://github.com/logicai-io/recsys2019/blob/master/src/recsys/vectorizers.py

class PandasToRecords(BaseEstimator, TransformerMixin):
    def fit(self, X, *arg):
        return self

    def transform(self, X):
        return X.to_dict(orient="records")

class SparsityFilter(BaseEstimator, TransformerMixin):
    def __init__(self, min_nnz=None):
        self.min_nnz = min_nnz

    def fit(self, X, y=None):
        self.sparsity = X.getnnz(0)
        return self

    def transform(self, X):
        return X[:, self.sparsity >= self.min_nnz]

class PandasToNpArray(BaseEstimator, TransformerMixin):
    def fit(self, X, *arg):
        return self

    def transform(self, X):
        return X.values.astype(np.float)

In [46]:
class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)

In [47]:
# removed vars: 'customer_id', 'article_id'

In [48]:
categorical_variables = ['index_code', 'postal_code', 'sales_channel_id', 'product_type_name', 'product_group_name', 
 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name', 'department_name',
 'index_name', 'index_group_name', 'section_name', 'garment_group_name',
 'FN', 'Active', 'club_member_status', 'fashion_news_frequency', 'graphical_appearance_name',
 'bestseller_previous_week_rank', 'product_type_no', 'colour_group_code']
numerical_variables = ['graphical_appearance_no', 'perceived_colour_value_id', 'perceived_colour_master_id', 
 'department_no', 'index_group_no', 'section_no', 'garment_group_no', 'age', 'product_code', 'price']
# numerical_variables_to_bin = ['price']

In [49]:
for i in range(1,4):
    categorical_variables += [f'bought_{i}_wks_ago']

In [50]:
pipe = make_pipeline(
    ColumnTransformer(
        [   
            (
                "categorical",
                Categorize(1000),
                categorical_variables,
            ),
            (
                "numerical",
                make_pipeline(PandasToNpArray(), SimpleImputer(strategy="constant", fill_value=-1)),
                numerical_variables,
            ),
#             (
#                 "numerical_to_bin",
#                 KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile'),
#                 numerical_variables_to_bin,
#             ),
        ]
    )
)

In [51]:
# pipe = make_pipeline(
#     ColumnTransformer(
#         [   
#             (
#                 "categorical",
#                 make_pipeline(PandasToRecords(), DictVectorizer(), SparsityFilter(min_nnz=60)),
#                 categorical_variables,
#             ),
#             (
#                 "numerical",
#                 make_pipeline(PandasToNpArray(), SimpleImputer(strategy="constant", fill_value=-1)),
#                 numerical_variables,
#             ),
#         ]
#     )
# )

In [52]:
%%time
train_X_vec = pipe.fit_transform(train_X)

CPU times: user 3min 52s, sys: 34.8 s, total: 4min 27s
Wall time: 4min 26s


In [53]:
%%time
valid_X_vec = pipe.transform(valid_X)

CPU times: user 9.77 s, sys: 2.11 s, total: 11.9 s
Wall time: 11.9 s


In [54]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=10,
    importance_type='gain',
    verbose=10,
#     num_leaves=60,

)

In [55]:
%%time

r = ranker.fit(
    train_X_vec,
    train_y,
    group=train_baskets,
    eval_set=[(valid_X_vec, valid_y)],
    eval_group=[valid_baskets],
)

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.761435
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.304362
[LightGBM] [Debug] init for col-wise cost 1.991427 seconds, init for row-wise cost 11.962770 seconds
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 1834
[LightGBM] [Info] Number of data points in the train set: 135721072, number of used features: 31
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 13
[1]	valid_0's ndcg@1: 0.720481	valid_0's ndcg@2: 0.719496	valid_0's ndcg@3: 0.72308	valid_0's ndcg@4: 0.722894	valid_0's ndcg@5: 0.72269
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 13
[2]	valid_0's ndcg@1: 0.720926	valid_0's ndcg@2: 0.719785	valid_0's ndcg@3: 0.723308	valid_0's ndcg@4: 0.723115	valid_0's ndcg@5: 0.722899
[LightGBM] [Debug] Trained a tree w

Learning more about what our model is doing is always very useful!

This can inform what we might want to work on next.

In [56]:
variables = categorical_variables + numerical_variables # + numerical_variables_to_bin

for i in ranker.feature_importances_.argsort()[::-1]:
    print(variables[i])

bestseller_previous_week_rank
price
age
index_code
department_no
garment_group_name
sales_channel_id
product_type_name
perceived_colour_value_name
colour_group_name
section_no
product_group_name
club_member_status
product_code
department_name
postal_code
section_name
graphical_appearance_no
garment_group_no
index_group_name
perceived_colour_master_id
perceived_colour_master_name
index_name
Active
FN
index_group_no
graphical_appearance_name
product_type_no
colour_group_code
bought_1_wks_ago
bought_2_wks_ago
bought_3_wks_ago
perceived_colour_value_id
fashion_news_frequency


In [57]:
# del train_set, valid_X, train_X, valid_X_vec, train_X_vec

In [58]:
# import joblib
# # save model
# # joblib.dump(ranker, 'data/ranker.pkl')
# # load model
# ranker = joblib.load('data/ranker.pkl')

In [59]:
%%time

chunk_size = 5_000_000
preds = []

for i in range(0, test_X.shape[0], chunk_size):
    test_X_chunk_vectorized = pipe.transform(test_X.iloc[i:i+chunk_size])
    preds.append(ranker.predict(test_X_chunk_vectorized))

CPU times: user 1min 42s, sys: 14 s, total: 1min 56s
Wall time: 1min 7s


In [60]:
preds = np.concatenate(preds)

test_X['preds'] = preds

In [61]:
test_X = test_X[['customer_id', 'article_id', 'preds']]

In [62]:
del preds, test_X_chunk_vectorized

In [63]:
%%time
cust_id2pred = {}

for grp in test_X[['customer_id', 'article_id', 'preds']].sort_values(['customer_id', 'preds'], ascending=False).groupby('customer_id'):
    cust_id2pred[grp[0]] = grp[1]['article_id'].head(12).tolist()

CPU times: user 2min 10s, sys: 4.27 s, total: 2min 15s
Wall time: 2min 14s


In [64]:
%%time

if not DRY_RUN:
    sub = pd.read_csv('data/sample_submission.csv')

    preds_str = []
    for c in sub.customer_id:
        preds_str.append(' '.join(cust_id2pred[c]))

    sub.prediction = preds_str
    sub.to_csv(f'data/subs/{SUB_NAME}.csv.gz', index=False)

CPU times: user 15.7 s, sys: 428 ms, total: 16.2 s
Wall time: 16.2 s


In [65]:
!kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f 'data/subs/{SUB_NAME}.csv.gz' -m {SUB_NAME}

100%|██████████████████████████████████████| 53.3M/53.3M [00:03<00:00, 14.5MB/s]
Successfully submitted to H&M Personalized Fashion Recommendations

In [66]:
from utils import eval_sub

In [67]:
eval_sub(f'data/subs/{SUB_NAME}.csv.gz')

0.14255287342912734

In [68]:
eval_sub(f'data/subs/{SUB_NAME}.csv.gz', skip_cust_with_no_purchases=0)

0.007167646336415192