# Candidate ReRank Model using Handcrafted Rules
In this notebook, we present a "candidate rerank" model using handcrafted rules. We can improve this model by engineering features, merging them unto items and users, and training a reranker model (such as XGB) to choose our final 20. Furthermore to tune and improve this notebook, we should build a local CV scheme to experiment new logic and/or models.

UPDATE: I published a notebook to compute validation score [here][10] using Radek's scheme described [here][11].

Note in this competition, a "session" actually means a unique "user". So our task is to predict what each of the `1,671,803` test "users" (i.e. "sessions") will do in the future. For each test "user" (i.e. "session") we must predict what they will `click`, `cart`, and `order` during the remainder of the week long test period.

### Step 1 - Generate Candidates
For each test user, we generate possible choices, i.e. candidates. In this notebook, we generate candidates from 5 sources:
* User history of clicks, carts, orders
* Most popular 20 clicks, carts, orders during test week
* Co-visitation matrix of click/cart/order to cart/order with type weighting
* Co-visitation matrix of cart/order to cart/order called buy2buy
* Co-visitation matrix of click/cart/order to clicks with time weighting

### Step 2 - ReRank and Choose 20
Given the list of candidates, we must select 20 to be our predictions. In this notebook, we do this with a set of handcrafted rules. We can improve our predictions by training an XGBoost model to select for us. Our handcrafted rules give priority to:
* Most recent previously visited items
* Items previously visited multiple times
* Items previously in cart or order
* Co-visitation matrix of cart/order to cart/order
* Current popular items

![](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Nov-2022/c_r_model.png)
  
# Credits
We thank many Kagglers who have shared ideas. We use co-visitation matrix idea from Vladimir [here][1]. We use groupby sort logic from Sinan in comment section [here][4]. We use duplicate prediction removal logic from Radek [here][5]. We use multiple visit logic from Pietro [here][2]. We use type weighting logic from Ingvaras [here][3]. We use leaky test data from my previous notebook [here][4]. And some ideas may have originated from Tawara [here][6] and KJ [here][7]. We use Colum2131's parquets [here][8]. Above image is from Ravi's discussion about candidate rerank models [here][9]

[1]: https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix
[2]: https://www.kaggle.com/code/pietromaldini1/multiple-clicks-vs-latest-items
[3]: https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items
[4]: https://www.kaggle.com/code/cdeotte/test-data-leak-lb-boost
[5]: https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic
[6]: https://www.kaggle.com/code/ttahara/otto-mors-aid-frequency-baseline
[7]: https://www.kaggle.com/code/whitelily/co-occurrence-baseline
[8]: https://www.kaggle.com/datasets/columbia2131/otto-chunk-data-inparquet-format
[9]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721
[10]: https://www.kaggle.com/cdeotte/compute-validation-score-cv-564
[11]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991

# Notes
Below are notes about versions:
* **Version 1 LB 0.573** Uses popular ideas from public notebooks and adds additional co-visitation matrices and additional logic. Has CV `0.563`. See validation notebook version 2 [here][1].
* **Version 2 LB 573** Refactor logic for `suggest_buys(df)` to make it clear how new co-visitation matrices are reranking the candidates by adding to candidate weights. Also new logic boosts CV by `+0.0003`. Also LB is slightly better too. See validation notebook version 3 [here][1]
* **Version 3** is the same as version 2 but 1.5x faster co-visitation matrix computation!
* **Version 4 LB 575** Use top20 for clicks and top15 for carts and buys (instead of top40 and top40). This boosts CV `+0.0015` hooray! New CV is `0.5647`. See validation version 5 [here][1]
* **Version 5** is the same as version 4 but 2x faster co-visitation matrix computation! (and 3x faster than version 1)
* **Version 6** Stay tuned for more versions...

[1]: https://www.kaggle.com/code/cdeotte/compute-validation-score-cv-564

# Step 1 - Candidate Generation with RAPIDS
For candidate generation, we build three co-visitation matrices. One computes the popularity of cart/order given a user's previous click/cart/order. We apply type weighting to this matrix. One computes the popularity of cart/order given a user's previous cart/order. We call this "buy2buy" matrix. One computes the popularity of clicks given a user previously click/cart/order.  We apply time weighting to this matrix. We will use RAPIDS cuDF GPU to compute these matrices quickly!

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


# word2vec

https://www.kaggle.com/code/mujrush/word2vec-how-to-training-and-submission-a93c6e/edit

In [2]:
INPUT_DIR = '/content/drive/MyDrive/kaggle/2022/OTTO/input/Otto Full Optimized Memory Footprint/'

In [3]:
!pip install polars

import polars as pl
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

train = pl.read_parquet(INPUT_DIR+'train.parquet')
test = pl.read_parquet(INPUT_DIR+'test.parquet')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
sentences_df = pl.concat([train, test]).groupby('session').agg(
    pl.col('aid').alias('sentence')
)

In [5]:
sentences = sentences_df['sentence'].to_list()

In [6]:
import gensim
gensim.__version__

'4.2.0'

In [7]:
#reboot & 全てのセルの再実行が必要
!pip install gensim==4.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
import gensim
gensim.__version__

'4.2.0'

In [9]:
%%time

w2vec = Word2Vec(sentences=sentences, vector_size=32, min_count=1, workers=4)

CPU times: user 1h 17min 15s, sys: 21.9 s, total: 1h 17min 37s
Wall time: 25min 48s


In [10]:
!pip install annoy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
%%time

from annoy import AnnoyIndex

aid2idx = {aid: i for i, aid in enumerate(w2vec.wv.index_to_key)}
index = AnnoyIndex(32, 'euclidean')

for aid, idx in aid2idx.items():
    index.add_item(idx, w2vec.wv.vectors[idx])
    
index.build(10)

CPU times: user 50.7 s, sys: 3.69 s, total: 54.4 s
Wall time: 20.9 s


True

# Step 2 - ReRank (choose 20) using handcrafted rules
For description of the handcrafted rules, read this notebook's intro.

In [12]:
import numpy as np
import pandas as pd

import collections
from collections import Counter

import lightgbm as lgb
from sklearn.model_selection import GroupKFold
import pickle

import glob

import gc

import itertools

In [13]:
INPUT_DIR = '/content/drive/MyDrive/kaggle/2022/OTTO/input/otto-chunk-data-inparquet-format/'

In [14]:
type_labels = {'clicks':0, 'carts':1, 'orders':2}

def load_test():    
    dfs = []
    for e, chunk_file in enumerate(glob.glob(INPUT_DIR+'test_parquet/*')):
        chunk = pd.read_parquet(chunk_file)
        chunk.ts = (chunk.ts/1000).astype('int32')
        chunk['type'] = chunk['type'].map(type_labels).astype('int8')
        dfs.append(chunk)
    return pd.concat(dfs).reset_index(drop=True) #.astype({"ts": "datetime64[ms]"})

test_df = load_test()
print('Test data has shape',test_df.shape)
test_df.head()

Test data has shape (6928123, 4)


Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000,0
1,12899780,1142000,1661724000,0
2,12899780,582732,1661724058,0
3,12899780,973453,1661724109,0
4,12899780,736515,1661724136,0


In [15]:
%%time

VER = 5
DISK_PIECES = 4

def pqt_to_dict(df):
    return df.groupby('aid_x').aid_y.apply(list).to_dict()
# LOAD THREE CO-VISITATION MATRICES
top_20_clicks = pqt_to_dict( pd.read_parquet(f'/content/drive/MyDrive/kaggle/2022/OTTO/input/cris_baseline/output/top_20_clicks_v{VER}_0.pqt') )
for k in range(1,DISK_PIECES): 
    top_20_clicks.update( pqt_to_dict( pd.read_parquet(f'/content/drive/MyDrive/kaggle/2022/OTTO/input/cris_baseline/output/top_20_clicks_v{VER}_{k}.pqt') ) )



# LOAD THREE CO-VISITATION MATRICES
top_20_buys = pqt_to_dict( pd.read_parquet(f'/content/drive/MyDrive/kaggle/2022/OTTO/input/cris_baseline/output/top_15_carts_orders_v{VER}_0.pqt') )
for k in range(1,DISK_PIECES): 
    top_20_buys.update( pqt_to_dict( pd.read_parquet(f'/content/drive/MyDrive/kaggle/2022/OTTO/input/cris_baseline/output/top_15_carts_orders_v{VER}_{k}.pqt') ) )

top_20_buy2buy = pqt_to_dict( pd.read_parquet(f'/content/drive/MyDrive/kaggle/2022/OTTO/input/cris_baseline/output/top_15_buy2buy_v{VER}_0.pqt') )



# TOP CLICKS AND ORDERS IN TEST
top_clicks = test_df.loc[test_df['type']=='clicks','aid'].value_counts().index.values[:20]
top_orders = test_df.loc[test_df['type']=='orders','aid'].value_counts().index.values[:20]

print('Here are size of our 3 co-visitation matrices:')
print( len( top_20_clicks ), len( top_20_buy2buy ), len( top_20_buys ) )

Here are size of our 3 co-visitation matrices:
1837166 1168768 1837166
CPU times: user 1min 35s, sys: 13.4 s, total: 1min 48s
Wall time: 1min 56s


In [16]:
#type_weight_multipliers = {'clicks': 1, 'carts': 6, 'orders': 3}
type_weight_multipliers = {0: 1, 1: 6, 2: 3}

def suggest_clicks(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(20) if aid2 not in unique_aids]    
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    # USE TOP20 TEST CLICKS
    
    ##### word2vec ##### 
    most_recent_aid = unique_aids[0]
    w2vec_aid = [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[most_recent_aid], 21)[1:]]
    #################### 
    
    return result + list(w2vec_aid)[:20-len(result)]

def suggest_buys(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type']==1)|(df['type']==2)]
    unique_buys = list(dict.fromkeys( df.aid.tolist()[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.5,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        # RERANK CANDIDATES USING "BUY2BUY" CO-VISITATION MATRIX
        aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
        for aid in aids3: aids_temp[aid] += 0.1
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2+aids3).most_common(20) if aid2 not in unique_aids] 
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    
    ##### word2vec ##### 
    most_recent_aid = unique_aids[0]
    w2vec_aid = [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[most_recent_aid], 21)[1:]]
    #################### 

    return result + list(w2vec_aid)[:20-len(result)]

# Create Submission CSV
Inferring test data with Pandas groupby is slow. We need to accelerate the following code.

In [17]:
%%time
pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_clicks(x)
)

pred_df_buys = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_buys(x)
)

CPU times: user 34min 44s, sys: 14 s, total: 34min 58s
Wall time: 34min 43s


In [18]:
clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks"), columns=["labels"]).reset_index()
orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders"), columns=["labels"]).reset_index()
carts_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_carts"), columns=["labels"]).reset_index()

In [19]:
class CFG:
  exp = 'Candidate_ReRank_Model_LB0575_and_w2vec'

In [20]:
OUTPUT_DIR = f'/content/drive/MyDrive/kaggle/2022/OTTO/output/cris_baseline/{CFG.exp}_'

In [21]:
pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df])
pred_df.columns = ["session_type", "labels"]
pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))
pred_df.to_csv(OUTPUT_DIR + "submission.csv", index=False)
pred_df.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59625 1253524 737445 438191 731692 1790770 942...
1,12899780_clicks,1142000 736515 973453 582732 1502122 889686 48...
2,12899781_clicks,918667 199008 194067 57315 141736 1460571 7594...
3,12899782_clicks,834354 595994 740494 889671 987399 779477 1344...
4,12899783_clicks,1817895 607638 1754419 1216820 1729553 300127 ...
