**Edit**: 
I have create new notebooks for applying our customize function for using all items as input for recommendation:
* Using only interactions: https://www.kaggle.com/astrung/sequential-model-fixed-missing-last-item
* Using interactions with item features: https://www.kaggle.com/code/astrung/lstm-model-with-item-infor-fix-missing-last-item

- - -

In my previous [notebook](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial) about sequential model with Recbole, someone asked me about the mechanism of test data when using `full_sort_topk` for prediction submitted recommendation in this [comment](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial/comments#1723707) and this [comment](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial/comments#1723707), and they have a doubt about whether we are using all items for getting final recommendation. Most of people has 2 questions about using `full_sort_topk` with test data:
1. Do items in test data are used as input features for getting recommendation ?
2. If test data is necessary for getting recommendation in Recbole API, how can we get recommendation without splitting into train/test data?

In this notebook i will answer all questions:
1. Yes. In sequential models, items in test data is used as input features, but not last items. As a example, if user X have 3 items in test data(A, B, C) and 5 items in train data(a,b,c,d,e), test data will generate 3 sample rows for evaluating performance on user X:
* Row 1: Input features: `a,b,c,d,e,0,0`. Output features: `A`. `0` is a pad item
* Row 2: Input features: `a,b,c,d,e,A,0`. Output features: `B`.
* Row 3: Input features: `a,b,c,d,e,A,B`. Output features: `C`.

In my previous notebook, i use last row result as recommendation, **so we still using nearly all of items as input for recommendation, except last item(item C)**. Our recommendation in previous notebooks may be not perfect, but it is simple as a tutorial for anyone want to start.

**Note: This mechanism is only for sequential model in recbole. For other types of model, it isn't correct - it won't use items in test data for getting recommendation. If you have requests for explaining for other model, please upvote and comment. I will explain it in other notebook**

In first session of this notebook, i will dig into test data to prove this conclusion.

2. Yes, we can get recommendation by using all of items as input features, without splitting train/test. In order to do this, you need to modify recbole code:
* Fist, you copy last row in dataset(input features have all items, except last one), then add last item into input features.
* Then you predict directly from model api, without using [full_sort_score or full_sort_topk](https://recbole.io/docs/user_guide/usage/case_study.html)

In second session of this notebook, i will show you how to do that.

Ok, let start

# I. How test items are used in test data.

For each item in test data, it will be generated as a sample row. As a example, if user X have 3 items in test data(A, B, C) and 5 items in train data(a,b,c,d,e), test data will generate 3 sample rows for evaluating performance on user X:
* Row 1: Input features: `a,b,c,d,e,0,0`. Output label: `A`. `0` is a pad item
* Row 2: Input features: `a,b,c,d,e,A,0`. Output label: `B`.
* Row 3: Input features: `a,b,c,d,e,A,B`. Output label: `C`.

For proving it, we will create a dataset, then extract input features and label in test data.

### 1. Let create test data in recbole

In [None]:
!pip install recbole

In [None]:
import pandas as pd
df = pd.read_csv(r"/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv", 
                 dtype={'article_id': 'str'})
df.head()

In [None]:
import numpy as np
df['t_dat'] = pd.to_datetime(df['t_dat'], format="%Y-%m-%d")
df['timestamp'] = df.t_dat.values.astype(np.int64) // 10 ** 9
df.head()

In [None]:
temp = df[df['timestamp'] > 1585620000][['customer_id', 'article_id', 'timestamp']].rename(
    columns={'customer_id': 'user_id:token', 'article_id': 'item_id:token', 'timestamp': 'timestamp:float'})
temp

Create data file in recbole format

In [None]:
!mkdir /kaggle/working/recbox_data
temp.to_csv('/kaggle/working/recbox_data/recbox_data.inter', index=False, sep='\t')

In [None]:
import gc
del temp
gc.collect()

In [None]:
import logging
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.sequential_recommender import GRU4Rec
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger

In [None]:
parameter_dict = {
    'data_path': '/kaggle/working',
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'TIME_FIELD': 'timestamp',
    'user_inter_num_interval': "[40,inf)",
    'item_inter_num_interval': "[40,inf)",
    'load_col': {'inter': ['user_id', 'item_id', 'timestamp']},
    'neg_sampling': None,
    'epochs': 2,
    'eval_args': {
        'split': {'RS': [9, 0, 1]},
        'group_by': 'user',
        'order': 'TO',
        'mode': 'full'}
}
config = Config(model='GRU4Rec', dataset='recbox_data', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()
# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)
logger.addHandler(c_handler)

# write config info into log
# logger.info(config)

Now let start spliting train data and test data in recbole

In [None]:
dataset = create_dataset(config)
logger.info(dataset)

In [None]:
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

### 2. Let extract sample rows from test data.

We will check items of user `0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef`. 

We except that last items of this user will be used as label in test data


In [None]:
last_item_ids = df[df.customer_id == '0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef'
                  ].tail(10).article_id.values
df[df.customer_id == '0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef'].tail(10)

In [None]:
last_item_ids

Recbole use an internal ids for identify user_id and item_id, so let convert this user_id and his items into internal ids.
* customer_id: `0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef` -> internal user id: 2
* last bought item_id: [..., '0698286004', '0861478002', '0901955001'] -> internal item id: [..., 3237, 4377, 4559]

In [None]:
test_data.dataset.token2id(test_data.dataset.uid_field, 
                           '0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef')

In [None]:
print(dataset.token2id(dataset.iid_field, last_item_ids))

**Now let extract input features and labels in our test data.
My extracted code is copy from [this source](https://recbole.io/docs/_modules/recbole/utils/case_study.html#full_sort_scores)**

In [None]:
input_features = test_data.dataset[np.isin(test_data.dataset[test_data.dataset.uid_field].numpy(), [2])]
input_features

* **item_id in above interaction is used as label item**
* **item_id_list in above interaction is used as feature items**

Let check it

In [None]:
print("test label: " + str(input_features['item_id']))
print("last 10 items from origin dataset: " + str(dataset.token2id(dataset.iid_field, last_item_ids)))

As we expected, in last 10 items, 5 last items are used as label item. So for evaluating this user, we will have 5 sample rows in test data: 
* Input feature: ? -> Output: 6745
* Input feature: ? -> Output: 3237
* Input feature: ? -> Output: 3237
* Input feature: ? -> Output: 4377
* Input feature: ? -> Output: 4559

Now, let check input features in **item_id_list**

In [None]:
input_features['item_id_list']

We can see:
* For 1st row, it uses all items in training as input features.
* For 2nd row, it uses all items in training + first label as input features
* For 3rd row, it uses all items in training + first label + second label as input features
* ...
* For last row, it uses all items except last item as input features.

In my previous notebooks([here](https://www.kaggle.com/code/astrung/lstm-sequential-modelwith-item-features-tutorial) and [here](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial/notebook)), **I use last row result for recommendation, so we are missing information from last item. **

So now let fix it- find a new way for using all items

# 2. Custom code for using all items in recommendation

We have seen that last row is missing only last item, so fixxing ideal is simple now:
* copy last row, add last item into it as a new interation(a row in test dataset)
* make prediction with new interation

So now let train a dummy model for testing it

### 1. Make dummy model

In [None]:
# model loading and initialization
model = GRU4Rec(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data)

In [None]:
model.eval()

In [None]:
input_features['item_id_list'].shape

Our sequence items is always have fix length(50). So if we have more than 50 items, we need to drop earlier items, and if there are less than 50 items, we need to add a padding(0) into input item features. As example:
* If last row input = [3, 4, 7,..., 20, 0, 0 ,0] (47 items < 50 item, so we have padding), after adding id=30, we will have input = [3, 4, 7,..., 20, 30, 0 ,0] Now our sequence lenght = 48 items.
* If last row input = [3, 4, 7,..., 20, 9, 10 ,12] (50 items), after adding id=30, we will have input = [4, 7,..., 9, 10, 12, 30] (drop first item and add last item).Now our sequence lenght still = 50 items.

Now let implement it.

First let extract last row from all interation when internal_user_id = 2 

In [None]:
index = np.isin(dataset[dataset.uid_field].numpy(), [2])
input_interaction = dataset[index]
input_interaction

Now let add last item into sequences, and make new interaction.
We also need to edit sequence lenght (without padding)

In [None]:
import torch
from recbole.data.interaction import Interaction

def add_last_item(old_interaction, last_item_id, max_len=50):
    new_seq_items = old_interaction['item_id_list'][-1]
    if old_interaction['item_length'][-1].item() < max_len:
        new_seq_items[input_interaction['item_length'][-1].item()] = last_item_id
    else:
        new_seq_items = torch.roll(new_seq_items, -1)
        new_seq_items[-1] = last_item_id
    return new_seq_items.view(1, len(new_seq_items))

test = {
            'item_id_list': add_last_item(input_interaction, input_interaction['item_id'][-1].item(), model.max_seq_length),
            'item_length': torch.tensor(
                [input_interaction['item_length'][-1].item() + 1
                 if input_interaction['item_length'][-1].item() < model.max_seq_length else model.max_seq_length])
        }
new_inter = Interaction(test)
new_inter

Interaction for GRU4Rec model need to have only `item_id_list` and `item_lenght`. You can drop other key.
If you want more information, you can check [GRU4Rec code](https://recbole.io/docs/_modules/recbole/model/sequential_recommender/gru4rec.html#GRU4Rec)

Then we can apply the remaining prediction code from [full_sort_scores](https://recbole.io/docs/_modules/recbole/utils/case_study.html#full_sort_scores)


In [None]:
new_inter = new_inter.to(config['device'])
new_scores = model.full_sort_predict(new_inter)
new_scores = new_scores.view(-1, test_data.dataset.item_num)
new_scores[:, 0] = -np.inf  # set scores of [pad] to -inf

So now by combining all fragments,we have a new function for predicting with all item in dataset. You can use this custom code for all sequential model

In [None]:
import torch
from recbole.data.interaction import Interaction

def add_last_item(old_interaction, last_item_id, max_len=50):
    new_seq_items = old_interaction['item_id_list'][-1]
    if old_interaction['item_length'][-1].item() < max_len:
        new_seq_items[old_interaction['item_length'][-1].item()] = last_item_id
    else:
        new_seq_items = torch.roll(new_seq_items, -1)
        new_seq_items[-1] = last_item_id
    return new_seq_items.view(1, len(new_seq_items))

def predict_for_all_item(external_user_id, dataset, model):
    model.eval()
    with torch.no_grad():
        uid_series = dataset.token2id(dataset.uid_field, [external_user_id])
        index = np.isin(dataset[dataset.uid_field].numpy(), uid_series)
        input_interaction = dataset[index]
        test = {
            'item_id_list': add_last_item(input_interaction, 
                                          input_interaction['item_id'][-1].item(), model.max_seq_length),
            'item_length': torch.tensor(
                [input_interaction['item_length'][-1].item() + 1
                 if input_interaction['item_length'][-1].item() < model.max_seq_length else model.max_seq_length])
        }
        new_inter = Interaction(test)
        new_inter = new_inter.to(config['device'])
        new_scores = model.full_sort_predict(new_inter)
        new_scores = new_scores.view(-1, test_data.dataset.item_num)
        new_scores[:, 0] = -np.inf  # set scores of [pad] to -inf
    return torch.topk(new_scores, 10)

In [None]:
predict_for_all_item('0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef', 
                     dataset, model) # we feed directly origin dataset, not train data or test data

Congratulation !!!.Now you can use all data as train set, don't need for a test set, but still can predict directly from dataset without testset.Now let apply it into our previous notebook.

I have create new notebooks for applying our customize function for using all items as input for recommendation:
* Using only interactions: https://www.kaggle.com/astrung/sequential-model-fixed-missing-last-item
* Using interactions with item features: https://www.kaggle.com/code/astrung/lstm-model-with-item-infor-fix-missing-last-item

Please check and upvote it