# SASRec & SSEPT
트랜스포머를 사용하는 시퀀셜 추천의 종류. 유저가 이전에 구매했거나 둘러본 아이템들의 시퀀스로 표현되는 유저의 선호를 인코딩하기 위해 트랜스포머를 사용. CNN(Caser)이나 RNN(GRU4Rec, SLI-Rec 등)을 사용하는 대신 아이템 시퀀스의 새로운 표현을 생성하는 트랜스포머 기반의 인코더 사용.   
- 바닐라 트랜스포머에 기반하였고 오직 아이템 시퀀스만 모델링하는 Self-Attentive Sequential Recommendation과
- 아이템과 유저까지 모델링하는 Stochastic Shared Embedding based Personalized Transformer


In [1]:
%load_ext autoreload
%autoreload 2

import re
import sys
import os
import scrapbook as sb
import numpy as np
import pandas as pd
import tensorflow as tf
from tempfile import TemporaryDirectory
from collections import defaultdict
tf.get_logger().setLevel('ERROR')

from recommenders.utils.timer import Timer
from recommenders.datasets.amazon_reviews import get_review_data
from recommenders.datasets.split_utils import filter_k_core

# Transformer based models
from recommenders.models.sasrec.model import SASREC
from recommenders.models.sasrec.ssept import SSEPT

# Sampler form sequential prediction
from recommenders.models.sasrec.sampler import WarpSampler
from recommenders.models.sasrec.util import SASRecDataSet

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
Tensorflow version: 2.7.3


## Input Parameters

In [2]:
epochs = 5
batch_size = 128
seed = 100

data_dir = os.path.join('..', '..' ,'tests', 'resources', 'deeprec', 'sasrec')

dataset = 'reviews_Electronics_5'

lr = 0.001
maxlen = 50              # maximum sequence length for each user
num_blocks = 2           # number of transformer blocks
hidden_units = 100       # number of units in the attention calculation
num_heads = 1            # number of attention heads
dropout_rate = 0.1
l2_emb = 0.0             # L2 regularization coefficient
num_neg_test = 100       # number of negative examples per positive example
model_name = 'ssept'     # 'sasrec' or 'ssept'


In [None]:
reviews_name = dataset + '.json'
outfile = dataset + '.txt'

reviews_file = os.path.join(data_dir, reviews_name)
if not os.path.exists(reviews_file):
    reviews_output = get_review_data(reviews_file)
else:
    reviews_output = os.path.join(data_dir, dataset+'.json_output')

In [4]:
if not os.path.exists(os.path.join(data_dir, outfile)):
    df = pd.read_csv(reviews_output, sep='\t', names=['userID', 'itemID', 'time'])
    print('필터 전 df는....', df.shape)
    print(df)
    df = filter_k_core(df, 10)
    print('필터 후 df는....', df.shape)
    print(df)
    
    user_set, item_set = set(df['userID'].unique()), set(df['itemID'].unique())
    user_map = dict()
    item_map = dict()
    for u, user in enumerate(user_set):
        user_map[user] = u+1
    for i, item in enumerate(item_set):
        item_map[item] = i+1
        
    df['userID'] = df['userID'].apply(lambda x:user_map[x])
    df['itemID'] = df['itemID'].apply(lambda x:item_map[x])
    df = df.sort_values(by=['userID', 'time'])
    df.drop(columns=['time'], inplace=True)
    df.to_csv(os.path.join(data_dir, outfile), sep='\t', header=False, index=False)
    

필터 전 df는.... (1689188, 3)
                 userID      itemID        time
0         AO94DHGC771SJ  0528881469  1370131200
1         AMO214LNFCEI4  0528881469  1290643200
2        A3N7T0DY83Y4IG  0528881469  1283990400
3        A1H8PY3QHMQQA0  0528881469  1290556800
4        A24EV6RXELQZ63  0528881469  1317254400
...                 ...         ...         ...
1689183  A34BZM6S9L7QI4  B00LGQ6HL8  1405555200
1689184  A1G650TTTHEAL5  B00LGQ6HL8  1405382400
1689185  A25C2M3QF9G7OQ  B00LGQ6HL8  1405555200
1689186   A1E1LEVQ9VQNK  B00LGQ6HL8  1405641600
1689187  A2NYK9KWFMJV4Y  B00LGQ6HL8  1405209600

[1689188 rows x 3 columns]
필터 후 df는.... (347393, 3)
                       userID      itemID        time
1480967  A0251761JI35FM4C8VK2  B009NHAEXE  1359936000
1397003  A0251761JI35FM4C8VK2  B008HD3CTI  1359936000
1330225  A0251761JI35FM4C8VK2  B007SZ0EOW  1359936000
599564   A0251761JI35FM4C8VK2  B002HK8TE0  1359936000
718345   A0251761JI35FM4C8VK2  B0036Q7MV0  1359936000
...                  

SASRec는 시퀀스 인풋과 시퀀스 타겟을 필요로 한다. 모델에 대한 인풋은 다음과 같다
- user's item history as input to the transformer
- user's item history shifted as target to the transformer(positive examples)
- a sequence of items that are not equal to the positive examples(negative examples)

유저에 대해 *N*개의 아이템이 있으면 *N-2*개의 아이템이 학습에 사용되고 나머지 2개는 각각 검증과 테스트를 위해 사용된다.   

인풋 파일의 포맷은 다음과 같아야 한다.
- 각 행은 정수로 변환된 user-id와 item-id를 갖고 있어야 한다.
- 행들은 user-id와 상호작용 시간 순서로 정렬되어 있어야 한다.

In [8]:
input_file = os.path.join(data_dir, dataset+'.txt')
print(input_file)

data = SASRecDataSet(filename=input_file, col_sep='\t')
data.split()

num_steps = int(len(data.user_train)/batch_size)
cc = 0.0
for u in data.user_train:
    cc += len(data.user_train[u])
print('{} users and {} items'.format(data.usernum, data.itemnum))
print('average sequence length : {}'.format(cc/len(data.user_train)))

../../tests/resources/deeprec/sasrec/reviews_Electronics_5.txt
20247 users and 11589 items
average sequence length : 15.157751765693684


## Model Creation

In [9]:
if model_name == 'sasrec':
    model = SASREC(item_num=data.itemnum, seq_max_len=maxlen, num_blocks=num_blocks,
                  embedding_dim=hidden_units, attention_dim=hidden_units, attention_num_heads=num_heads,
                  dropout_rate=dropout_rate, conv_dims=[100, 100], l2_reg=l2_emb, num_neg_test=num_neg_test)
    
elif model_name == 'ssept':
    model = SSEPT(item_num=data.itemnum, user_num=data.usernum, seq_max_len=maxlen,
                  num_blocks=num_blocks, user_embedding_dim=10, item_embedding_dim=hidden_units,
                  attention_dim=hidden_units, attention_num_heads=num_heads, dropout_rate=dropout_rate,
                  conv_dims = [110, 110], l2_reg=l2_emb, num_neg_test=num_neg_test)
                  # embedding_dim=hidden_units,  # optional
        
else:
    print('Model-{} not found'.format(model_name))

2022-07-12 16:28:56.360389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-12 16:28:56.364659: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-12 16:28:56.365095: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-12 16:28:56.365907: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

In [12]:
model

<recommenders.models.sasrec.ssept.SSEPT at 0x7f81fddf0650>

## Sampler
Sampler는 각 배치에서 훈련 데이터로부터 negative samples를 만든다. 유저의 상호작용 히스토리 원본을 보는 것과 전혀 나타나지 않은 아이템들을 만드는 것에 의해 수행된다. Sampler는 원본 히스토리와 같은 길이의 negative items의 시퀀스를 만든다

In [13]:
sampler = WarpSampler(data.user_train, data.usernum, data.itemnum, batch_size=batch_size)

## Model Training
- The loss function is defined over all the negative and positive logits.
- A mask has to be applied to indicate the non-zero items present in the output.
- Also add the regularization loss here.
- Having a train-step signature function can speed up the training process.

In [14]:
with Timer() as train_time:
    t_test = model.train(data, sampler, num_epochs=epochs, batch_size=batch_size, lr=lr, val_epoch=6)

print('Time cost for training is {0:.2f} mins'.format(train_time.interval/60.0))

  0%|                                          | 0/158 [00:00<?, ?b/s]2022-07-12 16:44:29.134455: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
                                                                      


epoch: 5, test (NDCG@10: 0.299637104966323, HR@10: 0.4957)
Time cost for training is 2.79 mins




In [15]:
res_syn = {'ndcg@10': t_test[0], 'Hit@10': t_test[1]}
print(res_syn)

sb.glue("ndcg@10", t_test[0])
sb.glue("Hit@10", t_test[1])

{'ndcg@10': 0.299637104966323, 'Hit@10': 0.4957}
