-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential Recommendation #119
Comments
To use information about a user's sequence of interactions in TFRS, you need to write the appropriate user model. For example, suppose your training dataset contains tuples of If that's the case, you could set up a model like this and train and evaluate it exactly like any of the tutorial models: class Model(tfrs.Model):
def __init__(self):
self._movie_model = tf.keras.layers.Embedding(...)
self._user_model = tf.keras.Sequential([
# Look up embeddings of past watches.
tf.keras.layers.Embedding(...),
# Summarize them using, for example, a recurrent layer.
tf.keras.layers.GRU(),
# Perhaps more layers here.
])
self._task = tfrs.tasks.Retrieval(...)
def compute_loss(self, inputs, training=False):
past_watches, new_watch = inputs
new_watch_embedding = self._movie_model(new_watch)
user_embedding = self._user_model(past_watches)
return self._task(user_embedding, new_watch_embedding) |
Thanks for your reply, it is very enlightening. A few more questions regarding this topic:
At last but not least, after all these questions are clear, can I open a PR adding it to the documentation? |
You may have to define for me what you mean by "dense layer" in your question. I'll try to clarify in the meantime:
Let's figure this out first - once we do, we can think about documenting this. |
By dense layer with one unit, I mean a I am finally hands-on in deep learning, but almost everything I know is research-related, where many things become implicit about the implementation. It may be more appropriate for me to explain my goal:
|
Got it. To the best of my knowledge, the example I wrote above gives you a more-or-less full implementation of GRU4Rec, assuming you transform your data into pairs of If you're predicting the next item, you're (to a first approximation) fitting a retrieval model, for which you'd use the For your model, then you need to create a Looking at my example, your In terms of metrics, TFRS's default |
Thank you @maciejkula for all those clarifications. @juliobguedes I am currently working on the same usecase but for retail purchases. I will keep you updated if I make a good notebook example. I will PR it in example if he can be useful for other people. :-) |
I spent some time debugging my code (I had a few bugs in my own implementation) but I understood your comments and I am trying to implement them. I am having one more problem: when I am training the model, I always get this error and the model never gets past the first epoch.
After these prints, I have no updates for quite some time, leading me to stop the execution. Let me add a few things that may help to figure out the problem. As previously stated, my input is a sequence of track ids such as vocab = StringLookup(vocabulary=vocabulary)
sequences = [vocab(seq) for seq in input_seq]
padded = pad_sequences(sequences, maxlen=seq_len) After this process:
I have coded the model similarly to your example: class GRU4Rec(TfrsModel):
"""
Defines a GRU4Rec Tensorflow model. This model was initially
described by Hidasi et al in ...
"""
def __init__(
self, ds, embedding_size=100,
gru_units=100, gru_activation='tanh',
dense_units=100, dense_activation='softmax',
loss='cross-entropy'):
super().__init__()
self.embedding_size = embedding_size
self.gru_units = gru_units
self.gru_activation = gru_activation
self.dense_activation = dense_activation
self.vocab = ds.vocab
self.seq_length = ds.sequence_length
self.num_users = ds.num_users
self.num_items = ds.num_items
self._build_model(ds.candidates())
def _build_model(self, candidates):
self.user_model = Sequential([
Embedding(self.num_items, self.embedding_size, input_length=self.seq_length),
GRU(self.gru_units, activation=self.gru_activation), # HIDDEN LAYER
], name='User_Model')
self.item_model = Sequential([
self.vocab,
Embedding(self.num_items, self.embedding_size)
], name='Item_Model')
self.task = Retrieval(metrics=FactorizedTopK(candidates.map(self.item_model)))
def compute_loss(self, inputs, training=False):
watching_history, next_item = inputs
history_embedding = self.user_model(watching_history)
next_item_embedding = self.item_model(next_item)
return self.task(history_embedding, next_item_embedding) So, why isn't my model fitting? Please let me know if I should add any other details |
Thanks for the detailed description: would you mind putting it into a colab with some dummy data so that I could run it? |
Off the top of my head, this is what I think what might be happening:
Please give these a go and let me know! |
Understood. I am doing this right now and will get back to you soon. |
Ok, sorry for taking so long. Here is the link for the colab: https://colab.research.google.com/drive/1pmhsZxG_rVij7hT3qCt-_8xARb-Aq4aF?usp=sharing I was able to make a similar dataset (although way smaller) using lastfm 1k users dataset. Using colab, it got past the first epoch, but the results are still not correct |
Thanks for the notebook! When you say that the results are not correct, what do you mean? |
Ignore my last statement. The metrics are being computed, it just wasn't visible due to the param you pointed out earlier. A new question: To predict the next item, how would I use the user_model and item_model? Do I have to add dense layers as in here? |
Right, that's because we turned off the metric computation during training - if you run There are a couple of things wrong here:
With these two changes I get reasonable-looking evaluation metrics, with the whole evaluation loop taking about 30 seconds. |
Nice, that's awesome, I am now able to run it, it still takes some time since I am not running it on a GPU. Considering a real research scenario: I have the ground-truth for the next K predictions in my datasets and would like to compute metrics such as Recall@K or NDCG@K. How would I do that? Thank you so much for your help so far, I'm really learning a lot here :) |
So, I was able to use the BruteForce layer based on the tutorial examples. I am using the very same model I posted earlier in the colab. One weird thing that keeps happening is: while my training loss decreases, my validation loss increases, and I can only think of this being related to these messages:
What causes them? They are even present in quickstart's last cell output, and regularization loss is always |
How are you computing your validation loss? Are the test metrics on the validation set also deteriorating over time? I suspect that |
As we talked earlier, I am using the last N interactions to predict the N+1th interaction. In order to create a validation set, I applied the same idea by using the last N-1 interactions to predict the Nth during training, and the last N to predict the N+1 during validation. To ensure that the parameter |
It looks like your validation doesn't run at all - certainly the metrics aren't updated (the symptom is the metric values being all zero). The metrics are computed on the training set, however. Did you set What do you get if your run it explicitly via |
I didnt set
I'm trying to run again using topk instead of mrr |
The numbers you have here suggest severe overfitting - the MRR metric is much higher on the training set than the evaluation set. I'm trying to get at disentangling two things here:
If (1) doesn't happen, it might be that there is a bug in the library. (2) is a tuning/modelling issue. |
I thought about being overfitting, but I just can't understand how and why it is overfitting, as you can see in the plot I am using only 20 epochs, but in the 5th the validation loss is already increasing. My model is almost the same thing you posted in your very first comment:
In the plots from my last comment, I ran again using the same data and topk, and these are the results: You can check the complete logs here. I have my code in a private github repo, but I can make it public so you can see it, if its necessary |
Cool, thanks for the details. I'm afraid this does look like overfitting! A couple of things to try:
|
I was running with a sample of 1% of the dataset, but I had around 5700 examples for training and validation, and 2k examples for testing. I have now changed to 10% of the dataset, with 72k examples for training and validation, and 18k examples for testing. I also added your suggestion about reducing the embedding dim to 32. It will take some time before I can get back to you.
About the 4th point, I believe the data is correct, I have checked it multiple times, but if I still can't solve the problem, I'll get back to that. |
So, I ran again following your comments, and you can see the complete logs here. There is a memory error at the end, but it's not related to the model itself. I already tried to fix it and we can ignore it and I'm waiting for output to see if it was fixed. As you stated, I have now started to agree that the model is easily overfitting, which would seem odd if I was using the entire dataset and not a sample of it. Here are the results: If the validation loss makes sense to you as being simply an overfit, I am ok with closing this issue, but I wonder if only the overfit would make the validation topk not having a considerable increase (epoch 1 was 0.0002, epoch 10 was 0.0031). |
Hi @juliobguedes I am currently facing the same kind of issue with a similar architecture did you have any updates about the way you worked around this problem ? thank you |
Hi @anisayari, I have worked with this architecture using TF Recommenders for almost 2 months, but I was not able to achieve any results. I tried to change the architecture (increase the number of GRU layers, regularization, and so on) and increase the dataset, but none of the changes seemed to result in metrics better than 10^-4 If you find any results better than this, Id like to understand your thoughts and implementation |
This is somewhat surprising to me. Julio's code here seems solid, and the evaluation metrics on the training set make sense. Julio, how do you do your train/test split? Is it possible that the candidates you are trying to predict in the test set differ significantly from your test set? It could be that the majority of the test set targets aren't in the training set. For example, you perform your train/test splits by time, and the test candidates are mostly new items not represented in the training set. You could have a look at the 20 most popular targets in the train and test sets; if they do not overlap, you have a lot of distribution shift. One other thing to look at is your vocabulary construction. What proportion of the test set maps to the OOV bucket? |
Let me explain my entire process.
I have not checked the distribution of items between training/validation and testing. I still have to modify a few things in my code before open-sourcing it, but that was it. One thing very important in Sequential Recommendation is to use the proper losses, but they aren't yet implemented in TF Recommenders. I tried to use the losses implemented in TF Ranking, but the results were the same. Please let me know if I can help with something else. |
Thank you for the detailed write-up! In the end, how many training sequences did you have? Based on the dataset statistics you shared, I would say that your data is simply too sparse to train an effective model. You get great training accuracy because GRUs + embeddings are excellent at memorization - but you will get terrible test accuracy, because you will be asking the model to extrapolate far beyond what you have seen. The Movielens 20M dataset has 20M interactions with 27,000 movies; your dataset has 20M interactions with 1M items. This makes your dataset 40 times sparser. How many sequences did you have in your largest dataset? It's hard to say what would be a sufficient number of observations here, but I would look to have at least 100M sequences. @anisayari I think Julio's approach here is good, and you could follow his steps. I think the main problem is simply not enough data. @juliobguedes what do you mean when you say "proper losses"? The in-batch softmax loss implemented by the |
I am explaining a lot of things how I understood them by the papers and code I read. You may know this better than me and not need the explanation, and if so, I'd appreciate any corrections. One of the ideas of Sequential Recommendation when avoiding matrices (matrix factorization and so on) is that it is also able to solve the recommendation problem with sparse datasets. We can see this in the GRU4Rec paper which performs their experiments using 2 datasets, one not that sparse (31M interactions with 37k items) and another sparse (13M interactions with 330k items), but the performance is better in the sparser one. Link to the paper. My comment about a proper loss is that I don't see how matrix factorization loss/learning would help the sequential recommendation problem, since I don't understand how the FactorizedTopK works, but still used as a black-box layer. I only have minor knowledge about this, the GRU4Rec paper also tried using softmax as its loss but found top1 and BPR to achieve better results. Sorry that I cannot contribute more to this point. Let me know if there is anything else that I can help with. |
@juliobguedes @anisayari it has been some time since this issue was opened, but I was wondering if you solved your problems, and if yes - how? From reading this, and watching the GRU talk, a thought comes to mind - could the issue be with the user model? The idea is that sessions are anonymous, so learning an embedding per user (but in reality this is a session) could lead to the severe overfitting? @maciejkula what are your thoughts on this? If the above could be the issue - how would you maybe suggest to "smooth" the sessions? Thanks in advance! |
@juliobguedes I had a look at your code because I'm trying a similar implementation. def _build_model(self, candidates):
self.user_model = Sequential([
Embedding(self.num_items, self.embedding_size, input_length=self.seq_length),
GRU(self.gru_units, activation=self.gru_activation), # HIDDEN LAYER
], name='User_Model')
self.item_model = Sequential([
self.vocab,
Embedding(self.num_items, self.embedding_size)
], name='Item_Model')
metrics = [TopK(), tfr.keras.metrics.MRRMetric()]
topk = FactorizedTopK(candidates.batch(1024).map(self.item_model), metrics=metrics)
self.task = Retrieval(metrics=topk) Since the input to your user_model is a sequence of the past items, you should apply the same vocabulary there. Otherwise, the items in the @maciejkula if you would be so kind, please let me know if I'm right or I have got it all wrong. |
There are a few independent implementations for sequential recommenders (sequence-aware, session-based, and so on) such as slientGe's repository. How do we perform such recommendations using this library?
Every example in the guides focuses on solving the matrix-completion problem, while this is not the objective in a sequential recommendation.
The text was updated successfully, but these errors were encountered: