-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SAR normalization and add accuracy evaluation metrics #1128
Fix SAR normalization and add accuracy evaluation metrics #1128
Conversation
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
hi @viktorku thanks for your contribution, could you please explain the steps to get these numbers?
|
@miguelgfierro If you run the same notebook on when instantiating the model: model = SAR(
col_user="userID",
col_item="itemID",
col_rating="rating",
col_timestamp="timestamp",
similarity_type="jaccard",
time_decay_coefficient=30,
timedecay_formula=True,
normalize=True
) and when generating the recommendations: with Timer() as test_time:
top_k = model.recommend_k_items(test, remove_seen=True, normalize=True)
print("Took {} seconds for prediction.".format(test_time.interval)) |
d68d9ae
to
2832e12
Compare
@miguelgfierro @gramhagen I've updated the branch after fixing conflicts with refactor PR and adapted the unit test. |
@viktorku it appears that the rescaling impacts the ranking metrics, which I would not expect to happen. In this algorithm the ratings are not directly comparable across users because they are depended on the number of item interactions. Can you apply the rescaling only per user? Would that be different than the current normalization approach? |
Which rescaling are you referring to here? The current one does indeed impact the ranking metrics, but not the one in this PR. I've updated the ranking/relevance metrics in the readme since they were a bit different from what we were getting in the notebook (if you were refererring to that).
I'm not sure if that would be different. Each row in the score matrix corresponds to a single user. So with the current normalization we do element-wise division for each user-item scores. Hence all scores will be normalized (incorrectly) by their corresponding affinity. Here's a quick experiment with 2 users (cooccurrence is correctly derived from ratings). In [125]: ua = np.array([[5, 4, 2, 0, 0], [4, 5, 0, 3, 3]])
In [127]: i2i = np.array([[2, 2, 1, 1, 1], [2, 2, 1, 1, 1], [1, 1, 1, 0, 0], [1, 1, 0, 1, 1], [1, 1, 0, 1, 1]])
In [128]: ua.dot(i2i)
Out[128]:
array([[20, 20, 11, 9, 9],
[24, 24, 9, 15, 15]])
In [129]: scores = ua.dot(i2i)
In [130]: np.where(ua != 0, 1, ua)
Out[130]:
array([[1, 1, 1, 0, 0],
[1, 1, 0, 1, 1]])
In [131]: uua = np.where(ua != 0, 1, ua)
In [132]: scores / uua.dot(i2i)
Out[132]:
array([[4. , 4. , 3.66666667, 4.5 , 4.5 ],
[4. , 4. , 4.5 , 3.75 , 3.75 ]])
In [133]: uua.dot(i2i)
Out[133]:
array([[5, 5, 3, 2, 2],
[6, 6, 2, 4, 4]])
In [134]: uua[0].dot(i2i)
Out[134]: array([5, 5, 3, 2, 2])
In [135]: scores[0] / uua[0].dot(i2i)
Out[135]: array([4. , 4. , 3.66666667, 4.5 , 4.5 ]) You can see that in the non-normalized score matrix, the elements with
@gramhagen This is exactly why we cannot normalize the way it's currently done. Normalization in this sense penalizes scores with more interaction and favors items with less. Which doesn't make sense. And as demonstrated above normalizing for each user yields the same result. |
What does make sense actually, is to rescale with the new min/max method for each user individually, and then compose the final scoring matrix. The difference would be in this case this: In [17]: rescale(scores, 1, 5, 0)
Out[17]:
array([[4.33333333, 4.33333333, 2.83333333, 2.5 , 2.5 ],
[5. , 5. , 2.5 , 3.5 , 3.5 ]])
In [18]: np.array([rescale(scores[0], 1, 5, 0), rescale(scores[1], 1, 5, 0)])
Out[18]:
array([[5. , 5. , 3.2, 2.8, 2.8],
[5. , 5. , 2.5, 3.5, 3.5]]) In this case, the ranking and relevance metrics would still differ than the non-normalized version, albeit a little less so, because we evaluate them across all users. |
@viktorku thanks for this, we are reviewing your code, it will take some time, hopefully you understand. For @gramhagen @loomlike @anargyri if we compare the original SAR with this version, the ranking metrics are the same:
so the new normalization is not affecting the ranking metrics, which is expected. |
@miguelgfierro you must use normalize=True in both the constructor and the recommend_top_k method to see the poor performing metrics. I do like the simplification in this implementation that only requires the flag at construction (though perhaps it would be useful to provide a way to get unnormalized results). |
@viktorku it's true the current normalization penalizes items with more interactions, this is done to remove the bias the algorithm has towards more popular items. One could argue that these offline metrics reward bias towards popular items, but it does seem like the results are pretty extreme after normalization. The concern I have with the rescaling you propose is determining the appropriate min/max values for the data. If it's inferred across all users we further penalize less-frequently interacted items and favor more popular items. But if it's inferred per user we could arbitrarily generate large differences between similar scores, or shift small scores so they have high ratings. You can see some of this is clearly happening with the large RMSE values. It is worth noting that the current normalization at least produces better calibrated scores: I'm wondering if there's a way we can use the result of the inner-product between unity user affinity and item-item similarity to re-weight the results that isn't just dividing the scores. For example using a Bayesian Estimator to get to a weighted rating: I'm struggling a bit with the "prior" C to use here, I'm thinking the average rating for each item over all users would work, but that may lose some specificity the algorithm is providing with the user-affinity. Another option could be the average rating across all items for that user. Or even your proposed re-scaled rating for the user-item. There's still an arbitrary value m to define, but this could be adjusted by the user depending on how strongly they want to rely on the prior. |
The more I think about it I think it's probably right not to "correct" bias in the algorithm through normalization. Your approach still maintains the original ranking so that seems intuitive. Maybe a good compromise is to use the result of the dot product of unity-user-affinity and item-item similarity to determine the min and max possible scores. Then for each user we can scale based on the max and max scores across all items for that user? This would still maintain the per-user ranking and hopefully avoid edge cases where values are inflated or shrunk. I may have some time to try this out and will report back results. Any thoughts on this approach @viktorku @anargyri @miguelgfierro ? |
I think so too.
That sounds like a good idea. Please do try it out. My intuition says it would clamp outliers without affecting/worsening the ratings of the normal data points (which I'm afraid that's the cost of my naive approach). As long as the ranking stays the same and the generated recommendations are identical, this would make sense. And we might see a better accuracy. Either way, normalising shouldn't tamper with the actual results. |
@viktorku I pushed an implementation to your repo/branch. please review and merge in if you approve. Thank you for highlighting this discrepancy, the examples were very helpful! |
I had a quick look at the code. It seems to me there is a bug in the code in master which is not really fixed in this PR. The bug comes from this function https://github.com/microsoft/recommenders/blob/558733f366983d576953a407ab7180b1642dbc5b/reco_utils/recommender/sar/sar_singlenode.py#L456 So, the bug is not caused by the type of normalization used; the desired behavior of An error in the notebook is using I also recommend adding tests that cover all cases and get the expected results, as numpy arrays, both for the normalized scores and the top-k recommendations. |
@anargyri can you explain the rationale behind ignoring the inconsistency between normalized and unnormalized results? Since as you mention normalization in the previous item-based scheme changes the ranking, is it possible to treat the unnormalized ranking metrics and the normalized rating metrics as representing performance for the same algorithm? |
Why would normalized scores only be used for performance evaluation? I'm aware predicting accurate ratings is not SAR's strong suit, but I see plenty of reasons for the recommendations to be scored in their true scale, for external consumers. Especially now since it's obvious that the accuracy is not that bad. So I find it absurd that we can't have one without the other (K items in their true scale and correctly ranked). |
Agree with this. I'll adapt the notebook |
Well, there is not a good rationale other than that normalization will involve some trade off inevitably. There are ways to avoid this inconsistency that would normalize with factors that depend on ratings, but this would be more sensitive to variation from new ratings data. Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range. Since there is no obvious choice of normalization, there could be multiple options for the user to choose. |
See the above reply to @gramhagen, there are ways to do this but they may have some other drawbacks. Feel free to choose one of them. |
That is essentially the proposal here, if I'm interpreting what you're saying correctly. This method normalizes per user by first scaling each user's scores between 0 and 1 based on the min and max possible scores (looking at the current denominators across all items for a user). Then this is scaled back up to the original rating range. We keep the scores in the correct rating range as well as maintain the original ranking. The downside is there can be some distortion of the ratings (the rating results are 30-40% worse than ALS, but I think this is still reasonable). The other issue is the expectation for usage of the normalization flag. If we do not specify it at construction then we will need to generate the unity-user-affinity matrix in all cases which will consume unneeded memory for users that don't care about normalization. Since the proposed approach does not alter rankings specifying normalization once at construction is best as it allows for reduced memory if excluded and does not require the user to remember to use it later during scoring, an easy error to make given I did it in the notebook you pointed out =). Are there cases where the unnormalized scores would be useful? If so I suggest we keep the normalize flag in the scoring methods but set the default to None and inherit from what was set at construction. That way the user doesn't need to worry about it if they've already specified one way or the other, and can disable it if desired. We will need to keep the raised error if the user does not construct this with normalization then sets score normalization = True. But honestly this seems like an odd requirement. |
This renormalization sounds good. The important thing is to maintain the rankings and the range of the ratings at the same time. There could be multiple ways to do this, in fact, and in the future there could be an option to choose between different normalization methods. As you say, the implementation is optimal in terms of efficiency. One question I have is the unity-user-affinity matrix required in your new proposed normalization (which seems to require only the scores AFAIU)? In general, my concern was the outward interface where you need to specify the normalization flag twice (both in constructor and recommendation) and it is easy to forget one of the two. I think it is not necessary to have the flag when recommending top k items. It is just a matter of implementation whether you use the unnormalized or the normalized scores for ranking the items (and with your proposed normalization they should be the same). On the other hand, in predict() we need to use the normalized scores, so again no flag is needed. In summary, it seems to me that the flag in the constructor should be an "efficiency" flag, not about functionality i.e. the results (rankings and ratings) should not changed if you flip the flag, but the computational time will be affected. |
Ok. Sounds good. So we keep this PR's version of normalization with the single flag at construction. I think the only remaining items were there comments on the evaluation tools. @viktorku when you've addressed those and the evaluation in the notebook we should be good. Let me know if you need any further support. |
@viktorku is this ready? or do you need any support to complete this? |
Hey!, sorry for the wait, I was on vacation. Will finish this up these days. |
…max possible scores per user
…g unit tests for rescale method, and black formatting
7891791
to
803aacf
Compare
@gramhagen I finished this. Sorry for the delay. Let me know if there's anything else to be done 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot for your work on this!
Description
sar_movielens.ipynb
notebook.With this PR we get the same rank/relevance metrics as the non-normalized version and the following accuracy:
To illustrate the problem with the current normalization, here are the current metrics with the (incorrect) normalization technique.
Preview notebook link
We always need to normalize the scores so that RMSE and MAE are computed in the correct scale.
Related Issues
Closes #903
Checklist:
staging
and notmaster
.