Fix SAR normalization and add accuracy evaluation metrics #1128

viktorku · 2020-06-25T12:39:57Z

Description

The normalization method in the SAR algorithm does not seem to be correct - it is currently implemented as a division of the computed scores by the item similarity matrix for each user we have ratings (unary affinity). If we actually use this normalization technique when evaluating SAR, we get extremely bad relevance and ranking metrics. Furthermore, this method gets rid of outliers and skews the relevance and ordinality of generated recommendations. This shouldn't be the case. Normalizing the scores to the original rating scale should yield identical metrics.
The above fix will allow us to correctly evaluate the accuracy measures like RMSE, MAE, and log loss. This PR also adds that in the sar_movielens.ipynb notebook.

With this PR we get the same rank/relevance metrics as the non-normalized version and the following accuracy:

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385
RMSE:	3.697559
MAE:	3.513341
R2:	-12.648769
Exp var:	-0.442580
Logloss:	3.268522

To illustrate the problem with the current normalization, here are the current metrics with the (incorrect) normalization technique.

Model:	
Top K:	10
MAP:	0.000045
NDCG:	0.000736
Precision@K:	0.000742
Recall@K:	0.000118

Preview notebook link

We always need to normalize the scores so that RMSE and MAE are computed in the correct scale.

Related Issues

Closes #903

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging and not master.

review-notebook-app · 2020-06-25T12:40:04Z

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

ghost · 2020-06-25T12:40:11Z

All CLA requirements met.

miguelgfierro · 2020-06-25T13:33:43Z

hi @viktorku thanks for your contribution, could you please explain the steps to get these numbers?

Model:	
Top K:	10
MAP:	0.000045
NDCG:	0.000736
Precision@K:	0.000742
Recall@K:	0.000118

viktorku · 2020-06-25T13:39:15Z

@miguelgfierro If you run the same notebook on master with the following changes

when instantiating the model:

model = SAR(
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

and when generating the recommendations:

with Timer() as test_time:
    top_k = model.recommend_k_items(test, remove_seen=True, normalize=True)

print("Took {} seconds for prediction.".format(test_time.interval))

viktorku · 2020-07-01T10:39:40Z

@miguelgfierro @gramhagen I've updated the branch after fixing conflicts with refactor PR and adapted the unit test.
2832e12 shows the obvious difference in scoring with this PR.

gramhagen · 2020-07-02T02:28:57Z

@viktorku it appears that the rescaling impacts the ranking metrics, which I would not expect to happen.

In this algorithm the ratings are not directly comparable across users because they are depended on the number of item interactions. Can you apply the rescaling only per user? Would that be different than the current normalization approach?

viktorku · 2020-07-02T09:07:05Z

it appears that the rescaling impacts the ranking metrics, which I would not expect to happen.

Which rescaling are you referring to here? The current one does indeed impact the ranking metrics, but not the one in this PR. I've updated the ranking/relevance metrics in the readme since they were a bit different from what we were getting in the notebook (if you were refererring to that).

Can you apply the rescaling only per user? Would that be different than the current normalization approach?

I'm not sure if that would be different. Each row in the score matrix corresponds to a single user. So with the current normalization we do element-wise division for each user-item scores. Hence all scores will be normalized (incorrectly) by their corresponding affinity. Here's a quick experiment with 2 users (cooccurrence is correctly derived from ratings).

In [125]: ua = np.array([[5, 4, 2, 0, 0], [4, 5, 0, 3, 3]])

In [127]: i2i = np.array([[2, 2, 1, 1, 1], [2, 2, 1, 1, 1], [1, 1, 1, 0, 0], [1, 1, 0, 1, 1], [1, 1, 0, 1, 1]])

In [128]: ua.dot(i2i)
Out[128]:
array([[20, 20, 11,  9,  9],
       [24, 24,  9, 15, 15]])

In [129]: scores = ua.dot(i2i)

In [130]: np.where(ua != 0, 1, ua)
Out[130]:
array([[1, 1, 1, 0, 0],
       [1, 1, 0, 1, 1]])

In [131]: uua = np.where(ua != 0, 1, ua)

In [132]: scores / uua.dot(i2i)
Out[132]:
array([[4.        , 4.        , 3.66666667, 4.5       , 4.5       ],
       [4.        , 4.        , 4.5       , 3.75      , 3.75      ]])

In [133]: uua.dot(i2i)
Out[133]:
array([[5, 5, 3, 2, 2],
       [6, 6, 2, 4, 4]])

In [134]: uua[0].dot(i2i)
Out[134]: array([5, 5, 3, 2, 2])

In [135]: scores[0] / uua[0].dot(i2i)
Out[135]: array([4.        , 4.        , 3.66666667, 4.5       , 4.5       ])

You can see that in the non-normalized score matrix, the elements with 9 are normalized to 4.5, while 11 is brought down to 3.6, and 15 to 3.75.

In this algorithm the ratings are [...] depended on the number of item interactions

@gramhagen This is exactly why we cannot normalize the way it's currently done. Normalization in this sense penalizes scores with more interaction and favors items with less. Which doesn't make sense. And as demonstrated above normalizing for each user yields the same result.

viktorku · 2020-07-02T09:40:07Z

What does make sense actually, is to rescale with the new min/max method for each user individually, and then compose the final scoring matrix.

The difference would be in this case this:

In [17]: rescale(scores, 1, 5, 0)
Out[17]:
array([[4.33333333, 4.33333333, 2.83333333, 2.5       , 2.5       ],
       [5.        , 5.        , 2.5       , 3.5       , 3.5       ]])

In [18]: np.array([rescale(scores[0], 1, 5, 0), rescale(scores[1], 1, 5, 0)])
Out[18]:
array([[5. , 5. , 3.2, 2.8, 2.8],
       [5. , 5. , 2.5, 3.5, 3.5]])

In this case, the ranking and relevance metrics would still differ than the non-normalized version, albeit a little less so, because we evaluate them across all users.

miguelgfierro · 2020-07-02T13:38:47Z

@viktorku thanks for this, we are reviewing your code, it will take some time, hopefully you understand.

For @gramhagen @loomlike @anargyri if we compare the original SAR with this version, the ranking metrics are the same:

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385

so the new normalization is not affecting the ranking metrics, which is expected.

gramhagen · 2020-07-04T14:58:18Z

@miguelgfierro you must use normalize=True in both the constructor and the recommend_top_k method to see the poor performing metrics. I do like the simplification in this implementation that only requires the flag at construction (though perhaps it would be useful to provide a way to get unnormalized results).

gramhagen · 2020-07-04T15:15:07Z

@viktorku it's true the current normalization penalizes items with more interactions, this is done to remove the bias the algorithm has towards more popular items. One could argue that these offline metrics reward bias towards popular items, but it does seem like the results are pretty extreme after normalization.

The concern I have with the rescaling you propose is determining the appropriate min/max values for the data. If it's inferred across all users we further penalize less-frequently interacted items and favor more popular items. But if it's inferred per user we could arbitrarily generate large differences between similar scores, or shift small scores so they have high ratings. You can see some of this is clearly happening with the large RMSE values. It is worth noting that the current normalization at least produces better calibrated scores:
RMSE: 1.243966
MAE: 1.050101

I'm wondering if there's a way we can use the result of the inner-product between unity user affinity and item-item similarity to re-weight the results that isn't just dividing the scores.

For example using a Bayesian Estimator to get to a weighted rating:
W = (Rv + Cm) / (v+m)
W = normalized user ratings
R*v = un-normalized scores
C = average item rating (over all users)
v = user-specific item counts (dot product of user-unity-affinity with item-item similarity)
m = tunable number of ratings to control shifting away from average value.

I'm struggling a bit with the "prior" C to use here, I'm thinking the average rating for each item over all users would work, but that may lose some specificity the algorithm is providing with the user-affinity. Another option could be the average rating across all items for that user. Or even your proposed re-scaled rating for the user-item. There's still an arbitrary value m to define, but this could be adjusted by the user depending on how strongly they want to rely on the prior.

gramhagen · 2020-07-04T15:46:35Z

The more I think about it I think it's probably right not to "correct" bias in the algorithm through normalization. Your approach still maintains the original ranking so that seems intuitive.

Maybe a good compromise is to use the result of the dot product of unity-user-affinity and item-item similarity to determine the min and max possible scores. Then for each user we can scale based on the max and max scores across all items for that user? This would still maintain the per-user ranking and hopefully avoid edge cases where values are inflated or shrunk.

I may have some time to try this out and will report back results. Any thoughts on this approach @viktorku @anargyri @miguelgfierro ?

viktorku · 2020-07-04T16:09:03Z

The more I think about it I think it's probably right not to "correct" bias in the algorithm through normalization

I think so too.

Maybe a good compromise is to use the result of the dot product of unity-user-affinity and item-item similarity to determine the min and max possible scores. Then for each user we can scale based on the max and max scores across all items for that user? This would still maintain the per-user ranking and hopefully avoid edge cases where values are inflated or shrunk.

That sounds like a good idea. Please do try it out. My intuition says it would clamp outliers without affecting/worsening the ratings of the normal data points (which I'm afraid that's the cost of my naive approach). As long as the ranking stays the same and the generated recommendations are identical, this would make sense. And we might see a better accuracy. Either way, normalising shouldn't tamper with the actual results.

gramhagen · 2020-07-04T20:40:22Z

@viktorku I pushed an implementation to your repo/branch. please review and merge in if you approve.

Thank you for highlighting this discrepancy, the examples were very helpful!

reco_utils/evaluation/python_evaluation.py

anargyri · 2020-07-09T12:58:07Z

I had a quick look at the code. It seems to me there is a bug in the code in master which is not really fixed in this PR. The bug comes from this function https://github.com/microsoft/recommenders/blob/558733f366983d576953a407ab7180b1642dbc5b/reco_utils/recommender/sar/sar_singlenode.py#L456
recommend_k_items() is used when evaluating the ranking metrics. So it should not depend on normalized scores (and hence there should not be a normalize argument in its signature). The PR code still has this issue.

So, the bug is not caused by the type of normalization used; the desired behavior of recommend_k_items() should be independent of the normalization, because it outputs a ranking of the items per user. Normalization should be applied at a separate function (I don't think we should have a normalize flag in the constructor). Note that any type of normalization that is item dependent in the denominator will have this issue: ranking of the scores before and after normalization will be different, because item i is normalized with a different denominator from item j and hence the order of i and j in the ranking may flip. We opted to ignore this inconsistency and use the normalized scores only for rating metrics; an alternative would be to normalize scores by a factor that depends only on the user and not on the item. The former has the advantage that the normalized scores are in the right range, but the latter does not.

An error in the notebook is using top_k for the computation of rating metrics; rating metrics should be computed on the full test data, not on k items only.

I also recommend adding tests that cover all cases and get the expected results, as numpy arrays, both for the normalized scores and the top-k recommendations.

gramhagen · 2020-07-09T14:01:21Z

@anargyri can you explain the rationale behind ignoring the inconsistency between normalized and unnormalized results? Since as you mention normalization in the previous item-based scheme changes the ranking, is it possible to treat the unnormalized ranking metrics and the normalized rating metrics as representing performance for the same algorithm?

viktorku · 2020-07-09T16:17:36Z

We opted to ignore this inconsistency and use the normalized scores only for rating metrics

Why would normalized scores only be used for performance evaluation? I'm aware predicting accurate ratings is not SAR's strong suit, but I see plenty of reasons for the recommendations to be scored in their true scale, for external consumers. Especially now since it's obvious that the accuracy is not that bad. So I find it absurd that we can't have one without the other (K items in their true scale and correctly ranked).

viktorku · 2020-07-09T16:20:07Z

An error in the notebook is using top_k for the computation of rating metrics; rating metrics should be computed on the full test data, not on k items only.

Agree with this. I'll adapt the notebook

anargyri · 2020-07-09T18:59:38Z

@anargyri can you explain the rationale behind ignoring the inconsistency between normalized and unnormalized results? Since as you mention normalization in the previous item-based scheme changes the ranking, is it possible to treat the unnormalized ranking metrics and the normalized rating metrics as representing performance for the same algorithm?

Well, there is not a good rationale other than that normalization will involve some trade off inevitably. There are ways to avoid this inconsistency that would normalize with factors that depend on ratings, but this would be more sensitive to variation from new ratings data. Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

Since there is no obvious choice of normalization, there could be multiple options for the user to choose.

anargyri · 2020-07-09T19:08:25Z

We opted to ignore this inconsistency and use the normalized scores only for rating metrics

Why would normalized scores only be used for performance evaluation? I'm aware predicting accurate ratings is not SAR's strong suit, but I see plenty of reasons for the recommendations to be scored in their true scale, for external consumers. Especially now since it's obvious that the accuracy is not that bad. So I find it absurd that we can't have one without the other (K items in their true scale and correctly ranked).

See the above reply to @gramhagen, there are ways to do this but they may have some other drawbacks. Feel free to choose one of them.
About the ratings metrics, I think you could use the predict() method in the notebook.

gramhagen · 2020-07-10T13:52:27Z

Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

That is essentially the proposal here, if I'm interpreting what you're saying correctly. This method normalizes per user by first scaling each user's scores between 0 and 1 based on the min and max possible scores (looking at the current denominators across all items for a user). Then this is scaled back up to the original rating range. We keep the scores in the correct rating range as well as maintain the original ranking. The downside is there can be some distortion of the ratings (the rating results are 30-40% worse than ALS, but I think this is still reasonable).

The other issue is the expectation for usage of the normalization flag. If we do not specify it at construction then we will need to generate the unity-user-affinity matrix in all cases which will consume unneeded memory for users that don't care about normalization. Since the proposed approach does not alter rankings specifying normalization once at construction is best as it allows for reduced memory if excluded and does not require the user to remember to use it later during scoring, an easy error to make given I did it in the notebook you pointed out =).

Are there cases where the unnormalized scores would be useful? If so I suggest we keep the normalize flag in the scoring methods but set the default to None and inherit from what was set at construction. That way the user doesn't need to worry about it if they've already specified one way or the other, and can disable it if desired. We will need to keep the raised error if the user does not construct this with normalization then sets score normalization = True. But honestly this seems like an odd requirement.

anargyri · 2020-07-14T10:05:35Z

Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

That is essentially the proposal here, if I'm interpreting what you're saying correctly. This method normalizes per user by first scaling each user's scores between 0 and 1 based on the min and max possible scores (looking at the current denominators across all items for a user). Then this is scaled back up to the original rating range. We keep the scores in the correct rating range as well as maintain the original ranking. The downside is there can be some distortion of the ratings (the rating results are 30-40% worse than ALS, but I think this is still reasonable).

The other issue is the expectation for usage of the normalization flag. If we do not specify it at construction then we will need to generate the unity-user-affinity matrix in all cases which will consume unneeded memory for users that don't care about normalization. Since the proposed approach does not alter rankings specifying normalization once at construction is best as it allows for reduced memory if excluded and does not require the user to remember to use it later during scoring, an easy error to make given I did it in the notebook you pointed out =).

Are there cases where the unnormalized scores would be useful? If so I suggest we keep the normalize flag in the scoring methods but set the default to None and inherit from what was set at construction. That way the user doesn't need to worry about it if they've already specified one way or the other, and can disable it if desired. We will need to keep the raised error if the user does not construct this with normalization then sets score normalization = True. But honestly this seems like an odd requirement.

This renormalization sounds good. The important thing is to maintain the rankings and the range of the ratings at the same time. There could be multiple ways to do this, in fact, and in the future there could be an option to choose between different normalization methods.

As you say, the implementation is optimal in terms of efficiency. One question I have is the unity-user-affinity matrix required in your new proposed normalization (which seems to require only the scores AFAIU)? In general, my concern was the outward interface where you need to specify the normalization flag twice (both in constructor and recommendation) and it is easy to forget one of the two. I think it is not necessary to have the flag when recommending top k items. It is just a matter of implementation whether you use the unnormalized or the normalized scores for ranking the items (and with your proposed normalization they should be the same). On the other hand, in predict() we need to use the normalized scores, so again no flag is needed. In summary, it seems to me that the flag in the constructor should be an "efficiency" flag, not about functionality i.e. the results (rankings and ratings) should not changed if you flip the flag, but the computational time will be affected.

gramhagen · 2020-07-14T10:45:38Z

Ok. Sounds good. So we keep this PR's version of normalization with the single flag at construction.

I think the only remaining items were there comments on the evaluation tools. @viktorku when you've addressed those and the evaluation in the notebook we should be good.

Let me know if you need any further support.

gramhagen · 2020-07-30T13:56:15Z

@viktorku is this ready? or do you need any support to complete this?

viktorku · 2020-08-10T08:40:28Z

Hey!, sorry for the wait, I was on vacation. Will finish this up these days.

…ormalization

…max possible scores per user

…g unit tests for rescale method, and black formatting

viktorku · 2020-08-18T09:42:40Z

@gramhagen I finished this. Sorry for the delay. Let me know if there's anything else to be done 👍

gramhagen

Looks good, thanks a lot for your work on this!

…trics

viktorku requested review from anargyri, gramhagen, loomlike, miguelgfierro and yueguoguo as code owners June 25, 2020 12:39

miguelgfierro assigned gramhagen Jun 29, 2020

viktorku force-pushed the fix-sar-normalization-and-add-accuracy-metrics branch from d68d9ae to 2832e12 Compare July 1, 2020 10:34

gramhagen approved these changes Jul 8, 2020

View reviewed changes

miguelgfierro reviewed Jul 8, 2020

View reviewed changes

reco_utils/evaluation/python_evaluation.py Outdated Show resolved Hide resolved

reco_utils/evaluation/python_evaluation.py Outdated Show resolved Hide resolved

viktorku and others added 8 commits August 14, 2020 16:46

Fix SAR normalization + add logloss eval prep function

6e84119

Update SAR MovieLens notebook to include accuracy metrics + correct n…

9edbbfb

…ormalization

Add R squared and explained variance

04ba1a8

Update SAR test for normalized scores

0f5fec0

removing rescale function and adjusting sar normalization to use min/…

75c8b60

…max possible scores per user

updating readme with new sar rating metrics

77be5d3

reverting rescale removal from python_utils and sar_singlenode, addin…

f6ea0d6

…g unit tests for rescale method, and black formatting

Remove redundant helper and update notebook

803aacf

viktorku force-pushed the fix-sar-normalization-and-add-accuracy-metrics branch from 7891791 to 803aacf Compare August 18, 2020 09:39

gramhagen approved these changes Aug 20, 2020

View reviewed changes

Merge branch 'staging' into fix-sar-normalization-and-add-accuracy-me…

7d46291

…trics

miguelgfierro merged commit c7658ac into recommenders-team:staging Aug 20, 2020

miguelgfierro mentioned this pull request Aug 20, 2020

Staging to master to add SAR improvements #1183

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SAR normalization and add accuracy evaluation metrics #1128

Fix SAR normalization and add accuracy evaluation metrics #1128

viktorku commented Jun 25, 2020 •

edited

review-notebook-app bot commented Jun 25, 2020

ghost commented Jun 25, 2020 •

edited by ghost

miguelgfierro commented Jun 25, 2020

viktorku commented Jun 25, 2020 •

edited

viktorku commented Jul 1, 2020 •

edited

gramhagen commented Jul 2, 2020

viktorku commented Jul 2, 2020 •

edited

viktorku commented Jul 2, 2020

miguelgfierro commented Jul 2, 2020 •

edited

gramhagen commented Jul 4, 2020

gramhagen commented Jul 4, 2020 •

edited

gramhagen commented Jul 4, 2020

viktorku commented Jul 4, 2020

gramhagen commented Jul 4, 2020

anargyri commented Jul 9, 2020

gramhagen commented Jul 9, 2020

viktorku commented Jul 9, 2020

viktorku commented Jul 9, 2020

anargyri commented Jul 9, 2020

anargyri commented Jul 9, 2020

gramhagen commented Jul 10, 2020

anargyri commented Jul 14, 2020

gramhagen commented Jul 14, 2020

gramhagen commented Jul 30, 2020

viktorku commented Aug 10, 2020

viktorku commented Aug 18, 2020

gramhagen left a comment

Fix SAR normalization and add accuracy evaluation metrics #1128

Fix SAR normalization and add accuracy evaluation metrics #1128

Conversation

viktorku commented Jun 25, 2020 • edited

Description

Related Issues

Checklist:

review-notebook-app bot commented Jun 25, 2020

ghost commented Jun 25, 2020 • edited by ghost

miguelgfierro commented Jun 25, 2020

viktorku commented Jun 25, 2020 • edited

viktorku commented Jul 1, 2020 • edited

gramhagen commented Jul 2, 2020

viktorku commented Jul 2, 2020 • edited

viktorku commented Jul 2, 2020

miguelgfierro commented Jul 2, 2020 • edited

gramhagen commented Jul 4, 2020

gramhagen commented Jul 4, 2020 • edited

gramhagen commented Jul 4, 2020

viktorku commented Jul 4, 2020

gramhagen commented Jul 4, 2020

anargyri commented Jul 9, 2020

gramhagen commented Jul 9, 2020

viktorku commented Jul 9, 2020

viktorku commented Jul 9, 2020

anargyri commented Jul 9, 2020

anargyri commented Jul 9, 2020

gramhagen commented Jul 10, 2020

anargyri commented Jul 14, 2020

gramhagen commented Jul 14, 2020

gramhagen commented Jul 30, 2020

viktorku commented Aug 10, 2020

viktorku commented Aug 18, 2020

gramhagen left a comment

Choose a reason for hiding this comment

viktorku commented Jun 25, 2020 •

edited

ghost commented Jun 25, 2020 •

edited by ghost

viktorku commented Jun 25, 2020 •

edited

viktorku commented Jul 1, 2020 •

edited

viktorku commented Jul 2, 2020 •

edited

miguelgfierro commented Jul 2, 2020 •

edited

gramhagen commented Jul 4, 2020 •

edited