Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SAR normalization and add accuracy evaluation metrics #1128

Conversation

viktorku
Copy link
Contributor

@viktorku viktorku commented Jun 25, 2020

Description

  • The normalization method in the SAR algorithm does not seem to be correct - it is currently implemented as a division of the computed scores by the item similarity matrix for each user we have ratings (unary affinity). If we actually use this normalization technique when evaluating SAR, we get extremely bad relevance and ranking metrics. Furthermore, this method gets rid of outliers and skews the relevance and ordinality of generated recommendations. This shouldn't be the case. Normalizing the scores to the original rating scale should yield identical metrics.
  • The above fix will allow us to correctly evaluate the accuracy measures like RMSE, MAE, and log loss. This PR also adds that in the sar_movielens.ipynb notebook.

With this PR we get the same rank/relevance metrics as the non-normalized version and the following accuracy:

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385
RMSE:	3.697559
MAE:	3.513341
R2:	-12.648769
Exp var:	-0.442580
Logloss:	3.268522

To illustrate the problem with the current normalization, here are the current metrics with the (incorrect) normalization technique.

Model:	
Top K:	10
MAP:	0.000045
NDCG:	0.000736
Precision@K:	0.000742
Recall@K:	0.000118

Preview notebook link

We always need to normalize the scores so that RMSE and MAE are computed in the correct scale.

Related Issues

Closes #903

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging and not master.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@ghost
Copy link

ghost commented Jun 25, 2020

CLA assistant check
All CLA requirements met.

@miguelgfierro
Copy link
Collaborator

hi @viktorku thanks for your contribution, could you please explain the steps to get these numbers?

Model:	
Top K:	10
MAP:	0.000045
NDCG:	0.000736
Precision@K:	0.000742
Recall@K:	0.000118

@viktorku
Copy link
Contributor Author

viktorku commented Jun 25, 2020

@miguelgfierro If you run the same notebook on master with the following changes

when instantiating the model:

model = SAR(
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

and when generating the recommendations:

with Timer() as test_time:
    top_k = model.recommend_k_items(test, remove_seen=True, normalize=True)

print("Took {} seconds for prediction.".format(test_time.interval))

@viktorku viktorku force-pushed the fix-sar-normalization-and-add-accuracy-metrics branch from d68d9ae to 2832e12 Compare July 1, 2020 10:34
@viktorku
Copy link
Contributor Author

viktorku commented Jul 1, 2020

@miguelgfierro @gramhagen I've updated the branch after fixing conflicts with refactor PR and adapted the unit test.
2832e12 shows the obvious difference in scoring with this PR.

@gramhagen
Copy link
Collaborator

@viktorku it appears that the rescaling impacts the ranking metrics, which I would not expect to happen.

In this algorithm the ratings are not directly comparable across users because they are depended on the number of item interactions. Can you apply the rescaling only per user? Would that be different than the current normalization approach?

@viktorku
Copy link
Contributor Author

viktorku commented Jul 2, 2020

it appears that the rescaling impacts the ranking metrics, which I would not expect to happen.

Which rescaling are you referring to here? The current one does indeed impact the ranking metrics, but not the one in this PR. I've updated the ranking/relevance metrics in the readme since they were a bit different from what we were getting in the notebook (if you were refererring to that).

Can you apply the rescaling only per user? Would that be different than the current normalization approach?

I'm not sure if that would be different. Each row in the score matrix corresponds to a single user. So with the current normalization we do element-wise division for each user-item scores. Hence all scores will be normalized (incorrectly) by their corresponding affinity. Here's a quick experiment with 2 users (cooccurrence is correctly derived from ratings).

In [125]: ua = np.array([[5, 4, 2, 0, 0], [4, 5, 0, 3, 3]])

In [127]: i2i = np.array([[2, 2, 1, 1, 1], [2, 2, 1, 1, 1], [1, 1, 1, 0, 0], [1, 1, 0, 1, 1], [1, 1, 0, 1, 1]])

In [128]: ua.dot(i2i)
Out[128]:
array([[20, 20, 11,  9,  9],
       [24, 24,  9, 15, 15]])

In [129]: scores = ua.dot(i2i)

In [130]: np.where(ua != 0, 1, ua)
Out[130]:
array([[1, 1, 1, 0, 0],
       [1, 1, 0, 1, 1]])

In [131]: uua = np.where(ua != 0, 1, ua)

In [132]: scores / uua.dot(i2i)
Out[132]:
array([[4.        , 4.        , 3.66666667, 4.5       , 4.5       ],
       [4.        , 4.        , 4.5       , 3.75      , 3.75      ]])

In [133]: uua.dot(i2i)
Out[133]:
array([[5, 5, 3, 2, 2],
       [6, 6, 2, 4, 4]])

In [134]: uua[0].dot(i2i)
Out[134]: array([5, 5, 3, 2, 2])

In [135]: scores[0] / uua[0].dot(i2i)
Out[135]: array([4.        , 4.        , 3.66666667, 4.5       , 4.5       ])

You can see that in the non-normalized score matrix, the elements with 9 are normalized to 4.5, while 11 is brought down to 3.6, and 15 to 3.75.

In this algorithm the ratings are [...] depended on the number of item interactions

@gramhagen This is exactly why we cannot normalize the way it's currently done. Normalization in this sense penalizes scores with more interaction and favors items with less. Which doesn't make sense. And as demonstrated above normalizing for each user yields the same result.

@viktorku
Copy link
Contributor Author

viktorku commented Jul 2, 2020

What does make sense actually, is to rescale with the new min/max method for each user individually, and then compose the final scoring matrix.

The difference would be in this case this:

In [17]: rescale(scores, 1, 5, 0)
Out[17]:
array([[4.33333333, 4.33333333, 2.83333333, 2.5       , 2.5       ],
       [5.        , 5.        , 2.5       , 3.5       , 3.5       ]])

In [18]: np.array([rescale(scores[0], 1, 5, 0), rescale(scores[1], 1, 5, 0)])
Out[18]:
array([[5. , 5. , 3.2, 2.8, 2.8],
       [5. , 5. , 2.5, 3.5, 3.5]])

In this case, the ranking and relevance metrics would still differ than the non-normalized version, albeit a little less so, because we evaluate them across all users.

@miguelgfierro
Copy link
Collaborator

miguelgfierro commented Jul 2, 2020

@viktorku thanks for this, we are reviewing your code, it will take some time, hopefully you understand.

For @gramhagen @loomlike @anargyri if we compare the original SAR with this version, the ranking metrics are the same:

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385

so the new normalization is not affecting the ranking metrics, which is expected.

@gramhagen
Copy link
Collaborator

@miguelgfierro you must use normalize=True in both the constructor and the recommend_top_k method to see the poor performing metrics. I do like the simplification in this implementation that only requires the flag at construction (though perhaps it would be useful to provide a way to get unnormalized results).

@gramhagen
Copy link
Collaborator

gramhagen commented Jul 4, 2020

@viktorku it's true the current normalization penalizes items with more interactions, this is done to remove the bias the algorithm has towards more popular items. One could argue that these offline metrics reward bias towards popular items, but it does seem like the results are pretty extreme after normalization.

The concern I have with the rescaling you propose is determining the appropriate min/max values for the data. If it's inferred across all users we further penalize less-frequently interacted items and favor more popular items. But if it's inferred per user we could arbitrarily generate large differences between similar scores, or shift small scores so they have high ratings. You can see some of this is clearly happening with the large RMSE values. It is worth noting that the current normalization at least produces better calibrated scores:
RMSE: 1.243966
MAE: 1.050101

I'm wondering if there's a way we can use the result of the inner-product between unity user affinity and item-item similarity to re-weight the results that isn't just dividing the scores.

For example using a Bayesian Estimator to get to a weighted rating:
W = (Rv + Cm) / (v+m)
W = normalized user ratings
R*v = un-normalized scores
C = average item rating (over all users)
v = user-specific item counts (dot product of user-unity-affinity with item-item similarity)
m = tunable number of ratings to control shifting away from average value.

I'm struggling a bit with the "prior" C to use here, I'm thinking the average rating for each item over all users would work, but that may lose some specificity the algorithm is providing with the user-affinity. Another option could be the average rating across all items for that user. Or even your proposed re-scaled rating for the user-item. There's still an arbitrary value m to define, but this could be adjusted by the user depending on how strongly they want to rely on the prior.

@gramhagen
Copy link
Collaborator

The more I think about it I think it's probably right not to "correct" bias in the algorithm through normalization. Your approach still maintains the original ranking so that seems intuitive.

Maybe a good compromise is to use the result of the dot product of unity-user-affinity and item-item similarity to determine the min and max possible scores. Then for each user we can scale based on the max and max scores across all items for that user? This would still maintain the per-user ranking and hopefully avoid edge cases where values are inflated or shrunk.

I may have some time to try this out and will report back results. Any thoughts on this approach @viktorku @anargyri @miguelgfierro ?

@viktorku
Copy link
Contributor Author

viktorku commented Jul 4, 2020

The more I think about it I think it's probably right not to "correct" bias in the algorithm through normalization

I think so too.

Maybe a good compromise is to use the result of the dot product of unity-user-affinity and item-item similarity to determine the min and max possible scores. Then for each user we can scale based on the max and max scores across all items for that user? This would still maintain the per-user ranking and hopefully avoid edge cases where values are inflated or shrunk.

That sounds like a good idea. Please do try it out. My intuition says it would clamp outliers without affecting/worsening the ratings of the normal data points (which I'm afraid that's the cost of my naive approach). As long as the ranking stays the same and the generated recommendations are identical, this would make sense. And we might see a better accuracy. Either way, normalising shouldn't tamper with the actual results.

@gramhagen
Copy link
Collaborator

@viktorku I pushed an implementation to your repo/branch. please review and merge in if you approve.

Thank you for highlighting this discrepancy, the examples were very helpful!

@anargyri
Copy link
Collaborator

anargyri commented Jul 9, 2020

I had a quick look at the code. It seems to me there is a bug in the code in master which is not really fixed in this PR. The bug comes from this function https://github.com/microsoft/recommenders/blob/558733f366983d576953a407ab7180b1642dbc5b/reco_utils/recommender/sar/sar_singlenode.py#L456
recommend_k_items() is used when evaluating the ranking metrics. So it should not depend on normalized scores (and hence there should not be a normalize argument in its signature). The PR code still has this issue.

So, the bug is not caused by the type of normalization used; the desired behavior of recommend_k_items() should be independent of the normalization, because it outputs a ranking of the items per user. Normalization should be applied at a separate function (I don't think we should have a normalize flag in the constructor). Note that any type of normalization that is item dependent in the denominator will have this issue: ranking of the scores before and after normalization will be different, because item i is normalized with a different denominator from item j and hence the order of i and j in the ranking may flip. We opted to ignore this inconsistency and use the normalized scores only for rating metrics; an alternative would be to normalize scores by a factor that depends only on the user and not on the item. The former has the advantage that the normalized scores are in the right range, but the latter does not.

An error in the notebook is using top_k for the computation of rating metrics; rating metrics should be computed on the full test data, not on k items only.

I also recommend adding tests that cover all cases and get the expected results, as numpy arrays, both for the normalized scores and the top-k recommendations.

@gramhagen
Copy link
Collaborator

@anargyri can you explain the rationale behind ignoring the inconsistency between normalized and unnormalized results? Since as you mention normalization in the previous item-based scheme changes the ranking, is it possible to treat the unnormalized ranking metrics and the normalized rating metrics as representing performance for the same algorithm?

@viktorku
Copy link
Contributor Author

viktorku commented Jul 9, 2020

We opted to ignore this inconsistency and use the normalized scores only for rating metrics

Why would normalized scores only be used for performance evaluation? I'm aware predicting accurate ratings is not SAR's strong suit, but I see plenty of reasons for the recommendations to be scored in their true scale, for external consumers. Especially now since it's obvious that the accuracy is not that bad. So I find it absurd that we can't have one without the other (K items in their true scale and correctly ranked).

@viktorku
Copy link
Contributor Author

viktorku commented Jul 9, 2020

An error in the notebook is using top_k for the computation of rating metrics; rating metrics should be computed on the full test data, not on k items only.

Agree with this. I'll adapt the notebook

@anargyri
Copy link
Collaborator

anargyri commented Jul 9, 2020

@anargyri can you explain the rationale behind ignoring the inconsistency between normalized and unnormalized results? Since as you mention normalization in the previous item-based scheme changes the ranking, is it possible to treat the unnormalized ranking metrics and the normalized rating metrics as representing performance for the same algorithm?

Well, there is not a good rationale other than that normalization will involve some trade off inevitably. There are ways to avoid this inconsistency that would normalize with factors that depend on ratings, but this would be more sensitive to variation from new ratings data. Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

Since there is no obvious choice of normalization, there could be multiple options for the user to choose.

@anargyri
Copy link
Collaborator

anargyri commented Jul 9, 2020

We opted to ignore this inconsistency and use the normalized scores only for rating metrics

Why would normalized scores only be used for performance evaluation? I'm aware predicting accurate ratings is not SAR's strong suit, but I see plenty of reasons for the recommendations to be scored in their true scale, for external consumers. Especially now since it's obvious that the accuracy is not that bad. So I find it absurd that we can't have one without the other (K items in their true scale and correctly ranked).

See the above reply to @gramhagen, there are ways to do this but they may have some other drawbacks. Feel free to choose one of them.
About the ratings metrics, I think you could use the predict() method in the notebook.

@gramhagen
Copy link
Collaborator

Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

That is essentially the proposal here, if I'm interpreting what you're saying correctly. This method normalizes per user by first scaling each user's scores between 0 and 1 based on the min and max possible scores (looking at the current denominators across all items for a user). Then this is scaled back up to the original rating range. We keep the scores in the correct rating range as well as maintain the original ranking. The downside is there can be some distortion of the ratings (the rating results are 30-40% worse than ALS, but I think this is still reasonable).

The other issue is the expectation for usage of the normalization flag. If we do not specify it at construction then we will need to generate the unity-user-affinity matrix in all cases which will consume unneeded memory for users that don't care about normalization. Since the proposed approach does not alter rankings specifying normalization once at construction is best as it allows for reduced memory if excluded and does not require the user to remember to use it later during scoring, an easy error to make given I did it in the notebook you pointed out =).

Are there cases where the unnormalized scores would be useful? If so I suggest we keep the normalize flag in the scoring methods but set the default to None and inherit from what was set at construction. That way the user doesn't need to worry about it if they've already specified one way or the other, and can disable it if desired. We will need to keep the raised error if the user does not construct this with normalization then sets score normalization = True. But honestly this seems like an odd requirement.

@anargyri
Copy link
Collaborator

Perhaps a better way to fix this issue is to take the current denominator and apply a max over items to it. It would keep the rankings consistent and the ratings in the correct range.

That is essentially the proposal here, if I'm interpreting what you're saying correctly. This method normalizes per user by first scaling each user's scores between 0 and 1 based on the min and max possible scores (looking at the current denominators across all items for a user). Then this is scaled back up to the original rating range. We keep the scores in the correct rating range as well as maintain the original ranking. The downside is there can be some distortion of the ratings (the rating results are 30-40% worse than ALS, but I think this is still reasonable).

The other issue is the expectation for usage of the normalization flag. If we do not specify it at construction then we will need to generate the unity-user-affinity matrix in all cases which will consume unneeded memory for users that don't care about normalization. Since the proposed approach does not alter rankings specifying normalization once at construction is best as it allows for reduced memory if excluded and does not require the user to remember to use it later during scoring, an easy error to make given I did it in the notebook you pointed out =).

Are there cases where the unnormalized scores would be useful? If so I suggest we keep the normalize flag in the scoring methods but set the default to None and inherit from what was set at construction. That way the user doesn't need to worry about it if they've already specified one way or the other, and can disable it if desired. We will need to keep the raised error if the user does not construct this with normalization then sets score normalization = True. But honestly this seems like an odd requirement.

This renormalization sounds good. The important thing is to maintain the rankings and the range of the ratings at the same time. There could be multiple ways to do this, in fact, and in the future there could be an option to choose between different normalization methods.

As you say, the implementation is optimal in terms of efficiency. One question I have is the unity-user-affinity matrix required in your new proposed normalization (which seems to require only the scores AFAIU)? In general, my concern was the outward interface where you need to specify the normalization flag twice (both in constructor and recommendation) and it is easy to forget one of the two. I think it is not necessary to have the flag when recommending top k items. It is just a matter of implementation whether you use the unnormalized or the normalized scores for ranking the items (and with your proposed normalization they should be the same). On the other hand, in predict() we need to use the normalized scores, so again no flag is needed. In summary, it seems to me that the flag in the constructor should be an "efficiency" flag, not about functionality i.e. the results (rankings and ratings) should not changed if you flip the flag, but the computational time will be affected.

@gramhagen
Copy link
Collaborator

Ok. Sounds good. So we keep this PR's version of normalization with the single flag at construction.

I think the only remaining items were there comments on the evaluation tools. @viktorku when you've addressed those and the evaluation in the notebook we should be good.

Let me know if you need any further support.

@gramhagen
Copy link
Collaborator

@viktorku is this ready? or do you need any support to complete this?

@viktorku
Copy link
Contributor Author

Hey!, sorry for the wait, I was on vacation. Will finish this up these days.

@viktorku viktorku force-pushed the fix-sar-normalization-and-add-accuracy-metrics branch from 7891791 to 803aacf Compare August 18, 2020 09:39
@viktorku
Copy link
Contributor Author

@gramhagen I finished this. Sorry for the delay. Let me know if there's anything else to be done 👍

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot for your work on this!

@miguelgfierro miguelgfierro merged commit c7658ac into recommenders-team:staging Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants