add logic and example of item feature vector based item-item similari… #1505

YanZhangADS · 2021-08-20T09:50:00Z

…ty calculation

Description

In the scenarios when item features are available, we may want to calculate item-item similarity based on item feature vectors. In this PR, we add logic and example on how to calculate diversity metrics using item feature vector based item-item similarity.

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

…ty calculation

review-notebook-app · 2021-08-20T09:50:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

gramhagen

few small comments, this looks like a nice addition.

examples/03_evaluate/als_movielens_diversity_metrics.ipynb

recommenders/evaluation/spark_evaluation.py

gramhagen · 2021-08-20T14:47:57Z

recommenders/evaluation/spark_evaluation.py

+            p = 2
+            return float(v1.dot(v2))/float(v1.norm(p)*v2.norm(p))
+        except:
+            return 0


do we want to indicate that two vectors are identical if there's a failure to compute the distance? can we return a NaN or something?

if we vectors are identical, their cosine similarity will be 1. On which situation do you think the calculation will fail? @gramhagen

sorry you're right, we're not computing a distance here. i think in this case returning a 0 is okay in the exception. my concern is just about silent failures, if for some reason there is an exception, most likely because the vectors are not the correct size or have invalid values, then we don't ever surface the fact that one of the item vectors is incorrectly defined.

added logic to check the size of vectors. If the sizes of two input vector are different, raise an exception

i'm a still a bit confused on this part, instead of both catching a particular exception and raising an exception and providing a generic except that silently swallows the exception and returns a 0. I think we just do one or the other. I'm fine with using a generic exception and returning a NaN or not even using a try-except and throwing the exception.

It's probably better practice to pass the exception through instead of returning 0, because people interpret 0 as the two vectors being orthogonal. Also passing NaN in the case of an exception is not great because people would expect NaN to be the output when one of the vectors is zero.

removed try/catch block

Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>

… just like user, item, relevance

examples/03_evaluate/als_movielens_diversity_metrics.ipynb

miguelgfierro

this is really nice Yan, it looks there are a couple of tests breaking, @laserprec might be able to help

examples/03_evaluate/als_movielens_diversity_metrics.ipynb

…nders into zhangya/itemsimilarity

recommenders/evaluation/spark_evaluation.py

anargyri · 2021-09-02T16:56:30Z

recommenders/evaluation/spark_evaluation.py

@@ -588,6 +608,15 @@ def _get_pairwise_items(self, df):
        )

    def _get_cosine_similarity(self, n_partitions=200):
+        if self.item_feature_df is None:


It would be better to select the method for calculating similarity using an argument instead of whether item_feature_df is present or not.
Because if the item features are available in the dataframe, one would need to create a new SparkDiversityEvaluation object if one wishes to compute the co-occurrence based similarities.
There would need to be an additional member dataframe, in order to keep track whether it has been computed already.

Do you mean we want to only create a single SparkDiversityEvaluation object, and be able to get diversity value from two different item similarity approaches if item features are available?
e.g.
eval = SparkDiversityEvaluation(args, ...)
diversity_item_coocurrence = eval.diversity(method="item_coocurrence")
diversity_item_similarity=eval.diversity(method="item_similarity")

If using this approach, it doubles the code in multiple functions. It looks very confusing. Therefore I kept the previous method unchanged, i.e. only one "item_sim_measure" can be chosen for one object
eval = SparkDiversityEvaluation(args, ...).

"select the method for calculating similarity using an argument instead of whether item_feature_df is present or not." -- this is implemented.

Why would it increase the code significantly? I think you would just need to add an extra argument in some functions and just pass this argument from calling to called function. So that you could write
als_diversity=als_eval.diversity(method="cooccurrence")
or something similar.
And you would remove the argument from the constructor that you added in the most recent commit.

The main thing I would like to avoid is the user creating two SparkDiversityEvaluation objects when they want to compute both types of diversity or serendipity metrics. This is more user-unfriendly than the alternative. And imagine what would happen if we added more item similarity methods: they would need to create as many objects as the methods available.

it will looks something like

@anargyri @gramhagen what do you think? Should we make change to use the above method?

Actually, you don't need to duplicate the code in this way. What I meant was to change some of the member DFs into dictionaries indexed by the similarity method. I.e. do something like this instead:

def user_diversity(self, item_sim_measure="item_cooccurrence_count"): if item_sim_measure not in self.df_user_diversity: self.df_intralist_similarity[item_sim_measure] = self._get_intralist_similarity(self.reco_df, item_sim_measure=item_sim_measure) self.df_user_diversity[item_sim_measure] = ( self.df_intralist_similarity[item_sim_measure].withColumn( "user_diversity", 1 - F.col("avg_il_sim") ) .select(self.col_user, "user_diversity") .orderBy(self.col_user) ) return self.df_user_diversity[item_sim_measure]

So the code doesn't really increase but it becomes less readable. I am fine both ways but it's a trade-off, i.e. shifting convenience from the end-user to the developer or vice versa.
@yueguoguo @miguelgfierro any opinions on this?

recommenders/evaluation/spark_evaluation.py

tests/unit/recommenders/evaluation/test_spark_evaluation.py

examples/03_evaluate/als_movielens_diversity_metrics.ipynb

recommenders/evaluation/spark_evaluation.py

gramhagen

lgtm

codecov-commenter · 2021-09-08T21:45:40Z

Codecov Report

Merging #1505 (39ca61d) into staging (c5029af) will decrease coverage by 12.20%.
The diff coverage is 82.75%.

@@             Coverage Diff              @@
##           staging    #1505       +/-   ##
============================================
- Coverage    74.23%   62.03%   -12.21%     
============================================
  Files           84       84               
  Lines         8369     8397       +28     
============================================
- Hits          6213     5209     -1004     
- Misses        2156     3188     +1032

Impacted Files	Coverage Δ
recommenders/evaluation/spark_evaluation.py	`86.66% <81.48%> (-0.78%)`	⬇️
recommenders/utils/constants.py	`100.00% <100.00%> (ø)`
...ecommenders/models/newsrec/io/mind_all_iterator.py	`12.21% <0.00%> (-86.65%)`	⬇️
recommenders/models/newsrec/io/mind_iterator.py	`15.67% <0.00%> (-82.71%)`	⬇️
...ommenders/models/deeprec/io/sequential_iterator.py	`15.85% <0.00%> (-81.94%)`	⬇️
recommenders/models/newsrec/models/base_model.py	`30.90% <0.00%> (-59.40%)`	⬇️
...deeprec/models/sequential/sequential_base_model.py	`46.97% <0.00%> (-47.66%)`	⬇️
recommenders/models/geoimc/geoimc_data.py	`41.66% <0.00%> (-44.80%)`	⬇️
...enders/models/deeprec/io/dkn_item2item_iterator.py	`45.61% <0.00%> (-42.11%)`	⬇️
...menders/models/deeprec/models/graphrec/lightgcn.py	`51.47% <0.00%> (-40.24%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5029af...39ca61d. Read the comment docs.

add logic and example of item feature vector based item-item similari…

f6432ab

…ty calculation

YanZhangADS requested review from anargyri, gramhagen, loomlike, miguelgfierro, wutaomsft and yueguoguo as code owners August 20, 2021 09:50

gramhagen requested changes Aug 20, 2021

View reviewed changes

YanZhangADS and others added 4 commits August 30, 2021 10:57

Update examples/03_evaluate/als_movielens_diversity_metrics.ipynb

7bd3d48

Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>

create self.col_item_features so the user can control the column name…

397653f

… just like user, item, relevance

add logic to check item feature vector size

5ebd240

Merge branch 'staging' into zhangya/itemsimilarity

54d71e2

miguelgfierro reviewed Aug 31, 2021

View reviewed changes

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Show resolved Hide resolved

miguelgfierro approved these changes Aug 31, 2021

View reviewed changes

miguelgfierro reviewed Aug 31, 2021

View reviewed changes

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Show resolved Hide resolved

YanZhangADS added 2 commits August 31, 2021 17:27

added why diversity metrics are important for a customer point of view

b3f9ff1

add unit tests

6a766a2

anargyri reviewed Sep 1, 2021

View reviewed changes

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Show resolved Hide resolved

anargyri reviewed Sep 1, 2021

View reviewed changes

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Show resolved Hide resolved

YanZhangADS added 4 commits September 1, 2021 18:24

enhance doc

355d687

enhance doc

65cd300

fix schema definition

5dedc14

Merge branch 'staging' into zhangya/itemsimilarity

78f99d9

YanZhangADS requested a review from gramhagen September 2, 2021 11:27

anargyri and others added 3 commits September 2, 2021 11:34

Minor edits

f45cbba

Merge branch 'zhangya/itemsimilarity' of github.com:microsoft/recomme…

578486d

…nders into zhangya/itemsimilarity

fix

dfd6786

anargyri reviewed Sep 2, 2021

View reviewed changes

recommenders/evaluation/spark_evaluation.py Outdated Show resolved Hide resolved

fix formatting

4256e43

anargyri reviewed Sep 2, 2021

View reviewed changes

recommenders/evaluation/spark_evaluation.py Outdated Show resolved Hide resolved

anargyri reviewed Sep 2, 2021

View reviewed changes

recommenders/evaluation/spark_evaluation.py Show resolved Hide resolved

anargyri reviewed Sep 2, 2021

View reviewed changes

tests/unit/recommenders/evaluation/test_spark_evaluation.py Outdated Show resolved Hide resolved

anargyri reviewed Sep 2, 2021

View reviewed changes

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Show resolved Hide resolved

YanZhangADS added 2 commits September 7, 2021 11:10

add input parameter item_sim_measure

9948241

fix

05265d9

anargyri reviewed Sep 8, 2021

View reviewed changes

recommenders/evaluation/spark_evaluation.py Show resolved Hide resolved

anargyri reviewed Sep 8, 2021

View reviewed changes

recommenders/evaluation/spark_evaluation.py Show resolved Hide resolved

YanZhangADS added 2 commits September 8, 2021 19:50

add tests for serendipity

af3ba00

add docstring, fix code

39ca61d

gramhagen approved these changes Sep 8, 2021

View reviewed changes

anargyri approved these changes Sep 9, 2021

View reviewed changes

fix

3f74efc

YanZhangADS merged commit c806d55 into staging Sep 9, 2021

miguelgfierro deleted the zhangya/itemsimilarity branch February 7, 2022 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add logic and example of item feature vector based item-item similari… #1505

add logic and example of item feature vector based item-item similari… #1505

YanZhangADS commented Aug 20, 2021 •

edited

review-notebook-app bot commented Aug 20, 2021

gramhagen left a comment

gramhagen Aug 20, 2021

YanZhangADS Aug 30, 2021

gramhagen Aug 30, 2021

YanZhangADS Aug 31, 2021

gramhagen Sep 2, 2021

anargyri Sep 2, 2021 •

edited

YanZhangADS Sep 7, 2021

miguelgfierro left a comment

anargyri Sep 2, 2021 •

edited

YanZhangADS Sep 4, 2021

YanZhangADS Sep 7, 2021

anargyri Sep 8, 2021

anargyri Sep 8, 2021

YanZhangADS Sep 8, 2021

YanZhangADS Sep 8, 2021

anargyri Sep 9, 2021 •

edited

gramhagen left a comment

codecov-commenter commented Sep 8, 2021

add logic and example of item feature vector based item-item similari… #1505

add logic and example of item feature vector based item-item similari… #1505

Conversation

YanZhangADS commented Aug 20, 2021 • edited

Description

Related Issues

Checklist:

review-notebook-app bot commented Aug 20, 2021

gramhagen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri Sep 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miguelgfierro left a comment

Choose a reason for hiding this comment

anargyri Sep 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri Sep 9, 2021 • edited

Choose a reason for hiding this comment

gramhagen left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 8, 2021

Codecov Report

YanZhangADS commented Aug 20, 2021 •

edited

anargyri Sep 2, 2021 •

edited

anargyri Sep 2, 2021 •

edited

anargyri Sep 9, 2021 •

edited