Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add logic and example of item feature vector based item-item similari… #1505

Merged
merged 20 commits into from
Sep 9, 2021

Conversation

YanZhangADS
Copy link
Collaborator

@YanZhangADS YanZhangADS commented Aug 20, 2021

…ty calculation

Description

In the scenarios when item features are available, we may want to calculate item-item similarity based on item feature vectors. In this PR, we add logic and example on how to calculate diversity metrics using item feature vector based item-item similarity.

Related Issues

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small comments, this looks like a nice addition.

examples/03_evaluate/als_movielens_diversity_metrics.ipynb Outdated Show resolved Hide resolved
recommenders/evaluation/spark_evaluation.py Outdated Show resolved Hide resolved
p = 2
return float(v1.dot(v2))/float(v1.norm(p)*v2.norm(p))
except:
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to indicate that two vectors are identical if there's a failure to compute the distance? can we return a NaN or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we vectors are identical, their cosine similarity will be 1. On which situation do you think the calculation will fail? @gramhagen

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry you're right, we're not computing a distance here. i think in this case returning a 0 is okay in the exception. my concern is just about silent failures, if for some reason there is an exception, most likely because the vectors are not the correct size or have invalid values, then we don't ever surface the fact that one of the item vectors is incorrectly defined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logic to check the size of vectors. If the sizes of two input vector are different, raise an exception

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a still a bit confused on this part, instead of both catching a particular exception and raising an exception and providing a generic except that silently swallows the exception and returns a 0. I think we just do one or the other. I'm fine with using a generic exception and returning a NaN or not even using a try-except and throwing the exception.

Copy link
Collaborator

@anargyri anargyri Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better practice to pass the exception through instead of returning 0, because people interpret 0 as the two vectors being orthogonal. Also passing NaN in the case of an exception is not great because people would expect NaN to be the output when one of the vectors is zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed try/catch block

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really nice Yan, it looks there are a couple of tests breaking, @laserprec might be able to help

@@ -588,6 +608,15 @@ def _get_pairwise_items(self, df):
)

def _get_cosine_similarity(self, n_partitions=200):
if self.item_feature_df is None:
Copy link
Collaborator

@anargyri anargyri Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to select the method for calculating similarity using an argument instead of whether item_feature_df is present or not.
Because if the item features are available in the dataframe, one would need to create a new SparkDiversityEvaluation object if one wishes to compute the co-occurrence based similarities.
There would need to be an additional member dataframe, in order to keep track whether it has been computed already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we want to only create a single SparkDiversityEvaluation object, and be able to get diversity value from two different item similarity approaches if item features are available?
e.g.
eval = SparkDiversityEvaluation(args, ...)
diversity_item_coocurrence = eval.diversity(method="item_coocurrence")
diversity_item_similarity=eval.diversity(method="item_similarity")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using this approach, it doubles the code in multiple functions. It looks very confusing. Therefore I kept the previous method unchanged, i.e. only one "item_sim_measure" can be chosen for one object
eval = SparkDiversityEvaluation(args, ...).

"select the method for calculating similarity using an argument instead of whether item_feature_df is present or not." -- this is implemented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it increase the code significantly? I think you would just need to add an extra argument in some functions and just pass this argument from calling to called function. So that you could write
als_diversity=als_eval.diversity(method="cooccurrence")
or something similar.
And you would remove the argument from the constructor that you added in the most recent commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I would like to avoid is the user creating two SparkDiversityEvaluation objects when they want to compute both types of diversity or serendipity metrics. This is more user-unfriendly than the alternative. And imagine what would happen if we added more item similarity methods: they would need to create as many objects as the methods available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will looks something like
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anargyri @gramhagen what do you think? Should we make change to use the above method?

Copy link
Collaborator

@anargyri anargyri Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you don't need to duplicate the code in this way. What I meant was to change some of the member DFs into dictionaries indexed by the similarity method. I.e. do something like this instead:

def user_diversity(self, item_sim_measure="item_cooccurrence_count"):
        if item_sim_measure not in self.df_user_diversity:
            self.df_intralist_similarity[item_sim_measure] = self._get_intralist_similarity(self.reco_df, item_sim_measure=item_sim_measure)
            self.df_user_diversity[item_sim_measure] = (
                self.df_intralist_similarity[item_sim_measure].withColumn(
                    "user_diversity", 1 - F.col("avg_il_sim")
                )
                .select(self.col_user, "user_diversity")
                .orderBy(self.col_user)
            )
        return self.df_user_diversity[item_sim_measure]

So the code doesn't really increase but it becomes less readable. I am fine both ways but it's a trade-off, i.e. shifting convenience from the end-user to the developer or vice versa.
@yueguoguo @miguelgfierro any opinions on this?

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@codecov-commenter
Copy link

Codecov Report

Merging #1505 (39ca61d) into staging (c5029af) will decrease coverage by 12.20%.
The diff coverage is 82.75%.

Impacted file tree graph

@@             Coverage Diff              @@
##           staging    #1505       +/-   ##
============================================
- Coverage    74.23%   62.03%   -12.21%     
============================================
  Files           84       84               
  Lines         8369     8397       +28     
============================================
- Hits          6213     5209     -1004     
- Misses        2156     3188     +1032     
Impacted Files Coverage Δ
recommenders/evaluation/spark_evaluation.py 86.66% <81.48%> (-0.78%) ⬇️
recommenders/utils/constants.py 100.00% <100.00%> (ø)
...ecommenders/models/newsrec/io/mind_all_iterator.py 12.21% <0.00%> (-86.65%) ⬇️
recommenders/models/newsrec/io/mind_iterator.py 15.67% <0.00%> (-82.71%) ⬇️
...ommenders/models/deeprec/io/sequential_iterator.py 15.85% <0.00%> (-81.94%) ⬇️
recommenders/models/newsrec/models/base_model.py 30.90% <0.00%> (-59.40%) ⬇️
...deeprec/models/sequential/sequential_base_model.py 46.97% <0.00%> (-47.66%) ⬇️
recommenders/models/geoimc/geoimc_data.py 41.66% <0.00%> (-44.80%) ⬇️
...enders/models/deeprec/io/dkn_item2item_iterator.py 45.61% <0.00%> (-42.11%) ⬇️
...menders/models/deeprec/models/graphrec/lightgcn.py 51.47% <0.00%> (-40.24%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5029af...39ca61d. Read the comment docs.

@YanZhangADS YanZhangADS merged commit c806d55 into staging Sep 9, 2021
@miguelgfierro miguelgfierro deleted the zhangya/itemsimilarity branch February 7, 2022 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants