Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Niwilso/adding tfidf #1088

Merged
merged 35 commits into from
Apr 23, 2020

Conversation

niwilso
Copy link
Contributor

@niwilso niwilso commented Apr 18, 2020

Description

Adding the quick start notebook (./notebooks/00_quick_start/tfidf_covid.ipynb), associated utils (./reco_utils/dataset/covid_utils.py and ./reco_utils/recommender/tfidf/tfidf_utils.py), and unit tests (./tests/unit/test_covid_utils.py and ./tests/unit/test_tfidf_utils.py) to the staging branch.

This directly addresses the open issue linked below, which proposes adding a simple TF-IDF recommender demonstration using a novel dataset.

Related Issues

#1087

Checklist:

  • [ x ] I have followed the contribution guidelines and code style for this project.
  • [ x ] I have added tests covering my contributions.
  • [ x ] I have updated the documentation accordingly.
  • [ x ] This PR is being made to staging and not master.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@gramhagen
Copy link
Collaborator

This is really great, thanks for the addition! I looked through briefly and looks like you did a very thorough job fitting this into the repo. I need to spend a bit more time looking more closely and will share a few suggestions.

I'm thinking we may want to find some ways to balance maintenance w/ likelihood of re-usability with some of the functions. If there's clear value in wrapping a function because it simplifies some complex functionality (particularly if it's expected to be repeated) then it makes sense. In general, we've sometimes gone the other way in this repo and ended up creating many functions that make it hard to discover functionality or ensure it is tested & working for all cases. So I'll take some time to make sure we can avoid that here.

reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/covid_utils.py Show resolved Hide resolved
reco_utils/dataset/covid_utils.py Show resolved Hide resolved
reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved
@anargyri
Copy link
Collaborator

The notebook looks very nice, great job! Just one thing, could you truncate the text output, it is a bit too long.
I agree with the comments above about refactoring, the methods that are not specific to the Covid data set can be moved to another place like download_utils.py and some of them (like duplicates or NaN) can be done easily with Pandas.
Another suggestion I have is to follow the convention we use for the other data sets (Movielens, Criteo) with the load_pandas_df() functions i.e. you could have such a function for Covid data, that the user can call, as in https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb

@niwilso
Copy link
Contributor Author

niwilso commented Apr 21, 2020

The notebook looks very nice, great job! Just one thing, could you truncate the text output, it is a bit too long.
I agree with the comments above about refactoring, the methods that are not specific to the Covid data set can be moved to another place like download_utils.py and some of them (like duplicates or NaN) can be done easily with Pandas.
Another suggestion I have is to follow the convention we use for the other data sets (Movielens, Criteo) with the load_pandas_df() functions i.e. you could have such a function for Covid data, that the user can call, as in https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb

Thank you @anargyri for the comment! I have made the following changes in my latest commit:

  1. Text output is now truncated.
  2. get_blob_service() and load_csv_from_blob have been moved from covid_utils.py to download_utils.py.
  3. covid_utils.py now has a load_pandas_df() function like in the MovieLens example.
  4. The quickstart notebook has been updated to accommodate the above changes.
  5. I have kept remove_duplicates() and remove_nan() due to the reasons explained in my responses to Miguel. These wrapper functions are more thorough in handling duplicates and nan in the case of this specific dataset.

@miguelgfierro
Copy link
Collaborator

amazing work Nile, I'll pass the tests manually since right now our test machines are down due to covid situation.

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of great content here! I added some suggestions to simplify and generalize where possible. the less code we put in the less we have to test =)

README.md Outdated Show resolved Hide resolved
reco_utils/dataset/download_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/download_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved
reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved
# Save to class
self.recommendations = results

def __organize_results_as_tabular(self, df_clean):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this function out! I tested if it was faster to (1) keep the code as is, (2) use python_utils.get_top_k_score_items() in this method, or (3) reduce iterating through k and not use python_utils.get_top_k_score_items().

k = 200

Method Time to execute __organize_results_as_tabular()
(1) Original 0.266 seconds
(2) Use python_utils.get_top_k_score_items() 0.353 seconds
(3) Reduce iterating in original 0.228 seconds

This is just a rough comparison, but I did a few runs and took the mean (for when k=200) and saw that approach (3) was the fastest. My latest commit uses this approach.

Also, I noticed when playing around with this that python_utils.get_top_k_score_items() breaks when you set sort_top_k=True. If I were to try to use python_utils.get_top_k_score_items(), I would need to go through some extra steps to make sure the output is properly sorted.

reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved
output = clean_dataframe(df)
assert len(df) > len(output)

def test_extract_text_from_file():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you plan on adding tests for these functions? I would rather force them to fail then add ones that pass, that way they will get attention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_text_from_file() is now joined with retrieve_text().

I was not planning on writing out tests for retrieve_text() and get_public_domain_text() because both of these functions require accessing Azure blob storage.

reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved
@niwilso
Copy link
Contributor Author

niwilso commented Apr 22, 2020

Thank you @gramhagen for the thorough review!

With the changes, the code should be more efficient, considering we are no longer looping through pandas dataframes (thank you for pointing it out and providing the resource).

In addition, the TfidfRecommender class is now more generalized and is not limited to the COVID-19 dataset.

There are still a few open comments where I would like to get other's feedback as well before making changes.

Regardless of if we make changes from the latest commit or not, I believe this PR should be ready to go.

@gramhagen
Copy link
Collaborator

thanks for making these changes. one last thing, we try to keep everything in the same format for readability by using black. Can you run that on the files for this pr? Some details here if you need it: https://github.com/microsoft/recommenders/wiki/Coding-Guidelines#python-and-docstrings-style

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work Nile!

reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved
tests/unit/test_tfidf_utils.py Show resolved Hide resolved
tests/unit/test_covid_utils.py Show resolved Hide resolved
@@ -0,0 +1,83 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it would be great to add an integration test for the notebook https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

If you don't have bandwidth, at least can you please mark it as todo with @pytest.mark.skip(reason="TODO: Implement this")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a new integration test file ./tests/integration/test_covid.py. It does not currently have any tests to run, but I hope to write them out today. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately did not have time to write out the tests today, but a placeholder file has been created ./tests/integration/test_covid.py

@miguelgfierro
Copy link
Collaborator

thanks for making these changes. one last thing, we try to keep everything in the same format for readability by using black. Can you run that on the files for this pr? Some details here if you need it: https://github.com/microsoft/recommenders/wiki/Coding-Guidelines#python-and-docstrings-style

@niwilso just trying to make your life easier. If you are using VSCode, with the current repo, you would just need to add these lines to your settings:

    // Python
    "python.pythonPath": "C:/Anaconda3/envs/py36/python.exe",
    "python.formatting.blackPath": "C:/Anaconda3/envs/py36/Scripts/black.exe",
    "python.formatting.provider": "black",

and when you save, magically, everything will be automatically formatted. If you are interested in the full settings I use, you can find them here: https://github.com/miguelgfierro/codebase/blob/master/minsc/vscode_settings.json

@niwilso
Copy link
Contributor Author

niwilso commented Apr 23, 2020

Summary of today's main edits:

  • load_csv_from_blob() has been moved to a new general file ./reco_utils/dataset/blob_utils.py
  • Required packages have been added to ./scripts/generate_conda_file.py, matching the versioning in the NLP repo
  • Created placeholder for integration testing and designated skipping TODO tests
  • Formatted code using Black

I believe all that remains now is writing the integration tests for the quickstart notebook. I don't think I'll be able to get to it tomorrow but hopefully I can on Friday. If the notebook integration test isn't required for this PR, I would be happy to open a separate PR after this one has been completed.

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent work!

@miguelgfierro miguelgfierro merged commit e646c4e into recommenders-team:staging Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants