-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Add get_ken_embeddings example #602
DOC Add get_ken_embeddings example #602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments
skrub/datasets/_ken_embeddings.py
Outdated
>>> games_embedding = get_ken_embeddings(types="video_games") | ||
>>> games_embedding.head() | ||
Entity Type ... X198 X199 | ||
0 <The_Mysterious_Island> <wikicat_novels_adapted_into_video_games> ... -0.072814 -0.156973 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: we still have the "<" here, it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, corrected, after I merged with #601
skrub/datasets/_ken_embeddings.py
Outdated
@@ -99,6 +99,7 @@ def get_ken_types( | |||
search_result = unique_types.X[unique_types.X["Type"].str.contains(search)] | |||
if exclude is not None: | |||
search_result = search_result[~search_result["Type"].str.contains(exclude)] | |||
search_result["Type"] = search_result["Type"].str.replace("<", "").str.replace(">", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to do ".str[1:-1]": what if there was a "<" inside the string (maybe it's impossible?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are extracted from Wikipedia URL's, thus we have the "_" instead of spaces.
And having ">" or "<" symbols is apparently not possible:
https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid#1547940
And having ">" or "<" symbols is apparently not possible:
https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid#1547940
OK. I would still think that .str[1:-1] is faster than two string replaces (you can time it, if you want :) ).
|
I also have a doubt about the name of the |
I also have a doubt about the name of the types parameter of the get_ken_embeddings function. People may not understand that we are actually performing a search. I would rather rename it to types_search or search.
I agreed.
Also, the "get_ken" functions should all be renamed "fetch_ken..." as they perform downloads over internet. The convention is that whatever does a download over internet is called "fetch" (we should write this somewhere).
I guess it's now or never after the release :)
Yes. Let's do it! :)
|
There's a conflict :( Also: don't we need to do the renaming also in the api docs? |
LGTM. Merging |
Fix for #578