-
Notifications
You must be signed in to change notification settings - Fork 129
Add example with KEN Wikipedia embeddings #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example with KEN Wikipedia embeddings #487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, here's a first pass on the code :)
Regarding the memory errors, do we know exactly where they happen? What is the dtype of the ken embeddings? float32, float64? I'm wondering if we could not have them with a less good precision: I suspect that it would lead to the same performance but lighter models. |
) | ||
emb_final = pd.concat([emb_final, emb_pca]) | ||
else: | ||
emb_final = pd.concat([emb_final, emb_extracts]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very good at pandas, but my gut feeling is that growing a pandas dataframe like this might lead to a duplication of emb_final
at each iteration.
I would try assembling the DataFrame at the end. To get a good understanding of the tradeoffs, it could be useful to do a small script to compare different ways of assembling the data in a DataFrame and profile it, including using memprofiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick experiment, and there is indeed a sizable benefit in terms both of memory and of time to use a list as an accumulator:
import pandas as pd
import numpy as np
def concat(n=30):
df = pd.DataFrame()
for i in range(n):
this_df = pd.DataFrame(np.empty((int(1e5), 10)))
df = pd.concat((df, this_df))
def accumulate(n=30):
l = list()
for i in range(n):
this_df = pd.DataFrame(np.empty((int(1e5), 10)))
l.append(this_df)
df = pd.concat(l)
# Measure execution time
import time
for func in concat, accumulate:
t0 = time.time()
# Run the function many times
(func() for _ in range(100))
t1 = time.time()
print(f"{func}: execution time: {t1 - t0}")
# Measure memory usage
from memory_profiler import memory_usage
for func in concat, accumulate:
# Run the function many times
mem_used = np.mean([np.max(memory_usage(func, )) for _ in range(3)])
print(f"{func}: memory usage: {mem_used}")
Results in
In [1]: %run test_pandas.py
<function concat at 0x7f2e9d82df30>: execution time: 4.291534423828125e-06
<function accumulate at 0x7f2e9d82e200>: execution time: 2.1457672119140625e-06
<function concat at 0x7f2e9d82df30>: memory usage: 581.1041666666666
<function accumulate at 0x7f2e9d82e200>: memory usage: 383.70703125
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great!
I will look for more improvements of this type so that we may avoid the memory errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an idea of how far above the memory limit we are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, still trying to find. I know the resources allocated by circleci are 2 CPUs and 4GB of RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The embeddings are float32, from what I saw, it is not possible do downcast the dtypes more to reduce memory usage. |
And we are good with memory! Taking a smaller subset of entity types and using pyarrow rather than pandas in |
Stupid comment: the figure visible in the summary of the example is not very exciting. I think that there is a way with sphinx-gallery to tell it to take not the first figure but another one. |
The get_ken_embeddings needs to be added to the documented functions. |
Parameters | ||
---------- | ||
types : str | ||
Types of embeddings to include. Write in lowercase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to give a list of acceptable types, or a pointer to where this list can be found (eg by adding it on the ken embedding page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just leave for now a link to where the detailed types can be found, and then maybe we can improve this in a second iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, as long as the link points to a list that is easy to understand.
emb_id : str | ||
Figshare ID of the file containing all embeddings. | ||
emb_type_id : str | ||
Figshare ID of the file containing the type of embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment. Can it / should it, be the same list of names: when we give emb_id as a name, the emb_type_id number is automatically deduced (aka the table has three columns: the name and the two numbers)? In which case, this number becomes 'None' by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be possible, yes.
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Is maybe the best solution is to add this correspondence table to https://github.com/dirty-cat/datasets?
Sounds good!
|
@@ -24,6 +24,8 @@ def get_ken_embeddings( | |||
---------- | |||
types : str | |||
Types of embeddings to include. Write in lowercase. | |||
For the full list of accepted types, see the `entity_detailed_types` | |||
table on https://soda-inria.github.io/ken_embeddings/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, this is not simple enough for the user: it forces the download of a parquet file just to know the accepted types. We should run a "unique" on this file and display the result in a way that is readable in a browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that being able to see all types in the browser would be great! Then the user can use Ctrl+F to find the types of interest. However there are a lot of types (dozens of thousands if I remember well, but I guess it can be filtered), so I don't know if it will load well in the browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe we need to upload a table with two columns: type name and number of occurences to a google doc spreadsheet and share this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upper limit for the number of rows in Google spreadsheet is 40 000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This types
parameter was meant to facilitate the needs of the user as it already filters the types by the input string pattern.
So if you put "music" you will get all types with this string (e.g. "wikicat_musician_from_france", "wikicat_music_label" etc.)
Going directly for the exact type name (e.g. "wikicat_rock_music_bands") seems a bit unnecessary.
But I agree it might be useful to have the most common types. We have 114 509 types in yago3 2022. But I guess it would be best to take the top 40 000 types with their number of occurrences, put it into a google doc spreadsheet and share it. There are 84 entities in the 40 000th type, so I guess it's ok to leave out the other ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it should be better described: that it is a substring and that it will be matched (the paragraph that you have above is actually quite useful.
I wonder if we should not provide a helper function to help people tune this substring: "get_ken_matching_types" that would return all the types that have the corresponding substring, and expose it to the users. I guess that we can do this in a second time, after merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue for the additional function: #495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; merging. Thank you!
Add example using embeddings from https://soda-inria.github.io/ken_embeddings/