Skip to content

Add example with KEN Wikipedia embeddings #487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 16, 2023

Conversation

jovan-stojanovic
Copy link
Member

Add example using embeddings from https://soda-inria.github.io/ken_embeddings/

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here's a first pass on the code :)

@GaelVaroquaux
Copy link
Member

Regarding the memory errors, do we know exactly where they happen?

What is the dtype of the ken embeddings? float32, float64? I'm wondering if we could not have them with a less good precision: I suspect that it would lead to the same performance but lighter models.

)
emb_final = pd.concat([emb_final, emb_pca])
else:
emb_final = pd.concat([emb_final, emb_extracts])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very good at pandas, but my gut feeling is that growing a pandas dataframe like this might lead to a duplication of emb_final at each iteration.

I would try assembling the DataFrame at the end. To get a good understanding of the tradeoffs, it could be useful to do a small script to compare different ways of assembling the data in a DataFrame and profile it, including using memprofiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick experiment, and there is indeed a sizable benefit in terms both of memory and of time to use a list as an accumulator:

import pandas as pd
import numpy as np

def concat(n=30):
    df = pd.DataFrame()
    for i in range(n):
        this_df = pd.DataFrame(np.empty((int(1e5), 10)))
        df = pd.concat((df, this_df))


def accumulate(n=30):
    l = list()
    for i in range(n):
        this_df = pd.DataFrame(np.empty((int(1e5), 10)))
        l.append(this_df)
    df = pd.concat(l)


# Measure execution time
import time

for func in concat, accumulate:
    t0 = time.time()
    # Run the function many times
    (func() for _ in range(100))
    t1 = time.time()
    print(f"{func}: execution time: {t1 - t0}")


# Measure memory usage
from memory_profiler import memory_usage

for func in concat, accumulate:
    # Run the function many times
    mem_used = np.mean([np.max(memory_usage(func, )) for _ in range(3)])
    print(f"{func}: memory usage: {mem_used}")

Results in

In [1]: %run test_pandas.py
<function concat at 0x7f2e9d82df30>: execution time: 4.291534423828125e-06
<function accumulate at 0x7f2e9d82e200>: execution time: 2.1457672119140625e-06
<function concat at 0x7f2e9d82df30>: memory usage: 581.1041666666666
<function accumulate at 0x7f2e9d82e200>: memory usage: 383.70703125

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great!
I will look for more improvements of this type so that we may avoid the memory errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an idea of how far above the memory limit we are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, still trying to find. I know the resources allocated by circleci are 2 CPUs and 4GB of RAM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figure_1

Ok, just did a memory profiling and it seems that we are around 1GB above the limit at one point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot from 2023-02-15 09-44-35
Of course, it's the first get_ken_embeddings function (when getting all games) that is the cause.
But interestingly, when repeated with line by line analysis we are somewhat under the limit.

@jovan-stojanovic
Copy link
Member Author

Regarding the memory errors, do we know exactly where they happen?

What is the dtype of the ken embeddings? float32, float64? I'm wondering if we could not have them with a less good precision: I suspect that it would lead to the same performance but lighter models.

The embeddings are float32, from what I saw, it is not possible do downcast the dtypes more to reduce memory usage.

@jovan-stojanovic
Copy link
Member Author

And we are good with memory! Taking a smaller subset of entity types and using pyarrow rather than pandas in fetch_figshare did the trick. Now just to verify if the rendering of the example is fine.

@GaelVaroquaux
Copy link
Member

Stupid comment: the figure visible in the summary of the example is not very exciting.

I think that there is a way with sphinx-gallery to tell it to take not the first figure but another one.

@GaelVaroquaux
Copy link
Member

The get_ken_embeddings needs to be added to the documented functions.

Parameters
----------
types : str
Types of embeddings to include. Write in lowercase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to give a list of acceptable types, or a pointer to where this list can be found (eg by adding it on the ken embedding page)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just leave for now a link to where the detailed types can be found, and then maybe we can improve this in a second iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, as long as the link points to a list that is easy to understand.

emb_id : str
Figshare ID of the file containing all embeddings.
emb_type_id : str
Figshare ID of the file containing the type of embeddings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment. Can it / should it, be the same list of names: when we give emb_id as a name, the emb_type_id number is automatically deduced (aka the table has three columns: the name and the two numbers)? In which case, this number becomes 'None' by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be possible, yes.

jovan-stojanovic and others added 2 commits February 15, 2023 15:34
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
jovan-stojanovic and others added 5 commits February 15, 2023 15:35
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2023 via email

@@ -24,6 +24,8 @@ def get_ken_embeddings(
----------
types : str
Types of embeddings to include. Write in lowercase.
For the full list of accepted types, see the `entity_detailed_types`
table on https://soda-inria.github.io/ken_embeddings/.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, this is not simple enough for the user: it forces the download of a parquet file just to know the accepted types. We should run a "unique" on this file and display the result in a way that is readable in a browser.

Copy link
Contributor

@alexis-cvetkov alexis-cvetkov Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that being able to see all types in the browser would be great! Then the user can use Ctrl+F to find the types of interest. However there are a lot of types (dozens of thousands if I remember well, but I guess it can be filtered), so I don't know if it will load well in the browser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe we need to upload a table with two columns: type name and number of occurences to a google doc spreadsheet and share this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upper limit for the number of rows in Google spreadsheet is 40 000

Copy link
Member Author

@jovan-stojanovic jovan-stojanovic Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This types parameter was meant to facilitate the needs of the user as it already filters the types by the input string pattern.
So if you put "music" you will get all types with this string (e.g. "wikicat_musician_from_france", "wikicat_music_label" etc.)
Going directly for the exact type name (e.g. "wikicat_rock_music_bands") seems a bit unnecessary.

But I agree it might be useful to have the most common types. We have 114 509 types in yago3 2022. But I guess it would be best to take the top 40 000 types with their number of occurrences, put it into a google doc spreadsheet and share it. There are 84 entities in the 40 000th type, so I guess it's ok to leave out the other ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it should be better described: that it is a substring and that it will be matched (the paragraph that you have above is actually quite useful.

I wonder if we should not provide a helper function to help people tune this substring: "get_ken_matching_types" that would return all the types that have the corresponding substring, and expose it to the users. I guess that we can do this in a second time, after merging this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue for the additional function: #495

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; merging. Thank you!

@GaelVaroquaux GaelVaroquaux merged commit 877859a into skrub-data:main Feb 16, 2023
@jovan-stojanovic jovan-stojanovic deleted the add_ken_example branch May 16, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants