-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Port CLX wordpiece tokenizer into cuDF #4981
Comments
Adding more information about the subword tokenizer and references. Typically, these types of tokenizers accept either a list of filenames or a list of strings/sentences. This version of our subword tokenizer is based on the Currently, in the GPU subword tokenizer, we have a mandatory parameters of The tokenizer is currently called a wordpiece tokenizer, but we would like to rename it to be a |
We need to work in terms of columns and tables. And I think we should not load files in this tokenizer, that should be done with the file reader/writer APIs (cuIO). So I guess the input would be a column of strings. |
I agree with this. I'd rather use cuIO to do the file in/out. Column of strings is what we've been using. |
What about the vocab/hash-table file that is loaded here?
How should this be handled? The file appears to be loaded as 3 vectors (hash-table, coefficients, offsets) and 6 scalars (unk-tok-id, first-tok-id, sep-tok-id, outer-table-a, outer-table-b, num-bins). The @kkraus14 for his opinion as well on this. |
How are the 3 device memory buffers returned in the I see they are allocated here:
But I don't see where they are freed
The So if these are all zero-copy conversions, then who is taking ownership and is calling |
It actually doesn't look like they are being freed. But maybe I'm missing something - @Iroy30 or @brhodes10? |
I cannot see where it is being freed. Willing to take advice on how to fix that. |
Because they're being explicitly allocated with Is waiting for it to be ported into cuDF an option? Within cuDF we can make the initial allocations using |
Absolutely. We have a working version right now, so taking care of this at integration time is fine. |
ACK. We can do C++ first and do Cython / Python as a follow up unless you want to do it all at once for unit testing purposes. If you make them |
@BartleyR What is the expected output here? The current CLX code is returning Torch tensors (?)
Would it be ok if this would return 3 |
It was returning Torch tensors because that is our current pipeline. After this, those tensors would be directly fed into a PyTorch model. However, @raykallen and I have been talking about making it more generic so it could be fed to a TF workflow as well. I'm fine with either from a pipeline perspective, but I'm not sure if one of them is more performant when going to a tensor as a next step. We don't envision there being a use case where you would not go directly to a tensor then into a BERT model for inference, but I can't say with 100% certainty. So I would prefer to err on the side of (1) efficiency and (2) generalizability. |
Ok, thanks. I think a |
I thought we already discussed that the input and output would become cudf::columns. Then the API would be similar to any libcudf API and the caller is responsible for deleting the results when finished. |
@davidwendt Mark is right and I forgot we discussed this. |
Closed by #5511 |
Is your feature request related to a problem? Please describe.
The RAPIDS CLX repo has GPU-accelerated tokenizers that will be more widely usable if incorporated into libcudf/cuDF. We would like to port the CLX wordpiece tokenizer to libcudf.
Describe the solution you'd like
The new wordpiece tokenizer should be redesigned to use libcudf types (columns, tables, views) for input and output while maintaining or improving its current high performance.
Since cuDF already has a (different type of) tokenizer, we should ensure consistency in tokenization APIs within the library.
Once it is ported to libcudf, we will need to write new cython bindings and provide a Python interface.
@BartleyR took an action item to provide an easy to run benchmark of the current tokenizer to ensure we don't have regressions when we port.
The text was updated successfully, but these errors were encountered: