Skip to content
This repository has been archived by the owner on May 19, 2023. It is now read-only.

[REVIEW] Tokenizer with rmm integration #155

Merged
merged 71 commits into from
Jun 12, 2020
Merged

[REVIEW] Tokenizer with rmm integration #155

merged 71 commits into from
Jun 12, 2020

Conversation

brhodes10
Copy link
Contributor

@brhodes10 brhodes10 commented May 15, 2020

  • Updated tokenizer to use rmm
  • Added tokenizer benchmark and unit tests

Bianca Rhodes added 30 commits April 28, 2020 21:35
brhodes10 and others added 5 commits May 27, 2020 16:18
update to use thrust exclusive scan

Co-authored-by: Mark Harris <mharris@nvidia.com>
update to use thrust exclusive scan

Co-authored-by: Mark Harris <mharris@nvidia.com>
@brhodes10
Copy link
Contributor Author

I just reviewed some of BasicTokenizer for now. Suggest making those changes and corresponding changes to the other tokenizers, and then I can re-review. This should simplify the changes a little.

Hi @harrism I have made changes based on your comments. I saw areas where I could apply your suggestions to the fullTokenizer as well. Let me know if you have any more comments. Thank you

@brhodes10 brhodes10 closed this May 28, 2020
@brhodes10 brhodes10 reopened this May 28, 2020
@brhodes10 brhodes10 changed the base branch from branch-0.14 to branch-0.15 June 1, 2020 14:13
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Jun 10, 2020

I think something is wrong here? 124000 additions and deletions? Those files can't be that long, can they?

image

@BartleyR
Copy link
Member

I think something is wrong here? 124000 additions and deletions? Those files can't be that long, can they?

I believe there are a few vectors in those files that we use as input to the basic tokenizer. They are large, and that's what is likely causing the large diffs.

@brhodes10
Copy link
Contributor Author

rerun tests

2 similar comments
@efajardo-nv
Copy link
Contributor

rerun tests

@brhodes10
Copy link
Contributor Author

rerun tests

Co-authored-by: Mark Harris <mharris@nvidia.com>
@efajardo-nv
Copy link
Contributor

rerun tests

@brhodes10 brhodes10 merged commit d72b813 into rapidsai:branch-0.15 Jun 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants