Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]Hashing Vectorizer and general vectorizer improvements #2554

Merged
merged 17 commits into from
Jul 20, 2020

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Jul 13, 2020

This pr adds Hashing Vectorizer and improves CountVectorizer

Count Vectorizer Improvements

All benchmarks on a Tesla T4 and dataset used is the one used in the blog

Improves memory pressures

2x improvement
PR: 4192 MB (1500 MB is dataset) 
Master: 9712 MB (1500 MB is dataset) 
Improves Run Time(3x improvement)
Master: 25.4 s 
Pr: 8.8 s
Sklearn: 1 Min

Hash Vectorizer: Benchmarks (9.5x speed up)

Half Dataset:
CUML: 4.37 s
Sklearn: 41.1 s


Full dataset:
7.83 s
1min 14s

Known Divergence (NOT INTRODUCED BY THIS PR): https://gist.github.com/VibhuJawa/5f8e2666da9ecbb63d0398277fc93ac5

Just linking the test/benchmarking notebooks here:

@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer [WIP]Hashing Vectorizer[skip-ci] Jul 14, 2020
@dantegd dantegd added 2 - In Progress Currenty a work in progress Cython / Python Cython or Python issue New Algorithm For tracking new algorithms that will be added to our existing collection labels Jul 14, 2020
@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer[skip-ci] [REVIEW]Hashing Vectorizer Jul 14, 2020
@VibhuJawa
Copy link
Member Author

This should now be ready for a first review.

" is not supported in cuML,"
" please read the cuML documentation for"
" more information.")
class HashingVectorizer(_VectorizerMixin):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main Logic Starts here

@VibhuJawa VibhuJawa changed the title [REVIEW]Hashing Vectorizer [WIP]Hashing Vectorizer Jul 14, 2020
@@ -602,15 +633,248 @@ def get_feature_names(self):
"""
return self.vocabulary_

def _check_sklearn_params(self, analyzer, sklearn_params):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to base class _VectorizerMixin .

@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer [REVIEW]Hashing Vectorizer Jul 14, 2020
@VibhuJawa VibhuJawa changed the title [REVIEW]Hashing Vectorizer [REVIEW]Hashing Vectorizer and General Vectorizer improvements Jul 16, 2020
@VibhuJawa VibhuJawa changed the title [REVIEW]Hashing Vectorizer and General Vectorizer improvements [REVIEW]Hashing Vectorizer and general vectorizer improvements Jul 16, 2020
@VibhuJawa VibhuJawa marked this pull request as ready for review July 16, 2020 21:49
@VibhuJawa VibhuJawa requested a review from a team as a code owner July 16, 2020 21:49
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a quick first review I have mainly comments regarding docstrings and using the cuml logger, nothing major, will do another pass tomorrow.

python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
@dantegd dantegd removed the 2 - In Progress Currenty a work in progress label Jul 16, 2020
@dantegd dantegd added the 4 - Waiting on Author Waiting for author to respond to review label Jul 16, 2020
@@ -51,15 +49,23 @@ def _preprocess(doc, lower=False, remove_non_alphanumeric=False,
if lower:
doc = doc.str.lower()
if remove_non_alphanumeric:
non_alpha = _get_non_alphanumeric_characters(doc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all non_alpha chars excluding _ char, ^\w does not extract _.

@VibhuJawa VibhuJawa changed the title [REVIEW]Hashing Vectorizer and general vectorizer improvements [WIP]Hashing Vectorizer and general vectorizer improvements Jul 17, 2020
@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer and general vectorizer improvements [WIP]Hashing Vectorizer and general vectorizer improvements[Skip-CI] Jul 17, 2020
@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer and general vectorizer improvements[Skip-CI] [WIP]Hashing Vectorizer and general vectorizer improvements[skip-ci] Jul 17, 2020
VibhuJawa and others added 2 commits July 17, 2020 10:51
@VibhuJawa VibhuJawa changed the title [WIP]Hashing Vectorizer and general vectorizer improvements[skip-ci] [REVIEW]Hashing Vectorizer and general vectorizer improvements Jul 17, 2020
@VibhuJawa
Copy link
Member Author

Ready for the second round of reviews.

@VibhuJawa
Copy link
Member Author

Just linking the test/benchmarking notebooks here:

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 super small comments/question, but overall looks great (and worked well in my initial testings)! Awesome job @VibhuJawa

python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
python/cuml/feature_extraction/_vectorizers.py Outdated Show resolved Hide resolved
VibhuJawa and others added 4 commits July 20, 2020 11:17
@VibhuJawa
Copy link
Member Author

All reviews have been addressed, Should be good for a final review now I think.

@VibhuJawa
Copy link
Member Author

I just realized rapidsai/cudf#5658 is also in, will integrate that in a follow-up pr.

@dantegd dantegd merged commit f938ef4 into rapidsai:branch-0.15 Jul 20, 2020
@VibhuJawa
Copy link
Member Author

Closed last remaining bits of #1266 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Author Waiting for author to respond to review Cython / Python Cython or Python issue New Algorithm For tracking new algorithms that will be added to our existing collection
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants