-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG] Hashing text vectorizer #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The reuse of |
examples/document_clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this you mean normalize each vector to have a norm of 1? Users coming from computational
linguistics backgrounds might find this phrasing hard to grasp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'l2' instead of 'l1'. I will rephrase.
|
I'll try to finish the documentation tonight. |
|
Ok, I could not find the time to do it tonight as I reviewed other PRs / bugfixes instead. Will try to do it tomorrow. I think it's a simple yet very useful feature that should be part of the release though. |
|
@larsmans I did a rebase onto master to fix the conflicts. I hope you did not had any pending changes in your branch, I forgot to ask first. I am working on finishing the documentation. |
|
No pending changes! |
|
Done. I think this is ready for a final review and hopefully merging before the 0.13 release of all goes well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that building the document-term matrix requires intermediate / temporary data structures.
And also an entire pass over the dataset is required to know the word-index mapping.
|
@ogrisel have you considered the option of using multiple probes as mahout does? This basically means mapping a word not to one single feature id, but to This allows even better feature space compaction since partial collisions are easily handled by downstream |
|
@mblondel those are all good remarks, thanks. I will try to address them tonight. @paolo-losi yes I would like to add |
|
Maybe the document classification example could plot the scores obtained by both ways of vectorizing? But since that example doesn't get plotted in the docs, maybe it wouldn't be that useful. |
|
Or maybe rather, attach a classifier to the comparison example, and bar plot time (or MB/s rate?) and scores, maybe on 2 different axes? (I don't like how the document classification one plots different units on the same axis and truncates times more than 1s) |
The score will be the same if IDF is not enabled. Only the memory usage will differ (and possibly the speed two).
MB/s rate would be for transform + predict? It does not really mean anything during fit while the model has not converged. The wall clock training speed is more important in that case. BTW: I think this text classification is getting too complex for the new users to be able to read it. I will try to think of another example for scalable / online / streaming learning on text data in a separate example, but probably in a separate PR. |
|
Oh sorry, I was thinking collisions happen. |
|
They do happen but it's negligible. |
|
Actually I will add a MB/s rate for the feature extraction part as it's well separated from the training in this example. |
|
@vene I added the feature extraction streaming speed. Around 5MB/s on my laptop for the hashing vectorizer vs ~3MB/s for the other. |
|
Fixed, thanks. |
|
I just did some copyediting to the docs, will push to your branch. As for the "no tf-idf" part: logarithmic scaling of frequencies is still possible, right? This could either be done by the user ( |
|
Indeed log scaling would be useful directly in the |
|
But an additional parameter would cause an invalid combination to arise: |
|
I would rather keep |
|
That sounds like a much cleaner solution as it brings the counts close to a Gaussian distribution with mean 0. Great idea. |
|
Shall you do it? |
|
No hurry, this can be added in a later PR after the release if you don't have time now. |
|
Let's do it in a separate PR. |
|
Can't be merged currently :-/. do you want to do the example before the release or do you want to merge without the improvements? |
|
I will try to do a rebase + update the clustering example tonight |
Moving common text feature extraction methods in a new base class Hashing vectorizer work in progress
…m alias for HashingVectorizer instead
…ature extraction doc
|
I did the rebase, there are no longer any conflicts but github is still gray for some reason. I won't have time to refactor the example tonight (and maybe neither tomorrow). So @amueller please feel free to merge this to master before the release (all tests pass on my box). I will do the clustering example rework in a separate PR. |
|
ok, merging now :) |
|
merged :) |
|
Thanks! |
New PR to wrap the
FeatureHasherwith the configurable text tokenization features to be able to extract hashed features directly from a collection of raw text files / strings.TODO:
Update narrative documentationUpdate reference documentationUpdate classification exampleUpdate clustering exampleUpdatedoc/whatnew.rstAdd a new example for online text classification, e.g. sentiment analysis from the twitter feed using the streaming API?(twitter is a bad idea because of the oauth requirement, I'll probably write another example later)