[MRG] Hashing text vectorizer #1471

ogrisel · 2012-12-14T01:14:36Z

New PR to wrap the FeatureHasher with the configurable text tokenization features to be able to extract hashed features directly from a collection of raw text files / strings.

TODO:

~~Update narrative documentation~~
~~Update reference documentation~~
~~Update classification example~~
~~Update clustering example~~
~~Update doc/whatnew.rst~~
~~Add a new example for online text classification, e.g. sentiment analysis from the twitter feed using the streaming API?~~ (twitter is a bad idea because of the oauth requirement, I'll probably write another example later)

mblondel · 2012-12-14T15:56:29Z

The reuse of FeatureHasher is nice!

vene · 2012-12-15T01:55:44Z

examples/document_clustering.py

By this you mean normalize each vector to have a norm of 1? Users coming from computational
linguistics backgrounds might find this phrasing hard to grasp.

'l2' instead of 'l1'. I will rephrase.

ogrisel · 2013-01-07T16:39:48Z

I'll try to finish the documentation tonight.

ogrisel · 2013-01-08T00:24:32Z

Ok, I could not find the time to do it tonight as I reviewed other PRs / bugfixes instead. Will try to do it tomorrow. I think it's a simple yet very useful feature that should be part of the release though.

ogrisel · 2013-01-08T22:39:21Z

@larsmans I did a rebase onto master to fix the conflicts. I hope you did not had any pending changes in your branch, I forgot to ask first. I am working on finishing the documentation.

larsmans · 2013-01-08T23:20:03Z

No pending changes!

ogrisel · 2013-01-09T00:35:15Z

Done. I think this is ready for a final review and hopefully merging before the 0.13 release of all goes well.

mblondel · 2013-01-09T10:49:10Z

doc/modules/feature_extraction.rst

I would add that building the document-term matrix requires intermediate / temporary data structures.

And also an entire pass over the dataset is required to know the word-index mapping.

paolo-losi · 2013-01-09T10:58:20Z

@ogrisel have you considered the option of using multiple probes as mahout does?

This basically means mapping a word not to one single feature id, but to n_probes feature ids by
using n_probes different hashing functions (e.g. adding a different constant prefix to the word)

This allows even better feature space compaction since partial collisions are easily handled by downstream
classifiers...

ogrisel · 2013-01-09T11:08:25Z

@mblondel those are all good remarks, thanks. I will try to address them tonight.

@paolo-losi yes I would like to add n_probes and features cross products (possibly over a small moving window) as mahout does but this requires changes in the FeatureHasher and / or its cython helper. I think we can post-pone that for a later PR. Please feel free to pitch in if you have the need for them and spare time at hand :)

vene · 2013-01-09T19:46:45Z

Maybe the document classification example could plot the scores obtained by both ways of vectorizing? But since that example doesn't get plotted in the docs, maybe it wouldn't be that useful.

vene · 2013-01-09T19:49:12Z

Or maybe rather, attach a classifier to the comparison example, and bar plot time (or MB/s rate?) and scores, maybe on 2 different axes? (I don't like how the document classification one plots different units on the same axis and truncates times more than 1s)

ogrisel · 2013-01-09T22:03:07Z

Maybe the document classification example could plot the scores obtained by both ways of vectorizing?

The score will be the same if IDF is not enabled. Only the memory usage will differ (and possibly the speed two).

Or maybe rather, attach a classifier to the comparison example, and bar plot time (or MB/s rate?)

MB/s rate would be for transform + predict? It does not really mean anything during fit while the model has not converged. The wall clock training speed is more important in that case.

BTW: I think this text classification is getting too complex for the new users to be able to read it. I will try to think of another example for scalable / online / streaming learning on text data in a separate example, but probably in a separate PR.

vene · 2013-01-09T22:35:59Z

Oh sorry, I was thinking collisions happen.

ogrisel · 2013-01-09T22:43:16Z

They do happen but it's negligible.

ogrisel · 2013-01-09T23:29:41Z

Actually I will add a MB/s rate for the feature extraction part as it's well separated from the training in this example.

ogrisel · 2013-01-09T23:31:07Z

@mblondel I think I addressed your comments.

@larsmans any more review from your side?

ogrisel · 2013-01-10T00:14:31Z

@vene I added the feature extraction streaming speed. Around 5MB/s on my laptop for the hashing vectorizer vs ~3MB/s for the other.

ogrisel · 2013-01-10T08:36:00Z

Fixed, thanks.

larsmans · 2013-01-10T10:29:41Z

I just did some copyediting to the docs, will push to your branch.

As for the "no tf-idf" part: logarithmic scaling of frequencies is still possible, right? This could either be done by the user (HashingVectorizer -> TfidfTransformer pipeline) or in the hashing vectorizer itself. Since it's both extremely simple and tied to the non_negative parameter, we might want to do it in the hasher.

ogrisel · 2013-01-10T10:32:56Z

Indeed log scaling would be useful directly in the FeatureHasher and exposed as an additional constructor param in HashingVectorizer as it's stateless (data independent, no need to fit).

larsmans · 2013-01-10T10:37:23Z

But an additional parameter would cause an invalid combination to arise: (sublinear_tf=True, non_negative=False). I'd rather merge these into one param with three possible values. Any idea on how to call that? Maybe just non_negative with values False, True, "log"?

ogrisel · 2013-01-10T11:12:31Z

I would rather keep non_negative as a boolean and have sublinear_tf able to deal with negative value by taking: sign(x) * log(abs(x)) instead of log(x).

larsmans · 2013-01-10T11:45:14Z

That sounds like a much cleaner solution as it brings the counts close to a Gaussian distribution with mean 0. Great idea.

ogrisel · 2013-01-10T11:48:17Z

Shall you do it?

ogrisel · 2013-01-10T11:48:55Z

No hurry, this can be added in a later PR after the release if you don't have time now.

larsmans · 2013-01-10T12:21:40Z

Let's do it in a separate PR.

amueller · 2013-01-19T17:39:28Z

Can't be merged currently :-/. do you want to do the example before the release or do you want to merge without the improvements?

ogrisel · 2013-01-19T17:47:04Z

I will try to do a rebase + update the clustering example tonight

Moving common text feature extraction methods in a new base class Hashing vectorizer work in progress

…a marker

…m alias for HashingVectorizer instead

…ature extraction doc

ogrisel · 2013-01-20T00:39:24Z

I did the rebase, there are no longer any conflicts but github is still gray for some reason. I won't have time to refactor the example tonight (and maybe neither tomorrow). So @amueller please feel free to merge this to master before the release (all tests pass on my box). I will do the clustering example rework in a separate PR.

amueller · 2013-01-20T12:01:27Z

ok, merging now :)

amueller · 2013-01-20T12:09:09Z

merged :)

ogrisel · 2013-01-21T17:50:36Z

Thanks!

vene reviewed Dec 15, 2012
View reviewed changes

ghost assigned ogrisel Dec 16, 2012

mblondel reviewed Jan 9, 2013
View reviewed changes

ogrisel and others added 23 commits January 19, 2013 19:01

ENH: output processing speed in MB/s for vectorizer example

80d32b1

Moving common text feature extraction methods in a new base class Hashing vectorizer work in progress

Initial work on hashing vectorizer

deb4d57

Add fit_transform support using the TransformerMixin + missing ABCMet…

474198f

…a marker

Improved the clustering example with HashingVectorizer

92a7e83

Remove TransformerMixin from vectorizers and do a direct fit_transfor…

2f11158

…m alias for HashingVectorizer instead

Improve module docstring of document clustering example

e9dde0d

COSMIT make BaseVectorizer a mixin

a97b5d2

cosmit

21f88ab

Updated whats_new.rst

913455b

DOC: Started section on hashing vectorizer in narrative section

7a7f05f

DOC: narrative doc for HashingVectorizer

3d69479

DOC: typos

7803aa4

DOC: merged the whats new entries and add links to the narrative doc

b7c1fd6

DOC: address @mblondel's comments

e64ac88

ENH: measure feature extraction speed in document classification example

f48ac46

DOC: typos

562a5dc

DOC copyedit HashingVectorizer docs

2cd28cb

typo

ce745cc

Fixed typos

81ca81c

More typo

ad88678

Add more tests: stateless fit

18cd0e1

PEP8 + typo

c54871b

Add links to the reference doc of estimators mentioned in the text fe…

fdc4db1

…ature extraction doc

amueller closed this Jan 20, 2013

Uh oh!

[MRG] Hashing text vectorizer #1471

[MRG] Hashing text vectorizer #1471

Uh oh!

Conversation

ogrisel commented Dec 14, 2012

Uh oh!

mblondel commented Dec 14, 2012

Uh oh!

vene Dec 15, 2012

Choose a reason for hiding this comment

Uh oh!

ogrisel Dec 15, 2012

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jan 7, 2013

Uh oh!

ogrisel commented Jan 8, 2013

Uh oh!

ogrisel commented Jan 8, 2013

Uh oh!

larsmans commented Jan 8, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

mblondel Jan 9, 2013

Choose a reason for hiding this comment

Uh oh!

paolo-losi commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

vene commented Jan 9, 2013

Uh oh!

vene commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

vene commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 9, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

larsmans commented Jan 10, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

larsmans commented Jan 10, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

larsmans commented Jan 10, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

ogrisel commented Jan 10, 2013

Uh oh!

larsmans commented Jan 10, 2013

Uh oh!

amueller commented Jan 19, 2013

Uh oh!

ogrisel commented Jan 19, 2013

Uh oh!

ogrisel commented Jan 20, 2013

Uh oh!

amueller commented Jan 20, 2013

Uh oh!

amueller commented Jan 20, 2013

Uh oh!

ogrisel commented Jan 21, 2013

Uh oh!

Reviewers

Assignees