Skip to content

Conversation

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 14, 2012

New PR to wrap the FeatureHasher with the configurable text tokenization features to be able to extract hashed features directly from a collection of raw text files / strings.

TODO:

  • Update narrative documentation
  • Update reference documentation
  • Update classification example
  • Update clustering example
  • Update doc/whatnew.rst
  • Add a new example for online text classification, e.g. sentiment analysis from the twitter feed using the streaming API? (twitter is a bad idea because of the oauth requirement, I'll probably write another example later)

@mblondel
Copy link
Member

The reuse of FeatureHasher is nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this you mean normalize each vector to have a norm of 1? Users coming from computational
linguistics backgrounds might find this phrasing hard to grasp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'l2' instead of 'l1'. I will rephrase.

@ghost ghost assigned ogrisel Dec 16, 2012
@ogrisel
Copy link
Member Author

ogrisel commented Jan 7, 2013

I'll try to finish the documentation tonight.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 8, 2013

Ok, I could not find the time to do it tonight as I reviewed other PRs / bugfixes instead. Will try to do it tomorrow. I think it's a simple yet very useful feature that should be part of the release though.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 8, 2013

@larsmans I did a rebase onto master to fix the conflicts. I hope you did not had any pending changes in your branch, I forgot to ask first. I am working on finishing the documentation.

@larsmans
Copy link
Member

larsmans commented Jan 8, 2013

No pending changes!

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

Done. I think this is ready for a final review and hopefully merging before the 0.13 release of all goes well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that building the document-term matrix requires intermediate / temporary data structures.

And also an entire pass over the dataset is required to know the word-index mapping.

@paolo-losi
Copy link
Member

@ogrisel have you considered the option of using multiple probes as mahout does?

This basically means mapping a word not to one single feature id, but to n_probes feature ids by
using n_probes different hashing functions (e.g. adding a different constant prefix to the word)

This allows even better feature space compaction since partial collisions are easily handled by downstream
classifiers...

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

@mblondel those are all good remarks, thanks. I will try to address them tonight.

@paolo-losi yes I would like to add n_probes and features cross products (possibly over a small moving window) as mahout does but this requires changes in the FeatureHasher and / or its cython helper. I think we can post-pone that for a later PR. Please feel free to pitch in if you have the need for them and spare time at hand :)

@vene
Copy link
Member

vene commented Jan 9, 2013

Maybe the document classification example could plot the scores obtained by both ways of vectorizing? But since that example doesn't get plotted in the docs, maybe it wouldn't be that useful.

@vene
Copy link
Member

vene commented Jan 9, 2013

Or maybe rather, attach a classifier to the comparison example, and bar plot time (or MB/s rate?) and scores, maybe on 2 different axes? (I don't like how the document classification one plots different units on the same axis and truncates times more than 1s)

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

Maybe the document classification example could plot the scores obtained by both ways of vectorizing?

The score will be the same if IDF is not enabled. Only the memory usage will differ (and possibly the speed two).

Or maybe rather, attach a classifier to the comparison example, and bar plot time (or MB/s rate?)

MB/s rate would be for transform + predict? It does not really mean anything during fit while the model has not converged. The wall clock training speed is more important in that case.

BTW: I think this text classification is getting too complex for the new users to be able to read it. I will try to think of another example for scalable / online / streaming learning on text data in a separate example, but probably in a separate PR.

@vene
Copy link
Member

vene commented Jan 9, 2013

Oh sorry, I was thinking collisions happen.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

They do happen but it's negligible.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

Actually I will add a MB/s rate for the feature extraction part as it's well separated from the training in this example.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 9, 2013

@mblondel I think I addressed your comments.

@larsmans any more review from your side?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

@vene I added the feature extraction streaming speed. Around 5MB/s on my laptop for the hashing vectorizer vs ~3MB/s for the other.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

Fixed, thanks.

@larsmans
Copy link
Member

I just did some copyediting to the docs, will push to your branch.

As for the "no tf-idf" part: logarithmic scaling of frequencies is still possible, right? This could either be done by the user (HashingVectorizer -> TfidfTransformer pipeline) or in the hashing vectorizer itself. Since it's both extremely simple and tied to the non_negative parameter, we might want to do it in the hasher.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

Indeed log scaling would be useful directly in the FeatureHasher and exposed as an additional constructor param in HashingVectorizer as it's stateless (data independent, no need to fit).

@larsmans
Copy link
Member

But an additional parameter would cause an invalid combination to arise: (sublinear_tf=True, non_negative=False). I'd rather merge these into one param with three possible values. Any idea on how to call that? Maybe just non_negative with values False, True, "log"?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

I would rather keep non_negative as a boolean and have sublinear_tf able to deal with negative value by taking: sign(x) * log(abs(x)) instead of log(x).

@larsmans
Copy link
Member

That sounds like a much cleaner solution as it brings the counts close to a Gaussian distribution with mean 0. Great idea.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

Shall you do it?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 10, 2013

No hurry, this can be added in a later PR after the release if you don't have time now.

@larsmans
Copy link
Member

Let's do it in a separate PR.

@amueller
Copy link
Member

Can't be merged currently :-/. do you want to do the example before the release or do you want to merge without the improvements?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 19, 2013

I will try to do a rebase + update the clustering example tonight

@ogrisel
Copy link
Member Author

ogrisel commented Jan 20, 2013

I did the rebase, there are no longer any conflicts but github is still gray for some reason. I won't have time to refactor the example tonight (and maybe neither tomorrow). So @amueller please feel free to merge this to master before the release (all tests pass on my box). I will do the clustering example rework in a separate PR.

@amueller
Copy link
Member

ok, merging now :)

@amueller
Copy link
Member

merged :)

@amueller amueller closed this Jan 20, 2013
@ogrisel
Copy link
Member Author

ogrisel commented Jan 21, 2013

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants