New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-skip ngrams for quadratic support in HashedDoc classes [WIP] #1342

Merged
merged 1 commit into from Aug 22, 2013

Conversation

Projects
None yet
4 participants
@van51
Contributor

van51 commented Jul 31, 2013

Still a WIP, since it remains(at least) to add this support in the conversion of the Converter and as well to remove the first commit. @sonney2k: have a look when you have some time

if ( n+s >= hashes->get_num_elements())
break;
uint32_t ngram_hash = hashes->get_element(0);

This comment has been minimized.

@rostyboost

rostyboost Aug 1, 2013

@van51 Are you sure of this 0 index? Shouldn't it be s instead?

@rostyboost

rostyboost Aug 1, 2013

@van51 Are you sure of this 0 index? Shouldn't it be s instead?

This comment has been minimized.

@van51

van51 Aug 1, 2013

Contributor

It's 0 because in this method I make all the n-combinations (and for every ngrams >= n > 1) that start with the first token.

@van51

van51 Aug 1, 2013

Contributor

It's 0 because in this method I make all the n-combinations (and for every ngrams >= n > 1) that start with the first token.

@rostyboost

View changes

Show outdated Hide outdated src/shogun/converter/HashedDocConverter.cpp
@rostyboost

View changes

Show outdated Hide outdated src/shogun/features/HashedDocDotFeatures.cpp
@@ -33,10 +33,12 @@ class CHashedDocDotFeatures: public CDotFeatures
* @param docs the document collection
* @param tzer the tokenizer to use on the documents
* @param normalize whether or not to normalize the result of the dot products
* @param n_grams max number of consecutive tokens to hash together (extra features)
* @param skips max number of tokens to skip when combining tokens

This comment has been minimized.

@rostyboost

rostyboost Aug 1, 2013

@van51 please give more details about how the skipping work

@rostyboost

rostyboost Aug 1, 2013

@van51 please give more details about how the skipping work

This comment has been minimized.

@van51

van51 Aug 1, 2013

Contributor

The idea behind the skipping is that you when you combine multiple tokens, you allow it to skip some and consider some others further ahead.
Eg. if you have the tokens ["a", "b", "c", "d",] and you specify n_grams = 2 and skips = 2, you will get :
["a", "ab", "ac" (skipped 1), "ad" (skipped 2), "b", "bc", "bd" (skipped 1), "c", "cd", "d"].

Maybe this example or something more elaborate should be added in the constructor definition.

@van51

van51 Aug 1, 2013

Contributor

The idea behind the skipping is that you when you combine multiple tokens, you allow it to skip some and consider some others further ahead.
Eg. if you have the tokens ["a", "b", "c", "d",] and you specify n_grams = 2 and skips = 2, you will get :
["a", "ab", "ac" (skipped 1), "ad" (skipped 2), "b", "bc", "bd" (skipped 1), "c", "cd", "d"].

Maybe this example or something more elaborate should be added in the constructor definition.

This comment has been minimized.

@rostyboost

rostyboost Aug 5, 2013

@van51 thanks. yes that would be great to add this little example somewhere in the documentation

@rostyboost

rostyboost Aug 5, 2013

@van51 thanks. yes that would be great to add this little example somewhere in the documentation

uint32_t ngram_hash = hashes->get_element(0);
for (index_t i=1+s; i<=n+s; i++)
ngram_hash = ngram_hash ^ hashes->get_element(i);

This comment has been minimized.

@van51

van51 Aug 1, 2013

Contributor

Is it sufficient here to xor the hashed values or will consecutive xor's somehow diminish the randomness or limit the outcomes?

@van51

van51 Aug 1, 2013

Contributor

Is it sufficient here to xor the hashed values or will consecutive xor's somehow diminish the randomness or limit the outcomes?

This comment has been minimized.

@rostyboost

rostyboost Aug 5, 2013

@van51 from my experience, it's completely ok, as long as a base hasher is very good (MurMur3).

@rostyboost

rostyboost Aug 5, 2013

@van51 from my experience, it's completely ok, as long as a base hasher is very good (MurMur3).

@sonney2k

View changes

Show outdated Hide outdated src/shogun/converter/HashedDocConverter.cpp
@van51

This comment has been minimized.

Show comment
Hide comment
@van51

van51 Aug 7, 2013

Contributor

Not good to merge yet. Found it has a bug when used with multiple threads (specifically in the apply methods of some classifiers). I'm in the process of trying to find out what it is

Contributor

van51 commented Aug 7, 2013

Not good to merge yet. Found it has a bug when used with multiple threads (specifically in the apply methods of some classifiers). I'm in the process of trying to find out what it is

@van51

View changes

Show outdated Hide outdated src/shogun/features/HashedDocDotFeatures.cpp
@van51

This comment has been minimized.

Show comment
Hide comment
@van51

van51 Aug 8, 2013

Contributor

Changed the way I was creating the combined ngrams, by passing an already created CDynamicArray (to hold the combination indices) as an argument to the function generate_ngrams(), instead of creating a new CDynamicArray each time in the method. It's not a pretty function call now, but it seems to solve the problem with mulithreading, at least in a minimal example. I plan on running the language detection demo tonight to make sure.

Contributor

van51 commented Aug 8, 2013

Changed the way I was creating the combined ngrams, by passing an already created CDynamicArray (to hold the combination indices) as an argument to the function generate_ngrams(), instead of creating a new CDynamicArray each time in the method. It's not a pretty function call now, but it seems to solve the problem with mulithreading, at least in a minimal example. I plan on running the language detection demo tonight to make sure.

*
* @param hashes the hashes of the tokens to combine as k-skip ngrams
* @param ngram_hashes the dynamic array in which to store the created indices
* @param num_bits the dimension in which to limit the hashed indices (means a dimension of size 2^num_bits)

This comment has been minimized.

@votjakovr

votjakovr Aug 17, 2013

Contributor

Doxygen shouldn't generate proper documentation for this method, since description and declaration are different: "num_bits" and "dim". Also, please check doxygen warnings, there are some of them, that you can fix :)

@votjakovr

votjakovr Aug 17, 2013

Contributor

Doxygen shouldn't generate proper documentation for this method, since description and declaration are different: "num_bits" and "dim". Also, please check doxygen warnings, there are some of them, that you can fix :)

This comment has been minimized.

@van51

van51 Aug 20, 2013

Contributor

Thanks for pointing that out. I will make sure they are fixed :)

@van51

van51 Aug 20, 2013

Contributor

Thanks for pointing that out. I will make sure they are fixed :)

sonney2k pushed a commit that referenced this pull request Aug 22, 2013

Soeren Sonnenburg
Merge pull request #1342 from van51/feature/quadratic
K-skip ngrams for quadratic support in HashedDoc classes

@sonney2k sonney2k merged commit 30caf36 into shogun-toolbox:develop Aug 22, 2013

1 check failed

default The Travis CI build could not complete due to an error
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment