Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stemming handlers (start with porter?) #60

Closed
raijyan opened this issue Mar 3, 2020 · 5 comments
Closed

Stemming handlers (start with porter?) #60

raijyan opened this issue Mar 3, 2020 · 5 comments
Assignees
Labels
enhancement New feature or request
Projects

Comments

@raijyan
Copy link

raijyan commented Mar 3, 2020

Would be great if we could have stemmers addition to "other" section to reduce dimensionality of NLP. Cutting down some wasted memory/processing time from things like plurals and generating stronger links for the TfIdf transformer. Usually applied after basic normalisation and stop words.

Would imagine something like it becoming a 4th option of the WordCountVectorizer. Though for processing it'd make sense for it to kick in during the tokanize method eg in NGram before it stitches back together the split word tokens.

Examples that'd be easy to drop in found at https://tartarus.org/martin/PorterStemmer/php.txt and https://github.com/angeloskath/php-nlp-tools/blob/master/src/NlpTools/Stemmers/PorterStemmer.php

^ tartarus.org/martin being the the home of the author of the Porter algorithm.

If more adventurous there's a bunch of multi-language examples at https://github.com/wamania/php-stemmer (could be added as a composer dependancy?)

@raijyan raijyan added the enhancement New feature or request label Mar 3, 2020
@andrewdalpino
Copy link
Member

andrewdalpino commented Mar 4, 2020

@raijyan I think this is a great idea and I've considered it before myself

One of my concerns was with non-English use cases. I like the idea of a stemming tokenizer for the reason you've mentioned but also because it wouldn't require another argument to Word Count Vectorizer.

https://github.com/wamania/php-stemmer seems like it can be integrated into a tokenizer quite easily. We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

I am considering a 'Rubix ML Extras' repository and package that would include experimental features such as obscure transformers, neural network activation functions, and perhaps stemmers. The hope is that we have enough hardcore users that will install and experiment with these features before (and if) we included them into the main package.

We are currently in a 'feature freeze' until our first stable release (we just put out our first release candidate this week) which means we do not plan to add additional functionality until after then. Only optimizations, bugfixes, and mayyyyyyybe a small feature. However we are free to develop an 'Extras' package in the meantime.

I'd love to hear your thoughts

Do you or someone you know have proficiency with stemmers?

Thanks for the great recommendation and information!

@simplechris
Copy link
Contributor

Just lurkin' around, but yeah I've ported all of the stemmers/tokenizers etc (including PorterStemmer) from lucene. I agree that it probably belongs in an 'extras' or other external package if you want tighter integration into Rubix

@raijyan
Copy link
Author

raijyan commented Mar 5, 2020

Tested adding Wamania\Snowball to my dependancies - on my product descriptions dataset (4,000 products) went from:

$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 1448.11M
Tokens: 3575
Array
(
    [0] => style
    [1] => size
    [2] => this
...

to:

use Wamania\Snowball\StemmerFactory;
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1), StemmerFactory::create('english'));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 853.78M
Tokens: 2680
Array
(
    [0] => style
    [1] => size
    [2] => this
...

So some savings to be made, at least for my use cases. Should cut a few hours off my training times on a jaccard.

An extras setup would be cool if you're pushing for a feature freeze. Would probably look at putting in a lemmatizer/a locality normaliser too then (darn variants of English).

Loving the library so far though, working though moving my existing production NLP over to it then time for some experiments >:)

@raijyan
Copy link
Author

raijyan commented Mar 5, 2020

We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

Yeah a wrapper might help, currently just tacked it in as:

public function tokenize(string $string, $stemmer = null) : array
...
    $nGram = $stemmer ? $stemmer->stem($word) : $word;
...

Not quite as clean as i'd like has done the trick for getting it up and running. Getting some nice results using NGram over my old php-nlp/php-ai combo with single word tokens.

A slight change to the structure would be cool if it allowed for fuller use of the multi-dictionary setup you've made. Could then set token configuration per defined dictionary from the column picker.
EG: being able to configure that my tags/attributes are single word tokens, but my titles/descriptions are NGram(1, 3) when it iterates over them and builds the dictionaries to be used for vectors would offer further performance improvement for... lazy... datasets.

@andrewdalpino
Copy link
Member

@raijyan @simplechris

We went ahead and created an Extras package that can be installed (composer require rubix/extras) right now as dev-master

Included is the Word Stemmer which can be used alone, or as the base tokenizer for either N-Gram or Skip Gram. Example below ...

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Other\Tokenizers\WordStemmer;

$transformer = new WordCountVectorizer(10000, 3, new NGram(1, 2, new WordStemmer('english')));

The changes to N-Gram and Skip Gram have not been released yet but you can install the latest dev-master to preview the features.

In addition, we've added Delta TF-IDF Transformer which is a supervised TF-IDF transformer that boosts term frequencies by how unique they are to a particular class not just the entire corpus.

Preliminary tests using the Sentiment example and the new Word Stemmer as the base tokenizer for N-Gram show no noticeable improvement in accuracy or training speed, however, your mileage may vary. Let me know how it works for you.

With that, we now have a standard way to introduce experimental features in Rubix ML. Feel free to suggest features or contribute to the development of the project if you are so willing.

@andrewdalpino andrewdalpino self-assigned this Mar 14, 2020
@andrewdalpino andrewdalpino added this to Backlog in Roadmap via automation Mar 14, 2020
@andrewdalpino andrewdalpino moved this from Backlog to In progress in Roadmap Mar 14, 2020
@andrewdalpino andrewdalpino moved this from In progress to Review in Roadmap Mar 14, 2020
@andrewdalpino andrewdalpino moved this from Review to Completed in Roadmap Mar 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Roadmap
  
Completed
Development

No branches or pull requests

3 participants