A native Go clean room implementation of the Porter Stemming algorithm.
Go
Permalink
Failed to load latest commit information.
LICENSE A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
README.md made some of the descrition in the README.md file more clear Jun 2, 2013
porterstemmer.go Merge pull request #10 from magiczhao/master Apr 12, 2016
porterstemmer_contains_vowel_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_fixes_test.go added test Apr 12, 2016
porterstemmer_fuzz_test.go added fuzz test to try many inputs to reproduce panics Oct 20, 2014
porterstemmer_has_repeat_double_consonant_suffix_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_has_suffix_test.go Renamed test file to add _test Sep 23, 2015
porterstemmer_is_consontant_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_measure_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_stem_string_test.go made it so this test can download the test files it needs Jun 2, 2013
porterstemmer_stem_without_lower_casing_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step1a_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step1b_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step1c_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step2_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step3_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step4_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step5a_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013
porterstemmer_step5b_test.go A native Go clean room implementation of the Porter Stemming Algorithm. Jun 2, 2013

README.md

Go Porter Stemmer

A native Go clean room implementation of the Porter Stemming Algorithm.

This algorithm is of interest to people doing Machine Learning or Natural Language Processing (NLP).

This is NOT a port. This is a native Go implementation from the human-readable description of the algorithm.

I've tried to make it (more) efficient by NOT internally using string's, but instead internally using []rune's and using the same (array) buffer used by the []rune slice (and sub-slices) at all steps of the algorithm.

For Porter Stemmer algorithm, see:

http://tartarus.org/martin/PorterStemmer/def.txt (URL #1)

http://tartarus.org/martin/PorterStemmer/ (URL #2)

Departures

Also, since when I initially implemented it, it failed the tests at...

http://tartarus.org/martin/PorterStemmer/voc.txt (URL #3)

http://tartarus.org/martin/PorterStemmer/output.txt (URL #4)

... after reading the human-readble text over and over again to try to figure out what the error I made was (and doing all sorts of things to debug it) I came to the conclusion that the some of these tests were wrong according to the human-readable description of the algorithm.

This led me to wonder if maybe other people's code that was passing these tests had rules that were not in the human-readable description. Which led me to look at the source code here...

http://tartarus.org/martin/PorterStemmer/c.txt (URL #5)

... When I looked there I noticed that there are some items marked as a "DEPARTURE", which differ from the original algorithm. (There are 2 of these.)

I implemented these departures, and the tests at URL #3 and URL #4 all passed.

Usage

To use this Golang library, use with something like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {

  word := "Waxes"

  stem := porterstemmer.StemString(word)

  fmt.Printf("The word [%s] has the stem [%s].\n", word, stem)
}

Alternatively, if you want to be a bit more efficient, use []rune slices instead, with code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {

  word := []rune("Waxes")

  stem := porterstemmer.Stem(word)

  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Although NOTE that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".

Also alternatively, if you already know that your word is already lowercase (and you don't need this library to lowercase your word for you) you can instead use code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {

  word := []rune("waxes")

  stem := porterstemmer.StemWithoutLowerCasing(word)

  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Again NOTE (like with the previous example) that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".