Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish support? #40

Closed
demian85 opened this issue Oct 26, 2016 · 13 comments
Closed

Spanish support? #40

demian85 opened this issue Oct 26, 2016 · 13 comments
Labels

Comments

@demian85
Copy link

There is no mention whatsoever about language support.
Schinke stemmer is supposed to be latin but it doesn't work as expected.
Thanks.

@Yomguithereal
Copy link
Owner

Yomguithereal commented Oct 26, 2016

Hello @demian85. I guess you mean to ask if the library has a stemmer for the Spanish language. Unfortunately it does not have one yet. Using Schinke stemmer on Spanish text will indeed produce only garbage since the algorithm is targeting Latin.

However, I am currently working on Talisman, a much wider library than this one (which is in JavaScript, not Clojure) and can probably implement some kind of Spanish stemmer soon (the ones used by Lucene I think). Tell me if this would suit your use case.

The stemmers I found for Spanish are the Martin Porter one in Snowball & the UniNe one used by Lucene.

@demian85
Copy link
Author

Turns out that what I'm looking for is an inflector, I just want a way to normalize a string. More specifically, I need to singularize nouns in spanish.

@Yomguithereal
Copy link
Owner

Ok. The UniNe stemmer might be of some use to you then. It perform really simple stemming and will probably drop most plural forms (won't inflect them in a grammatically correct way though).

Here is how it works:

  • Deburr the string
  • If the string is less than 5 characters long, then don't affect it
  • Else drop final o, a and e
  • Handle final s likewise:
if (s[len-2] == 'e' && s[len-3] == 's' && s[len-4] == 'e')
  return len-2;
if (s[len-2] == 'e' && s[len-3] == 'c') {
  s[len-3] = 'z';
  return len - 2;
}
if (s[len-2] == 'o' || s[len-2] == 'a' || s[len-2] == 'e')
  return len - 2;

@Yomguithereal
Copy link
Owner

Else, here the code of a python inflector for the Spanish language.

@Yomguithereal
Copy link
Owner

What are you trying to achieve specifically here? Fuzzy matching? Clustering?

@demian85
Copy link
Author

I'm using MongoDB but the full text search is not smart enough to cover edge cases. I cannot find a way to match all terms using AND without losing stemming and other stuff.
I'm just planning to store a normalized string and search for equality.
Thanks por the Python version, do you know any JS implementation?

@Yomguithereal
Copy link
Owner

If you tell me the python inflector works for you and solves your problem, I can implement it in talisman but I'll need some time to do so.

@Yomguithereal
Copy link
Owner

Yomguithereal commented Oct 27, 2016

Ok, I just implemented both the UniNe stemmer & the python inflector in talisman @demian85. Here is how to use them:

npm install talisman
// The stemmer
const stemmer = require('talisman/stemmers/spanish/unine');

// The inflector
const inflector = require('talisman/inflectors/spanish/noun').singularize;

@demian85
Copy link
Author

Great! I'll give it a try! Thanks!

@Yomguithereal
Copy link
Owner

I'll close this issue. Open one on talisman if you have any problem.

@Yomguithereal
Copy link
Owner

So did it work for you?

@Yomguithereal
Copy link
Owner

@demian85 you never told me if this worked for you or if there was some things to fix.

@demian85
Copy link
Author

demian85 commented Dec 2, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants