Join GitHub today
Handle "you" and "your" better in poterminology with --nonstop-needed #1102
This patch tries to solve this by changing the class of "you" and "your" in the stopword list. Please review, and comment on the placing in the stopword file. Should it move to the ">" section?
--- stoplist-en (revision 12164)
Firstly, to address your suggested patch, changing the stopword prefix for "you" and "your" from '=' to '>'. This will not generally do what you want, although in some cases it may be helpful.
The "don't count against length for phrases" feature prevents the default three-word limit on phrases from counting the word, but this will not automatically prevent "you sure you" from appearing. It does allow "you sure you want to" and "you sure you do" to appear, and if all appearances of the shorter phrase are within appearances of the longer phrase (i.e. the counts for both are equal), then the shorter phrase will be suppressed in favor of the longer one.
This begs the question of whether the longer phrases are useful, but it is probably an slight improvement over the current output. (If the stoplist prefix for "are" were '=' instead of '@', the longer phrases would probably be "are you sure you want to" and "are you sure you do" - which are really not too bad, but this would also allow other phrases with "are" to appear, which might make things worse overall.)
Also, while this approach works for "you sure you" (in most cases there will a longer phrase containing all instances of the shorter one), it may not suppress other undesirable instances of phrases containing "you".
The only way to suppress those other phrases with the current implementation of poterminology is to use the '@' or '<' stopword prefixes, which will eliminate all phrases containing "you" and "your" (the '<' prefix would allow them to appear as single words, which the current '=' prefix does not allow).
The original motivation for leaving "you" and "your" as non-disregarded stopwords was that in some languages there are different words for this depending on level of formality (e.g. fr: tu / vous, de: du / sie, es: tu / Ud. or usted). In the case of Spanish, a full or abbreviated version of the formal "you" can be chosen.
While a terminology file for use across many projects probably shouldn't insist on formal or informal choice, within one project it makes sense to be consistent, e.g. OLPC, targeted at kids, would prefer the informal "tu" while time tracking software for lawyers would definitely want the formal "vous". Other projects might choose one or the other, but they would probably not want to mix them.
I did choose to have a stoplist entry for "you" so that it would not appear as a word by itself - this is because English uses "you" for both subject (you love me) and object (I love you) pronouns, as well as both singular and plural, so that the single word could have as many as four distinct translations. While "your" is not quite as bad (subject / object doesn't apply to a possessive), the singular you / plural you distinction is still lost, and there may be multiple forms of the possessive for gender and/or quantity (e.g. fr: ta / ton / votre / vos, es: tu / tus / su / sus).
When "you" or "your" appears in a phrase, however, it usually will be restricted by context to one of the possible meanings (and for "your" will probably have the following noun to specify gender/quantity), making a terminology entry more useful. This was the reason I ended up with the '=' stoplist prefix for these words.
For these reasons, I am not in favor of a '@' stoplist entry for these words as a default (global terminology projects may well wish to add it themselves, though), and I would only accept a '<' stoplist entry for "your" (as "you" by itself is just too ambiguous to be useful).
With code changes, there are some other possibilities for suppressing these less-than-useful terminology entries, and in fact I made a note in http://translate.sourceforge.net/wiki/toolkit/poterminology#issues:
Terms containing only words that are ignored individually, but not excluded from phrases (e.g. “you are you”) may be generated by poterminology, but aren't generally useful. Adding a new threshold option --nonstop-needed could allow these to be suppressed ("in your," "you are you," and "you sure you" contain no non-stopwords, so a default of --nonstop-needed=1 would suppress all of them).
Something like this is probably a better approach than using the '@' stopword prefix, and could be combined with a change to '>' stopword prefix if the longer phrases like "are you sure you want" do indeed seem to be useful (otherwise, leaving the current '=' will suppress the shorter phrases, and the longer ones will not be considered). With --nonstop-needed=1 default, it may also make sense to revisit some '@' stoplist entries (like "are") and consider changing them to '=' or '>' now that many phrases containing them would be suppressed by other means.
In summary, I would accept your patch as-is, although I would suggest that using the '<' stopword prefix for "your" might be more effective in reducing useless phrases in that case. However, I would put a comment on those modified entries referencing this bug id (1102) and would leave it open pending implementation of the --nonstop-needed enhancement, which is a much more complete solution.
I doubt you would want to set --nonstop-needed=0 (thus the default of 1) but you might want to set it higher. Ï suppose that phrases with no stopwords would be exempt, i.e. if --nonstop-needed=3 a two word phrase with no stopwords would be allowed.
Would it make sense to have.a (separate).minimum phrase length option as well as the current -t maximum?
I implemented --nonstop-needed in r14950
I hope you can review this commit Alexander