New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "you" and "your" better in poterminology with --nonstop-needed #1102

Open
friedelwolff opened this Issue Aug 12, 2009 · 5 comments

Comments

Projects
None yet
3 participants
@friedelwolff
Member

friedelwolff commented Aug 12, 2009

Version: trunk

Terms containing "you" and "your" are often present in poterminology output, in forms that seem less useful, and possibly dangerous if people would try to standardise them (like "you sure you", "in your" etc.).

@friedelwolff

This comment has been minimized.

Show comment
Hide comment
@friedelwolff

friedelwolff Aug 12, 2009

Member

This patch tries to solve this by changing the class of "you" and "your" in the stopword list. Please review, and comment on the placing in the stopword file. Should it move to the ">" section?

Index: stoplist-en

--- stoplist-en (revision 12164)
+++ stoplist-en (working copy)
@@ -596,11 +596,11 @@
=year
=years
=yet
-=you
+>you
=young
=younger
=youngest
-=your
+>your
=yourself
=yourselves

Member

friedelwolff commented Aug 12, 2009

This patch tries to solve this by changing the class of "you" and "your" in the stopword list. Please review, and comment on the placing in the stopword file. Should it move to the ">" section?

Index: stoplist-en

--- stoplist-en (revision 12164)
+++ stoplist-en (working copy)
@@ -596,11 +596,11 @@
=year
=years
=yet
-=you
+>you
=young
=younger
=youngest
-=your
+>your
=yourself
=yourselves

@dupuy

This comment has been minimized.

Show comment
Hide comment
@dupuy

dupuy Aug 13, 2009

Contributor

Firstly, to address your suggested patch, changing the stopword prefix for "you" and "your" from '=' to '>'. This will not generally do what you want, although in some cases it may be helpful.

The "don't count against length for phrases" feature prevents the default three-word limit on phrases from counting the word, but this will not automatically prevent "you sure you" from appearing. It does allow "you sure you want to" and "you sure you do" to appear, and if all appearances of the shorter phrase are within appearances of the longer phrase (i.e. the counts for both are equal), then the shorter phrase will be suppressed in favor of the longer one.

This begs the question of whether the longer phrases are useful, but it is probably an slight improvement over the current output. (If the stoplist prefix for "are" were '=' instead of '@', the longer phrases would probably be "are you sure you want to" and "are you sure you do" - which are really not too bad, but this would also allow other phrases with "are" to appear, which might make things worse overall.)

Also, while this approach works for "you sure you" (in most cases there will a longer phrase containing all instances of the shorter one), it may not suppress other undesirable instances of phrases containing "you".

The only way to suppress those other phrases with the current implementation of poterminology is to use the '@' or '<' stopword prefixes, which will eliminate all phrases containing "you" and "your" (the '<' prefix would allow them to appear as single words, which the current '=' prefix does not allow).

The original motivation for leaving "you" and "your" as non-disregarded stopwords was that in some languages there are different words for this depending on level of formality (e.g. fr: tu / vous, de: du / sie, es: tu / Ud. or usted). In the case of Spanish, a full or abbreviated version of the formal "you" can be chosen.

While a terminology file for use across many projects probably shouldn't insist on formal or informal choice, within one project it makes sense to be consistent, e.g. OLPC, targeted at kids, would prefer the informal "tu" while time tracking software for lawyers would definitely want the formal "vous". Other projects might choose one or the other, but they would probably not want to mix them.

I did choose to have a stoplist entry for "you" so that it would not appear as a word by itself - this is because English uses "you" for both subject (you love me) and object (I love you) pronouns, as well as both singular and plural, so that the single word could have as many as four distinct translations. While "your" is not quite as bad (subject / object doesn't apply to a possessive), the singular you / plural you distinction is still lost, and there may be multiple forms of the possessive for gender and/or quantity (e.g. fr: ta / ton / votre / vos, es: tu / tus / su / sus).

When "you" or "your" appears in a phrase, however, it usually will be restricted by context to one of the possible meanings (and for "your" will probably have the following noun to specify gender/quantity), making a terminology entry more useful. This was the reason I ended up with the '=' stoplist prefix for these words.

For these reasons, I am not in favor of a '@' stoplist entry for these words as a default (global terminology projects may well wish to add it themselves, though), and I would only accept a '<' stoplist entry for "your" (as "you" by itself is just too ambiguous to be useful).

With code changes, there are some other possibilities for suppressing these less-than-useful terminology entries, and in fact I made a note in http://translate.sourceforge.net/wiki/toolkit/poterminology#issues:

Terms containing only words that are ignored individually, but not excluded from phrases (e.g. “you are you”) may be generated by poterminology, but aren't generally useful. Adding a new threshold option --nonstop-needed could allow these to be suppressed ("in your," "you are you," and "you sure you" contain no non-stopwords, so a default of --nonstop-needed=1 would suppress all of them).

Something like this is probably a better approach than using the '@' stopword prefix, and could be combined with a change to '>' stopword prefix if the longer phrases like "are you sure you want" do indeed seem to be useful (otherwise, leaving the current '=' will suppress the shorter phrases, and the longer ones will not be considered). With --nonstop-needed=1 default, it may also make sense to revisit some '@' stoplist entries (like "are") and consider changing them to '=' or '>' now that many phrases containing them would be suppressed by other means.

In summary, I would accept your patch as-is, although I would suggest that using the '<' stopword prefix for "your" might be more effective in reducing useless phrases in that case. However, I would put a comment on those modified entries referencing this bug id (1102) and would leave it open pending implementation of the --nonstop-needed enhancement, which is a much more complete solution.

Contributor

dupuy commented Aug 13, 2009

Firstly, to address your suggested patch, changing the stopword prefix for "you" and "your" from '=' to '>'. This will not generally do what you want, although in some cases it may be helpful.

The "don't count against length for phrases" feature prevents the default three-word limit on phrases from counting the word, but this will not automatically prevent "you sure you" from appearing. It does allow "you sure you want to" and "you sure you do" to appear, and if all appearances of the shorter phrase are within appearances of the longer phrase (i.e. the counts for both are equal), then the shorter phrase will be suppressed in favor of the longer one.

This begs the question of whether the longer phrases are useful, but it is probably an slight improvement over the current output. (If the stoplist prefix for "are" were '=' instead of '@', the longer phrases would probably be "are you sure you want to" and "are you sure you do" - which are really not too bad, but this would also allow other phrases with "are" to appear, which might make things worse overall.)

Also, while this approach works for "you sure you" (in most cases there will a longer phrase containing all instances of the shorter one), it may not suppress other undesirable instances of phrases containing "you".

The only way to suppress those other phrases with the current implementation of poterminology is to use the '@' or '<' stopword prefixes, which will eliminate all phrases containing "you" and "your" (the '<' prefix would allow them to appear as single words, which the current '=' prefix does not allow).

The original motivation for leaving "you" and "your" as non-disregarded stopwords was that in some languages there are different words for this depending on level of formality (e.g. fr: tu / vous, de: du / sie, es: tu / Ud. or usted). In the case of Spanish, a full or abbreviated version of the formal "you" can be chosen.

While a terminology file for use across many projects probably shouldn't insist on formal or informal choice, within one project it makes sense to be consistent, e.g. OLPC, targeted at kids, would prefer the informal "tu" while time tracking software for lawyers would definitely want the formal "vous". Other projects might choose one or the other, but they would probably not want to mix them.

I did choose to have a stoplist entry for "you" so that it would not appear as a word by itself - this is because English uses "you" for both subject (you love me) and object (I love you) pronouns, as well as both singular and plural, so that the single word could have as many as four distinct translations. While "your" is not quite as bad (subject / object doesn't apply to a possessive), the singular you / plural you distinction is still lost, and there may be multiple forms of the possessive for gender and/or quantity (e.g. fr: ta / ton / votre / vos, es: tu / tus / su / sus).

When "you" or "your" appears in a phrase, however, it usually will be restricted by context to one of the possible meanings (and for "your" will probably have the following noun to specify gender/quantity), making a terminology entry more useful. This was the reason I ended up with the '=' stoplist prefix for these words.

For these reasons, I am not in favor of a '@' stoplist entry for these words as a default (global terminology projects may well wish to add it themselves, though), and I would only accept a '<' stoplist entry for "your" (as "you" by itself is just too ambiguous to be useful).

With code changes, there are some other possibilities for suppressing these less-than-useful terminology entries, and in fact I made a note in http://translate.sourceforge.net/wiki/toolkit/poterminology#issues:

Terms containing only words that are ignored individually, but not excluded from phrases (e.g. “you are you”) may be generated by poterminology, but aren't generally useful. Adding a new threshold option --nonstop-needed could allow these to be suppressed ("in your," "you are you," and "you sure you" contain no non-stopwords, so a default of --nonstop-needed=1 would suppress all of them).

Something like this is probably a better approach than using the '@' stopword prefix, and could be combined with a change to '>' stopword prefix if the longer phrases like "are you sure you want" do indeed seem to be useful (otherwise, leaving the current '=' will suppress the shorter phrases, and the longer ones will not be considered). With --nonstop-needed=1 default, it may also make sense to revisit some '@' stoplist entries (like "are") and consider changing them to '=' or '>' now that many phrases containing them would be suppressed by other means.

In summary, I would accept your patch as-is, although I would suggest that using the '<' stopword prefix for "your" might be more effective in reducing useless phrases in that case. However, I would put a comment on those modified entries referencing this bug id (1102) and would leave it open pending implementation of the --nonstop-needed enhancement, which is a much more complete solution.

@friedelwolff

This comment has been minimized.

Show comment
Hide comment
@friedelwolff

friedelwolff Aug 13, 2009

Member

Thank you for the thorough handling of the issue. In which cases do you think that --nonstop-needed=1 would not do what is expected? Can't we simply make that the behaviour in all cases?

Member

friedelwolff commented Aug 13, 2009

Thank you for the thorough handling of the issue. In which cases do you think that --nonstop-needed=1 would not do what is expected? Can't we simply make that the behaviour in all cases?

@dupuy

This comment has been minimized.

Show comment
Hide comment
@dupuy

dupuy Aug 13, 2009

Contributor

I doubt you would want to set --nonstop-needed=0 (thus the default of 1) but you might want to set it higher. Ï suppose that phrases with no stopwords would be exempt, i.e. if --nonstop-needed=3 a two word phrase with no stopwords would be allowed.

Would it make sense to have.a (separate).minimum phrase length option as well as the current -t maximum?

Contributor

dupuy commented Aug 13, 2009

I doubt you would want to set --nonstop-needed=0 (thus the default of 1) but you might want to set it higher. Ï suppose that phrases with no stopwords would be exempt, i.e. if --nonstop-needed=3 a two word phrase with no stopwords would be allowed.

Would it make sense to have.a (separate).minimum phrase length option as well as the current -t maximum?

@alaaosh

This comment has been minimized.

Show comment
Hide comment
@alaaosh

alaaosh commented Jul 15, 2010

I implemented --nonstop-needed in r14950
http://translate.svn.sourceforge.net/viewvc/translate/src/trunk/translate/tools/poterminology.py?r1=14950&r2=14949&pathrev=14950

I hope you can review this commit Alexander

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment