New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poterminology should suggest options when thresholding eliminates all terms #582

Open
leuce opened this Issue Oct 23, 2008 · 6 comments

Comments

Projects
None yet
3 participants
@leuce

leuce commented Oct 23, 2008

Version: 1.2.0

Poterminology with the default settings gets zero terms from the attached PO file of over half a million words.

The PO file:
type strings words (source) words (translation)
translated: 141230 (100%) 549808 (100%) 531861
fuzzy: 0 ( 0%) 0 ( 0%) n/a
untranslated: 0 ( 0%) 0 ( 0%) n/a
Total: 141230 549808 531861

The operation:
C:>poterminology ansi.po ansi_terms.po
processing 1 files…
[###########################################] 100%
106960 terms from 141261 units in 1 files
0 terms after thresholding
0 terms after subphrase reduction

@leuce

This comment has been minimized.

Show comment
Hide comment
@leuce

leuce Oct 23, 2008

Created [attachment 275](http://bugs.locamotion.org/attachment.cgi?id=275)

The half-a-million-word PO file

leuce commented Oct 23, 2008

Created [attachment 275](http://bugs.locamotion.org/attachment.cgi?id=275)

The half-a-million-word PO file

@friedelwolff

This comment has been minimized.

Show comment
Hide comment
@friedelwolff

friedelwolff Oct 23, 2008

Member

The PO file seems to be seriously broken. What tool created this?
msgid ""#0000ff" or “blue”):"
msgstr ""#0000ff" of “blue”):"
The quotes are not escaped.

There are also duplicates, which our tools can usually handle, but gettext can’t.

The bigger issue here is the absence of #: comments. It seems poterminology can’t currently operate without #: comments (I verified with another file created with msgunfmt).

Member

friedelwolff commented Oct 23, 2008

The PO file seems to be seriously broken. What tool created this?
msgid ""#0000ff" or “blue”):"
msgstr ""#0000ff" of “blue”):"
The quotes are not escaped.

There are also duplicates, which our tools can usually handle, but gettext can’t.

The bigger issue here is the absence of #: comments. It seems poterminology can’t currently operate without #: comments (I verified with another file created with msgunfmt).

@dupuy

This comment has been minimized.

Show comment
Hide comment
@dupuy

dupuy Oct 23, 2008

Contributor

You need to specify —locs-needed=1 if you have an input file that lacks any location information, but poterminology can work with such a file. Note that this is already explained in the (excellent) wiki documentation http://translate.sourceforge.net/wiki/toolkit/poterminology:

—locs-needed

Rather than requiring that a term appear in multiple input PO or POT files, this requires that it have been present in multiple source code files, as evidenced by location comments in the PO/POT sources.

Not all PO/POT files contain proper location comments. If your input files don’t have (good) location comments and the output terminology file is reduced to zero or very few entries by thresholding, you may need to override the default value for this threshold and set it to 1, which disables this check.

[There is also a very relevant comment a bit further down:]

—fullmsg-needed
—substr-needed

These two thresholds specify the number of different translation units (messages) in which a term must appear; they both work in the same way, but the first one applies to terms which appear as complete translation units in one or more of the source files (full message terms), and the second one to all other terms (substring terms). Note that translations are extracted only for full message terms; poterminology cannot identify the corresponding substring in a translation.

If you are working with a single input file without useful location comments, increasing these thresholds may be the only way to effectively reduce the output terminology. Generally, you should increase the —substr-needed threshold first, as the full message terms are more likely to be useful terminology.

Given that you are trying to get useful terminology from a single file with no location information, I suspect you will need to use the above thresholds.

Rather than reject this bug, I will take it as a suggestion for improvement, which is that poterminology should suggest option settings (based on the maximum observed values of all threshold quantities) if thresholding removes all terms. This would provide output something like the following:

C:>poterminology ansi.po ansi_terms.po
processing 1 files…
[###########################################] 100%
106960 terms from 141261 units in 1 files
0 terms after thresholding
0 terms after subphrase reduction
Current threshold settings have removed all terms. In order to generate
non-empty output, run poterminology with the following options:
—locs-needed=1

The suggestion text would include all options where the maximum threshold
quantity was less than the configured threshold; the other threshold options are —inputs-needed (which is automatically reduced to 1 if only one file is provided) —fullmsg-needed and —substr-needed

@alex

Contributor

dupuy commented Oct 23, 2008

You need to specify —locs-needed=1 if you have an input file that lacks any location information, but poterminology can work with such a file. Note that this is already explained in the (excellent) wiki documentation http://translate.sourceforge.net/wiki/toolkit/poterminology:

—locs-needed

Rather than requiring that a term appear in multiple input PO or POT files, this requires that it have been present in multiple source code files, as evidenced by location comments in the PO/POT sources.

Not all PO/POT files contain proper location comments. If your input files don’t have (good) location comments and the output terminology file is reduced to zero or very few entries by thresholding, you may need to override the default value for this threshold and set it to 1, which disables this check.

[There is also a very relevant comment a bit further down:]

—fullmsg-needed
—substr-needed

These two thresholds specify the number of different translation units (messages) in which a term must appear; they both work in the same way, but the first one applies to terms which appear as complete translation units in one or more of the source files (full message terms), and the second one to all other terms (substring terms). Note that translations are extracted only for full message terms; poterminology cannot identify the corresponding substring in a translation.

If you are working with a single input file without useful location comments, increasing these thresholds may be the only way to effectively reduce the output terminology. Generally, you should increase the —substr-needed threshold first, as the full message terms are more likely to be useful terminology.

Given that you are trying to get useful terminology from a single file with no location information, I suspect you will need to use the above thresholds.

Rather than reject this bug, I will take it as a suggestion for improvement, which is that poterminology should suggest option settings (based on the maximum observed values of all threshold quantities) if thresholding removes all terms. This would provide output something like the following:

C:>poterminology ansi.po ansi_terms.po
processing 1 files…
[###########################################] 100%
106960 terms from 141261 units in 1 files
0 terms after thresholding
0 terms after subphrase reduction
Current threshold settings have removed all terms. In order to generate
non-empty output, run poterminology with the following options:
—locs-needed=1

The suggestion text would include all options where the maximum threshold
quantity was less than the configured threshold; the other threshold options are —inputs-needed (which is automatically reduced to 1 if only one file is provided) —fullmsg-needed and —substr-needed

@alex

@leuce

This comment has been minimized.

Show comment
Hide comment
@leuce

leuce Oct 23, 2008

The TM was created by me. The escaping issue is not an issue — I refined my tool to escape quotes and slashes, and recreated the PO file, but the result from PoTerminlogy was the same. Should I upload the new PO file for you?

leuce commented Oct 23, 2008

The TM was created by me. The escaping issue is not an issue — I refined my tool to escape quotes and slashes, and recreated the PO file, but the result from PoTerminlogy was the same. Should I upload the new PO file for you?

@leuce

This comment has been minimized.

Show comment
Hide comment
@leuce

leuce Oct 23, 2008

Aha, the documentation is a little obtuse:

—locs-needed=MIN omit terms appearing in less than MIN different original source files (default 2)

To a non-programmer user of poterminology, the PO file is the original source file.

leuce commented Oct 23, 2008

Aha, the documentation is a little obtuse:

—locs-needed=MIN omit terms appearing in less than MIN different original source files (default 2)

To a non-programmer user of poterminology, the PO file is the original source file.

@leuce

This comment has been minimized.

Show comment
Hide comment
@leuce

leuce Oct 24, 2008

:poterminology —locs-needed=1 newansi.po newansi_terms.po
processing 1 files…
[###########################################] 100%
108231 terms from 141262 units in 1 files
0 terms after thresholding
0 terms after subphrase reduction

:poterminology —locs-needed=0 newansi.po -o newansi_terms.po
processing 1 files…
[###########################################] 100%
108231 terms from 141262 units in 1 files
38872 terms after thresholding
31464 terms after subphrase reduction

Well, that works at least.

leuce commented Oct 24, 2008

:poterminology —locs-needed=1 newansi.po newansi_terms.po
processing 1 files…
[###########################################] 100%
108231 terms from 141262 units in 1 files
0 terms after thresholding
0 terms after subphrase reduction

:poterminology —locs-needed=0 newansi.po -o newansi_terms.po
processing 1 files…
[###########################################] 100%
108231 terms from 141262 units in 1 files
38872 terms after thresholding
31464 terms after subphrase reduction

Well, that works at least.

@leuce leuce added the tools label Jul 27, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment