Description
Expected Behavior:
The API function PrintVariables
prints current parameters to a file, and ReadConfigFile
reads parameters from a file. Intuitively, ReadConfigFile
should be able to read the files that PrintVariables
writes. This is explicitly assumed within the ProcessPage
function, where these functions are used together to "Save current config variables before switching modes" and then "Restore saved config variables".
Lines 1293 to 1306 in a873553
Current Behavior:
Unfortunately, this does not currently work properly. The issue is that PrintVariables
prints parameter descriptions alongside key/value pairs (e.g. chs_trailing_punct1 ).,;:?! 1st Trailing punctuation
), and ReadConfigFile
reads the description as a value (for string parameters). An example showing this is below.
#include <tesseract/baseapi.h>
int main()
{
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
static const char *kOldVarsFile = "failed_vars.txt";
// Print default value of chs_trailing_punct1
printf("Initial value: %s\n", api->GetStringVariable("chs_trailing_punct1"));
FILE *fp = fopen(kOldVarsFile, "wb");
api->PrintVariables(fp);
fclose(fp);
api->ReadConfigFile(kOldVarsFile);
printf("After PrintVariables/ReadConfigFile: %s\n", api->GetStringVariable("chs_trailing_punct1"));
api->End();
delete api;
return 0;
}
This returns the following:
Initial value: ).,;:?!
After PrintVariables/ReadConfigFile: ).,;:?! 1st Trailing punctuation
The impact of this is:
ProcessPage
does not work correctly when used withretry_config
- There is no simple interface for generating a config file with the user's current settings
- This is useful for saving/restoring configurations (as
ProcessPage
attempts to do)
- This is useful for saving/restoring configurations (as
Suggested Fix:
The simplest solution would be to remove the descriptions from the PrintVariables
output (or at least hide that behavior behind an option). I can write a PR if others agree this makes sense. Editing ReadConfigFile
to ignore the descriptions is likely also possible, but could be higher effort.
Environment
Tesseract Version: 5.2.0
Commit Number: 15200c6
Platform: Linux ubuntu 5.15.0-43-generic
Activity
amitdo commentedon Oct 15, 2022
We don't want to do this because that function is used by the command line tool and we need the descriptions in that case.
tesseract/src/tesseract.cpp
Lines 721 to 724 in 0daf18c
Balearica commentedon Oct 15, 2022
@amitdo Good point. How about adding an option to
PrintVariables
for whether info text should be included (enabled by default, but disabled for the call withinProcessPage
).amitdo commentedon Oct 16, 2022
How about using:
if (fp == stdout)
or:
(fp == stdout) ? ... : ...
in:
tesseract/src/ccutil/params.cpp
Line 164 in 60fd2b4
Updated PrintParams to no longer print info text to files per tessera…
Balearica commentedon Oct 19, 2022
@amitdo Works for me. Added PR #3947.
I confirmed (1) running
tesseract --print-parameters
from the command line still prints the info text and (2) the issue demonstrated in the example above is resolved (the value ofchs_trailing_punct1
is not changed).stweil commentedon Oct 19, 2022
The regression was added in c51691f in 2014, a long time ago, so all relevant releases are affected.
stweil commentedon Oct 19, 2022
Obviously the argument
retry_config
for the API functionsProcessPages
andProcessPagesFileList
which triggers the regression was not used for years. Is it useful? If not, we simply might drop the support for it.If
retry_config
is useful, maybe the regression could be fixed by extendingParamUtils::ReadParamsFromFp
to remove the comment part from the input lines. Or even better,TessBaseAPI::ProcessPage
should save and restore parameters in memory without any file on disk.Balearica commentedon Oct 19, 2022
I think the ability to dump the user's current parameters and easily restore them later is useful (we're implementing a feature in Tesseract.js that tries to do this using
PrintVariables
/ReadConfigFile
).Adding a new function that writes a config file without the info text (suggested in the PR #3947) seems like an easy and effective solution. I looked into whether
ReadConfigFile
could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value ofpage_separator
is\n
).amitdo commentedon Oct 19, 2022
Related issue: #3260
stweil commentedon Oct 19, 2022
ParamUtils::PrintParams
uses KEY TAB VALUE TAB COMMENT.amitdo commentedon Oct 19, 2022
#3670 (comment)
This comment and my comments below it are also related to this issue.
Balearica commentedon Oct 20, 2022
Okay, if we are confident that VALUE will never include tabs, then I can edit
ReadConfigFile
to only read until the first tab. If this is not the case, then I can do the separateDumpParameters
function.Balearica commentedon Nov 4, 2022
@stweil Do you believe it's safe to assume that tabs will not be used as parameter values (so it is safe for
ReadConfigFile
to read everything up to the tab and ignore everything after)?Balearica commentedon May 10, 2023
I created a new PR (#4074) based on @stweil's idea to implement by adding a new function in #3947 (review).