-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Makefile based plotting #236
Conversation
@Shreeshrii I was trying to get on with the plotting, but I stumbled upon the term "validationlist". |
Usually for training we use a training list and eval list. You can use a different dataset for validation, if you want. |
@Shreeshrii But I could not get the result of "Make TSV with Eval CER". I think there is a problem with my environment or steps, but I would appreciate your advice if you can help me. Environment and Steps.Environment
StepsI think "validate" should be prepared separately from "eval", but for now, I have set the same thing as "eval" in "validate".
ResultThe logs did not contain any logs about "eval", unlike the logs described in the Issue. LogTSVPlot |
Can we merge this pull request, or would anybody suggest additional changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do very much welcome your effort to integrate a decent plotting facility, I have strong reservations about this implementation and about the way this PR is set up technically.
Beyond the comments given inline, I see two general questions:
- How do the two old Python scripts relate to the single new one? (There seems to be lots of code re-use. It's hard to compare to the previous version this way.)
- How does this relate to the bigger problem of CER not being calculated correctly by lstmtraining/lstmeval and checkpoints only being created at arbitrary intervals?
VALIDATE_CER=9 | ||
|
||
# Training log file. This should match logfile name from training. Default: $(MODEL_LOG) | ||
MODEL_LOG =../data/$(MODEL_NAME).log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing ../data
here, this assumes that the toplevel training used the default DATA_DIR=data
, which is overly restrictive. I suggest either rewriting in terms of a DATA_DIR
variable as well, or by recursively calling the plot/Makefile with everything exported from the toplevel Makefile.
@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; | ||
@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is rename
– it's certainly not POSIX, and does not work on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At any rate, I would prefer changing lstmtraining.cpp to simply create zero-padded file names in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, I don't even see the need for zero padding filenames at all: you can always sort them correctly via sort -n
.
@grep 'At iteration' $(MODEL_LOG) \ | ||
| sed -e '/^Sub/d' \ | ||
| sed -e '/^Update/d' \ | ||
| sed -e '/^ New worst char/d' \ | ||
| sed -e 's/At iteration \([0-9]*\)\/\([0-9]*\)\/.*char train=/\t\t\1\t\2\t\t/' \ | ||
| sed -e 's/%, word.*/\t/' >> "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly recommend generating a CSV/TSV file directly from lstmtraining.cpp instead of trying to parse its output strings here – any slight change there (different wording/formatting or additional lines) would break this.
|
||
# END-EVAL | ||
|
||
.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are concrete file targets, marking them as phony is wrong.
FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r) | ||
|
||
# Build validate log files list based on above traineddata list. | ||
FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer subst is a no-op.
FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) | |
FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log) |
OMP_THREAD_LIMIT=1 time -p lstmeval \ | ||
--verbosity=0 \ | ||
--model $< \ | ||
--eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule needs still to advertise all its dependencies, esp. TMP_VALIDATE_LIST
and FAST_DATA_FILES
.
# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV. | ||
$(TMP_FAST_LOG): $(FAST_LOG_FILES) | ||
@for i in $^; do \ | ||
echo Filename : "$$i";echo;cat "$$i"; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the Filename
lines used for?
ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv" | ||
ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv" | ||
etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv" | ||
vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv" | ||
plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coding the values in Python redundantly instead of passing the fully defined path names from the makefile is really bad style.
# Combine TSV files with all required CER values, generated from training log and validation logs. Plot. | ||
$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) | ||
@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@" | ||
python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be a separate rule with its own target – the concret plot PNG – and passing all the path names of the necessary input files.
Relying on Y_MAX_CER
to be defined outside the makefile (instead of, say, a default), is bad style.
$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) | ||
@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@" | ||
python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER) | ||
@rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST)*.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be placed into a separate (phony) rule (like clean
), and instead of redefining the filenames implicitly (tmp-*
), re-use the actual above definitions.
@bertsky
Thank you for your detailed feedback. I agree that this may not be the best
way to implement the plotting facility for tesseract training. The scripts
are what I used at that point of time and it helped me visualise the
training process and results.
This PR can be closed if you or others provide an alternative
implementation.
…On Mon, Sep 6, 2021, 15:16 Robert Sachunsky ***@***.***> wrote:
***@***.**** requested changes on this pull request.
While I do very much welcome your effort to integrate a decent plotting
facility, I have strong reservations about this implementation and about
the way this PR is set up technically.
Beyond the comments given inline, I see two general questions:
1. How do the two old Python scripts relate to the single new one?
(There seems to be lots of code re-use. It's hard to compare to the
previous version this way.)
2. How does this relate to the bigger problem
<https://github.com/tesseract-ocr/tesstrain/issues/261> of CER not
being calculated correctly by lstmtraining/lstmeval and checkpoints only
being created at arbitrary intervals?
------------------------------
In plot/Makefile
<#236 (comment)>
:
> @@ -0,0 +1,132 @@
+# Name of the model to be evaluated. Default: $(MODEL_NAME)
+MODEL_NAME = foo
+
+# Suffix of list of lstmf files to be used as validation set e.g. list.validate. Default: $(VALIDATE_LIST)
+VALIDATE_LIST=validate
+
+# Integer part of maximum Validation CER, ONLY use values between 0-9. Default: $(VALIDATE_CER)
+VALIDATE_CER=9
+
+# Training log file. This should match logfile name from training. Default: $(MODEL_LOG)
+MODEL_LOG =../data/$(MODEL_NAME).log
Seeing ../data here, this assumes that the toplevel training used the
default DATA_DIR=data, which is overly restrictive. I suggest either
rewriting in terms of a DATA_DIR variable as well, or by recursively
calling the plot/Makefile with everything exported from the toplevel
Makefile.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
What is rename – it's certainly not POSIX, and does not work on Linux.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @grep 'At iteration' $(MODEL_LOG) \
+ | sed -e '/^Sub/d' \
+ | sed -e '/^Update/d' \
+ | sed -e '/^ New worst char/d' \
+ | sed -e 's/At iteration \([0-9]*\)\/\([0-9]*\)\/.*char train=/\t\t\1\t\2\t\t/' \
+ | sed -e 's/%, word.*/\t/' >> "$@"
I would strongly recommend generating a CSV/TSV file directly from
lstmtraining.cpp instead of trying to parse its output strings here – any
slight change there (different wording/formatting or additional lines)
would break this.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> +help:
+ @echo ""
+ @echo " Targets"
+ @echo ""
+ @echo " traineddata Create best and fast .traineddata files from each .checkpoint file"
+ @echo " plotvalidatecer Make plots from TSV files generated from training and eval logs"
+ @echo ""
+ @echo " Variables"
+ @echo ""
+ @echo " MODEL_NAME Name of the model to be built. Default: $(MODEL_NAME)"
+ @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)"
+ @echo ""
+
+# END-EVAL
+
+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)
These are concrete file targets, marking them as phony is wrong.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\
+ else \
+ grep -E "$(VALIDATE_LIST).log$$|iteration" $(TMP_FAST_LOG) > $(TMP_VALIDATE_LOG) ;\
+ echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\
+ sed 'N;s/\nAt iteration 0, stage 0, /At iteration 0, stage 0, /;P;D' $(TMP_VALIDATE_LOG) \
+ | grep 'Eval Char' \
+ | sed -e "s/.$(VALIDATE_LIST).log.*Eval Char error rate=/\t\t\t/" \
+ | sed -e 's/, Word.*$$//' \
+ | sed -e 's/\(^.*\)_\([0-9].*\)_\([0-9].*\)_\([0-9].*\)\t/\1\t\2\t\3\t\4\t/g' >> "$@" ;\
+ fi;
+
+# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9].
+FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r)
+
+# Build validate log files list based on above traineddata list.
+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
The outer subst is a no-op.
⬇️ Suggested change
-FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
+FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log)
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)"
+ @echo ""
+
+# END-EVAL
+
+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)
+
+# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later.
+# Run Makefile in main directory to create traineddata from all checkpoints.
+# Add ../ to lstmf file names in validate list relative to plot subdirectory.
+traineddata:
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
+ $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME)
+ @mkdir -p $(PLOT_DIR)
+ @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST)
This line creates TMP_VALIDATE_LIST, the rule therefore should mark this
file as its target (perhaps as a dependent sub-rule).
But I don't see how the default list.validate should get created in the
first place. So far, tesstrain only creates list.train and list.eval.
(And since it does not make any actual *use* of the eval files, not for
checkpointing and not even for checkpoint selection, I don't see the merit
in providing a *second* hold-out set. If you have a manual split, you
could easily pass that into list.train / list.eval already.)
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @echo ""
+
+# END-EVAL
+
+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)
+
+# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later.
+# Run Makefile in main directory to create traineddata from all checkpoints.
+# Add ../ to lstmf file names in validate list relative to plot subdirectory.
+traineddata:
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
+ $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME)
+ @mkdir -p $(PLOT_DIR)
+ @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST)
+ @Sed -i -e 's/^data/..\/data/' $(TMP_VALIDATE_LIST)
Again, this makes too strong directly assumptions. Even if you pass in
DATA_DIR from the main makefile, you need to calculate the exact relative
path, or use absolute paths entirely.
------------------------------
In plot.sh
<#236 (comment)>
:
> +cd plot
+make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3
+make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3
If you need a shell script to run a makefile, then that makefile is poorly
written (or documented).
Assuming it is a good choice to have a separate makefile for plotting in
the first place, I think the plot/Makefile should just require/assume that
the toplevel Makfile's traineddata target has already been run.
In the very least, this should read make traineddata && make
plotvalidatecer, not make all; make all.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + fi;
+
+# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9].
+FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r)
+
+# Build validate log files list based on above traineddata list.
+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
+
+#Note: This does not find the new traineddata files from current run.
+# Hence `make` needs to be run twice to generate new validate.log files.
+
+$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata
+ OMP_THREAD_LIMIT=1 time -p lstmeval \
+ --verbosity=0 \
+ --model $< \
+ --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@
This rule needs still to advertise all its dependencies, esp.
TMP_VALIDATE_LIST and FAST_DATA_FILES.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> +# Build validate log files list based on above traineddata list.
+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
+
+#Note: This does not find the new traineddata files from current run.
+# Hence `make` needs to be run twice to generate new validate.log files.
+
+$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata
+ OMP_THREAD_LIMIT=1 time -p lstmeval \
+ --verbosity=0 \
+ --model $< \
+ --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@
+
+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.
+$(TMP_FAST_LOG): $(FAST_LOG_FILES)
+ @for i in $^; do \
+ echo Filename : "$$i";echo;cat "$$i"; \
What are the Filename lines used for?
------------------------------
In plot/plot-eval-validate-cer.py
<#236 (comment)>
:
> +ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv"
+ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv"
+etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv"
+vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv"
+plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png"
Hard-coding the values in Python redundantly instead of passing the fully
defined path names from the makefile is really bad style.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> +$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata
+ OMP_THREAD_LIMIT=1 time -p lstmeval \
+ --verbosity=0 \
+ --model $< \
+ --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@
+
+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.
+$(TMP_FAST_LOG): $(FAST_LOG_FILES)
+ @for i in $^; do \
+ echo Filename : "$$i";echo;cat "$$i"; \
+ done > $@
+
+# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.
+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)
+ @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
+ python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)
This should really be a separate rule with its own target – the concret
plot PNG – and passing all the path names of the necessary input files.
Relying on Y_MAX_CER to be defined outside the makefile (instead of, say,
a default), is bad style.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + OMP_THREAD_LIMIT=1 time -p lstmeval \
+ --verbosity=0 \
+ --model $< \
+ --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@
+
+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.
+$(TMP_FAST_LOG): $(FAST_LOG_FILES)
+ @for i in $^; do \
+ echo Filename : "$$i";echo;cat "$$i"; \
+ done > $@
+
+# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.
+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)
+ @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
+ python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)
+ @rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST)*.*
Should be placed into a separate (phony) rule (like clean), and instead
of redefining the filenames implicitly (tmp-*), re-use the actual above
definitions.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
At any rate, I would prefer changing lstmtraining.cpp to simply create
zero-padded file names in the first place.
------------------------------
In plot/Makefile
<#236 (comment)>
:
> + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+ @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
Then again, I don't even see the need for zero padding filenames at all:
you can always sort them correctly via sort -n.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#236 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I3HSQVNVLEW5BP2HNDUASEYBANCNFSM4ZI5G4CQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks for your fast feedback @Shreeshrii. In that case I suggest we do close, and (being very much interested in this feature) I promise I'll revisit as soon as #261 is out of the way. (I will strive for a close integration with lstmtraining to output data files, and with the tesstrain makefile, and then re-use your plotting code.) |
@bertsky I am closing this. Hope you will add a better implementation soon. |
I have the same issue as #236 (comment) |
@bertsky is planning to improve the output CER as discussed in above thread and will redo the plotting feature. I have added some more hacks to the scripts for my own personal use based on the above feedback. I will look to posting them in a repo and post a link here as a workaround till the official update. |
That sounds great! Thank you! |
Many thanks! I will try that! |
Just to clarify am I right that it still doesn't plot eval data for fine tuned training (only for replacing a layering training)? |
The scripts create a tsv from the log file generated during training process. If tesseract does not run it then the info won't be in the log file and won't get plotted. Also, the eval info does not include the training iteration number, it only has the learning iteration number. As an alternative I run lstmeval on each of the checkpoints and plot that separately, that is the lstmeval after training. I have also added impact centre's ocreval as well as ISRI evaluation's accuracy info. Plotting of accuracy is not yet implemented. |
Many thanks for the excellent work! I have tried, but the script doesn't work with the ground truth tif and txt pairs for fine tuning we have... it seems to require model.training_text file and reports: [01:32:54] INFO - === Starting training for language eng [01:32:54] INFO - Program /usr/bin/text2image failed with return code 1. Abort. with I will keep investigating. |
Yes, as my repo name indicates this is the version for training from fonts and training text. However the plotting part only depends on the log file from training. I will upload a different version that works with existing tesstrain makefile. |
Amazing! I will try when it is available : ) And the lstmeval commands does work and are useful with CER and WER output with the original make output, although not showing in plotting: nohup lstmeval --model data/modelname.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/modelname-eval-list.log & nohup lstmeval --model data/eng.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/eng-eval-list.log & |
No description provided.