Makefile based plotting #236

Shreeshrii · 2021-03-16T16:44:37Z

No description provided.

M3ssman · 2021-05-04T14:17:30Z

@Shreeshrii I was trying to get on with the plotting, but I stumbled upon the term "validationlist".
What does this mean in this context? AFAIU, the plotting requires an already existing *.traineddata-model, which is passed to lstmeval Tool. Do I need, besides my previous training data, an additional list of lstmf-files to measure the performance? By now, I do it the hard way, which means put the new model into tessconfigs and do a fresh run with real image data.

Shreeshrii · 2021-05-04T14:36:27Z

Usually for training we use a training list and eval list. You can use a different dataset for validation, if you want.

tfukumori · 2021-06-10T03:33:10Z

@Shreeshrii
Thanks for the great feature! I was able to draw a plot.

But I could not get the result of "Make TSV with Eval CER".
It seems that Eval Char is not included in data/ocrd.log, so it was not output.

I think there is a problem with my environment or steps, but I would appreciate your advice if you can help me.

Environment and Steps.

Environment

$ tesseract --version
tesseract 5.0.0-alpha-20210401
 leptonica-1.80.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1

Steps

I think "validate" should be prepared separately from "eval", but for now, I have set the same thing as "eval" in "validate".

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/frk.traineddata -O ~/tessdata_best/frk.traineddata

unzip ocrd-testset.zip -d data/ocrd-ground-truth

nohup make training MODEL_NAME=ocrd \
	START_MODEL=frk \
	TESSDATA=~/tessdata_best \
	MAX_ITERATIONS=10000 > data/ocrd.log

cp data/ocrd/list.eval data/ocrd/list.validate

bash -x plot.sh ocrd validate 10

Result

The logs did not contain any logs about "eval", unlike the logs described in the Issue.

Log

ocrd.log.zip

TSV

ocrd-validate-cer.tsv.zip

Plot

stweil · 2021-09-04T04:11:33Z

Can we merge this pull request, or would anybody suggest additional changes?

bertsky

While I do very much welcome your effort to integrate a decent plotting facility, I have strong reservations about this implementation and about the way this PR is set up technically.

Beyond the comments given inline, I see two general questions:

How do the two old Python scripts relate to the single new one? (There seems to be lots of code re-use. It's hard to compare to the previous version this way.)
How does this relate to the bigger problem of CER not being calculated correctly by lstmtraining/lstmeval and checkpoints only being created at arbitrary intervals?

bertsky · 2021-09-04T10:57:56Z

plot/Makefile

+VALIDATE_CER=9
+
+# Training log file. This should match logfile name from training. Default: $(MODEL_LOG)
+MODEL_LOG =../data/$(MODEL_NAME).log


Seeing ../data here, this assumes that the toplevel training used the default DATA_DIR=data, which is overly restrictive. I suggest either rewriting in terms of a DATA_DIR variable as well, or by recursively calling the plot/Makefile with everything exported from the toplevel Makefile.

bertsky · 2021-09-04T11:03:51Z

plot/Makefile

+	@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
+	@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;


What is rename – it's certainly not POSIX, and does not work on Linux.

At any rate, I would prefer changing lstmtraining.cpp to simply create zero-padded file names in the first place.

Then again, I don't even see the need for zero padding filenames at all: you can always sort them correctly via sort -n.

bertsky · 2021-09-04T11:07:55Z

plot/Makefile

+	@grep 'At iteration' $(MODEL_LOG) \
+		| sed -e '/^Sub/d' \
+		| sed -e '/^Update/d' \
+		| sed -e '/^ New worst char/d' \
+		| sed -e 's/At iteration \([0-9]*\)\/\([0-9]*\)\/.*char train=/\t\t\1\t\2\t\t/' \
+		| sed -e 's/%, word.*/\t/' >>  "$@"


I would strongly recommend generating a CSV/TSV file directly from lstmtraining.cpp instead of trying to parse its output strings here – any slight change there (different wording/formatting or additional lines) would break this.

bertsky · 2021-09-04T11:29:47Z

plot/Makefile

+
+# END-EVAL
+
+.PHONY:  $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)


These are concrete file targets, marking them as phony is wrong.

bertsky · 2021-09-06T08:39:20Z

plot/Makefile

+FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r)
+
+# Build validate log files list based on above traineddata list.
+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))


The outer subst is a no-op.

Suggested change

FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log)

bertsky · 2021-09-06T09:19:01Z

plot/Makefile

+	OMP_THREAD_LIMIT=1 time -p lstmeval  \
+		--verbosity=0 \
+		--model $< \
+		--eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@


This rule needs still to advertise all its dependencies, esp. TMP_VALIDATE_LIST and FAST_DATA_FILES.

bertsky · 2021-09-06T09:20:28Z

plot/Makefile

+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.
+$(TMP_FAST_LOG): $(FAST_LOG_FILES)
+	@for i in $^; do \
+		echo Filename : "$$i";echo;cat "$$i"; \


What are the Filename lines used for?

bertsky · 2021-09-06T09:26:13Z

plot/plot-eval-validate-cer.py

+ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv"
+ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv"
+etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv"
+vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv"
+plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png"


Hard-coding the values in Python redundantly instead of passing the fully defined path names from the makefile is really bad style.

bertsky · 2021-09-06T09:28:04Z

plot/Makefile

+# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.
+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) 
+	@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
+	python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)


This should really be a separate rule with its own target – the concret plot PNG – and passing all the path names of the necessary input files.

Relying on Y_MAX_CER to be defined outside the makefile (instead of, say, a default), is bad style.

bertsky · 2021-09-06T09:30:07Z

plot/Makefile

+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) 
+	@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
+	python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)
+	@rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST)*.*


Should be placed into a separate (phony) rule (like clean), and instead of redefining the filenames implicitly (tmp-*), re-use the actual above definitions.

Shreeshrii · 2021-09-06T11:40:44Z

@bertsky Thank you for your detailed feedback. I agree that this may not be the best way to implement the plotting facility for tesseract training. The scripts are what I used at that point of time and it helped me visualise the training process and results. This PR can be closed if you or others provide an alternative implementation.

…

On Mon, Sep 6, 2021, 15:16 Robert Sachunsky ***@***.***> wrote: ***@***.**** requested changes on this pull request. While I do very much welcome your effort to integrate a decent plotting facility, I have strong reservations about this implementation and about the way this PR is set up technically. Beyond the comments given inline, I see two general questions: 1. How do the two old Python scripts relate to the single new one? (There seems to be lots of code re-use. It's hard to compare to the previous version this way.) 2. How does this relate to the bigger problem <https://github.com/tesseract-ocr/tesstrain/issues/261> of CER not being calculated correctly by lstmtraining/lstmeval and checkpoints only being created at arbitrary intervals? ------------------------------ In plot/Makefile <#236 (comment)> : > @@ -0,0 +1,132 @@ +# Name of the model to be evaluated. Default: $(MODEL_NAME) +MODEL_NAME = foo + +# Suffix of list of lstmf files to be used as validation set e.g. list.validate. Default: $(VALIDATE_LIST) +VALIDATE_LIST=validate + +# Integer part of maximum Validation CER, ONLY use values between 0-9. Default: $(VALIDATE_CER) +VALIDATE_CER=9 + +# Training log file. This should match logfile name from training. Default: $(MODEL_LOG) +MODEL_LOG =../data/$(MODEL_NAME).log Seeing ../data here, this assumes that the toplevel training used the default DATA_DIR=data, which is overly restrictive. I suggest either rewriting in terms of a DATA_DIR variable as well, or by recursively calling the plot/Makefile with everything exported from the toplevel Makefile. ------------------------------ In plot/Makefile <#236 (comment)> : > + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; What is rename – it's certainly not POSIX, and does not work on Linux. ------------------------------ In plot/Makefile <#236 (comment)> : > + @grep 'At iteration' $(MODEL_LOG) \ + | sed -e '/^Sub/d' \ + | sed -e '/^Update/d' \ + | sed -e '/^ New worst char/d' \ + | sed -e 's/At iteration $[0-9]*$\/$[0-9]*$\/.*char train=/\t\t\1\t\2\t\t/' \ + | sed -e 's/%, word.*/\t/' >> "$@" I would strongly recommend generating a CSV/TSV file directly from lstmtraining.cpp instead of trying to parse its output strings here – any slight change there (different wording/formatting or additional lines) would break this. ------------------------------ In plot/Makefile <#236 (comment)> : > +help: + @echo "" + @echo " Targets" + @echo "" + @echo " traineddata Create best and fast .traineddata files from each .checkpoint file" + @echo " plotvalidatecer Make plots from TSV files generated from training and eval logs" + @echo "" + @echo " Variables" + @echo "" + @echo " MODEL_NAME Name of the model to be built. Default: $(MODEL_NAME)" + @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)" + @echo "" + +# END-EVAL + +.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG) These are concrete file targets, marking them as phony is wrong. ------------------------------ In plot/Makefile <#236 (comment)> : > + echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\ + else \ + grep -E "$(VALIDATE_LIST).log$$|iteration" $(TMP_FAST_LOG) > $(TMP_VALIDATE_LOG) ;\ + echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\ + sed 'N;s/\nAt iteration 0, stage 0, /At iteration 0, stage 0, /;P;D' $(TMP_VALIDATE_LOG) \ + | grep 'Eval Char' \ + | sed -e "s/.$(VALIDATE_LIST).log.*Eval Char error rate=/\t\t\t/" \ + | sed -e 's/, Word.*$$//' \ + | sed -e 's/$^.*$_$[0-9].*$_$[0-9].*$_$[0-9].*$\t/\1\t\2\t\3\t\4\t/g' >> "$@" ;\ + fi; + +# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9]. +FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r) + +# Build validate log files list based on above traineddata list. +FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) The outer subst is a no-op. ⬇️ Suggested change -FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) +FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log) ------------------------------ In plot/Makefile <#236 (comment)> : > + @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)" + @echo "" + +# END-EVAL + +.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG) + +# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later. +# Run Makefile in main directory to create traineddata from all checkpoints. +# Add ../ to lstmf file names in validate list relative to plot subdirectory. +traineddata: + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; + $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME) + @mkdir -p $(PLOT_DIR) + @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST) This line creates TMP_VALIDATE_LIST, the rule therefore should mark this file as its target (perhaps as a dependent sub-rule). But I don't see how the default list.validate should get created in the first place. So far, tesstrain only creates list.train and list.eval. (And since it does not make any actual *use* of the eval files, not for checkpointing and not even for checkpoint selection, I don't see the merit in providing a *second* hold-out set. If you have a manual split, you could easily pass that into list.train / list.eval already.) ------------------------------ In plot/Makefile <#236 (comment)> : > + @echo "" + +# END-EVAL + +.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG) + +# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later. +# Run Makefile in main directory to create traineddata from all checkpoints. +# Add ../ to lstmf file names in validate list relative to plot subdirectory. +traineddata: + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; + $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME) + @mkdir -p $(PLOT_DIR) + @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST) + @Sed -i -e 's/^data/..\/data/' $(TMP_VALIDATE_LIST) Again, this makes too strong directly assumptions. Even if you pass in DATA_DIR from the main makefile, you need to calculate the exact relative path, or use absolute paths entirely. ------------------------------ In plot.sh <#236 (comment)> : > +cd plot +make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3 +make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3 If you need a shell script to run a makefile, then that makefile is poorly written (or documented). Assuming it is a good choice to have a separate makefile for plotting in the first place, I think the plot/Makefile should just require/assume that the toplevel Makfile's traineddata target has already been run. In the very least, this should read make traineddata && make plotvalidatecer, not make all; make all. ------------------------------ In plot/Makefile <#236 (comment)> : > + fi; + +# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9]. +FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r) + +# Build validate log files list based on above traineddata list. +FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) + +#Note: This does not find the new traineddata files from current run. +# Hence `make` needs to be run twice to generate new validate.log files. + +$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata + OMP_THREAD_LIMIT=1 time -p lstmeval \ + --verbosity=0 \ + --model $< \ + --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@ This rule needs still to advertise all its dependencies, esp. TMP_VALIDATE_LIST and FAST_DATA_FILES. ------------------------------ In plot/Makefile <#236 (comment)> : > +# Build validate log files list based on above traineddata list. +FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES))) + +#Note: This does not find the new traineddata files from current run. +# Hence `make` needs to be run twice to generate new validate.log files. + +$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata + OMP_THREAD_LIMIT=1 time -p lstmeval \ + --verbosity=0 \ + --model $< \ + --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@ + +# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV. +$(TMP_FAST_LOG): $(FAST_LOG_FILES) + @for i in $^; do \ + echo Filename : "$$i";echo;cat "$$i"; \ What are the Filename lines used for? ------------------------------ In plot/plot-eval-validate-cer.py <#236 (comment)> : > +ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv" +ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv" +etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv" +vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv" +plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png" Hard-coding the values in Python redundantly instead of passing the fully defined path names from the makefile is really bad style. ------------------------------ In plot/Makefile <#236 (comment)> : > +$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata + OMP_THREAD_LIMIT=1 time -p lstmeval \ + --verbosity=0 \ + --model $< \ + --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@ + +# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV. +$(TMP_FAST_LOG): $(FAST_LOG_FILES) + @for i in $^; do \ + echo Filename : "$$i";echo;cat "$$i"; \ + done > $@ + +# Combine TSV files with all required CER values, generated from training log and validation logs. Plot. +$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) + @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@" + python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER) This should really be a separate rule with its own target – the concret plot PNG – and passing all the path names of the necessary input files. Relying on Y_MAX_CER to be defined outside the makefile (instead of, say, a default), is bad style. ------------------------------ In plot/Makefile <#236 (comment)> : > + OMP_THREAD_LIMIT=1 time -p lstmeval \ + --verbosity=0 \ + --model $< \ + --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@ + +# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV. +$(TMP_FAST_LOG): $(FAST_LOG_FILES) + @for i in $^; do \ + echo Filename : "$$i";echo;cat "$$i"; \ + done > $@ + +# Combine TSV files with all required CER values, generated from training log and validation logs. Plot. +$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) + @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@" + python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER) + @rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST)*.* Should be placed into a separate (phony) rule (like clean), and instead of redefining the filenames implicitly (tmp-*), re-use the actual above definitions. ------------------------------ In plot/Makefile <#236 (comment)> : > + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; At any rate, I would prefer changing lstmtraining.cpp to simply create zero-padded file names in the first place. ------------------------------ In plot/Makefile <#236 (comment)> : > + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \; + @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \; Then again, I don't even see the need for zero padding filenames at all: you can always sort them correctly via sort -n. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#236 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I3HSQVNVLEW5BP2HNDUASEYBANCNFSM4ZI5G4CQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

bertsky · 2021-09-06T14:31:46Z

Thanks for your fast feedback @Shreeshrii. In that case I suggest we do close, and (being very much interested in this feature) I promise I'll revisit as soon as #261 is out of the way. (I will strive for a close integration with lstmtraining to output data files, and with the tesstrain makefile, and then re-use your plotting code.)

Shreeshrii · 2021-11-17T09:42:51Z

@bertsky I am closing this. Hope you will add a better implementation soon.

whisere · 2022-02-03T01:43:02Z

I have the same issue as #236 (comment)
also reported in Shreeshrii/tesstrain-ben#1

Shreeshrii · 2022-02-03T03:06:27Z

@bertsky is planning to improve the output CER as discussed in above thread and will redo the plotting feature.

I have added some more hacks to the scripts for my own personal use based on the above feedback. I will look to posting them in a repo and post a link here as a workaround till the official update.

whisere · 2022-02-03T03:49:13Z

That sounds great! Thank you!

Shreeshrii · 2022-02-03T06:11:18Z

See https://github.com/Shreeshrii/tess5train-fonts

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engFineTuned/plots/engFineTuned-LOG-2.png

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engLayer/plots/engLayer-LOG-2.png

and also

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engLayer/plots/engLayer-2.png

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engFineTuned/plots/engFineTuned-2.png

whisere · 2022-02-03T06:22:37Z

Many thanks! I will try that!

whisere · 2022-02-03T06:30:55Z

Just to clarify am I right that it still doesn't plot eval data for fine tuned training (only for replacing a layering training)?

Shreeshrii · 2022-02-03T07:27:42Z

The scripts create a tsv from the log file generated during training process. If tesseract does not run it then the info won't be in the log file and won't get plotted. Also, the eval info does not include the training iteration number, it only has the learning iteration number.

As an alternative I run lstmeval on each of the checkpoints and plot that separately, that is the lstmeval after training. I have also added impact centre's ocreval as well as ISRI evaluation's accuracy info. Plotting of accuracy is not yet implemented.

whisere · 2022-02-04T01:39:48Z

Many thanks for the excellent work! I have tried, but the script doesn't work with the ground truth tif and txt pairs for fine tuning we have... it seems to require model.training_text file and reports:

[01:32:54] INFO - === Starting training for language eng
[01:32:54] INFO - Testing font: Arial Bold
[01:32:54] ERROR - Could not find font named Arial Bold.
Pango suggested font DejaVu Sans Bold.
Please correct --font arg.

[01:32:54] INFO - Program /usr/bin/text2image failed with return code 1. Abort.
[01:32:54] INFO - === Phase I: Generating training images ===
[01:32:54] CRITICAL - Required/expected file 'data/ground-truth/modelname-eval.training_text' does not exist
Makefile:437: recipe for target 'data/gtd/list.eval' failed
make: *** [data/gtd/list.eval] Error 1

with
nohup bash 2-training.sh eng Latin eng modelname FineTune 9999 > data/logs/modelname.LOG &

I will keep investigating.

Shreeshrii · 2022-02-04T03:02:49Z

Yes, as my repo name indicates this is the version for training from fonts and training text. However the plotting part only depends on the log file from training.

I will upload a different version that works with existing tesstrain makefile.

whisere · 2022-02-04T04:09:26Z

Amazing! I will try when it is available : )

And the lstmeval commands does work and are useful with CER and WER output with the original make output, although not showing in plotting:

nohup lstmeval --model data/modelname.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/modelname-eval-list.log &

nohup lstmeval --model data/eng.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/eng-eval-list.log &

Shreeshrii added 2 commits March 16, 2021 16:36

Add makefile for plotting

c4686e9

Add bash script for plotting the training graphs

65be4af

Shreeshrii mentioned this pull request Mar 26, 2021

./plot/plot_cer.sh is missing! #239

Closed

stale bot added the stale Issues which require input by the reporter which is not provided label Jun 3, 2021

stale bot removed the stale Issues which require input by the reporter which is not provided label Jun 10, 2021

stale bot added the stale Issues which require input by the reporter which is not provided label Jul 11, 2021

wrznr requested review from bertsky, kba and stweil July 15, 2021 11:55

wrznr added the pinned Eternal issues which are save from becoming stale label Jul 15, 2021

stale bot removed stale Issues which require input by the reporter which is not provided labels Jul 15, 2021

tesseract-ocr deleted a comment from stale bot Sep 4, 2021

bertsky requested changes Sep 6, 2021

View reviewed changes

Shreeshrii closed this Nov 17, 2021

Shreeshrii deleted the plot branch December 1, 2021 14:00

bertsky mentioned this pull request Dec 19, 2021

output true CER for checkpoints (at least the final one) tesseract-ocr/tesseract#3560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Makefile based plotting #236

Makefile based plotting #236

Shreeshrii commented Mar 16, 2021

M3ssman commented May 4, 2021

Shreeshrii commented May 4, 2021

tfukumori commented Jun 10, 2021 •

edited

Loading

stweil commented Sep 4, 2021

bertsky left a comment

bertsky Sep 4, 2021

bertsky Sep 4, 2021

bertsky Sep 4, 2021

bertsky Sep 6, 2021

bertsky Sep 4, 2021

bertsky Sep 4, 2021

bertsky Sep 6, 2021

bertsky Sep 6, 2021

bertsky Sep 6, 2021

bertsky Sep 6, 2021

bertsky Sep 6, 2021

bertsky Sep 6, 2021

Shreeshrii commented Sep 6, 2021 via email

bertsky commented Sep 6, 2021

Shreeshrii commented Nov 17, 2021

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022 •

edited

Loading

whisere commented Feb 3, 2022

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022

whisere commented Feb 4, 2022

Shreeshrii commented Feb 4, 2022

whisere commented Feb 4, 2022

		@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODEL_NAME)_[0-9]\.[0-9]_._.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
		@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODEL_NAME)_[0-9]\.[0-9][0-9]_._..checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;


		# END-EVAL

		.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)

	FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
	FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log)

Makefile based plotting #236

Makefile based plotting #236

Conversation

Shreeshrii commented Mar 16, 2021

M3ssman commented May 4, 2021

Shreeshrii commented May 4, 2021

tfukumori commented Jun 10, 2021 • edited Loading

Environment and Steps.

Environment

Steps

Result

Log

TSV

Plot

stweil commented Sep 4, 2021

bertsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shreeshrii commented Sep 6, 2021 via email

bertsky commented Sep 6, 2021

Shreeshrii commented Nov 17, 2021

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022 • edited Loading

whisere commented Feb 3, 2022

whisere commented Feb 3, 2022

Shreeshrii commented Feb 3, 2022

whisere commented Feb 4, 2022

Shreeshrii commented Feb 4, 2022

whisere commented Feb 4, 2022

tfukumori commented Jun 10, 2021 •

edited

Loading

Shreeshrii commented Feb 3, 2022 •

edited

Loading