Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makefile based plotting #236

Closed
wants to merge 2 commits into from
Closed

Makefile based plotting #236

wants to merge 2 commits into from

Conversation

Shreeshrii
Copy link
Collaborator

No description provided.

@M3ssman
Copy link
Contributor

M3ssman commented May 4, 2021

@Shreeshrii I was trying to get on with the plotting, but I stumbled upon the term "validationlist".
What does this mean in this context? AFAIU, the plotting requires an already existing *.traineddata-model, which is passed to lstmeval Tool. Do I need, besides my previous training data, an additional list of lstmf-files to measure the performance? By now, I do it the hard way, which means put the new model into tessconfigs and do a fresh run with real image data.

@Shreeshrii
Copy link
Collaborator Author

Usually for training we use a training list and eval list. You can use a different dataset for validation, if you want.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Jun 3, 2021
@tfukumori
Copy link

tfukumori commented Jun 10, 2021

@Shreeshrii
Thanks for the great feature! I was able to draw a plot.

But I could not get the result of "Make TSV with Eval CER".
It seems that Eval Char is not included in data/ocrd.log, so it was not output.

I think there is a problem with my environment or steps, but I would appreciate your advice if you can help me.

Environment and Steps.

Environment

$ tesseract --version
tesseract 5.0.0-alpha-20210401
 leptonica-1.80.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1

Steps

I think "validate" should be prepared separately from "eval", but for now, I have set the same thing as "eval" in "validate".

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/frk.traineddata -O ~/tessdata_best/frk.traineddata

unzip ocrd-testset.zip -d data/ocrd-ground-truth

nohup make training MODEL_NAME=ocrd \
	START_MODEL=frk \
	TESSDATA=~/tessdata_best \
	MAX_ITERATIONS=10000 > data/ocrd.log

cp data/ocrd/list.eval data/ocrd/list.validate

bash -x plot.sh ocrd validate 10

Result

The logs did not contain any logs about "eval", unlike the logs described in the Issue.

Log

ocrd.log.zip

TSV

ocrd-validate-cer.tsv.zip

Plot

ocrd-validate-cer

@stale stale bot removed the stale Issues which require input by the reporter which is not provided label Jun 10, 2021
@stale stale bot added the stale Issues which require input by the reporter which is not provided label Jul 11, 2021
@wrznr wrznr requested review from bertsky, kba and stweil July 15, 2021 11:55
@wrznr wrznr added the pinned Eternal issues which are save from becoming stale label Jul 15, 2021
@stale stale bot removed stale Issues which require input by the reporter which is not provided labels Jul 15, 2021
@tesseract-ocr tesseract-ocr deleted a comment from stale bot Sep 4, 2021
@tesseract-ocr tesseract-ocr deleted a comment from stale bot Sep 4, 2021
@stweil
Copy link
Collaborator

stweil commented Sep 4, 2021

Can we merge this pull request, or would anybody suggest additional changes?

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do very much welcome your effort to integrate a decent plotting facility, I have strong reservations about this implementation and about the way this PR is set up technically.

Beyond the comments given inline, I see two general questions:

  1. How do the two old Python scripts relate to the single new one? (There seems to be lots of code re-use. It's hard to compare to the previous version this way.)
  2. How does this relate to the bigger problem of CER not being calculated correctly by lstmtraining/lstmeval and checkpoints only being created at arbitrary intervals?

VALIDATE_CER=9

# Training log file. This should match logfile name from training. Default: $(MODEL_LOG)
MODEL_LOG =../data/$(MODEL_NAME).log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing ../data here, this assumes that the toplevel training used the default DATA_DIR=data, which is overly restrictive. I suggest either rewriting in terms of a DATA_DIR variable as well, or by recursively calling the plot/Makefile with everything exported from the toplevel Makefile.

Comment on lines +54 to +55
@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]\.[0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9])_/$${1}00_/' {} \;
@find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.*$(MODEL_NAME)_[0-9]*\.[0-9][0-9]_.*_.*.checkpoint -exec rename -v 's/(.[0-9][0-9])_/$${1}0_/' {} \;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is rename – it's certainly not POSIX, and does not work on Linux.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any rate, I would prefer changing lstmtraining.cpp to simply create zero-padded file names in the first place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, I don't even see the need for zero padding filenames at all: you can always sort them correctly via sort -n.

Comment on lines +66 to +71
@grep 'At iteration' $(MODEL_LOG) \
| sed -e '/^Sub/d' \
| sed -e '/^Update/d' \
| sed -e '/^ New worst char/d' \
| sed -e 's/At iteration \([0-9]*\)\/\([0-9]*\)\/.*char train=/\t\t\1\t\2\t\t/' \
| sed -e 's/%, word.*/\t/' >> "$@"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly recommend generating a CSV/TSV file directly from lstmtraining.cpp instead of trying to parse its output strings here – any slight change there (different wording/formatting or additional lines) would break this.


# END-EVAL

.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are concrete file targets, marking them as phony is wrong.

FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODEL_NAME)_[0-$(VALIDATE_CER)].[0-9]*_*.traineddata | sort -n -r)

# Build validate log files list based on above traineddata list.
FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer subst is a no-op.

Suggested change
FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))
FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log)

OMP_THREAD_LIMIT=1 time -p lstmeval \
--verbosity=0 \
--model $< \
--eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule needs still to advertise all its dependencies, esp. TMP_VALIDATE_LIST and FAST_DATA_FILES.

# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.
$(TMP_FAST_LOG): $(FAST_LOG_FILES)
@for i in $^; do \
echo Filename : "$$i";echo;cat "$$i"; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the Filename lines used for?

Comment on lines +28 to +32
ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv"
ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv"
etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv"
vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv"
plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding the values in Python redundantly instead of passing the fully defined path names from the makefile is really bad style.

# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.
$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)
@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be a separate rule with its own target – the concret plot PNG – and passing all the path names of the necessary input files.

Relying on Y_MAX_CER to be defined outside the makefile (instead of, say, a default), is bad style.

$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)
@cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"
python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)
@rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST)*.*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be placed into a separate (phony) rule (like clean), and instead of redefining the filenames implicitly (tmp-*), re-use the actual above definitions.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Sep 6, 2021 via email

@bertsky
Copy link
Collaborator

bertsky commented Sep 6, 2021

Thanks for your fast feedback @Shreeshrii. In that case I suggest we do close, and (being very much interested in this feature) I promise I'll revisit as soon as #261 is out of the way. (I will strive for a close integration with lstmtraining to output data files, and with the tesstrain makefile, and then re-use your plotting code.)

@Shreeshrii
Copy link
Collaborator Author

@bertsky I am closing this. Hope you will add a better implementation soon.

@whisere
Copy link

whisere commented Feb 3, 2022

I have the same issue as #236 (comment)
also reported in Shreeshrii/tesstrain-ben#1

@Shreeshrii
Copy link
Collaborator Author

@bertsky is planning to improve the output CER as discussed in above thread and will redo the plotting feature.

I have added some more hacks to the scripts for my own personal use based on the above feedback. I will look to posting them in a repo and post a link here as a workaround till the official update.

@whisere
Copy link

whisere commented Feb 3, 2022

That sounds great! Thank you!

@whisere
Copy link

whisere commented Feb 3, 2022

Many thanks! I will try that!

@whisere
Copy link

whisere commented Feb 3, 2022

Just to clarify am I right that it still doesn't plot eval data for fine tuned training (only for replacing a layering training)?

@Shreeshrii
Copy link
Collaborator Author

The scripts create a tsv from the log file generated during training process. If tesseract does not run it then the info won't be in the log file and won't get plotted. Also, the eval info does not include the training iteration number, it only has the learning iteration number.

As an alternative I run lstmeval on each of the checkpoints and plot that separately, that is the lstmeval after training. I have also added impact centre's ocreval as well as ISRI evaluation's accuracy info. Plotting of accuracy is not yet implemented.

@whisere
Copy link

whisere commented Feb 4, 2022

Many thanks for the excellent work! I have tried, but the script doesn't work with the ground truth tif and txt pairs for fine tuning we have... it seems to require model.training_text file and reports:

[01:32:54] INFO - === Starting training for language eng
[01:32:54] INFO - Testing font: Arial Bold
[01:32:54] ERROR - Could not find font named Arial Bold.
Pango suggested font DejaVu Sans Bold.
Please correct --font arg.

[01:32:54] INFO - Program /usr/bin/text2image failed with return code 1. Abort.
[01:32:54] INFO - === Phase I: Generating training images ===
[01:32:54] CRITICAL - Required/expected file 'data/ground-truth/modelname-eval.training_text' does not exist
Makefile:437: recipe for target 'data/gtd/list.eval' failed
make: *** [data/gtd/list.eval] Error 1

with
nohup bash 2-training.sh eng Latin eng modelname FineTune 9999 > data/logs/modelname.LOG &

I will keep investigating.

@Shreeshrii
Copy link
Collaborator Author

Yes, as my repo name indicates this is the version for training from fonts and training text. However the plotting part only depends on the log file from training.

I will upload a different version that works with existing tesstrain makefile.

@whisere
Copy link

whisere commented Feb 4, 2022

Amazing! I will try when it is available : )

And the lstmeval commands does work and are useful with CER and WER output with the original make output, although not showing in plotting:

nohup lstmeval --model data/modelname.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/modelname-eval-list.log &

nohup lstmeval --model data/eng.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/eng-eval-list.log &

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Eternal issues which are save from becoming stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants