Plotting – again #377

bertsky · 2024-03-04T10:48:42Z

After several attempts by @Shreeshrii to share her excellent plotting scripts, each of which was unfortunately thwarted by bad circumstances (other big changes occurring at the same time), here comes a plotting facility again.

I based this on the ocrddata branch of her fork, cherry-picking only the two relevant changesets, resolving conflicts and then refactoring to make this better fit our makefileization.

Usage is simply make plot, which will only work after make training. (I could also make this dependency explicit, but that would cause make plot to start the training if it did not happen already for that combination of variables.)

The output files will be created in $OUTPUT_DIR/$MODEL_NAME.plot_log.png, e.g.

and $OUTPUT_DIR/$MODEL_NAME.plot_cer.png, e.g.

All intermediate files (except for the lstmeval log files generated under $OUTPUT_DIR/eval/*.log because they are valuable in their own right) are marked as such and therefore removed by make.

Perhaps we should discuss how both plots could be combined into a single one (which is probably what @Shreeshrii tried to do already) – I can see that there's a problem by the granularity these data points are recorded (training iterations for validation during lstmtraining vs. learning iterations for validation afterwards via external lstmeval). But IIUC we have everything it takes to be able to combine them (twin y plot with synced x axes)...

- avoid creating extra subdirectories (would necessitate mkdir and extra deps) - produce all necessary outputs in OUTPUT_DIR - make temporary files .INTERMEDIATE (so they will be removed automatically afterwards) - define file names close to where they are actually used and in the correct order (no need for the user to see/customise them) - use fixed name for training log file and plot file (derived from OUTPUT_DIR and MODEL_NAME) - simplify extraction pipelines, make dependencies explicit - make checkpoint evaluations directly dependent on checkpoint models - provide phony (aliased aggregate) target lstmeval for these - provide phony (aliased aggregate) target plot for the actual plots - never cut off ranges from y scale (CER) - make naming and coloring more consistent

bertsky · 2024-03-05T13:48:22Z

With the last commit I did re-instante @Shreeshrii's LOG_FILE variable.

The big pro is that thus you can opt in to plotting even older logs, e.g.

make plot LOG_FILE=nohup.out

zdenop · 2024-03-07T14:32:55Z

I just make quick test on openSUSE (15.5) and here are a few suggestions:

it would be nice to have a short example of how to make example plot on example data:

git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
mkdir data
unzip ocrd-testset.zip -d data/ocrd-ground-truth
...
# install needed requirements
...
nohup make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 > plot/TESSTRAIN.LOG &
make plot MODEL_NAME=ocrd

I removed python2 from openSUSE and I got this error:

python plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv
/bin/bash: python: command not found

What about using PY_CMD (as rest of Makefile)?

When I run manually python3 plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv I got error:

Traceback (most recent call last):
  File "/home/podobny/Projekty/tesstrain/plot_cer.py", line 6, in <module>
    import matplotlib
ModuleNotFoundError: No module named 'matplotlib'

It would be good to mentioned that user should install matplotlib and pandas (pip3 install matplotlib pandas) before running make plot.

stweil · 2024-03-07T14:36:57Z

It would be good to mention that user should install matplotlib and pandas (pip3 install matplotlib pandas) before running make plot.

Both are mentioned in requirements.txt, so running pip3 install -r requirements.txt is sufficient.

bertsky · 2024-03-07T14:43:29Z

@zdenop very good points – thanks!
I'll address these in a few follow-up commits.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Makefile

stweil

@bertsky, some of the commits could already be applied to the main branch. Would it be okay if I cherry-pick them? You (or I) would have to rebase your plotting branch after that. I think we can improve plotting faster by getting it integrated like that.

.gitignore

Makefile

README.md

Co-authored-by: Stefan Weil <sw@weilnetz.de>

plot_cer.py

plot_log.py

Co-authored-by: Stefan Weil <sw@weilnetz.de>

bertsky · 2024-03-20T19:06:50Z

So?

zdenop · 2024-03-25T18:09:59Z

@bertsky : please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.

bertsky · 2024-03-30T18:56:06Z

In light of tesseract-ocr/tesseract#3763 (comment) I tend to prefer changing from fast to best models for evaluation.

bertsky · 2024-04-05T09:50:46Z

@zdenop

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.

I fail to see the ambiguity. The manual says they think l can be easily confused with 1. I really don't care for such subjective stances.

please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

Not my code originally, so I don't care. But I also don't believe in code stylers. Since @stweil seems to be an avid user of these, I don't think my help is needed in that respect.

I'll resolve the conflict on the readme arising from your concurrent edits in the plotting section, then I'll be done here, I think.

.gitignore

stweil · 2024-04-08T05:49:52Z

Makefile

@@ -126,19 +128,21 @@ help:
 	@echo "    unicharset       Create unicharset"
 	@echo "    charfreq         Show character histogram"
 	@echo "    lists            Create lists of lstmf filenames for training and eval"
-	@echo "    training         Start training"
+	@echo "    training         Start training (i.e. create .checkpoint files)"


Suggested change

@echo " training Start training (i.e. create .checkpoint files)"

@echo " training Start training"

I'd prefer a less technical description here and avoid details which might change any time.

Like I said above, the description of traineddata already mentions .checkpoint files, too. And so does evaluation. So as a matter of clarity and consistency, it should be mentioned here as well.

Since this entire repo is just a small makefile wrapper around the actual training tools, which are only properly documented in tessdoc along these technical details, I prefer keeping it the way it is.

stweil · 2024-04-08T05:53:38Z

Makefile

 	@echo "    traineddata      Create best and fast .traineddata files from each .checkpoint file"
 	@echo "    proto-model      Build the proto model"
 	@echo "    tesseract-langdata  Download stock unicharsets"
+	@echo "    evaluation       Evaluate .checkpoint models on eval dataset via lstmeval"


Suggested change

@echo " evaluation Evaluate .checkpoint models on eval dataset via lstmeval"

@echo " evaluation Evaluate intermediate models on evaluation dataset"

Use help text with less technical details

Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.

If you agree, I'd commit my suggestions and apply your pull request.

No, I don't agree, hence my comment above.

Isn't your help text even wrong? I though that lstmeval is running on .traineddata files, not on .checkpoint files.

No, it's running on .traineddata extracted from .checkpoint files. The help texts only mention the former when relevant to the user (traineddata target).

stweil · 2024-04-08T05:57:49Z

Makefile

 	@echo "    TESSDATA_REPO      Tesseract model repo to use (_fast or _best). Default: $(TESSDATA_REPO)"
 	@echo "    MAX_ITERATIONS     Max iterations. Default: $(MAX_ITERATIONS)"
 	@echo "    EPOCHS             Set max iterations based on the number of lines for the training. Default: none"
 	@echo "    DEBUG_INTERVAL     Debug Interval. Default:  $(DEBUG_INTERVAL)"
 	@echo "    LEARNING_RATE      Learning rate. Default: $(LEARNING_RATE)"
-	@echo "    NET_SPEC           Network specification. Default: $(NET_SPEC)"
+	@echo "    NET_SPEC           Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)"


Suggested change

@echo " NET_SPEC Network specification (in VGSL) for new model from scratch. Default: $(NET_SPEC)"

@echo " NET_SPEC Network specification (in VGSL), only used for training without START_MODEL. Default: $(NET_SPEC)"

I found it more helpful if the relevant terms fine-tune and scratch were already mentioned in their respective variable's short help string. (Since START_MODEL already mentions the former, NET_SPEC now mentions the latter.) Also, scratch already precludes a start model by definition.

stweil · 2024-04-08T06:02:44Z

Makefile

 	@echo "    LANG_TYPE          Language Type - Indic, RTL or blank. Default: '$(LANG_TYPE)'"
 	@echo "    PSM                Page segmentation mode. Default: $(PSM)"
 	@echo "    RANDOM_SEED        Random seed for shuffling of the training data. Default: $(RANDOM_SEED)"
 	@echo "    RATIO_TRAIN        Ratio of train / eval training data. Default: $(RATIO_TRAIN)"
 	@echo "    TARGET_ERROR_RATE  Default Target Error Rate. Default: $(TARGET_ERROR_RATE)"
+	@echo "    LOG_FILE           File to copy training output to and read plot figures from. Default: $(LOG_FILE)"


Suggested change

@echo " LOG_FILE File to copy training output to and read plot figures from. Default: $(LOG_FILE)"

@echo " LOG_FILE File copy of the training protocol (also used for plotting). Default: $(LOG_FILE)"

I don't like your formulation. LOG_FILE is not a protocol of the training procedure (like Tensorboard), it's just the output of the training recipe.

If you prefer "log" instead of "protocol", that would be fine for me, too. The output of the training is much more than the printed messages: the most important output is the new model file.

A LOG_FILE is a log file and not a result file.

"Log file of training process (also used as input for plotting)" would also be fine for me.

README.md

stweil · 2024-04-08T06:20:40Z

README.md

+Plotting can even be done while training is still running, and  will depict the training status
+up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.)


Suggested change

Plotting can even be done while training is still running, and will depict the training status

up to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.)

Plotting can even be done while training is still running, and will show the training status

up to that point. (It can be re-run any time the `LOG_FILE` is changed or new checkpoints are written.)

I like depict better than show here, and has changed better than is changed.

Deepl and I disagree.

Deepl does not know anything about tesstrain, and I am pretty confident in my assessment of what the English language allows me to express.

Besides, IMO it should not be your role as a maintainer to micromanage individual contributions to the letter. This is becoming extremely tedious – I thought this addition is long overdue and would be welcomed.

stweil

@bertsky, I still have several change requests for the help text in Makefile. Those changes then must be made in README.md as well.

bertsky and others added 8 commits March 3, 2024 23:33

remove remaining 'default' (missing in 16df5ab)

d8a221b

net_spec: read from unicharset directly (produces prettier log)

5d20a8a

Update plotting in Makefile - OCRevalCER and ISRIevalCER NOT used

a3377c4

Remove external tools eval

9c0cf86

tee to log instead of redirect stdout

74960e6

help: add lstmeval target

218e3af

update/improve readme

b39f34e

bertsky requested review from zdenop, Shreeshrii and stweil March 4, 2024 11:13

bertsky mentioned this pull request Mar 4, 2024

question: How to Diagnose Overfitting and Underfitting of Tesseract Models? #200

Open

bertsky added 2 commits March 4, 2024 15:47

training log: also redirect stderr

2157121

do use LOG_FILE variable after all

46a2199

Fix some typos (found by codespell and typos)

fb4fa83

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil reviewed Mar 7, 2024

View reviewed changes