Query-Efficient Black Box Approximation for OCR

The repository contains code for training and evaluating the experiments performed in the submission titled "Document Image Cleaning using Budget-Aware Black-Box Approximation". A large part of the code is derived from Gradient-Approx-to-improve-OCR.

Setup

Create a python virtual environment and install the required packages using

pip3 install -r requirements.txt

Datasets

The dataset links are as follows:

Train, Val and Test splits should be extracted and placed in a folder called "data".

Training

An example command to train a preprocessor using the POS dataset is shown below -

python -u train_nn_patch.py --epoch $EPOCH --data_base_path $DATA_PATH --crnn_model  $CRNN_MODEL_PATH --exp_base_path $EXP_BASE_PATH  --minibatch_subset TopKCER --minibatch_subset_prop 0.95  --inner_limit 1 --inner_limit_skip --cers_ocr_path $CER_JSON_PATH --ocr $OCR

Relevant arguments are explained here

data_base_path: Path to folder containing train, val and test sets.
crnn_model: Path to pre-trained CRNN model
exp_base_path: Path for saving model checkpoints
minibatch_subset: Used to specify different selection algorithms. (Random=random, TopKCER=TopKCER, UniformCER=rangeCER)
minibatch_subset_prop: Specify the proportion of samples for each OCR is not queried. Here, 0.95 indicates skipping almost 95-96% of samples, hence the OCR is queried for only 4% of samples.
inner_limit: Number of times the images are jittered. If inner_limit_skip is specified, label tracking is enabled and images are not jittered at all.
cers_ocr_path: Initialize the sample cers with a json file. E.g. VGG, POS
ocr: Specify the OCR - Tesseract / EasyOCR

To train a preprocessor with the VGG dataset, use train_nn_area.py with the same arguments as train_nn_patch.py.

An example command to train a CRNN model is shown below -

python -u train_crnn.py --batch_size $BATCH_SIZE --epoch $EPOCH --crnn_model_path $CRNN_MODEL_PATH --dataset vgg --data_base_path $DATA_PATH --ocr EasyOCR

Evaluation

eval_prep.py is used for evaluating a trained preprocessor.

python -u eval_prep.py --prep_path $PREP_PATH --dataset pos --prep_model_name $PREP_MODEL_NAME --data_base_path $DATA_PATH --ocr EasyOCR

prep_path specifies folder path containing preprocessor checkpoints.
prep_model_name specifies name of specific model checkpoint to be evaluated.
dataset specifies pos/vgg dataset.

Trained Models

The directory pretrained_models contains trained preprocessors and pretrained CRNN models from some experiments. The preprocessor directory contains models with name n_model where n can be 4, 8 or 100 (indicating the query budget). The models in the preprocessor directory were obtained using the POS dataset and Tesseract OCR engine.

Pending Items

Trained Models
Add colab link

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
cer_data_utils		cer_data_utils
compute_canada		compute_canada
datasets		datasets
hyperparam_sweeps		hyperparam_sweeps
label_tracking		label_tracking
models		models
ocr_helper		ocr_helper
pruning		pruning
trained_models		trained_models
.gitignore		.gitignore
README.md		README.md
area_cli.py		area_cli.py
eval_crnn.py		eval_crnn.py
eval_prep.py		eval_prep.py
eval_utils.py		eval_utils.py
patch_cli.py		patch_cli.py
properties.py		properties.py
requirements.txt		requirements.txt
requirements_cc.txt		requirements_cc.txt
selection_utils.py		selection_utils.py
tracking_utils.py		tracking_utils.py
train_crnn.py		train_crnn.py
train_nn_area.py		train_nn_area.py
train_nn_patch.py		train_nn_patch.py
transform_helper.py		transform_helper.py
utils.py		utils.py
wandb_config.json		wandb_config.json

tataganesh/Query-Efficient-Approx-to-improve-OCR

Folders and files

Latest commit

History

Repository files navigation

Query-Efficient Black Box Approximation for OCR

Setup

Datasets

Training

Evaluation

Trained Models

Pending Items

About

Resources

Stars

Watchers

Forks

Languages