# **Donut 🍩 : Document Understanding Transformer**
Donut 🍩, Document understanding transformer, is a new method of document understanding that utilizes an **OCR-free** end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). In addition, we present SynthDoG 🐶, Synthetic Document Generator, that helps the model pre-training to be flexible on various languages and domains.

### **Setting**

In [1]:
!pip install transformers==4.25.1
!pip install pytorch-lightning==1.6.4
!pip install timm==0.5.4
!pip install gradio
!pip install donut-python

Collecting transformers==4.25.1
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0 (from transformers==4.25.1)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.25.1)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m117.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 tokenizers-0.13.3 transformers-4.25.1
Collecting pytorch-lightning==1.6.4
  Downloading pytorch_lightning-1.6.4-py3-none-any.

Collecting timm==0.5.4
  Downloading timm-0.5.4-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.5/431.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: timm
Successfully installed timm-0.5.4
Collecting gradio
  Downloading gradio-3.40.1-py3-none-any.whl (20.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.101.1-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>=0.4.0 (from gradio)
  Downloading gradio_client-0.4.0-py3-none-any.w

In [2]:
import argparse
import gradio as gr
import torch
from PIL import Image

from donut import DonutModel

In [3]:
!unzip /content/DocumentAI_OCR.zip

Archive:  /content/DocumentAI_OCR.zip
   creating: DocumentAI_OCR/
   creating: DocumentAI_OCR/OMG/
  inflating: DocumentAI_OCR/.DS_Store  
  inflating: __MACOSX/DocumentAI_OCR/._.DS_Store  
   creating: DocumentAI_OCR/IVE/
   creating: DocumentAI_OCR/OCR/
   creating: DocumentAI_OCR/Doc/
  inflating: DocumentAI_OCR/OMG/OMG.png  
  inflating: DocumentAI_OCR/OMG/masked1.png  
  inflating: DocumentAI_OCR/OMG/masked0.png  
  inflating: DocumentAI_OCR/OMG/masked2.png  
  inflating: DocumentAI_OCR/OMG/masked3.png  
  inflating: DocumentAI_OCR/IVE/IVE.jpeg  
  inflating: DocumentAI_OCR/IVE/masked1.png  
  inflating: DocumentAI_OCR/IVE/masked0.png  
  inflating: DocumentAI_OCR/IVE/masked2.png  
  inflating: DocumentAI_OCR/OCR/Bumblebee.jpg  
  inflating: __MACOSX/DocumentAI_OCR/OCR/._Bumblebee.jpg  
  inflating: DocumentAI_OCR/OCR/Busanhaeng.jpg  
  inflating: __MACOSX/DocumentAI_OCR/OCR/._Busanhaeng.jpg  
  inflating: DocumentAI_OCR/Doc/.DS_Store  
  inflating: __MACOSX/DocumentAI_OCR/Doc/._

In [4]:
def demo_process_vqa(input_img, question):
    global pretrained_model, task_prompt, task_name
    input_img = Image.fromarray(input_img)
    user_prompt = task_prompt.replace("{user_input}", question)
    output = pretrained_model.inference(input_img, prompt=user_prompt)["predictions"][0]
    return output


def demo_process(input_img):
    global pretrained_model, task_prompt, task_name
    input_img = Image.fromarray(input_img)
    output = pretrained_model.inference(image=input_img, prompt=task_prompt)["predictions"][0]
    return output

### **Document Classification**

In [5]:
!unzip DocumentAI_OCR.zip

Archive:  DocumentAI_OCR.zip
replace DocumentAI_OCR/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: DocumentAI_OCR/.DS_Store  
replace __MACOSX/DocumentAI_OCR/._.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace DocumentAI_OCR/OMG/OMG.png? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [6]:
parser = argparse.ArgumentParser()
parser.add_argument("--task", type=str, default="rvlcdip")
parser.add_argument("--pretrained_path", type=str, default="naver-clova-ix/donut-base-finetuned-rvlcdip")
args, left_argv = parser.parse_known_args()

task_name = args.task
if "docvqa" == task_name:
    task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>"
else:
    task_prompt = f"<s_{task_name}>"

pretrained_model = DonutModel.from_pretrained(args.pretrained_path)

if torch.cuda.is_available():
    pretrained_model.half()
    device = torch.device("cuda")
    pretrained_model.to(device)
else:
    pretrained_model.encoder.to(torch.bfloat16)

pretrained_model.eval()

demo = gr.Interface(
    fn=demo_process_vqa if task_name == "docvqa" else demo_process,
    inputs=["image", "text"] if task_name == "docvqa" else "image",
    outputs="json",
    title=f"Donut 🍩 demonstration for `{task_name}` task",
)
demo.launch()

Downloading (…)official/config.json:   0%|          | 0.00/401 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/558 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/536 [00:00<?, ?B/s]

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



### **Document VQA**

In [7]:
parser = argparse.ArgumentParser()
parser.add_argument("--task", type=str, default="docvqa")
parser.add_argument("--pretrained_path", type=str, default="naver-clova-ix/donut-base-finetuned-docvqa")
args, left_argv = parser.parse_known_args()

task_name = args.task
if "docvqa" == task_name:
    task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>"
else:
    task_prompt = f"<s_{task_name}>"

pretrained_model = DonutModel.from_pretrained(args.pretrained_path)

if torch.cuda.is_available():
    pretrained_model.half()
    device = torch.device("cuda")
    pretrained_model.to(device)
else:
    pretrained_model.encoder.to(torch.bfloat16)

pretrained_model.eval()

demo = gr.Interface(
    fn=demo_process_vqa if task_name == "docvqa" else demo_process,
    inputs=["image", "text"] if task_name == "docvqa" else "image",
    outputs="json",
    title=f"Donut 🍩 demonstration for `{task_name}` task",
)
demo.launch()

Downloading (…)official/config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/478 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/535 [00:00<?, ?B/s]

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



# **DiffSTE : Diffusion models for Scene Text Editing**
edit scene text into different font styles and colors following given text instruction. Specifically, we propose to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. We then utilize an instruction tuning framework to train our model learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our model the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g. italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task.



### **Setting**

In [2]:
!unzip /content/DocumentAI_OCR.zip

Archive:  /content/DocumentAI_OCR.zip
replace DocumentAI_OCR/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: DocumentAI_OCR/.DS_Store  
  inflating: __MACOSX/DocumentAI_OCR/._.DS_Store  
  inflating: DocumentAI_OCR/OMG/OMG.png  
  inflating: DocumentAI_OCR/OMG/masked1.png  
  inflating: DocumentAI_OCR/OMG/masked0.png  
  inflating: DocumentAI_OCR/OMG/masked2.png  
  inflating: DocumentAI_OCR/OMG/masked3.png  
  inflating: DocumentAI_OCR/IVE/IVE.jpeg  
  inflating: DocumentAI_OCR/IVE/masked1.png  
  inflating: DocumentAI_OCR/IVE/masked0.png  
  inflating: DocumentAI_OCR/IVE/masked2.png  
  inflating: DocumentAI_OCR/OCR/Bumblebee.jpg  
  inflating: __MACOSX/DocumentAI_OCR/OCR/._Bumblebee.jpg  
  inflating: DocumentAI_OCR/OCR/Busanhaeng.jpg  
  inflating: __MACOSX/DocumentAI_OCR/OCR/._Busanhaeng.jpg  
  inflating: DocumentAI_OCR/Doc/.DS_Store  
  inflating: __MACOSX/DocumentAI_OCR/Doc/._.DS_Store  
  inflating: DocumentAI_OCR/Doc/PRML.png  
  inflating: __MACOSX/DocumentAI

In [3]:
# img resize(256*256) -> https://www.iloveimg.com/ko/resize-image/resize-jpg
# coord -> http://maschek.hu/imagemap/imgmap/
coord = "117,108,233,144"

In [4]:
# character masking
import cv2
import numpy as np

img = '/content/DocumentAI_OCR/OMG/OMG.png'
image = cv2.imread(img)
filename = img.split('/'[-1])

temp = tuple(map(int, coord.split(',')))
x, y, w, h = temp

image2 = np.zeros((image.shape[0], image.shape[1]), dtype="uint8")
cv2.rectangle(image2, (x, y), (w, h), 255, -1)
cv2.imwrite(f'/content/masked_testimg.png', image2)

True

### **Scene Text Editing**

In [5]:
!git clone https://github.com/UCSB-NLP-Chang/DiffSTE.git
%cd DiffSTE

Cloning into 'DiffSTE'...
remote: Enumerating objects: 251, done.[K
remote: Counting objects: 100% (251/251), done.[K
remote: Compressing objects: 100% (196/196), done.[K
remote: Total 251 (delta 52), reused 246 (delta 47), pack-reused 0[K
Receiving objects: 100% (251/251), 6.57 MiB | 22.74 MiB/s, done.
Resolving deltas: 100% (52/52), done.
/content/DiffSTE


In [None]:
# requirements.txt -> flax==0.7.2

In [6]:
!curl -sSL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
!bash ./miniconda.sh -bfp /usr/local
!conda --version
!conda create --name DiffSTE python=3.8 -y

PREFIX=/usr/local
Unpacking payload ...
                                                                                      
Installing base environment...


Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: - \ | / - done
Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /usr/local
conda 23.5.2
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

In [7]:
# pretrained model download
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f" -O diffste.ckpt && rm -rf ~/cookies.txt

--2023-08-17 06:39:56--  https://docs.google.com/uc?export=download&confirm=t&id=1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f
Resolving docs.google.com (docs.google.com)... 209.85.200.139, 209.85.200.101, 209.85.200.113, ...
Connecting to docs.google.com (docs.google.com)|209.85.200.139|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-5o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vfsvja3vfamndghar9o83c2o3vfcuaq5/1692254325000/04178247694259295951/*/1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f?e=download&uuid=0737708e-797a-41cc-821a-355e5301f4f5 [following]
--2023-08-17 06:39:56--  https://doc-0o-5o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vfsvja3vfamndghar9o83c2o3vfcuaq5/1692254325000/04178247694259295951/*/1fc0RKGWo6MPSJIZNIA_UweTOPai64S9f?e=download&uuid=0737708e-797a-41cc-821a-355e5301f4f5
Resolving doc-0o-5o-docs.googleusercontent.com (doc-0o-5o-docs.googleusercontent.com)... 173.194.196.132,

In [8]:
# style-free generation
%%bash
source activate DiffSTE
pip install -r requirements.txt
pip install jax --upgrade
pip install triton

python generate.py \
    --ckpt_path /content/DiffSTE/diffste.ckpt \
    --in_image /content/DocumentAI_OCR/OMG/OMG.png \
    --in_mask /content/DocumentAI_OCR/OMG/masked0.png \
    --text Q \
    --out_dir /content/

Collecting accelerate==0.16.0 (from -r requirements.txt (line 1))
  Downloading accelerate-0.16.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 kB 5.3 MB/s eta 0:00:00
Collecting datasets==2.9.0 (from -r requirements.txt (line 2))
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 462.8/462.8 kB 21.8 MB/s eta 0:00:00
Collecting editdistance==0.6.2 (from -r requirements.txt (line 3))
  Downloading editdistance-0.6.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 283.4/283.4 kB 31.5 MB/s eta 0:00:00
Collecting einops==0.6.0 (from -r requirements.txt (line 4))
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.6/41.6 kB 5.6 MB/s eta 0:00:00
Collecting flax==0.7.2 (from -r requirements.txt (line 5))
  Obtaining dependency information for flax==0.7.2 from https://files.pythonhosted.org/packages/7

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Downloading (…)_pytorch_model.bin";:   0%|          | 0.00/335M [00:00<?, ?B/s]Downloading (…)_pytorch_model.bin";:   3%|▎         | 10.5M/335M [00:00<00:04, 70.3MB/s]Downloading (…)_pytorch_model.bin";:   9%|▉         | 31.5M/335M [00:00<00:02, 129MB/s] Downloading (…)_pytorch_model.bin";:  16%|█▌        | 52.4M/335M [00:00<00:02, 112MB/s]Downloading (…)_pytorch_model.bin";:  22%|██▏       | 73.4M/335M [00:00<00:02, 114MB/s]Downloading (…)_pytorch_model.bin";:  28%|██▊       | 94.4M/335M [00:00<00:02, 110MB/s]Downloading (…)_pytorch_model.bin";:  34%|███▍      | 115M/335M [00:01<00:02, 105MB/s] Downloading (…)_pytorch_model.bin";:  41%|████      | 136M/335M [00:01<00:02, 81.8MB/s]Downloading (…)_pytorch_model.bin";:  44%|████▍     | 147M/335M [00:01<00:02, 80.3MB/s]Downloading (…)_pytorch_model.bin";:  50%|█████     | 168M/335M [00:01<00:01, 95.8MB/s]Downloading (…)_pytorch_model.bi

In [9]:
# style-conditional generation
%%bash
source activate DiffSTE
pip install -r requirements.txt
pip install jax --upgrade

python generate.py \
    --ckpt_path /content/DiffSTE/diffste.ckpt \
    --in_image /content/DocumentAI_OCR/OMG/OMG.png \
    --in_mask /content/DocumentAI_OCR/OMG/masked3.png \
    --text QnA \
    --font Caprasimo \
    --color white \
    --out_dir /content/

Collecting jax==0.4.7 (from -r requirements.txt (line 9))
  Using cached jax-0.4.7-py3-none-any.whl
Installing collected packages: jax
  Attempting uninstall: jax
    Found existing installation: jax 0.4.13
    Uninstalling jax-0.4.13:
      Successfully uninstalled jax-0.4.13
Successfully installed jax-0.4.7
Collecting jax
  Using cached jax-0.4.13-py3-none-any.whl
Installing collected packages: jax
  Attempting uninstall: jax
    Found existing installation: jax 0.4.7
    Uninstalling jax-0.4.7:
      Successfully uninstalled jax-0.4.7
Successfully installed jax-0.4.13
Initialize vae from finetuned...
Initialize CharTokenizer
Initilize char embedder...
AutoencoderKL has 83.65 M params.
UNet2DMultiConditionModel has 1108.40 M params.
CLIPTextModel has 123.06 M params.
CharEmbedder has 0.00 M params.


No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)


# **EasyOCR**
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.
- opensource text detection & recognition model
- support language -> https://www.jaided.ai/easyocr/

### **Setting**

In [1]:
!pip install easyocr

[0m

In [2]:
!unzip DocumentAI.zip

unzip:  cannot find or open DocumentAI.zip, DocumentAI.zip.zip or DocumentAI.zip.ZIP.


### **Detection & Crop**

In [12]:
def crop_box(ocr):
    max_diff = 0
    max_row = []
    for i in ocr:
        diff = i[0][2][1]-i[0][0][1]
        x_diff = i[0][2][0]-i[0][0][0]
        if diff > max_diff and x_diff > 0:
            max_diff = diff
            max_row = i[0]
    box = tuple(max_row[0] + max_row[2])
    return box

In [13]:
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

def img_show(img):
    img = np.array(img)
    plt.imshow(img)

### **English OCR**

In [14]:
import easyocr

en_imgpath = "/content/DocumentAI_OCR/OCR/Bumblebee.jpg"
en_img = Image.open(en_imgpath)
img_show(en_img)

ModuleNotFoundError: ignored

In [None]:
en_reader = easyocr.Reader(['en'])
en_ocr = en_reader.readtext(en_imgpath)
box = crop_box(en_ocr)
text = en_img.crop(box)
img_show(text)

In [None]:
# recognition
print(en_ocr)

### **Korean OCR**

In [None]:
ko_imgpath = "/content/DocumentAI_OCR/OCR/Busanhaeng.jpg"
ko_img = Image.open(ko_imgpath)
img_show(ko_img)

In [None]:
ko_reader = easyocr.Reader(['ko'])
ko_ocr = ko_reader.readtext(ko_imgpath)
box = crop_box(ko_ocr)
text = ko_img.crop(box)
img_show(text)

In [None]:
# recognition
print(ko_ocr)