![title](./pics/dd_logo.png) 

# Getting started

**deep**doctection is a package that can be used to extract text from complex structured documents. These can be native PDFs but also scans. In contrast to various text miners **deep**doctection makes use of deep learning models from powerful third party libraries for solving OCR, vision or language embedding problems. 

This notebook will give you a quick tour so that you can get started straight away. 

If you are running this notebook on Colab and you haven't installed it before by yourself, simply activate the following cell:

In [1]:
pip install  "dataflow @ git+https://github.com/tensorpack/dataflow.git"

Collecting dataflow@ git+https://github.com/tensorpack/dataflow.git
  Cloning https://github.com/tensorpack/dataflow.git to /private/var/folders/x2/hv4cc0kd50x399jfl74yd62c0000gn/T/pip-install-ghgdyzpx/dataflow_4c5621080e02436a8184126ae709899c
  Running command git clone --filter=blob:none --quiet https://github.com/tensorpack/dataflow.git /private/var/folders/x2/hv4cc0kd50x399jfl74yd62c0000gn/T/pip-install-ghgdyzpx/dataflow_4c5621080e02436a8184126ae709899c
  Resolved https://github.com/tensorpack/dataflow.git to commit 4ac75d6b000c887b68bbc4ace11c57a47eff662c
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install deepdoctection

Collecting deepdoctection
  Using cached deepdoctection-0.21-py3-none-any.whl (512 kB)
Collecting apted==1.0.3
  Using cached apted-1.0.3-py3-none-any.whl (40 kB)
Collecting catalogue==2.0.7
  Using cached catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting distance==0.1.3
  Using cached Distance-0.1.3.tar.gz (180 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting huggingface-hub<0.11.0,>=0.4.0
  Using cached huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
Collecting importlib-metadata>=4.11.2
  Using cached importlib_metadata-6.0.0-py3-none-any.whl (21 kB)
Collecting jsonlines==3.0.0
  Using cached jsonlines-3.0.0-py3-none-any.whl (8.5 kB)
Collecting lxml>=4.9.1
  Using cached lxml-4.9.2.tar.gz (3.7 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting mock==4.0.3
  Using cached mock-4.0.3-py3-none-any.whl (28 kB)
Collecting networkx>=2.7.1
  Using cached networkx-3.0-py3-none-any.whl (2.0 MB)
Collecting numpy<1.24,>=1.21
  Using cached numpy-1.23.5-cp3

In [None]:
#!apt-get install -y tesseract-ocr tesseract-ocr-deu
#!apt-get install poppler-utils
#!pip install -e git+https://github.com/deepdoctection/deepdoctection.git#egg=deepdoctection[source-pt]

In [None]:
import cv2
from pathlib import Path
from matplotlib import pyplot as plt
from IPython.core.display import HTML

import deepdoctection as dd

## Sample

Take an image (e.g. .png, .jpg, ...). If you take the example below you'll maybe need to change ```image_path```.

In [None]:
image_path = Path.cwd() / "pics/samples/sample_2/sample_2.png"
image = cv2.imread(image_path.as_posix())
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/samples/sample_2/sample_2.png)

## Analyzer

Next, we instantiate the **deep**doctection analyzer. There is a built-in pipeline you can use. The analyzer is an example of a pipeline that can be built depending on the problem you want to tackle. This particular pipeline is built from various building blocks as shown in the diagram. 

There is a lot going on under the hood. The analyzer calls three object detectors to structure the page and an OCR engine to extract the text. However, this is clearly not enough. On top of that, words have to be mapped to layout structures and a reading order has to be inferred. 

![title](./pics/dd_pipeline.png)  

In [None]:
analyzer = dd.get_dd_analyzer(language='deu')

The language of the sample is german and passing the argument `language='deu'` will use a Tesseract model that has been trained on a german corpus giving much better result than the default english version.

## Analyze methods

Once all models have been loaded, we can process single pages or documents. You can either set `path=path/to/dir` if you have a folder of scans or `path=path/to/my/doc.pdf` if you have a single pdf document.

In [None]:
path = Path.cwd() / "pics/samples/sample_2"

df = analyzer.analyze(path=path)
df.reset_state()  # This method must be called just before starting the iteration. It is part of the API.

You can see when activating the cell that not much has happened yet. The reason is that `analyze` is a generator function. We need a `for`-loop or `next` to start the process.   

In [None]:
doc=iter(df)
page = next(doc)

## Page

Let's see what we got back. We start with some header information about the page. With `get_attribute_names()` you get a list of all attributes. 

In [None]:
page.height, page.width, page.file_name, page.location

In [None]:
page.get_attribute_names()

`page.document_type` returns None. The reason is that this pipeline is not built for document classification. You can easily build a pipeline containing a document classifier, though. Check this [notebook](Using_LayoutLM_for_sequence_classification.ipynb) for further information.

In [None]:
print(page.document_type)

We can visualize the detected segments. If you set `interactive=True` a viewer will pop up. Use + and - to zoom out/in. Use q to close the page.

Alternatively, you can visualize the output with matplotlib.

In [None]:
image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/output_16_1.png)

Let's have a look at other attributes. We can use the `text` property to get the content of the document. You will notice that the table is not included. You can therefore filter tables from the other content. In fact you can even filter on every layout segment.

In [None]:
print(page.text)

In [None]:
for layout in page.layouts:
    if layout.category_name=="title":
        print(f"Title: {layout.text}")

Tables are stored in `page.tables` which is a python list of table objects. Obviously, only one table has been detected. Let's have a closer look at the table. Most attributes are hopefully self explained. If you `print(page.tables)` you will get a very cryptic `__repr__` output.

In [None]:
len(page.tables)

In [None]:
table = page.tables[0]
table.get_attribute_names()

In [None]:
table.number_of_rows, table.number_of_columns

In [None]:
HTML(table.html)

Let's go deeper down the rabbit hole. A `Table` has cells and we can even get the text of one particular cell. Note that the output list is not sorted by row or column. You have to do it yourself.

In [None]:
cell = table.cells[0]
cell.get_attribute_names()

In [None]:
cell.column_number, cell.row_number, cell.text, cell.annotation_id  # every object comes with a unique annotation_id

Still not down yet, we have a list of words that is responsible to generate the text string.

In [None]:
word = cell.words[0]
word.get_attribute_names()

The reading order determines the string position. OCR engines generally provide a some heuristics to infer a reading order. This library, however, follows the apporach to disentangle every processing step.

In [None]:
word.characters, word.reading_order, word.token_class

The `Page` object is read-only and even though you can change the value it will not be persisted.

In [None]:
word.token_class = "ORG"

In [None]:
word #  __repr__ of the base object does carry <WordType.token_class> information.  

You can save your result in a big `.json` file. The default `save` configuration will store the image as b64 encoded string, so be aware: The `.json` file with that image has a size of 6,2 MB!

In [None]:
page.save()

Having saved the results you can easily parse the file into the `Page` format.

In [None]:
path = Path.cwd() / "pics/samples/sample_2/sample_2.json"

df = dd.SerializerJsonlines.load(path)
page = dd.Page.from_dict(**next(iter(df)))