<td>   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=256/></a></td>


# PDF Annotation Import


Supported annotations for PDF assets

*Annotation types*
- Checklist classification (including nested classifications)
- Radio classifications (including nested classifications)
- Free text classifications
- Bounding box
- Entities
- Relationships (only supported for MAL imports)


*NDJson*
- Checklist classification (including nested classifications)
- Radio classifications (including nested classifications)
- Free text classifications
- Bounding box
- Entities
- Relationships (only supported for MAL imports)

### Setup

In [22]:
%pip install -q "labelbox[data]"

In [23]:
import uuid
import json
import requests
import labelbox as lb
import labelbox.types as lb_types

### Replace with your API key
Guides on https://docs.labelbox.com/docs/create-an-api-key

In [24]:
# Add your api key
API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjbHo5ZDB3YzQwMHlwMDcweDhoczFjMzN6Iiwib3JnYW5pemF0aW9uSWQiOiJjbHo5ZDB3YngwMHlvMDcweDJsc2c1NjBmIiwiYXBpS2V5SWQiOiJjbHo5ZWxyN28wM3JrMDcycTc4bnU5ejVzIiwic2VjcmV0IjoiMmY0ZTM0N2RlZmZmZDc5NjEzNzA1NzQyM2MyMTZhOWIiLCJpYXQiOjE3MjI0MDM1NTMsImV4cCI6MjM1MzU1NTU1M30.kmm0_wlTSegtIfQ9m50dPtPpv1t1GcofCJdU6zK7HvQ"
client = lb.Client(api_key=API_KEY)

### Supported Annotations

In [25]:
########## Entity ##########

# Annotation Types
entities_annotations = lb_types.ObjectAnnotation(
    name="named_entity",
    value=lb_types.DocumentEntity(
        name="named_entity",
        textSelections=[
            lb_types.DocumentTextSelection(token_ids=[], group_id="", page=1)
        ],
    ),
)

# NDJSON
entities_annotations_ndjson = {
    "name":
        "named_entity",
    "textSelections": [{
        "tokenIds": ["<UUID>",],
        "groupId": "<UUID>",
        "page": 1,
    }],
}

In [26]:
########### Radio Classification #########

# Annotation types
radio_annotation = lb_types.ClassificationAnnotation(
    name="radio_question",
    value=lb_types.Radio(answer=lb_types.ClassificationAnswer(
        name="first_radio_answer")),
)
# NDJSON
radio_annotation_ndjson = {
    "name": "radio_question",
    "answer": {
        "name": "first_radio_answer"
    },
}

In [27]:
############ Checklist Classification ###########

# Annotation types
checklist_annotation = lb_types.ClassificationAnnotation(
    name="checklist_question",
    value=lb_types.Checklist(answer=[
        lb_types.ClassificationAnswer(name="first_checklist_answer"),
        lb_types.ClassificationAnswer(name="second_checklist_answer"),
    ]),
)

# NDJSON
checklist_annotation_ndjson = {
    "name":
        "checklist_question",
    "answer": [
        {
            "name": "first_checklist_answer"
        },
        {
            "name": "second_checklist_answer"
        },
    ],
}

In [28]:
############ Bounding Box ###########

bbox_annotation = lb_types.ObjectAnnotation(
    name="bounding_box",  # must match your ontology feature"s name
    value=lb_types.DocumentRectangle(
        start=lb_types.Point(x=102.771, y=135.3),  # x = left, y = top
        end=lb_types.Point(x=518.571,
                           y=245.143),  # x= left + width , y = top + height
        page=0,
        unit=lb_types.RectangleUnit.POINTS,
    ),
)

bbox_annotation_ndjson = {
    "name": "bounding_box",
    "bbox": {
        "top": 135.3,
        "left": 102.771,
        "height": 109.843,
        "width": 415.8
    },
    "page": 0,
    "unit": "POINTS",
}

In [29]:
# ############ global nested classifications ###########

nested_checklist_annotation = lb_types.ClassificationAnnotation(
    name="nested_checklist_question",
    value=lb_types.Checklist(answer=[
        lb_types.ClassificationAnswer(
            name="first_checklist_answer",
            classifications=[
                lb_types.ClassificationAnnotation(
                    name="sub_checklist_question",
                    value=lb_types.Checklist(answer=[
                        lb_types.ClassificationAnswer(
                            name="first_sub_checklist_answer")
                    ]),
                )
            ],
        )
    ]),
)

nested_checklist_annotation_ndjson = {
    "name":
        "nested_checklist_question",
    "answer": [{
        "name":
            "first_checklist_answer",
        "classifications": [{
            "name": "sub_checklist_question",
            "answer": {
                "name": "first_sub_checklist_answer"
            },
        }],
    }],
}

nested_radio_annotation = lb_types.ClassificationAnnotation(
    name="nested_radio_question",
    value=lb_types.Radio(answer=lb_types.ClassificationAnswer(
        name="first_radio_answer",
        classifications=[
            lb_types.ClassificationAnnotation(
                name="sub_radio_question",
                value=lb_types.Radio(answer=lb_types.ClassificationAnswer(
                    name="first_sub_radio_answer")),
            )
        ],
    )),
)

nested_radio_annotation_ndjson = {
    "name": "nested_radio_question",
    "answer": {
        "name":
            "first_radio_answer",
        "classifications": [{
            "name": "sub_radio_question",
            "answer": {
                "name": "first_sub_radio_answer"
            },
        }],
    },
}

In [30]:
############## Classification Free-form text ##############

text_annotation = lb_types.ClassificationAnnotation(
    name="free_text",  # must match your ontology feature"s name
    value=lb_types.Text(answer="sample text"),
)

text_annotation_ndjson = {"name": "free_text", "answer": "sample text"}

In [31]:
######### BBOX with nested classifications #########

bbox_with_radio_subclass_annotation = lb_types.ObjectAnnotation(
    name="bbox_with_radio_subclass",
    value=lb_types.DocumentRectangle(
        start=lb_types.Point(x=317.271, y=226.757),  # x = left, y = top
        end=lb_types.Point(x=566.657,
                           y=420.986),  # x= left + width , y = top + height
        unit=lb_types.RectangleUnit.POINTS,
        page=1,
    ),
    classifications=[
        lb_types.ClassificationAnnotation(
            name="sub_radio_question",
            value=lb_types.Radio(answer=lb_types.ClassificationAnswer(
                name="first_sub_radio_answer",
                classifications=[
                    lb_types.ClassificationAnnotation(
                        name="second_sub_radio_question",
                        value=lb_types.Radio(
                            answer=lb_types.ClassificationAnswer(
                                name="second_sub_radio_answer")),
                    )
                ],
            )),
        )
    ],
)

bbox_with_radio_subclass_annotation_ndjson = {
    "name": "bbox_with_radio_subclass",
    "classifications": [{
        "name": "sub_radio_question",
        "answer": {
            "name":
                "first_sub_radio_answer",
            "classifications": [{
                "name": "second_sub_radio_question",
                "answer": {
                    "name": "second_sub_radio_answer"
                },
            }],
        },
    }],
    "bbox": {
        "top": 226.757,
        "left": 317.271,
        "height": 194.229,
        "width": 249.386,
    },
    "page": 1,
    "unit": "POINTS",
}

In [32]:
############ NER with nested classifications ########

ner_with_checklist_subclass_annotation = lb_types.ObjectAnnotation(
    name="ner_with_checklist_subclass",
    value=lb_types.DocumentEntity(
        name="ner_with_checklist_subclass",
        text_selections=[
            lb_types.DocumentTextSelection(token_ids=[], group_id="", page=1)
        ],
    ),
    classifications=[
        lb_types.ClassificationAnnotation(
            name="sub_checklist_question",
            value=lb_types.Checklist(answer=[
                lb_types.ClassificationAnswer(name="first_sub_checklist_answer")
            ]),
        )
    ],
)

ner_with_checklist_subclass_annotation_ndjson = {
    "name":
        "ner_with_checklist_subclass",
    "classifications": [{
        "name": "sub_checklist_question",
        "answer": [{
            "name": "first_sub_checklist_answer"
        }],
    }],
    "textSelections": [{
        "tokenIds": ["<UUID>"],
        "groupId": "<UUID>",
        "page": 1
    }],
}

In [33]:
######### Relationships ##########
entity_source = lb_types.ObjectAnnotation(
    name="named_entity",
    value=lb_types.DocumentEntity(
        name="named_entity",
        textSelections=[
            lb_types.DocumentTextSelection(token_ids=[], group_id="", page=1)
        ],
    ),
)

entity_target = lb_types.ObjectAnnotation(
    name="named_entity",
    value=lb_types.DocumentEntity(
        name="named_entity",
        textSelections=[
            lb_types.DocumentTextSelection(token_ids=[], group_id="", page=1)
        ],
    ),
)

entity_relationship = lb_types.RelationshipAnnotation(
    name="relationship",
    value=lb_types.Relationship(
        source=entity_source,
        target=entity_target,
        type=lb_types.Relationship.Type.UNIDIRECTIONAL,
    ),
)

## Only supported for MAL imports
uuid_source = str(uuid.uuid4())
uuid_target = str(uuid.uuid4())

entity_source_ndjson = {
    "name":
        "named_entity",
    "uuid":
        uuid_source,
    "textSelections": [{
        "tokenIds": ["<UUID>"],
        "groupId": "<UUID>",
        "page": 1
    }],
}

entity_target_ndjson = {
    "name":
        "named_entity",
    "uuid":
        uuid_target,
    "textSelections": [{
        "tokenIds": ["<UUID>"],
        "groupId": "<UUID>",
        "page": 1
    }],
}
ner_relationship_annotation_ndjson = {
    "name": "relationship",
    "relationship": {
        "source": uuid_source,
        "target": uuid_target,
        "type": "unidirectional",
    },
}

In [34]:
######### BBOX with relationships #############
# Python Annotation
bbox_source = lb_types.ObjectAnnotation(
    name="bounding_box",
    value=lb_types.DocumentRectangle(
        start=lb_types.Point(x=188.257, y=68.875),  # x = left, y = top
        end=lb_types.Point(x=270.907,
                           y=149.556),  # x = left + width , y = top + height
        unit=lb_types.RectangleUnit.POINTS,
        page=1,
    ),
)

bbox_target = lb_types.ObjectAnnotation(
    name="bounding_box",
    value=lb_types.DocumentRectangle(
        start=lb_types.Point(x=96.424, y=66.251),
        end=lb_types.Point(x=179.074, y=146.932),
        unit=lb_types.RectangleUnit.POINTS,
        page=1,
    ),
)

bbox_relationship = lb_types.RelationshipAnnotation(
    name="relationship",
    value=lb_types.Relationship(
        source=bbox_source,
        target=bbox_target,
        type=lb_types.Relationship.Type.UNIDIRECTIONAL,
    ),
)

## Only supported for MAL imports
uuid_source_2 = str(uuid.uuid4())
uuid_target_2 = str(uuid.uuid4())

bbox_source_ndjson = {
    "name": "bounding_box",
    "uuid": uuid_source_2,
    "bbox": {
        "top": 68.875,
        "left": 188.257,
        "height": 80.681,
        "width": 82.65
    },
    "page": 1,
    "unit": "POINTS",
}

bbox_target_ndjson = {
    "name": "bounding_box",
    "uuid": uuid_target_2,
    "bbox": {
        "top": 66.251,
        "left": 96.424,
        "height": 80.681,
        "width": 82.65
    },
    "page": 1,
    "unit": "POINTS",
}

bbox_relationship_annotation_ndjson = {
    "name": "relationship",
    "relationship": {
        "source": uuid_source_2,
        "target": uuid_target_2,
        "type": "unidirectional",
    },
}

## Upload Annotations - putting it all together

### Step 1: Import data rows into Catalog

Passing a `text_layer_url` is not longer required. Labelbox automatically generates a text layer using Google Document AI and its OCR engine to detect tokens.

However, it's important to note that Google Document AI imposes specific restrictions on document size:
- The document must have no more than 15 pages.
- The file size should not exceed 20 MB.

Furthermore, Google Document AI optimizes documents before OCR processing. This optimization might include rotating images or pages to ensure that text appears horizontally. Consequently, token coordinates are calculated based on the rotated/optimized images, resulting in potential discrepancies with the original PDF document.

For example, in a landscape-oriented PDF, the document is rotated by 90 degrees before processing. As a result, all tokens in the text layer are also rotated by 90 degrees.

You may still pass a `text_layer_url` if you wish to bypass the automatic text layer generation


In [35]:
global_key = "0801.3483_doc.pdf" + str(uuid.uuid4())
img_url = {
    "row_data": {
        "pdf_url":
            "https://arxiv.org/pdf/1905.01657"
    },
    "global_key": global_key,
}

dataset = client.create_dataset(name="pdf_demo_dataset")
task = dataset.create_data_rows([img_url])
task.wait_till_done()
print(f"Failed data rows: {task.failed_data_rows}")
print(f"Errors: {task.errors}")

if task.errors:
    for error in task.errors:
        if ("Duplicate global key" in error["message"] and
                dataset.row_count == 0):
            # If the global key already  exists in the workspace the dataset will be created empty, so we can delete it.
            print(f"Deleting empty dataset: {dataset}")
            dataset.delete()

Failed data rows: None
Errors: None


### Step 2: Create/select an Ontology for your project



In [36]:
## Setup the ontology and link the tools created above.

ontology_builder = lb.OntologyBuilder(
    classifications=[  # List of Classification objects
        lb.Classification(
            class_type=lb.Classification.Type.RADIO,
            name="radio_question",
            scope=lb.Classification.Scope.GLOBAL,
            options=[
                lb.Option(value="first_radio_answer"),
                lb.Option(value="second_radio_answer"),
            ],
        ),
        lb.Classification(
            class_type=lb.Classification.Type.CHECKLIST,
            name="checklist_question",
            scope=lb.Classification.Scope.GLOBAL,
            options=[
                lb.Option(value="first_checklist_answer"),
                lb.Option(value="second_checklist_answer"),
            ],
        ),
        lb.Classification(
            class_type=lb.Classification.Type.TEXT,
            name="free_text",
            scope=lb.Classification.Scope.GLOBAL,
        ),
        lb.Classification(
            class_type=lb.Classification.Type.RADIO,
            name="nested_radio_question",
            scope=lb.Classification.Scope.GLOBAL,
            options=[
                lb.Option(
                    "first_radio_answer",
                    options=[
                        lb.Classification(
                            class_type=lb.Classification.Type.RADIO,
                            name="sub_radio_question",
                            options=[lb.Option("first_sub_radio_answer")],
                        )
                    ],
                )
            ],
        ),
        lb.Classification(
            class_type=lb.Classification.Type.CHECKLIST,
            name="nested_checklist_question",
            scope=lb.Classification.Scope.GLOBAL,
            options=[
                lb.Option(
                    "first_checklist_answer",
                    options=[
                        lb.Classification(
                            class_type=lb.Classification.Type.CHECKLIST,
                            name="sub_checklist_question",
                            options=[lb.Option("first_sub_checklist_answer")],
                        )
                    ],
                )
            ],
        ),
    ],
    tools=[  # List of Tool objects
        lb.Tool(tool=lb.Tool.Type.BBOX, name="bounding_box"),
        lb.Tool(tool=lb.Tool.Type.NER, name="named_entity"),
        lb.Tool(tool=lb.Tool.Type.RELATIONSHIP, name="relationship"),
        lb.Tool(
            tool=lb.Tool.Type.NER,
            name="ner_with_checklist_subclass",
            classifications=[
                lb.Classification(
                    class_type=lb.Classification.Type.CHECKLIST,
                    name="sub_checklist_question",
                    options=[lb.Option(value="first_sub_checklist_answer")],
                )
            ],
        ),
        lb.Tool(
            tool=lb.Tool.Type.BBOX,
            name="bbox_with_radio_subclass",
            classifications=[
                lb.Classification(
                    class_type=lb.Classification.Type.RADIO,
                    name="sub_radio_question",
                    options=[
                        lb.Option(
                            value="first_sub_radio_answer",
                            options=[
                                lb.Classification(
                                    class_type=lb.Classification.Type.RADIO,
                                    name="second_sub_radio_question",
                                    options=[
                                        lb.Option("second_sub_radio_answer")
                                    ],
                                )
                            ],
                        )
                    ],
                )
            ],
        ),
    ],
)

ontology = client.create_ontology(
    "Document Annotation Import Demo",
    ontology_builder.asdict(),
    media_type=lb.MediaType.Document,
)

### Step 3: Creating a labeling project

In [37]:
# Create a Labelbox project
project = client.create_project(name="PDF_annotation_demo",
                                media_type=lb.MediaType.Document)
project.setup_editor(ontology)



### Step 4: Send a batch of data rows to the project

In [38]:
project.create_batch(
    "PDF_annotation_batch",  # Each batch in a project must have a unique name
    global_keys=[
        global_key
    ],  # Paginated collection of data row objects, list of data row ids or global keys
    priority=5,  # priority between 1(Highest) - 5(lowest)
)

<Batch ID: 842f27b0-4efd-11ef-aba5-2fbca5f6033d>

### Step 5. Create the annotation payload
Create the annotations payload using the snippets of code in Supported predictions section.

Labelbox support NDJSON only for this data type.

The resulting label should have exactly the same content for annotations that are supported by both (with exception of the uuid strings that are generated)

##### Step 5.1: First, we need to populate the text selections for Entity annotations
To import ner annotations, you must pass a `text_layer_url`, Labelbox automatically generates a `text_layer_url` after importing a pdf asset that doesn't include a `text_layer_url`


To extract the generated text layer url we first need to export the data row

In [39]:
client.enable_experimental = True
task = lb.DataRow.export(client=client, global_keys=[global_key])
task.wait_till_done()
stream = task.get_buffered_stream()

text_layer = ""
for output in stream:
    output_json = output.json
    text_layer = output_json["media_attributes"]["text_layer_url"]
print(text_layer)

https://storage.labelbox.com/clz9d0wbx00yo070x2lsg560f%2F2puwxoit7970gfs1tctme9zlc-text-layer.json?Expires=1722490022803&KeyName=labelbox-assets-key-3&Signature=Rn8Ng4-a7hI2GchReh8KPDt2oAs


In [40]:
# Helper method
def update_text_selections(annotation, group_id, list_tokens, page):
    return annotation.update({
        "textSelections": [{
            "groupId": group_id,
            "tokenIds": list_tokens,
            "page": page
        }]
    })


# Fetch the content of the text layer
res = requests.get(text_layer)

# Phrases that we want to annotation obtained from the text layer url
content_phrases = [
    "Metal-insulator (MI) transitions have been one of the",
    "T. Sasaki, N. Yoneyama, and N. Kobayashi",
    "Organic charge transfer salts based on the donor",
    "the experimental investigations on this issue have not",
]

# Parse the text layer
text_selections = []
text_selections_ner = []
text_selections_source = []
text_selections_target = []

for obj in json.loads(res.text):
    for group in obj["groups"]:
        if group["content"] == content_phrases[0]:
            list_tokens = [x["id"] for x in group["tokens"]]
            # build text selections for Python Annotation Types
            document_text_selection = lb_types.DocumentTextSelection(
                groupId=group["id"], tokenIds=list_tokens, page=1)
            text_selections.append(document_text_selection)
            # build text selection for the NDJson annotations
            update_text_selections(
                annotation=entities_annotations_ndjson,
                group_id=group["id"],  # id representing group of words
                list_tokens=
                list_tokens,  # ids representing individual words from the group
                page=1,
            )
        if group["content"] == content_phrases[1]:
            list_tokens_2 = [x["id"] for x in group["tokens"]]
            # build text selections for Python Annotation Types
            ner_text_selection = lb_types.DocumentTextSelection(
                groupId=group["id"], tokenIds=list_tokens_2, page=1)
            text_selections_ner.append(ner_text_selection)
            # build text selection for the NDJson annotations
            update_text_selections(
                annotation=ner_with_checklist_subclass_annotation_ndjson,
                group_id=group["id"],  # id representing group of words
                list_tokens=
                list_tokens_2,  # ids representing individual words from the group
                page=1,
            )
        if group["content"] == content_phrases[2]:
            relationship_source = [x["id"] for x in group["tokens"]]
            # build text selections for Python Annotation Types
            text_selection_entity_source = lb_types.DocumentTextSelection(
                groupId=group["id"], tokenIds=relationship_source, page=1)
            text_selections_source.append(text_selection_entity_source)
            # build text selection for the NDJson annotations
            update_text_selections(
                annotation=entity_source_ndjson,
                group_id=group["id"],  # id representing group of words
                list_tokens=
                relationship_source,  # ids representing individual words from the group
                page=1,
            )
        if group["content"] == content_phrases[3]:
            relationship_target = [x["id"] for x in group["tokens"]]
            # build text selections for Python Annotation Types
            text_selection_entity_target = lb_types.DocumentTextSelection(
                group_id=group["id"], tokenIds=relationship_target, page=1)
            text_selections_target.append(text_selection_entity_target)
            # build text selections forthe NDJson annotations
            update_text_selections(
                annotation=entity_target_ndjson,
                group_id=group["id"],  # id representing group of words
                list_tokens=
                relationship_target,  # ids representing individual words from the group
                page=1,
            )

Re-write the python annotations to include text selections (only required for python annotation types)

In [41]:
# re-write the entity annotation with text selections
entities_annotation_document_entity = lb_types.DocumentEntity(
    name="named_entity", textSelections=text_selections)
entities_annotation = lb_types.ObjectAnnotation(
    name="named_entity", value=entities_annotation_document_entity)

# re-write the entity annotation + subclassification with text selections
classifications = [
    lb_types.ClassificationAnnotation(
        name="sub_checklist_question",
        value=lb_types.Checklist(answer=[
            lb_types.ClassificationAnswer(name="first_sub_checklist_answer")
        ]),
    )
]
ner_annotation_with_subclass = lb_types.DocumentEntity(
    name="ner_with_checklist_subclass", textSelections=text_selections_ner)
ner_with_checklist_subclass_annotation = lb_types.ObjectAnnotation(
    name="ner_with_checklist_subclass",
    value=ner_annotation_with_subclass,
    classifications=classifications,
)

# re-write the entity source and target annotations withe text selectios
entity_source_doc = lb_types.DocumentEntity(
    name="named_entity", text_selections=text_selections_source)
entity_source = lb_types.ObjectAnnotation(name="named_entity",
                                          value=entity_source_doc)

entity_target_doc = lb_types.DocumentEntity(
    name="named_entity", text_selections=text_selections_target)
entity_target = lb_types.ObjectAnnotation(name="named_entity",
                                          value=entity_target_doc)

# re-write the entity relationship with the re-created entities
entity_relationship = lb_types.RelationshipAnnotation(
    name="relationship",
    value=lb_types.Relationship(
        source=entity_source,
        target=entity_target,
        type=lb_types.Relationship.Type.UNIDIRECTIONAL,
    ),
)

In [42]:
# Final NDJSON and python annotations
print(f"entities_annotations_ndjson={entities_annotations_ndjson}")
print(f"entities_annotation={entities_annotation}")
print(
    f"nested_entities_annotation_ndjson={ner_with_checklist_subclass_annotation_ndjson}"
)
print(f"nested_entities_annotation={ner_with_checklist_subclass_annotation}")
print(f"entity_source_ndjson={entity_source_ndjson}")
print(f"entity_target_ndjson={entity_target_ndjson}")
print(f"entity_source={entity_source}")
print(f"entity_target={entity_target}")

entities_annotations_ndjson={'name': 'named_entity', 'textSelections': [{'tokenIds': ['<UUID>'], 'groupId': '<UUID>', 'page': 1}]}
entities_annotation=custom_metrics=None confidence=None name='named_entity' feature_schema_id=None extra={} value=DocumentEntity(text_selections=[]) classifications=[]
nested_entities_annotation_ndjson={'name': 'ner_with_checklist_subclass', 'classifications': [{'name': 'sub_checklist_question', 'answer': [{'name': 'first_sub_checklist_answer'}]}], 'textSelections': [{'tokenIds': ['<UUID>'], 'groupId': '<UUID>', 'page': 1}]}
nested_entities_annotation=custom_metrics=None confidence=None name='ner_with_checklist_subclass' feature_schema_id=None extra={} value=DocumentEntity(text_selections=[]) classifications=[ClassificationAnnotation(custom_metrics=None, confidence=None, name='sub_checklist_question', feature_schema_id=None, extra={}, value=Checklist(confidence=None, name='checklist', answer=[ClassificationAnswer(custom_metrics=None, confidence=None, name='

#### Python annotation
Here we create the complete labels ndjson payload of annotations only using python annotation format. There is one annotation for each reference to an annotation that we created. Note that only a handful of python annotation types are supported for PDF documents.

In [43]:
labels = []

labels.append(
    lb_types.Label(
        data={"global_key": global_key},
        annotations=[
            entities_annotation,
            checklist_annotation,
            nested_checklist_annotation,
            text_annotation,
            radio_annotation,
            nested_radio_annotation,
            bbox_annotation,
            bbox_with_radio_subclass_annotation,
            ner_with_checklist_subclass_annotation,
            entity_source,
            entity_target,
            entity_relationship,  # Only supported for MAL imports
            bbox_source,
            bbox_target,
            bbox_relationship,  # Only supported for MAL imports
        ],
    ))

#### NDJson annotations
Here we create the complete labels ndjson payload of annotations only using NDJSON format. There is one annotation for each reference to an annotation that we created above.

In [44]:
label_ndjson = []
for annot in [
        entities_annotations_ndjson,
        checklist_annotation_ndjson,
        nested_checklist_annotation_ndjson,
        text_annotation_ndjson,
        radio_annotation_ndjson,
        nested_radio_annotation_ndjson,
        bbox_annotation_ndjson,
        bbox_with_radio_subclass_annotation_ndjson,
        ner_with_checklist_subclass_annotation_ndjson,
        entity_source_ndjson,
        entity_target_ndjson,
        ner_relationship_annotation_ndjson,  # Only supported for MAL imports
        bbox_source_ndjson,
        bbox_target_ndjson,
        bbox_relationship_annotation_ndjson,  # Only supported for MAL imports
]:
    annot.update({
        "dataRow": {
            "globalKey": global_key
        },
    })
    label_ndjson.append(annot)

### Step 6: Import the annotation payload
For the purpose of this tutorial only import one of the annotations payloads at the time (NDJSON or Python annotation types).

Option A: Upload to a labeling project as pre-labels (MAL)

In [45]:
upload_job = lb.MALPredictionImport.create_from_objects(
    client=client,
    project_id=project.uid,
    name="pdf_annotation_upload" + str(uuid.uuid4()),
    predictions=labels,
)

upload_job.wait_until_done()
# Errors will appear for annotation uploads that failed.
print("Errors:", upload_job.errors)
print("Status of uploads: ", upload_job.statuses)

Errors: [{'uuid': '7d97ca0c-529e-437a-bf5a-f7af3d06f8e1', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKey': '0801.3483_doc.pdfe4b70893-037a-4233-86cf-03a337d35530'}, 'status': 'FAILURE', 'errors': [{'name': 'InvalidAnnotation', 'message': 'source annotation uuid was not found in current import', 'additionalInfo': None}, {'name': 'InvalidAnnotation', 'message': 'target annotation uuid was not found in current import', 'additionalInfo': None}]}, {'uuid': '570c4405-c1b3-424b-bec8-6f0462a62bb0', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKey': '0801.3483_doc.pdfe4b70893-037a-4233-86cf-03a337d35530'}, 'status': 'FAILURE', 'errors': [{'name': 'InvalidAnnotation', 'message': 'Document annotation is missing page or unit fields.', 'additionalInfo': None}]}, {'uuid': '9be95fce-8742-456c-a587-6c883c2b6095', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKey': '0801.3483_doc.pdfe4b70893-037a-4233-86cf-03a337d35530'}, 'status': 'FAILURE', 'errors': [{'name': 'ValidationE

Option B: Upload to a labeling project using ground truth

In [47]:
# Uncomment this code when excluding relationships from label import
# Relationships are not currently supported for label import

upload_job = lb.LabelImport.create_from_objects(
    client = client,
    project_id = project.uid,
    name="label_import_job"+str(uuid.uuid4()),
    labels=labels) ## Remove unsupported relationships from the labels list

print("Errors:", upload_job.errors)
print("Status of uploads: ", upload_job.statuses)

Errors: [{'uuid': '7d97ca0c-529e-437a-bf5a-f7af3d06f8e1', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKey': '0801.3483_doc.pdfe4b70893-037a-4233-86cf-03a337d35530'}, 'status': 'FAILURE', 'errors': [{'name': 'InvalidAnnotation', 'message': 'Relationships are supported only for MAL imports', 'additionalInfo': None}, {'name': 'InvalidAnnotation', 'message': 'source annotation uuid was not found in current import', 'additionalInfo': None}, {'name': 'InvalidAnnotation', 'message': 'target annotation uuid was not found in current import', 'additionalInfo': None}]}, {'uuid': 'dcc2f48a-1104-4661-b19c-22e7e85a8b96', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKey': '0801.3483_doc.pdfe4b70893-037a-4233-86cf-03a337d35530'}, 'status': 'FAILURE', 'errors': [{'name': 'InvalidAnnotation', 'message': 'Relationships are supported only for MAL imports', 'additionalInfo': None}]}, {'uuid': '570c4405-c1b3-424b-bec8-6f0462a62bb0', 'dataRow': {'id': 'clz9emtct1sfg0797tioxwup2', 'globalKe

# **Summary**

Completing this assignment has provided me with practical experience in using Labelbox for data annotation. I have learned how to use Labelbox to import PDF data, create labeling projects, and set up ontologies with various annotation types like bounding boxes, entities, and classifications. This process highlighted the importance of high-quality labeled data for training machine learning models. I gained insights into Labelbox’s capabilities in managing large-scale, collaborative annotation projects, ensuring consistency and efficiency. This experience demonstrated the flexibility of Labelbox in adapting to different use cases and its role in integrating annotations into machine learning workflows, preparing me to contribute effectively to data annotation and model training projects.