Skip to content

Latest commit

 

History

History
460 lines (349 loc) · 17.5 KB

partition.mdx

File metadata and controls

460 lines (349 loc) · 17.5 KB
title description
Partition
Documentation for Aryn SDK Partition

import PartitionOptions from '/snippets/partition_options.mdx';

Please find the documentation for the Aryn SDK Partition module below. All parameters are optional unless specified otherwise.

Synchronous Partitioning Functions

partition_file

Sends a file to Aryn DocParse and returns a Python dictionary with elements containing its document structure and text.

from aryn_sdk.partition import partition_file

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-API-KEY",
        text_mode="standard_ocr",
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']

A dictionary containing keys status and elements. If output_format is markdown, returns a dictionary of status and markdown.

User errors:

  • HTTPError: Error:status_code 403. Reason: "This async action requires you to upgrade your account plan"
    • Fix: Please upgrade your account here
  • HTTPError: Error:status_code 403. Reason: "No Aryn API Key provided"
    • Fix: Please provide an API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Invalid Aryn API key"
    • Fix: Please provide a valid API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Expired Aryn API key"
    • Fix: Please get a new API key here.

Other errors:

  • HTTPError: Error:status_code 5xx. Reason: Internal Server Error

Aynchronous Partitioning Functions

partition_file_async_submit

Submit a document for asynchronous partitioning and get its task_id. The results of the task will remain available in the system for 48 hours. Meant to be used with partition_file_async_result. Note: sending multiple asynchronous partitioning tasks at the same time does not guarantee that they will run simultaneously.

  • webhook_url: a string url for Aryn to visit when the async task has stopped. POSTs a body like this: {"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}
```python from aryn_sdk.partition import partition_file_async_submit

with open("my-favorite-pdf.pdf", "rb") as f: response = partition_file_async_submit( f, text_mode="standard_ocr", extract_table_structure=True, )

get task id

task_id = response["task_id"]

</Accordion>
<Accordion title="Return Value">
A dict containing the `task_id` of the submitted request.

```json
{
  "task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"
}

User errors:

  • HTTPError: Error:status_code 403. Reason: "This async action requires you to upgrade your account plan"
    • Fix: Please upgrade your account here
  • HTTPError: Error:status_code 403. Reason: "No Aryn API Key provided"
    • Fix: Please provide an API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Invalid Aryn API key"
    • Fix: Please provide a valid API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Expired Aryn API key"
    • Fix: Please get a new API key here.
  • HTTPError: Error:status_code 429. Reason: "Too many requests"
    • Fix: Please try again after some time. Each account is allowed 1000 tasks to run at a time.

Other errors:

  • HTTPError: Error:status_code 5xx. Reason: Internal Server Error

partition_file_async_result

Gets the results of an asynchronous partitioning task by task_id. Meant to be used with partition_file_async_submit.

  • task_id: Required. A string of the task id to poll and attempt to get the result for.
  • aryn_api_key: An Aryn API key, provided as a string. You can get one for free at aryn.ai/get-started. Default is None (If not provided, the sdk will check for it in the environment variable ARYN_API_KEY or will look in aryn_config as specified above).
  • aryn_config: An ArynConfig object (defined in aryn_sdk/config.py), used for finding an api key. If aryn_api_key is set it will override this. The default ArynConfig looks in the env var ARYN_API_KEY and then in the file ~/.aryn/config.yaml. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your environment variables, and then in ~/.aryn/config.yaml).
  • ssl_verify: A bool that controls whether the client verifies the SSL certificate of the chosen DocParse server. ssl_verify is True by default, enforcing SSL verification.
```python import time from aryn_sdk.partition import partition_file_async_result

Poll for the results

while True: result = partition_file_async_result(task_id) if result["task_status"] != "pending": break time.sleep(1)

</Accordion>

<Accordion title="Return Value">
A dict like the one in the example below containing "task_status". When "task_status" is "done", the returned
dict also contains "result" which contains what would have been returned had `partition_file` been called directly. If there is an error with partitioning the file itself, then the "task_status" will still be "done" but the
"result" will contain an "error" field indicating what went wrong.

"task_status" can be "done" or "pending". <br />
```json
{
  "task_status":"done",
  "result": ...
}

User errors:

  • HTTPError: Error:status_code 403. Reason: "This async action requires you to upgrade your account plan"
    • Fix: Please upgrade your account here
  • HTTPError: Error:status_code 403. Reason: "No Aryn API Key provided"
    • Fix: Please provide an API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Invalid Aryn API key"
    • Fix: Please provide a valid API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Expired Aryn API key"
    • Fix: Please get a new API key here.
  • aryn_sdk.partition.partition.PartitionTaskNotFoundError. Reason: "No such task"
    • Fix: Check to make sure the task_id specified is correct.

Other errors:

  • HTTPError: Error:status_code 5xx. Reason: Internal Server Error

partition_file_async_cancel

Cancels the task associated with the task_id specified.

  • task_id: Required. A string of the task id to cancel.
  • aryn_api_key: An Aryn API key, provided as a string. You can get one for free at aryn.ai/get-started. Default is None (If not provided, the sdk will check for it in the environment variable ARYN_API_KEY or will look in aryn_config as specified above).
  • aryn_config: An ArynConfig object (defined in aryn_sdk/config.py), used for finding an api key. If aryn_api_key is set it will override this. The default ArynConfig looks in the env var ARYN_API_KEY and then in the file ~/.aryn/config.yaml. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your environment variables, and then in ~/.aryn/config.yaml).
  • ssl_verify: A bool that controls whether the client verifies the SSL certificate of the chosen DocParse server. ssl_verify is True by default, enforcing SSL verification.
```python from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel task_id = partition_file_async_submit( "path/to/file.pdf", text_mode="standard_ocr", extract_table_structure=True, extract_images=True, )["task_id"]

partition_file_async_cancel(task_id)

</Accordion>
<Accordion title="Return Value">
No return value. Asynchronous tasks may only be successfully cancelled once. Once a task has been
cancelled, any `partition_file_async_result` calls using that task's id will throw an exception.
</Accordion>

<Accordion title="Exceptions">

User errors:
- ```HTTPError: Error:status_code 403```. Reason: `"This async action requires you to upgrade your account plan"`
    - Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
- ```HTTPError: Error:status_code 403```. Reason: `"No Aryn API Key provided"`
    - Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
- ```HTTPError: Error:status_code 403```. Reason: `"Invalid Aryn API key"`
    - Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
- ```HTTPError: Error:status_code 403```. Reason: `"Expired Aryn API key"`
    - Fix: Please get a new API key [here](https://console.aryn.ai/api-keys). 
- ```aryn_sdk.partition.partition.PartitionTaskNotFoundError```. Reason: `"No such task"`
    - Fix: Check to make sure the task_id specified is correct.

Other errors:
- ```HTTPError: Error:status_code 5xx```. Reason: `Internal Server Error`

</Accordion>
</AccordionGroup>




### partition_file_async_list
Lists all the partition_file tasks still running in your account.
<AccordionGroup>
<Accordion title="Parameters">
- `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn_config as specified above).
- `aryn_config`: An ArynConfig object (defined in
[aryn_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
an api key. If `aryn_api_key` is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
and then in the file `~/.aryn/config.yaml`. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your
environment variables, and then in `~/.aryn/config.yaml`).
- `ssl_verify`: A `bool` that controls whether the client verifies the SSL certificate of the chosen DocParse server.
ssl_verify is `True` by default, enforcing SSL verification.
</Accordion>
<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file_async_list

l = partition_file_async_list()
A dict like the one below which maps task_ids to a dict containing details of the respective task. ```json { "aryn:t-sc0v0lglkauo774pioflp4l": { "task_status": "pending" }, "aryn:t-b9xp7ny0eejvqvbazjhg8rn": { "task_status": "pending" } } ```

User errors:

  • HTTPError: Error:status_code 403. Reason: "This async action requires you to upgrade your account plan"
    • Fix: Please upgrade your account here
  • HTTPError: Error:status_code 403. Reason: "No Aryn API Key provided"
    • Fix: Please provide an API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Invalid Aryn API key"
    • Fix: Please provide a valid API key either as a parameter or specify it in the environment variable ARYN_API_KEY.
  • HTTPError: Error:status_code 403. Reason: "Expired Aryn API key"
    • Fix: Please get a new API key here.

Other errors:

  • HTTPError: Error:status_code 5xx. Reason: Internal Server Error

Helper Functions

convert_image_element

Convert an image element to a more usable format. If no format is specified, create a PIL Image object. If a format is specified, output the bytes of the image in that format. If b64encode is set to True, base64-encode the bytes and return them as a string.

- `elem`: Required. An image element from the `elements` field of a `partition_file` response. - `format`: A `string` specifying the format to output bytes to. Default is `PIL`. - `b64encode`: A `boolean` that when set to True enables base64-encoding of the output bytes of this function. Format cannot be `PIL` when this option is `True`. Default is `False`. ```python from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f: data = partition_file( f, extract_images=True ) image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0]) jpg_bytes = convert_image_element(image_elts[1], format='JPEG') png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

</Accordion>

<Accordion title="Return Value">

Either a PIL `Image` object, bytes of an image, or a base64-encoded image as a `str`.

</Accordion>


<Accordion title="Exceptions">
-  ```ValueError``` - "b64encode was True but format was PIL. Cannot b64-encode a PIL Image". 
    - Fix: If you're calling the function with a PIL image, please set b64encode to False. 
</Accordion>

</AccordionGroup>

### draw_with_boxes

Create a list of images from the provided PDF, one for each page, with bounding boxes detected by the partitioner drawn on.

<AccordionGroup>

<Accordion title="Parameters">
- `pdf_file`: Required. A PDF file opened in binary mode or a path to a PDF file expressed as a `string` or a `PathLike` object upon which to draw.
-`partitioning_data`: Required. The output from `partition_file`.
- `draw_table_cells`: A boolean that when `True`, makes the function draw individually detected cells of tables. When `False`, the bounding boxes of table cells are not drawn but the outer bounding boxes of tables and the bounding boxes of all other elements are still drawn. Default is False.

</Accordion>

<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file, draw_with_boxes

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-API-KEY",
        text_mode="standard_ocr",
        extract_table_structure=True,
        extract_images=True
    )
pages = draw_with_boxes("my-favorite-pdf.pdf", data, draw_table_cells=True)
A list of images of pages of the PDF, each with bounding boxes drawn on. - Will throw an exception if the function is not called with a PDF file.

table_elem_to_dataframe

Create a pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type table or doesn't contain any table data, return None instead.

- `elem`: Required. An element from the 'elements' field of a `partition_file` response. ```python from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f: data = partition_file( f, text_mode="standard_ocr", extract_table_structure=True, extract_images=True )

Find the first table and convert it to a dataframe

df = None for element in data['elements']: if element['type'] == 'table': df = table_elem_to_dataframe(element) break

</Accordion>

<Accordion title="Return Value">
A Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, returns None instead.

</Accordion>

</AccordionGroup>


### tables_to_pandas

For every table element in the provided partitioning response, create a pandas DataFrame representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding DataFrames.


<AccordionGroup>

<Accordion title="Parameters">

- `data`: A response from `partition_file`

</Accordion>


<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file, tables_to_pandas

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-API-KEY",
        text_mode="standard_ocr",
        extract_table_structure=True,
        extract_images=True
    )
elts_and_dataframes = tables_to_pandas(data)

A list of tuples, where each tuple contains an element from the 'elements' field of a partition_file response and a Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, the DataFrame will be None.