title | description |
---|---|
Partition |
Documentation for Aryn SDK Partition |
import PartitionOptions from '/snippets/partition_options.mdx';
Please find the documentation for the Aryn SDK Partition module below. All parameters are optional unless specified otherwise.
Sends a file to Aryn DocParse and returns a Python dictionary with elements containing its document structure and text.
from aryn_sdk.partition import partition_file
with open("my-favorite-pdf.pdf", "rb") as f:
data = partition_file(
f,
aryn_api_key="MY-ARYN-API-KEY",
text_mode="standard_ocr",
extract_table_structure=True,
extract_images=True
)
elements = data['elements']
A dictionary containing keys status
and elements
. If output_format is markdown
, returns a dictionary of status
and markdown
.
User errors:
HTTPError: Error:status_code 403
. Reason:"This async action requires you to upgrade your account plan"
- Fix: Please upgrade your account here
HTTPError: Error:status_code 403
. Reason:"No Aryn API Key provided"
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Invalid Aryn API key"
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Expired Aryn API key"
- Fix: Please get a new API key here.
Other errors:
HTTPError: Error:status_code 5xx
. Reason:Internal Server Error
Submit a document for asynchronous partitioning and get its task_id
. The results of the task will remain available in the system for 48 hours. Meant to be used with partition_file_async_result
.
Note: sending multiple asynchronous partitioning tasks at the same time does not guarantee that they will run simultaneously.
webhook_url
: a string url for Aryn to visit when the async task has stopped. POSTs a body like this:{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}
with open("my-favorite-pdf.pdf", "rb") as f: response = partition_file_async_submit( f, text_mode="standard_ocr", extract_table_structure=True, )
task_id = response["task_id"]
</Accordion>
<Accordion title="Return Value">
A dict containing the `task_id` of the submitted request.
```json
{
"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"
}
User errors:
HTTPError: Error:status_code 403
. Reason:"This async action requires you to upgrade your account plan"
- Fix: Please upgrade your account here
HTTPError: Error:status_code 403
. Reason:"No Aryn API Key provided"
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Invalid Aryn API key"
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Expired Aryn API key"
- Fix: Please get a new API key here.
HTTPError: Error:status_code 429
. Reason:"Too many requests"
- Fix: Please try again after some time. Each account is allowed 1000 tasks to run at a time.
Other errors:
HTTPError: Error:status_code 5xx
. Reason:Internal Server Error
Gets the results of an asynchronous partitioning task by task_id
. Meant to be used with partition_file_async_submit
.
task_id
: Required. A string of the task id to poll and attempt to get the result for.aryn_api_key
: An Aryn API key, provided as a string. You can get one for free at aryn.ai/get-started. Default isNone
(If not provided, the sdk will check for it in the environment variableARYN_API_KEY
or will look in aryn_config as specified above).aryn_config
: An ArynConfig object (defined in aryn_sdk/config.py), used for finding an api key. Ifaryn_api_key
is set it will override this. The default ArynConfig looks in the env varARYN_API_KEY
and then in the file~/.aryn/config.yaml
. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your environment variables, and then in~/.aryn/config.yaml
).ssl_verify
: Abool
that controls whether the client verifies the SSL certificate of the chosen DocParse server. ssl_verify isTrue
by default, enforcing SSL verification.
while True: result = partition_file_async_result(task_id) if result["task_status"] != "pending": break time.sleep(1)
</Accordion>
<Accordion title="Return Value">
A dict like the one in the example below containing "task_status". When "task_status" is "done", the returned
dict also contains "result" which contains what would have been returned had `partition_file` been called directly. If there is an error with partitioning the file itself, then the "task_status" will still be "done" but the
"result" will contain an "error" field indicating what went wrong.
"task_status" can be "done" or "pending". <br />
```json
{
"task_status":"done",
"result": ...
}
User errors:
HTTPError: Error:status_code 403
. Reason:"This async action requires you to upgrade your account plan"
- Fix: Please upgrade your account here
HTTPError: Error:status_code 403
. Reason:"No Aryn API Key provided"
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Invalid Aryn API key"
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Expired Aryn API key"
- Fix: Please get a new API key here.
aryn_sdk.partition.partition.PartitionTaskNotFoundError
. Reason:"No such task"
- Fix: Check to make sure the task_id specified is correct.
Other errors:
HTTPError: Error:status_code 5xx
. Reason:Internal Server Error
Cancels the task associated with the task_id specified.
task_id
: Required. A string of the task id to cancel.aryn_api_key
: An Aryn API key, provided as a string. You can get one for free at aryn.ai/get-started. Default isNone
(If not provided, the sdk will check for it in the environment variableARYN_API_KEY
or will look in aryn_config as specified above).aryn_config
: An ArynConfig object (defined in aryn_sdk/config.py), used for finding an api key. Ifaryn_api_key
is set it will override this. The default ArynConfig looks in the env varARYN_API_KEY
and then in the file~/.aryn/config.yaml
. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your environment variables, and then in~/.aryn/config.yaml
).ssl_verify
: Abool
that controls whether the client verifies the SSL certificate of the chosen DocParse server. ssl_verify isTrue
by default, enforcing SSL verification.
partition_file_async_cancel(task_id)
</Accordion>
<Accordion title="Return Value">
No return value. Asynchronous tasks may only be successfully cancelled once. Once a task has been
cancelled, any `partition_file_async_result` calls using that task's id will throw an exception.
</Accordion>
<Accordion title="Exceptions">
User errors:
- ```HTTPError: Error:status_code 403```. Reason: `"This async action requires you to upgrade your account plan"`
- Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
- ```HTTPError: Error:status_code 403```. Reason: `"No Aryn API Key provided"`
- Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
- ```HTTPError: Error:status_code 403```. Reason: `"Invalid Aryn API key"`
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
- ```HTTPError: Error:status_code 403```. Reason: `"Expired Aryn API key"`
- Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).
- ```aryn_sdk.partition.partition.PartitionTaskNotFoundError```. Reason: `"No such task"`
- Fix: Check to make sure the task_id specified is correct.
Other errors:
- ```HTTPError: Error:status_code 5xx```. Reason: `Internal Server Error`
</Accordion>
</AccordionGroup>
### partition_file_async_list
Lists all the partition_file tasks still running in your account.
<AccordionGroup>
<Accordion title="Parameters">
- `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn_config as specified above).
- `aryn_config`: An ArynConfig object (defined in
[aryn_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
an api key. If `aryn_api_key` is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
and then in the file `~/.aryn/config.yaml`. Default is None (aryn-sdk will look in the aryn_api_key parameter, in your
environment variables, and then in `~/.aryn/config.yaml`).
- `ssl_verify`: A `bool` that controls whether the client verifies the SSL certificate of the chosen DocParse server.
ssl_verify is `True` by default, enforcing SSL verification.
</Accordion>
<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file_async_list
l = partition_file_async_list()
User errors:
HTTPError: Error:status_code 403
. Reason:"This async action requires you to upgrade your account plan"
- Fix: Please upgrade your account here
HTTPError: Error:status_code 403
. Reason:"No Aryn API Key provided"
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide an API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Invalid Aryn API key"
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
ARYN_API_KEY
.
- Fix: Please provide a valid API key either as a parameter or specify it in the environment variable
HTTPError: Error:status_code 403
. Reason:"Expired Aryn API key"
- Fix: Please get a new API key here.
Other errors:
HTTPError: Error:status_code 5xx
. Reason:Internal Server Error
Convert an image element to a more usable format. If no format is specified, create a PIL Image object. If a format is specified, output the bytes of the image in that format. If b64encode
is set to True
, base64-encode the bytes and return them as a string
.
with open("my-favorite-pdf.pdf", "rb") as f: data = partition_file( f, extract_images=True ) image_elts = [e for e in data['elements'] if e['type'] == 'Image']
pil_img = convert_image_element(image_elts[0]) jpg_bytes = convert_image_element(image_elts[1], format='JPEG') png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)
</Accordion>
<Accordion title="Return Value">
Either a PIL `Image` object, bytes of an image, or a base64-encoded image as a `str`.
</Accordion>
<Accordion title="Exceptions">
- ```ValueError``` - "b64encode was True but format was PIL. Cannot b64-encode a PIL Image".
- Fix: If you're calling the function with a PIL image, please set b64encode to False.
</Accordion>
</AccordionGroup>
### draw_with_boxes
Create a list of images from the provided PDF, one for each page, with bounding boxes detected by the partitioner drawn on.
<AccordionGroup>
<Accordion title="Parameters">
- `pdf_file`: Required. A PDF file opened in binary mode or a path to a PDF file expressed as a `string` or a `PathLike` object upon which to draw.
-`partitioning_data`: Required. The output from `partition_file`.
- `draw_table_cells`: A boolean that when `True`, makes the function draw individually detected cells of tables. When `False`, the bounding boxes of table cells are not drawn but the outer bounding boxes of tables and the bounding boxes of all other elements are still drawn. Default is False.
</Accordion>
<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file, draw_with_boxes
with open("my-favorite-pdf.pdf", "rb") as f:
data = partition_file(
f,
aryn_api_key="MY-ARYN-API-KEY",
text_mode="standard_ocr",
extract_table_structure=True,
extract_images=True
)
pages = draw_with_boxes("my-favorite-pdf.pdf", data, draw_table_cells=True)
Create a pandas
DataFrame representing the tabular data inside the provided table element. If the element is not of type table
or doesn't contain any table data, return None
instead.
with open("partition-me.pdf", "rb") as f: data = partition_file( f, text_mode="standard_ocr", extract_table_structure=True, extract_images=True )
df = None for element in data['elements']: if element['type'] == 'table': df = table_elem_to_dataframe(element) break
</Accordion>
<Accordion title="Return Value">
A Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, returns None instead.
</Accordion>
</AccordionGroup>
### tables_to_pandas
For every table element in the provided partitioning response, create a pandas DataFrame representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding DataFrames.
<AccordionGroup>
<Accordion title="Parameters">
- `data`: A response from `partition_file`
</Accordion>
<Accordion title="Example">
```python
from aryn_sdk.partition import partition_file, tables_to_pandas
with open("my-favorite-pdf.pdf", "rb") as f:
data = partition_file(
f,
aryn_api_key="MY-ARYN-API-KEY",
text_mode="standard_ocr",
extract_table_structure=True,
extract_images=True
)
elts_and_dataframes = tables_to_pandas(data)
A list of tuples, where each tuple contains an element from the 'elements' field of a partition_file
response and a Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, the DataFrame will be None
.