Replace `ProcessPoolExecutor` with `asyncio` page splitting #92

micmarty-deepsense · 2024-05-23T12:08:48Z

update the documentation
use aiohttp/httpx instead of synchronous requests
rate limiting or any other way of controlling concurrency
remove all references to threads/processes

How to verify that this PR works

Unit & Integration Tests

make install && make test

Manually

make install
pip install --editable .
python -m timeit --repeat 10 --verbose "$(cat test-client.py)"

Where test-client.py has the following contents:

import os
import sys

import unstructured_client
from unstructured_client import UnstructuredClient

print(unstructured_client.__file__)
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

filename = "_sample_docs/layout-parser-paper.pdf"

with open(filename, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=True,
    split_pdf_concurrency_level=1,
)
resp = s.general.partition(req)
ids = [e.element_id for e in resp.elements]
print(ids)

src/unstructured_client/_hooks/custom/split_pdf_hook.py

amadeusz-ds · 2024-05-23T13:32:28Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

+        loop = asyncio.get_event_loop()
+        responses = loop.run_until_complete(asyncio.gather(*prt_requests))


I don't understand how this hook flow works but I feel like you should start the event loop when you define the tasks and here just wait for the tasks to complete.

i was considering that, but as we discussed that last week - in such scenario before_request would also need to be async method which can't be done. Speakeasy requires hooks to be regular methods.

src/unstructured_client/_hooks/custom/split_pdf_hook.py

src/unstructured_client/_hooks/custom/request_utils.py

micmarty-deepsense · 2024-05-24T18:46:44Z

_test_unstructured_client/integration/test_integration_freemium.py

+
+@pytest.mark.parametrize("split_pdf", [True, False])
+@pytest.mark.parametrize("error_code", [500, 403])
+def test_partition_handling_server_error(error_code, split_pdf, monkeypatch, doc_path):


https://github.com/Unstructured-IO/unstructured-python-client/actions/runs/9225821357/job/25384186276?pr=92

I managed to pass all tests except for this one. @badGarnet , if you're able to identify the source of the problem with a quick glance, let me know. I haven't managed to figure it out myself on time.

now this test is passing, I just had to mock httpx response 😅

micmarty-deepsense · 2024-05-24T18:47:41Z

gen.yaml

+      httpx: ">=0.27.0"
+      aiolimiter: ">=1.1.0"


two new dependencies: httpx is BSD-3-licensed, aiolimiter has MIT license

no longer being used, resolving

micmarty-deepsense · 2024-05-24T18:48:31Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

@@ -135,43 +137,93 @@ def before_request(
            fallback_value=DEFAULT_CONCURRENCY_LEVEL,
            max_allowed=MAX_CONCURRENCY_LEVEL,
        )
+        limiter = AsyncLimiter(max_rate=concurrency_level, time_period=1)


so by default it's 15 requests per second

Don't we want to set a hard limit to it? If it's per second then we could get over 15 pending requests with this approach.

I changed it to asyncio.Semaphore, WDYT @mpolomdeepsense? https://medium.com/@kasperjuunge/semaphore-in-asyncio-1aaaf4038e30

micmarty-deepsense · 2024-05-24T18:51:08Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

+                response = requests.Response()
+                response.status_code = status_code
+                response._content = json.dumps(json_response).encode()  # pylint: disable=W0212
+                response.headers["Content-Type"] = "application/json"
+                return response


I wish speakeasy supported async client generation...
They rely on synchronous requests library, forcing me to do the conversion here.

src/unstructured_client/_hooks/custom/split_pdf_hook.py

mpolomdeepsense

LGTM 👍

micmarty-deepsense · 2024-05-29T14:17:36Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

+            page_number = page_index + starting_page_number
+            # Check if this page is the last one
+            print(f"Page {page_number} of {all_pages_number}")
+            if page_index == all_pages_number - 1:


@ds-filipknefel you mentioned that there's a bug on the main branch. I'll incorporate your fix on Friday.

badGarnet

blocking for now as scale testing reveals some potential issues when using this code:

locust base setup can't run the client with page splitting -> raises error that There is no current event loop in thread 'Dummy-2'. Investigating this more at the moment.

badGarnet

issue with locust was due to nesting of async event loops; not a problem for user experience

Update readme to highlight how to use the new page splitting logic safely.

micmarty-deepsense requested review from mpolomdeepsense and amadeusz-ds May 23, 2024 12:08

amadeusz-ds requested changes May 23, 2024

View reviewed changes

micmarty-deepsense force-pushed the mike/async-page-splitting branch from 250c28f to c85a2d3 Compare May 23, 2024 23:21

mpolomdeepsense reviewed May 24, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

micmarty-deepsense force-pushed the mike/async-page-splitting branch from 668c9c7 to 976b238 Compare May 24, 2024 10:50

micmarty-deepsense commented May 24, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/request_utils.py Show resolved Hide resolved

micmarty-deepsense changed the title ~~[WIP] Replace ProcessPoolExecutor with asyncio page splitting~~ Replace ProcessPoolExecutor with asyncio page splitting May 24, 2024

micmarty-deepsense commented May 24, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

micmarty-deepsense self-assigned this May 24, 2024

micmarty-deepsense requested a review from mpolomdeepsense May 28, 2024 13:49

micmarty-deepsense added 14 commits May 28, 2024 16:11

refactor ProcessPoolExecutor to asyncio

897911d

version that seems to work

73c5392

simplify everything

38b54bc

add rate limiter for api calls, refactor requests creation

12d9809

quickfix missing content type

11e01e2

add aiolimiter and httpx as dependencies

e9dbf8c

remove unused imports

d0609e8

add integration test

7da5405

update test_unit_clear_operation

dcf9680

rename self.partition_x variables

af8385c

disable pylint for one line

a268366

mock client.send for httpx requests

51839f4

replace AsyncLimiter with Semaphore

2440ac8

update readme

473560e

micmarty-deepsense force-pushed the mike/async-page-splitting branch from 74613b8 to 473560e Compare May 28, 2024 14:11

micmarty-deepsense added 2 commits May 28, 2024 23:06

set event loop manually

93bf432

create event loop via pytetst fixture

7075c40

micmarty-deepsense requested a review from badGarnet May 28, 2024 21:19

amadeusz-ds approved these changes May 29, 2024

View reviewed changes

mpolomdeepsense approved these changes May 29, 2024

View reviewed changes

micmarty-deepsense commented May 29, 2024

View reviewed changes

badGarnet requested changes May 30, 2024

View reviewed changes

badGarnet approved these changes May 30, 2024

View reviewed changes

chore: update readme (#97)

8c18f4b

Update readme to highlight how to use the new page splitting logic safely.

badGarnet merged commit cb94407 into main May 30, 2024

badGarnet deleted the mike/async-page-splitting branch May 30, 2024 22:01

		loop = asyncio.get_event_loop()
		responses = loop.run_until_complete(asyncio.gather(*prt_requests))

Replace ProcessPoolExecutor with asyncio page splitting #92

Replace ProcessPoolExecutor with asyncio page splitting #92

Uh oh!

Conversation

micmarty-deepsense commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to verify that this PR works

Unit & Integration Tests

Manually

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

micmarty-deepsense May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

micmarty-deepsense May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mpolomdeepsense left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Replace `ProcessPoolExecutor` with `asyncio` page splitting #92

Replace `ProcessPoolExecutor` with `asyncio` page splitting #92

micmarty-deepsense commented May 23, 2024 •

edited

Loading

micmarty-deepsense May 28, 2024 •

edited

Loading

micmarty-deepsense May 24, 2024 •

edited

Loading