Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DEFAULT_SESSION use be thread-safe #228

Closed
wants to merge 1 commit into from

Conversation

ruhan
Copy link

@ruhan ruhan commented Apr 8, 2021

I have an implementation that uses threads to distribute calls to S3 resources. When using several threads at once, lots of errors emerged.

I figured out that it was because of this:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/session.html#multithreading-or-multiprocessing-with-sessions

Meaning that Sessions can't be reused by different threads.

To solve this what I did was to change the way we use the global variable DEFAULT_SESSION to make it possible to have one default session for each thread.

I have an implementation that uses threads to distribute
calls to S3 resources. When using several threads at once,
lots of errors emerged.

I figured out that it was because of this:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/session.html#multithreading-or-multiprocessing-with-sessions

Meaning that Sessions can't be reused by different threads.

To solve this what I did was to change the way we use the global
variable DEFAULT_SESSION to make it possible to have one default
session for each thread.
@sadraque4k
Copy link

+1 for this one!

@terricain
Copy link
Owner

Considering this is an async project, is there a specific reason for using aioboto3 in multiple threads over multiple tasks in the event loop, or using boto3 in threads?

I don't at the moment see the use case.

@ruhan
Copy link
Author

ruhan commented Apr 14, 2021

Hey @terrycain ,

Thanks for your answer.

Let me try to explain my case. Suppose you have a task that is both: IO-bound and CPU-bound at the same time. For example, a task that gets lots of data from S3, organizes it in a table and executes a classification algorithm that uses some vector operations on all this data. In a kind of pyseudocode it could be like:

async def task(data_chunk):
    data = await get_data_from_s3_io_bound()
    return execute_complex_vector_operation_cpu_bound(data)

Now, think that you have lots of tasks like this to execute and that they can be carried on in parallel.

In this case, you are going to have something like this in your pyseudocode:

import concurrent

def execute_tasks_in_parallel():
    chunks = range(1000)
    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:
        tasks = [ pool.submit(task, chunk) for chunk  in chunks ]

        for task_completed in concurrent.futures.as_completed(tasks):
            do_something_with_task_result(task_completed.result())

Thus, we are going to have different threads trying to access the same DEFAULT_SESSION variable under the hood, and it crashes the execution sometimes, concluding as a scenario where we need a different session for each thread, otherwise, it's going to crash.

I don't know if anyone else is experiencing this something like this, because it can be too specific, but I think thread-safety characteristic is something this library should have as its property.

Feel free to discord at any point or even to clarify whatever you want to, I may be missing some points here because it is the first time I work with aioboto3.

Thanks for the patience in analyzing this PR.

@ruhan
Copy link
Author

ruhan commented May 19, 2021

Hey @terrycain , can you take a look at my answer here?

@terricain
Copy link
Owner

Ok so these changes have the capability to store infinite copies of the session, especially long after a thread has ended which isn't great for memory usage.

Is there anything stopping you from creating a Session() object yourself and using that instead of relying on the global one.

One of the things I've been contemplating recently is actually dropping the entire global session as it can present problems for frameworks that stop the event loop and create new ones (e.g. aws chalice)

@ruhan
Copy link
Author

ruhan commented Jun 4, 2021 via email

terricain added a commit that referenced this pull request Jun 27, 2021
@terricain terricain closed this in fcc7321 Jun 27, 2021
@terricain
Copy link
Owner

9.0.0 is out, there is no default session anymore.

@ruhan
Copy link
Author

ruhan commented Jun 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants