-
-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom serializers #143
Custom serializers #143
Conversation
Codecov Report
@@ Coverage Diff @@
## master #143 +/- ##
=========================================
- Coverage 98.74% 97.15% -1.6%
=========================================
Files 8 8
Lines 638 667 +29
Branches 90 95 +5
=========================================
+ Hits 630 648 +18
- Misses 6 12 +6
- Partials 2 7 +5 |
Codecov Report
@@ Coverage Diff @@
## master #143 +/- ##
=========================================
- Coverage 98.74% 98.05% -0.7%
=========================================
Files 8 8
Lines 638 667 +29
Branches 90 95 +5
=========================================
+ Hits 630 654 +24
- Misses 6 10 +4
- Partials 2 3 +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, will need history and docs and a few tweaks suggested below.
arq/connections.py
Outdated
def __init__( | ||
self, | ||
pool_or_conn, | ||
_serialize: Optional[Callable[[Any], bytes]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_serialize: Optional[Callable[[Any], bytes]] = None, | |
_job_serializer: Optional[Callable[[Any], bytes]] = None, |
Might be even clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and same obviously for _deserialize
-> _job_deserializer
.
arq/jobs.py
Outdated
pass | ||
|
||
|
||
def pickle_job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename unpickle_job
to deserialize_job
, better to get it right now.
Change should be pretty explicit and easy to fix.
arq/jobs.py
Outdated
pass | ||
|
||
|
||
def pickle_job(function_name: str, args: tuple, kwargs: dict, job_try: int, enqueue_time_ms: int): | ||
class PickleError(SerializationError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and please update HISTORY.rst
.
Otherwise LGTM.
arq/connections.py
Outdated
self, | ||
pool_or_conn, | ||
_job_serializer: Optional[Callable[[Any], bytes]] = None, | ||
_job_deserializer: Optional[Callable[[bytes], Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_job_deserializer: Optional[Callable[[bytes], Any]] = None, | |
_job_deserializer: Optional[Callable[[bytes], Dict[str, Any]] = None, |
Or am I missing something?
arq/connections.py
Outdated
def __init__( | ||
self, | ||
pool_or_conn, | ||
_job_serializer: Optional[Callable[[Any], bytes]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_job_serializer: Optional[Callable[[Any], bytes]] = None, | |
_job_serializer: Optional[Callable[[Dict[str, Any]], bytes]] = None, |
I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and obviously elsewhere where it's used, might be worth having these types defined once, eg. in jobs.py
?
arq/jobs.py
Outdated
job_try: int, | ||
enqueue_time_ms: int, | ||
*, | ||
_serialize: Optional[Callable[[Any], bytes]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_serialize: Optional[Callable[[Any], bytes]] = None, | |
serialize: Optional[Callable[[Any], bytes]] = None, |
since this is explicitly required as a keyword arg, I don't think you need to make it "private".
694116f
to
084bdac
Compare
084bdac
to
5254e52
Compare
@samuelcolvin I addressed your suggestions. You said that I don't think the HISTORY.rst change belongs to this PR and you should do it in a separate commit, also because I don't know when you will release the next version (even though I hope very soon). |
The way I work every commit to code has an entry in Please add the following to history
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise looking good.
arq/connections.py
Outdated
@@ -141,29 +164,33 @@ class ArqRedis(Redis): | |||
password=settings.password, | |||
timeout=settings.conn_timeout, | |||
encoding='utf8', | |||
commands_factory=ArqRedis, | |||
commands_factory=functools.partial( | |||
ArqRedis, _job_serializer=job_serializer, _job_deserializer=job_deserializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArqRedis, _job_serializer=job_serializer, _job_deserializer=job_deserializer | |
ArqRedis, job_serializer=job_serializer, job_deserializer=job_deserializer |
I think since the change?
tests/test_worker.py
Outdated
worker: Worker = worker(functions=[foobar]) | ||
with pytest.raises(SerializationError) as exc_info: | ||
await worker.main() | ||
assert exc_info.value.startswith('unable to deserialize job: ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert exc_info.value.startswith('unable to deserialize job: ') | |
assert exc_info.value.startswith('unable to deserialize job: ') |
this is indented wrongly I think.
tests/test_worker.py
Outdated
) | ||
with pytest.raises(SerializationError) as exc_info: | ||
await worker.main() | ||
assert exc_info.value.startswith('unable to deserialize job: ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert exc_info.value.startswith('unable to deserialize job: ') | |
assert exc_info.value.startswith('unable to deserialize job: ') |
@samuelcolvin Done. Then after you merge this PR I'll add it for the other one. |
@samuelcolvin I fixed the code as per your latest feedback. |
It seems some tests fail on Python 3.8, but I don't think the errors are related to this PR. |
Great, thanks. I'll deploy as soon as the other PR is passing and I can merge it. |
@samuelcolvin Thanks for taking the time to review it multiple times. |
Changes
I would like to propose this change, which allows the user to specify custom serializers in alternative to pickle, which is left as the default.
The public API is changed only very slightly: there are new
_serialize
and_deserialize
arguments tocreate_pool
andWorker
. The rest of the changes essentially reduces to passing these to the pickle functions.For backward compatibility, I have left the
pickle_jobs
, etc. functions inarq.jobs
with the same name, even though the name is slightly inappropriate now.Rationale
The main reason I am proposing this change is to allow better memory usage. The default serialization method, pickle, is quite space inefficient. Since the job data is essentially JSON, we can attain much better memory usage by switching to serialization methods that are more appropriate.
In my case, by using MsgPack instead of Pickle, I am seeing an improvement of 47%-49% in memory usage which is quite significant. At tens to hundreds of thousands of tasks this means hundreds of MBs if not GBs in savings.
Usage
The usage is quite simple. Existing code works as usual, but one can use a different serialization method as follows:
Tests and documentation
The tests pass, except two tests that also fail on master. If this change is approved, I can write a paragraph in the documentation on custom serialization functions.