Skip to content

Add pagination and bulk update support for User(discord user) serializer.#378

Merged
MarkKoz merged 42 commits into
python-discord:masterfrom
RohanJnr:user_endpoint
Oct 8, 2020
Merged

Add pagination and bulk update support for User(discord user) serializer.#378
MarkKoz merged 42 commits into
python-discord:masterfrom
RohanJnr:user_endpoint

Conversation

@RohanJnr
Copy link
Copy Markdown
Contributor

@RohanJnr RohanJnr commented Aug 26, 2020

closes #375

Implementation

Pagination

  • Using built in PageNumberPagination with the default page size being 10,000 user objects(dicts) per page.
  • 2 query parameteres to control page number and page size.

Bulk Update

  • New endpoint: /bot/users/bulk_patch
  • A UserListSerializer for UserSerializer has been implemented to support bulk updates.(ref to docs: drf-Customizing multiple update)

Additional

  • Replace restframework_bulk package by implementing bulk creation of User models.

Tests for new features.

  • complete

Pagination is done via PageNumberPagination, i.e, each page contains a specific number of `user` objects.
implemented a method to handle bulk updates on user model via a new endpoint: /bot/users/bulk_patch
@RohanJnr RohanJnr requested a review from a team as a code owner August 26, 2020 17:39
@ghost ghost added the needs 2 approvals label Aug 26, 2020
@ghost
Copy link
Copy Markdown

ghost commented Aug 26, 2020

Thank you for contributing to Python Discord!

Please check out the following documents:

@lemonsaurus
Copy link
Copy Markdown
Contributor

Okay, before I really jump into this review, I need you to take a look at the BulkSerializerMixin and figure out whether we still need it. It looks like it was implemented specifically to solve bulk input, and we even have a dependency just for that. So I'd like you to justify your approach over that approach, and consider whether this PR should remove the dependency that includes the BulkSerializerMixin.

@lemonsaurus
Copy link
Copy Markdown
Contributor

To be clear, I'm not saying you should necessarily be using BulkSerializerMixin instead, but you need to at least figure out what it does and how it does it so we can make an informed decision about whether to keep it or get rid of it.

@RohanJnr
Copy link
Copy Markdown
Contributor Author

I managed to simplify the code in the bulk update PR, I no longer need to override the to_internal_value function, I just had to define an id field explicitly as mentioned in the docs(which I missed before).

Coming to the drf-bulk package, we are only using it for multiple creations, this can be easily achieved by overriding the create method in ModelViewSet as mentioned in the docs(I tested it and works fine, takes around 7-10 lines of code).

So, I would say that we can get rid of drf-bulk(by this, I mean BulkSerailizerMixin and BulkCreateModelMixin which we are using) package and implement bulk features ourselves which is fairly easy.

@lemonsaurus
Copy link
Copy Markdown
Contributor

Okay @RohanJnr - Then I think you should get rid of the bulk plugin.

…mplify UserListSerializer

`to_internal_value()` function is not longer overriden in UserListSerializer, this is due to explicitly stating the `id` field in UserSerializer as mentioned in the documentation.
Override `create()` method in UserListSerializer and override `get_serializer()` method in `UserViewSet` to support bulk creation.
@RohanJnr
Copy link
Copy Markdown
Contributor Author

This PR is ready for review.

the Pipfile.lock conflict was resolved by re-locking the pipfile.
@RohanJnr
Copy link
Copy Markdown
Contributor Author

The bulk update method needs optimization. It calls the database on every instance that needs to be updated.
Currently working on a fix.

…thod

The Model.objects.bulk_update() method greatly reduces the number of SQL queries by updating all required instances in 1 query.
@RohanJnr
Copy link
Copy Markdown
Contributor Author

here are some rough benchmarks of the API changes coming in this PR
Creating 1000 users(bulk).
0.39 seconds

Creating 10000 users(bulk).
3.14 seconds

Updating 1000 users all in 1 SQL query.
0.74 seconds

Updating 10000 users all in 1 SQL query.
14.67 seconds

Updating 10000 users by updating 1000 users in 1 SQL query.
7.12 seconds

Updating 10000 users by updating 2000 users in 1 SQL query.
6.78 seconds

Looks like 2000 users in 1 SQL query is the sweet spot.

@MrHemlock MrHemlock self-requested a review September 22, 2020 17:40
Copy link
Copy Markdown
Member

@MrHemlock MrHemlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to be in order.

@ghost ghost added the needs 1 approval label Sep 23, 2020
@RohanJnr RohanJnr requested a review from MarkKoz October 7, 2020 04:29
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for so many incremental reviews. I've finally done a comprehensive review and test of the API. Along with my request changes, here are the issues I encountered:

  • Posting a single user (i.e. no bulk_create) that already exists results in a 500 response because of an IntegrityError. This should be re-raises as 400 error instead.

  • Posting multiple users which exist with bulk_create works but the response still includes objects for users which were duplicates. It's just a copy of the data given in the request rather than what's in the database.

    For example, if you try to set in_guild from false to true for an existing user, the response will return the object with in_guild: true even though it never changed and is still false in the DB.

    This isn't too important since we never check the return value, but if you know of an easy fix then you should go for it. In fact, you can respond with nothing since we don't rely on the response. This was profiled not too long ago, and it turns out that serialisation takes a significant amount of time. Granted, it won't have as much of a performance benefit at this scale (IIRC we benchmarked serialisation of all users in a single response).

  • bulk_patch returns duplicate objects in the response. More specifically, the amount of duplicates it returns depends on the number of fields that were given in the request. Again, not too important, but fix it if you can.

  • Checking for duplicate IDs in the request doesn't work because filtered_instances = queryset.filter(id__in=object_ids) effectively filters out duplicates before they ever reach the serialiser. This is also indicative of a hole in your tests; there should be a test that catches this issue.

    I suppose you could merge the duplicate check with the creation of [item["id"] for item in request.data] if you use a normal loop instead of a comprehension and use the same concept with a set as before.

"""
Update multiple User objects in a single request.

## Route
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detailed endpoint info should be in the class docstring rather than in the method's docstring.

## Route
### PATCH /bot/users/bulk_patch
Update all users with the IDs.
`id` field is mandatory, rest are optional.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite accurate because it raises a validation error (400) if there are no fields besides the ID. Therefore, the ID and at least one other field are mandatory.

### POST /bot/users
Adds a single or multiple new users.
The roles attached to the user(s) must be roles known by the site.
User creation process will be skipped if user is already present in the database.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you worded it is unclear in the sense that it may imply the entire operation will abort.

Suggested change
User creation process will be skipped if user is already present in the database.
Users that already exist in the database will be skipped.

Comment on lines +77 to +78
- page_size: Number of Users in one page.
- page: Page number
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- page_size: Number of Users in one page.
- page: Page number
- page_size: number of Users in one page
- page: page number

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be noted these are optional parameters and that the default page size is 10k.

#### Status codes
- 201: returned on success
- 400: if one of the given roles does not exist, or one of the given fields is invalid
- 400: if multiple user objects with the same id are given.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 400: if multiple user objects with the same id are given.
- 400: if multiple user objects with the same id are given

Comment on lines +210 to +212
- 200: returned on success.
- 400: if the request body was invalid, see response body for details.
- 404: if the user with the given id does not exist.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 200: returned on success.
- 400: if the request body was invalid, see response body for details.
- 404: if the user with the given id does not exist.
- 200: returned on success
- 400: if the request body was invalid, see response body for details
- 404: if the user with the given id does not exist

except KeyError:
# user ID not provided in request body.
resp = {
"Error": "User ID not provided."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the rest of DRF, it should be named "detail".

Suggested change
"Error": "User ID not provided."
"detail": "The id field is missing from at least one object."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can raise a validation error with

raise ValidationError({"id": ["This field is required."]})

Which may or may not be more idiomatic here. It's in a list cause that's how DRF normally formats such errors. I am leaning more towards the ValidationError approach.

Comment thread pydis_site/apps/api/serializers.py Outdated

if not fields_to_update:
# Raise ValidationError when only id field is given.
raise ValidationError({"data": "Insufficient data provided."})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, using "data" as a key name implies that there's something wrong with a field named data. If you need to raise ValidationErrors that aren't specific to a field, use the configured NON_FIELD_ERRORS_KEY as the key for consistency.

Furthermore, maybe the value should be in a list since this seems to be DRF convention. I think the idea is that there could be several errors for the same field/key, hence the list.

Comment thread pydis_site/apps/api/serializers.py Outdated
try:
user = instance_mapping[user_data["id"]]
except KeyError:
raise NotFound({"id": f"User with id {user_data['id']} not found."})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only ValidationErrors use the convention of the field name as the key. 404s raised by DRF elsewhere use detail as the key.

Suggested change
raise NotFound({"id": f"User with id {user_data['id']} not found."})
raise NotFound({"detail": f"User with id {user_data['id']} not found."})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding this comment for posterity: the reason it's okay to raise a 404 here and risk failing the entire request is because the syncer shouldn't ever run into a race condition where it's trying to update a user that has been deleted (or not created yet). The former is impossible because it never deletes users; it only sets in_guild as false. The latter is impossible because if it sees the user doesn't exist in the DB, it will try to create rather than update.

Comment on lines +194 to +205
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool
... },
... {
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict items should be indented.

Suggested change
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool
... },
... {
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool
... },
... {
... 'id': int,
... 'name': str,
... 'discriminator': int,
... 'roles': List[int],
... 'in_guild': bool

@ghost ghost added the s: waiting for author Waiting for author to address a review or respond to a comment label Oct 8, 2020
@ghost ghost removed the s: waiting for author Waiting for author to address a review or respond to a comment label Oct 8, 2020
@RohanJnr RohanJnr requested a review from MarkKoz October 8, 2020 09:37
Comment thread pydis_site/apps/api/serializers.py Outdated
user = instance_mapping[user_data["id"]]
except KeyError:
raise NotFound({"id": f"User with id {user_data['id']} not found."})
raise NotFound({"detail": [f"User with id {user_data['id']} not found."]})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a list is a ValidationError convention only

Suggested change
raise NotFound({"detail": [f"User with id {user_data['id']} not found."]})
raise NotFound({"detail": f"User with id {user_data['id']} not found."})

Comment thread pydis_site/apps/api/serializers.py Outdated

return User.objects.bulk_create(new_users, ignore_conflicts=True)
users = User.objects.bulk_create(new_users, ignore_conflicts=True)
return User.objects.filter(id__in=[user.id for user in users])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You misunderstood what I meant by duplicates. I mean that it still returns objects which weren't created because the ID is already in the database. What you're trying to solve is removing duplicate objects in the response, which is impossible.

To fix this, you'd have to know which users are already in the DB before you do a bulk_create. This sounds relatively slow, so it may not be worth fixing (or return nothing) since we don't even read the response.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I will return an empty list then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, don't forget to document that.

Comment thread pydis_site/apps/api/serializers.py Outdated
for user_dict in validated_data:
if user_dict["id"] in seen:
raise ValidationError(
{"id": f"User with ID {user_dict['id']} given multiple times."}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"id": f"User with ID {user_dict['id']} given multiple times."}
{"id": [f"User with ID {user_dict['id']} given multiple times."]}

Comment on lines +213 to +228
object_ids = set()
for data in request.data:
try:
if data["id"] in object_ids:
# If request data contains users with same ID.
raise ValidationError(
{"id": [f"User with ID {data['id']} given multiple times."]}
)
except KeyError:
# If user ID not provided in request body.
raise ValidationError(
{"id": ["This field is required."]}
)
object_ids.add(data["id"])

filtered_instances = queryset.filter(id__in=object_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's performing validation, I think it makes more sense for this code to be in the serialiser. In the hypothetical case that the serialiser needs to be used elsewhere, this code would have to be duplicated.

@ghost ghost added the s: waiting for author Waiting for author to address a review or respond to a comment label Oct 8, 2020
@MarkKoz
Copy link
Copy Markdown
Contributor

MarkKoz commented Oct 8, 2020

I also got this warning:

web_1       | /usr/local/lib/python3.7/site-packages/rest_framework/pagination.py:200: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'pydis_site.apps.api.models.bot.user.User'> QuerySet.
web_1       |   paginator = self.django_paginator_class(queryset, page_size)

…onsistent results with an unordered object_list: <class 'pydis_site.apps.api.models.bot.user.User'> QuerySet.
@ghost ghost removed the s: waiting for author Waiting for author to address a review or respond to a comment label Oct 8, 2020
@RohanJnr RohanJnr requested a review from MarkKoz October 8, 2020 18:36
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. Thanks for putting up with my scrutiny.

@MarkKoz MarkKoz merged commit fcfb422 into python-discord:master Oct 8, 2020
@RohanJnr RohanJnr deleted the user_endpoint branch October 9, 2020 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: API Related to or causes API changes type: feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the user endpoint more scalable

6 participants