Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for performantly bulk-creating pages programatically #11480

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

NXPY123
Copy link
Contributor

@NXPY123 NXPY123 commented Jan 19, 2024

Fixes #10833 adds an API for bulk creating children.

  • Do the tests still pass?[^1]
  • Does the code comply with the style guide?
    • Run make lint from the Wagtail root.
  • For Python changes: Have you added tests to cover the new/fixed behaviour?
  • For new features: Has the documentation been updated accordingly?

Copy link

squash-labs bot commented Jan 19, 2024

Manage this branch in Squash

Test this branch here: https://nxpy123feature10833-api-bulk-c-y7omf.squash.io

@NXPY123
Copy link
Contributor Author

NXPY123 commented Jan 20, 2024

@ababic does slug conflict matter at different depths of the tree? For example, if there's an object with the pathBread/bagel and we try to add another with the path Bread/FrenchBread/bagel should this be considered a conflict?

@NXPY123
Copy link
Contributor Author

NXPY123 commented Jan 20, 2024

I understood your comments earlier but I have a few doubts regarding them:

  1. Is there any benefits to making a set of the existing slugs instead of filtering out matches with the new slugs?
  2. The data being passed is a recursive structure where the children can have their own children. I'm assuming slug conflict doesn't matter when they don't share the same parent so I created a recursive function that searches each level and makes srue there's no conflict and auto generates slugs based on slugs at each level. Is this correct or do I have to check the combined the original combined slug list everytime?

@ababic
Copy link
Contributor

ababic commented Jan 20, 2024

Hi @NXPY123.

  1. Out of all of the common container types in Python, set is significantly faster than anything else when it comes to membership tests. So, if that's what you intend to use it for, why wouldn't you?
  2. Slugs only have to be unique between siblings under the same parent. Personally I think that should be all we should be going for here... Making it recursive is a nice idea, but I would argue that use cases for that are quite rare, so the added code complexity is probably not worth it (Especially while we're still trying to get the basics right)

if kwargs.get("clean", True):
page.full_clean()

page_post_details["slug_changed"] = False
Copy link
Contributor

@ababic ababic Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulk_create() should only be used to create new items, so this shouldn't be needed (We shouldn't ever modify supplied slugs either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry the CustomManager was something I tested out earlier. It didn't function and I forgot to remove it before commiting. Since it was recommended that we don't have to replicate everything, I didn't look further into fixing it. The only change I intended was the bulk_add_children and _check_unique

page_post_details["slug_changed"] = False
page_post_details["is_new"] = page.id is None

if page_post_details["is_new"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this would be all objects? bulk_update() should be uses for updates

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 1, 2024

Can you please check the _check_unique() and bulk_add_children() methods in models/__init__.py ?

@ababic
Copy link
Contributor

ababic commented Feb 1, 2024

@NXPY123 I'm not sure this is a direction that we can continue with unfortunately (building on top of load_bulk()). While it might work for creating vanilla Page objects from dicts, that isn't very useful in practice. As explained here, I think we need to take a separate approach that gives us control over the page types used. I think something more akin to Django's bulk_create() (perhaps just requiring a new parent_page argument) would be much more useful

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 7, 2024

I've made changes to use bulk_create instead in 5be4dff. The page does seem to be made but I think the page is not set as the child page of the parent. Do we have to call add_child on the parent to be able to do this or is there any other way?

@ababic
Copy link
Contributor

ababic commented Feb 7, 2024

Hi @NXPY123. Thanks for sticking with it. Yes, it definitely won't work as it is. We need to replicate what add_child() does in terms of setting field values on the new child pages involved, and updating numchild on the parent page.

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 10, 2024

In the django-treebeard docs they suggest not changing the path, depth, and num_child values directly and using one of the methods instead. Is this something I should take into consideration?

@ababic
Copy link
Contributor

ababic commented Feb 10, 2024

@NXPY123 it's good to be cautious. I expect the reason the docs say that is to keep you using the documented APIs, because what happens underneath is quite complicated. However, I don't see how using those methods can work in a bulk creation context, because each use involves at least one write to the database (one of the key aims of this feature is to avoid that).

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 11, 2024

In the django-treebeard code, when the children have a sorted order set in the tree, they use this to find the location to add the child and get the path:

if self.pos == 'sorted-sibling':
            siblings = self.node.get_sorted_pos_queryset(
                self.node.get_siblings(), newobj)
            try:
                newpos = siblings.all()[0]._get_lastpos_in_path()
            except IndexError:
                newpos = None
            if newpos is None:
                self.pos = 'last-sibling'
        else:
            newpos, siblings = None, []

        _, newpath = self.reorder_nodes_before_add_or_move(
            self.pos, newpos, self.node.depth, self.node, siblings, None,
            False)
            

I'm kinda stuck trying to figure out how to get newpath for each child when we're bulk adding them since we're not committing and updating the query set each time we add a child so there is a possibility of conflict in the newpath we obtain for the children as the function will give out the same newpath for new children that are determined to have the same position in the queryset. I think converting the queryset to a list would help append the children as we create them but get_sorted_pos_queryset expects a queryset to be passed and making it into a list might cause issues.

@ababic
Copy link
Contributor

ababic commented Feb 11, 2024

Hi @NXPY123. I think we can ignore the 'sorted-sibling' case in your code sample (
we're only interested in the 'adding children to a parent' case), which should simplify things somewhat. We just need to figure out how to generate path values for the new pages, in the order they are supplied. You might need to do a single query to identify the current maximum path for children at the same location.

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 11, 2024

Can you please check the changes I've made? I haven't checked the code when the children are ordered, only the case where it's unordered.

node.path = self.__class__._get_path(self.path, node.depth, step)
step += 1
if len(node.path) > max_length:
raise ValidationError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the handling of this. This message is nice and helpful!

Copy link
Contributor

@ababic ababic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along nicely. In addition to the points already raised, it would be good to start thinking about:

  • Adding tests
  • Moving the logic to the Manager class instead of implementing at the Model level

self.numchild += 1
node._cached_parent_obj = self

self.save()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be slightly clearer overall to keep this as a 'preparation' step, and leave saving up to a different method. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a great suggestion. It'll help separate the preparation step and the saving step. Since we're loading the children to the db and saving changes to the parent page I think we can keep them in a separate method.

the page to the database using their inbuilt tree functionality.
"""

if self.node_order_by and not self.is_leaf():
Copy link
Contributor

@ababic ababic Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to remove this case for page types, since node_order_by is not set by default, and overriding it isn't supported by wagtail

slugs.append(candidate_slug)

if child.locale_id is None:
child.locale = self.get_default_locale()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Localised content tends to live in its own part of the tree, so inheriting the parent page's locale might be a better option here

bulk_add_children can only be used to add new pages, not to update existing pages"
)
# Get the list of existing children to check for duplicate slugs
existing_pages = self.get_children()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this value isn't used for anything else, how about just pulling the slugs out of the database in _check_unique()? Also, If slug values are all we are interested in, we should probably use values_list() with flat=True, so that we're only fetching what we need.

@@ -3468,12 +3627,14 @@ def locked_for_user(self, obj, user):

def user_can_lock(self, obj, user):
"""Returns True if a user who would not normally be able to lock the object should be able to if the object is currently on this task.
Note that returning False does not remove permissions from users who would otherwise have them."""
Note that returning False does not remove permissions from users who would otherwise have them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to keep unrelated changes out of the PR if possible. It may help to disable any editor-level auto formatting when working on different projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry about that. I'm not sure why this happened. I only used Ruff and Black for formatting because the checks usually fail when it's not formatted properly. Could it be because of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The black formatter is reformatting the file and causing this. But if I don't reformat the file won't the checks fail?

"Duplicate slugs in use within the parent page at '%(parent_url_path)s'"
)
% {
"parent_url_path": self.path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be self.url_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thank you for catching that! I got confused between path and url_path.

wagtail/models/__init__.py Outdated Show resolved Hide resolved
wagtail/models/__init__.py Outdated Show resolved Hide resolved
Load the pages into the database and save the parent page in the database
"""

pages = Page.objects.bulk_create(children)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Page.objects.bulk_create() here suggests we only want to create generic Page objects, but that is not the case is it? The type of each child should determine which manager is used here.

@ababic
Copy link
Contributor

ababic commented Feb 23, 2024

Hi @NXPY123, I've left a little more feedback. Please continue with moving the changes to the Manager and adding tests. Things tend to change slightly during that process, so it will be better for me to review again once everything is finished.

Please also consider marking this PR as a draft until you are finished, as it helps other contributors to know that it's with you, and you aren't waiting on anyone else.

# Add the slugs of the current children
slugs.extend(self.get_children().values_list("slug", flat=True))

if len(slugs) != len(set(slugs)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're converting the list to a set here, why don't we save a reference to that set and use it in the code below, instead of the list? Remember: Sets are difficult to beat when it comes to membership checks

@NXPY123 NXPY123 marked this pull request as draft February 24, 2024 07:45
@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 27, 2024

I've added tests. When I went through the bulk_create docs they mentioned a few caveats:

  • The model’s save() method will not be called, and the pre_save and post_save signals will not be sent.
  • It does not work with child models in a multi-table inheritance scenario.
  • If the model’s primary key is an AutoField and ignore_conflicts is False, the primary key attribute can only be retrieved on certain databases (currently PostgreSQL, MariaDB, and SQLite 3.35+). On other databases, it will not be set.
  • It does not work with many-to-many relationships.
  • It casts objs to a list, which fully evaluates objs if it’s a generator. The cast allows inspecting all objects so that any objects with a manually set primary key can be inserted first.

Will these cause issues when implementing the functionality?

@ababic
Copy link
Contributor

ababic commented Feb 27, 2024

@NXPY123 Ahh, that second point sounds like it could be problematic. Honestly, I find that quite surprising with how advanced the API is in other areas. Maybe it just hasn't been in-demand enough.

@ababic
Copy link
Contributor

ababic commented Feb 27, 2024

Hi @NXPY123. Just to put you on the right track, I was imagining that this would be baked into the existing BasePageManager class, rather than inventing a new concept. Ideally, we would override the default bulk_create() implementation, so that it is available from the manager of any page subclass. For example:

SimplePage.objects.bulk_create(simplepage_list, parent_page)

It sounds like we might be blocked on being able to utilise Django's default implementation, but that doesn't necessarily need to be the end of the story... I would suspect there is a workable solution out there in the wild somewhere.

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 28, 2024

Hi @NXPY123. Just to put you on the right track, I was imagining that this would be baked into the existing BasePageManager class, rather than inventing a new concept. Ideally, we would override the default bulk_create() implementation, so that it is available from the manager of any page subclass. For example:

SimplePage.objects.bulk_create(simplepage_list, parent_page)

It sounds like we might be blocked on being able to utilise Django's default implementation, but that doesn't necessarily need to be the end of the story... I would suspect there is a workable solution out there in the wild somewhere.

I thought about it but wouldn't the BulkManager inherit methods of the models.Manager class which we wouldn't require?

@NXPY123
Copy link
Contributor Author

NXPY123 commented Feb 28, 2024

@NXPY123 Ahh, that second point sounds like it could be problematic. Honestly, I find that quite surprising with how advanced the API is in other areas. Maybe it just hasn't been in-demand enough.

I found examples for workarounds here: https://stackoverflow.com/questions/49826482/django-valueerror-cant-bulk-create-a-multi-table-inherited-model

Can we use any of these?

@ababic
Copy link
Contributor

ababic commented Feb 28, 2024

I thought about it but wouldn't the BulkManager inherit methods of the models.Manager class which we wouldn't require?

We'll, yes, but that is really the point! 😀Bulk creation is typically done via a model's default Manager, so why should this be any different? In my view at least, the bulk creation logic isn't important enough to warrant invention of new concepts. I beleive developers would expect it to work as closely to the default bulk_create() implementation as possible (the only difference being that a parent page must be provided).

I found examples for workarounds here. Can we use any of these?

Your first step really needs to be writing test that fail due to this limitation. If you're not getting a clear error about multi-table-inheritance not being supported, then clearly things aren't working the way you're intending. Once you're at the point where you're sure that's where the issue is, you'll be in a good place to experiment and see if you can get past it (we're not quite there yet, though).

@NXPY123
Copy link
Contributor Author

NXPY123 commented Mar 1, 2024

I've put the method in the PageManager and set it as the default Manager for Page class. Is this fine?

@NXPY123 NXPY123 marked this pull request as ready for review March 7, 2024 08:30
@NXPY123 NXPY123 marked this pull request as draft April 4, 2024 05:10
@NXPY123
Copy link
Contributor Author

NXPY123 commented Apr 4, 2024

@ababic sorry I couldn't work on this for a while. Are the changes made fine? Is there anything to add to the docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An API for performantly bulk-creating pages programatically
3 participants