Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New UI for listing object uses, including in RichText and StreamField. #4481

Closed
wants to merge 13 commits into from

Conversation

BertrandBordage
Copy link
Member

This pull request replaces #3961.

The heart of this pull request is the same, except that it was rebuild from current master with all the rich text editor API changes. It is split in 5 comprehensible commits. To avoid mistakes, I did all the rebasing work manually, by coding again everything except the new collectors.py file.

I took the decision to change a bit the rich text API for two reasons:

  • It was missing some critical features for this PR to work:
    • The link between rich text handlers and models
    • There was no unified way differentiating between front-end & editor HMTL generation. At first sight, it seemed to be expand_db_attributes for editor HTML & a top-level function for generation front-end HTML. However, Page link doesn’t follow this pattern, and I found some name mistakes in the tests due to the confusion introduced by this unclear architecture
  • As a site developer using custom editor features, I’m lost in the current API, so I thought it wouldn’t hurt to start introducing some unification with base classes such as LinkHandler, especially since we expect people to reference the link between their custom features and the related models, for the collectors introduced in this PR.

And of course, it still looks the same as in #3961. It might be improved, but that’s enough for now given that this is already an important enhancement compared to the current state where we have nothing except the usage count, and 500 errors on protected items deletion.

image

@gasman
Copy link
Collaborator

gasman commented Apr 20, 2018

Hi @BertrandBordage,
I'm afraid the rich text API changes here are taking things in the opposite direction to what I'd planned...

Prior to Wagtail 2.0, PageLinkHandler (and DocumentLinkHandler etc) was responsible for the conversions from database-HTML representation to front-end HTML and editor-HTML representation, and those things shared a lot of code (such as the expand_db_html method with a for_editor flag). However, the editor-HTML representation is specific to hallo.js, and so that whole code path is now semi-deprecated as of Wagtail 2.0. I made a start on splitting those up - hence the new top-level page_linktype_handler method for the front-end conversion - but I didn't manage to take that as far as I'd hoped for the 2.0 release. (Ideally, the code now residing in PageLinkHandler / DocumentLinkHandler etc would be moved somewhere under wagtail.admin.rich_text.converters.editor_html where we can forget about it forever... except inevitably it's not as easy as that, because wagtail.admin isn't 'allowed' to know about documents/images...)

I haven't looked at your subsequent commits yet, but I guess you need an object to attach the "find entities of this type" method to. PageLinkHandler probably looks like the most suitable candidate for that in the current codebase, but it really isn't :-) We should create a new kind of object to encapsulate the non-editor-specific logic of rich text entities: namely, the front-end HTML conversion and the new "find entities of this type" method. I'll have a ponder this weekend to see how we might perform some minimal refactoring, so that we can introduce this new object without making an even bigger mess :-)

@BertrandBordage
Copy link
Member Author

BertrandBordage commented Apr 23, 2018

@gasman: Indeed, I did this because I needed a system to fetch models linked to an embedtype or linktype and fetch the object from the stored HTML attribute. This system has to be created by inheriting classes, because the editors are extendable and devs might want to create new embedtypes or linktypes. I used the HTML representation simply because… It’s what’s stored in the database, on both editors. I don’t know if you plan to change the way rich text data is stored in the database in future releases, but I assumed it would not change because that would be a dramatic source of migration issues.

I also started bringing back elements together because either we want to create an extension for Hallo or Draftail, most things are the same: the related models, the way we fetch database objects from attributes, the way we select elements in the DOM, the name of the button, etc. To me, this is going exactly the right way because we want to make a registering system that devs can extend with a clear API. The day Hallo goes away, just one or two methods would have to be removed from the registering system. Today, sorry to say it roughly, but it’s messy and I had a hard figuring out lots of details, and some still don’t make sense to me, like the class name convention for lists and dicts like EditorHTMLEmbedConversionRule.

I didn’t put all Draftail linktype and embedtype functions & objects in that “unified” registering system to avoid doing too much at once: I just stuck to modifying what I absolutely needed.

And to conclude, I didn’t really have a choice: I couldn’t base my work on the Draftail extension classes because they don’t exist, and because data is stored as Hallo HTML, not as Draft.js JSON. I didn’t want to introduce a new registration as well, since we obviously don’t want a third system of registering extensions in addition of Hallo’s & Draftail’s ones. Devs would forget to use it in addition of the Draftail & Hallo systems, so the uses UI wouldn’t work well with custom rich text extensions.

So how should it work? We can also use it as is and improve/change it later when we drop Hallo. I’m perfectly willing to do so.

@gasman
Copy link
Collaborator

gasman commented Apr 24, 2018

Have now done some refactoring in https://github.com/gasman/wagtail/commits/cleanup/richtext-converter-refactor, so that the wagtail.*.rich_text modules now export one handler class which is the obvious correct one to extend here. Will try to rebase this PR on top of that.

I acknowledge that the rich text handling has some severe "bus factor" issues right now... there are some design decisions (e.g. my desire to keep the dbHTML -> frontendHTML and dbHTML <-> editorHTML logic separated) that came from being immersed in the rich text logic for several weeks, but might not be so obvious in the resulting code. We should certainly review (and document!) those decisions at some point... however, given that (I believe) it's a fairly tangential part of this PR, and I don't want to hold up the progress of this feature into 2.1, I think it's best to say "just trust me on this" for now :-)


@staticmethod
def get_model():
return Embed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's logically correct to treat EmbedHandler as being associated with the Embed model - the Embed model is more like a cache, and deleting records from that table has no consequence on rich text fields that reference that URL. (Looking at it more abstractly: there's no reason that a linktype/embedtype has to be associated with a model; it's just rewriting one kind of HTML tag into another.)

Happily, I don't think there's any negative consequence from simply returning None here instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this for more than just deleting. The idea is to have a way, through collectors, to find all references of a single object. This gives a lot of possibilities, including the possibility to create a task to clean the embeds from time to time. If a reference to an embed is removed, the object still sits there in the database, useless. On huge sites, that’ll be good to have a command to remove useless embeds from time to time, hence the system to find references for it in rich text & stream fields.

Indeed, linktype/embedtypes can do something without a model, but this is not the case in all the linktype/embedtypes provided by Wagtail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using it for garbage-collecting the embeds table, then calling get_embed from get_instance will do exactly the wrong thing... for any embed not found in the table, it will make a round-trip to the embed provider and create a new record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but in practice, it will almost never happen. If you have a reference to an embed in a rich text, then an embed database object was created via AJAX when you added the embed to the rich text. If you have a reference to an embed in a StreamField, then you most likely used it in front-end, which triggered the creation of an embed object in the database.

So unless you created a script to create “ghost” embed references, all embed references will have a corresponding database object in real life cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair point! On a related note, I think we could optimise things further by having the Link/EmbedHandler classes implement a method to check "do these tag attributes correspond to one of these model instances", so that we're not uselessly instantiating every object we encounter in rich text (only to either throw it away or match it to an object we already have). But at that point I'm gift-horsing ;-)

params['val'] = (values[0] if len(values) == 1
else r'(%s)' % r'|'.join(values))
else:
pattern = r'<(a|embed) %(type)s[^>]*>'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, this case exists to serve a "hidden feature" of find_objects where passing an empty object list returns all objects of any type referenced within rich text. What's the motivation for this feature? It seems out of place here: it's listing objects rather than finding them; it doesn't match the logical expected behaviour of find_object on an empty list; and it requires quite a bit of special-case code to make it work (and, in particular, conflicts with my earlier assertion that link/embed handlers don't have to be associated with a model)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is indeed an unused feature for now. At first, I started with doing that wildcard search only if searched_objects is None.
If I remember well, I changed my mind to make a cleaner API using .find_objects(*searched_objects) instead of .find_objects(searched_objects=None). I’m sure I thought “if you search for one or more objects, it searches them, otherwise it searches for every object”. It must have been a brain fart, as I clearly forgot that sometimes we just pass a tuple or list directly without checking whether an item exists or not.

The good news is that it cannot have any impact on the views added, but that’s definitely worth changing. I think that we should just modify line 79 with if searched_objects is None and reverse the condition, then modify find_objects, for example by adding a find_all boolean that could only be True if there’s no searched_objects. Or add a _find_objects(searched_objects) method with all the logic from find_objects, and create two methods find_objects(*searched_objects) and find_all_objects() (that last one passing None to _find_objects).

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two separate methods for find_objects and find_all_objects definitely makes more sense to me - they're functionally very different things. I think get_pattern_for_objects and get_handlers can also be split up in the same way, to simplify the logic.

try:
return model._default_manager.get(id=attrs['id'])
except model.DoesNotExist:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BertrandBordage Any objection to leaving the DoesNotExist exception uncaught here, to be handled by the caller? I think that makes it a bit more idiomatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean put it in find_objects_in_rich_text? Yeah, that would be valid, I can’t remember a reason why it would be bad.

@gasman
Copy link
Collaborator

gasman commented Apr 27, 2018

Rebased at https://github.com/gasman/wagtail/tree/BertrandBordage-new-uses-ui-rebase with the proposed changes (not re-combining the frontendHTML and editorHTML handlers; adding separate methods for find_all_objects) and some additional cleanup (reorganising the removal of USAGE_COUNT so tests pass after every commit). I still need to finish reviewing, but hopefully the remaining bits should be the more straightforward ones (dealing with refactoring and points of integration is always more fiddly than new view code)...

@BertrandBordage
Copy link
Member Author

Great @gasman :)
So it’s almost there :D Thanks a lot for your review and the adjustment of the API! It’s better you did it, it’s faster than if I did another iteration 80% compliant with what you had in mind ;)

for field in self.fields]

def prepare_value(self, value):
if isinstance(value, Model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reposting this comment as I made it against a commit on my own branch rather than this PR, and not sure if you got the notification, @BertrandBordage...)

Do we ever call find_objects with values that aren't model instances? If so, it looks like we're missing tests for these cases, and we should also consider renaming find_objects to find_values. Personally I'd prefer to remove this functionality if there's no immediate need for it, though - it seems like feature creep, and adds complexity that we could do without. (I'm not entirely clear on why " and | characters in strings need to be double-escaped, or why we can't replace most of this method with json.dumps, and if we can avoid going down that rabbit-hole, we should :-) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should probably rename it to find_values. I wrote this as a framework to search any type of data in StreamFields, so you can for example search for references of an URL that no longer exists and replace them manually. I guess it requires additional tests, like you said.

There are multiple reasons why I didn't use json.dumps. The main reason was performance: it's faster this way and doesn't need to escape | in other cases than a string (although I realise now we should escape pk as well). I know, now that we use the Django collector with it, it's completely negligible.

We need to escape quotes because we want to search for "the \"value\"" instead of "the "value"" in the database cell. We escape the pipe because we search for multiple values at once like this: "a value"|"the other one"|"a Bash command using \|".

Yes both escapes should also be done for pks, as stated above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I'm all in favour of making the collector API broad enough to cover use cases beyond the core functionality as used by the object deletion feature - but as it stands, there's no documentation for this API, or even any samples of calling code that I can infer an API from. This means that as a reviewer it's impossible for me to verify the correctness of the code.

For example, it seems to me that StreamFieldCollector.find_values has a different concept of what a "value" is: for rich text blocks it's calling find_objects_in_rich_text, which will only ever return actual model objects, and therefore there's no way of using this method to find URLs within rich text. Is this a bug that directly impacts on the object deletion feature, or a bug in a currently-unused part of the collector API, or a not-yet-implemented part of the API, or something that's completely out of the scope of the API, or does it actually work perfectly thanks to some clever nuance of the code that I've missed? :-) There are numerous questions like this which I can only answer by trying to keep 1000+ lines of code in my head at once...

Ultimately I think the only way I can review the collector code in its current "open-ended" form is for it to be split up further into one commit for each class, with docstrings and tests to define the acceptable kinds of parameters/return values, and for obvious reasons I don't think you want to spend the 2.2 release cycle doing that :-) To get this into 2.1, I think we're going to have to strip it down to the minimal code necessary for the object deletion feature to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so would you be OK with you if that simplification consisted in just removing prepare_value and just systematically apply json.dumps(v.pk).replace('|', r'\|') in find_objects?

Sorry I don’t commit anymore on this branch, I’m not quite sure what to do now that you’ve started your own ^^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll go ahead and remove prepare_value and any other code paths that become redundant as a result (I think there are a few if isinstance(value, Model): switches elsewhere).

I still don't understand why you're manually escaping | in addition to calling re.escape. Again I'm struggling with the lack of docstring here - is the return value of prepare_value intended to be a literal string to match within the json (in which case it's the caller's responsibility to escape that literal string appropriately for use in a regexp, or whatever string-matching mechanism it decides to use), or a regexp pattern (in which case we should call re.replace on the literal string, since | is not the only character that has special meaning)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you’re right, the | escaping must be a mistake given that I added an re.escape. The only purpose of this method was to prepare the value before injecting it in stream_structure_pattern, so it had to be escaped so it’s considered as a literal by stream_structure_pattern.

@gasman
Copy link
Collaborator

gasman commented May 4, 2018

Very reluctantly bumping this to 2.2 - unfortunately the reviewing work has proceeded slower than I'd hoped, and I don't think I'll be able to complete it without causing excessive delays to the 2.1 release (I'm only in for two days next week). To minimise the impact on @BertrandBordage's Kickstarter work, I'm happy to deal with any code changes that arise from the rest of the review (although, of course, that may entail Bertrand or someone else reviewing my changes...)

@gasman gasman modified the milestones: 2.1, 2.2 May 4, 2018
@BertrandBordage
Copy link
Member Author

Ouch, OK.

No problem for me to review or contribute to the improvements, I still reserved some Wagtail work aside from the Kickstarter work.

block_types = (ChooserBlock, RichTextBlock)
block_paths_per_collector = [(c, list(c.find_block_type(block_types)))
for c in self.field_collectors]
stream_structure_pattern = r'[[ \t\n:](%s)[] \t\n,}]' % '|'.join(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of trying and failing to update this regexp for readability, I discovered some "fun" quirks of the mysql regexp engine:

  • It doesn't understand backslashed character classes such as \s or \n at all, and interprets them as the letter (s or n) instead. (It does understand POSIX character classes such as [:space:], but the Python module, which is used by sqlite, doesn't.)
  • Backslashes inside a [...] set are handled literally. This means that [[ \t\n:] will match \, t or n, but not a tab or line break

This means there's no portable way to match all whitespace characters :-( I was tempted to just leave out \t\n, since the JSON serializer used by StreamField isn't going to insert those characters, but that's the sort of decision that's bound to come back to bite us. So, I guess I'll go ahead and add an explicit check for mysql.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s right, I also discovered that a few months ago, but forgot that I used it here ☹️
So we have to differentiate between the two cases. Do you think you could make a test for this? I’m not sure if that’s doable without manually modifying the serialized JSON content.

yield from self.find_values(sub_value, block_path)
elif isinstance(stream, StreamValue.StreamChild):
if stream.block == current_block:
yield from self.find_values(stream.value, block_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These recursion rules don't seem quite right: unwrapping StreamValue and StreamChild happens over two iterations of find_values, during which two blocks will be popped off block_path - but both wrapper elements correspond to a StreamBlock, so logically only one block should have been popped by that point.

(Looking closer, I think there's some inconsistency about whether the type of stream corresponds to the first block in block_path, or the block that's just been popped in the previous iteration, and my guess is that this inconsistency only surfaces in nested StreamBlocks, which aren't covered by the tests.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure I fully follow you here.

I started building this recursive method using my knowledge of the way StructBlock, ListBlock and StreamBlock work, but it ended up being more complex, with several intermediate classes (that could probably be refactored). I adjusted the method to work in all the scenarii I could think of. Could you make an example that would not be handled by this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing test (throwing an 'Unexpected StreamField value' warning) here: gasman@4ed64fa

You can verify that there's a logic error here by asking: does the value of stream correspond to the first item of block_path, or not?

  • The line if isinstance(current_block, StreamBlock) and isinstance(stream, StreamValue): suggests that it does correspond
  • The if not block_path: section at the top suggests that it doesn't correspond (block_path is empty, but we still have a stream value)
  • elif current_block.name == '' and isinstance(stream, list): also suggests that it doesn't correspond (the 'nameless' block in a ListBlock definition is the child, not the ListBlock itself)

Unfortunately at this point I really have to ask you to add docstrings to the remainder of collectors.py before I can continue reviewing. Throughout this code there's a lack of precision about the inputs / outputs of each method... up to now I was willing to approach that as a documentation issue, and fill in that missing detail as part of the review process. However, it now seems that this lack of precision is directly causing logical errors such as this one and the regexp escaping bug earlier - at this point, I don't trust my ability to catch these logical errors at the same time as figuring out the intended behaviour of the code.

Copy link
Collaborator

@gasman gasman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this back to @BertrandBordage to deal with https://github.com/wagtail/wagtail/pull/4481/files#r190932257, which looks like it's going to need some non-trivial refactoring :-/

@BertrandBordage
Copy link
Member Author

@gasman: Just added some in-depth docstrings, hope it helps :)

@BertrandBordage
Copy link
Member Author

Updated the PR against master.
Back to you for the review, @gasman!

@gasman
Copy link
Collaborator

gasman commented Jul 24, 2018

Hi @BertrandBordage - thanks for the added docstrings! Unfortunately we've now ended up with two diverging branches - my branch at https://github.com/gasman/wagtail/tree/BertrandBordage-new-uses-ui-rebase has various fixes from the earlier round of reviewing which aren't included in this PR. Also, https://github.com/wagtail/wagtail/pull/4481/files#r190932257 still needs fixing.

I'll create a rebased version of my branch with your docstrings merged in, and open a new PR to replace this one.

>>> [[b.__class__.__name__ for b in path]
... for path in StreamFieldCollector(stream_field)
... .find_block_type((CharBlock, TextBlock))]
[['StreamBlock', 'ListBlock', 'TextBlock']]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be [['StreamBlock', 'CharBlock'], ['StreamBlock', 'ListBlock', 'TextBlock']]?

@gasman
Copy link
Collaborator

gasman commented Jul 24, 2018

Replaced by #4702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants