Indent HTML lists correctly (Issue 1073) #1170

lcgeneralprojects · 2024-05-17T05:14:11Z

Fixes #1073
Implements paragraph indentation via adjustments of pdf.x instead of using a natural number of whitespaces.
Breaks up list items into individual paragraphs.

Checklist:

The GitHub pipeline is OK (green),
meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.
A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
A mention of the change is present in CHANGELOG.md

Not sure how to actually name the HTML2FPDF.list_pseudo_margin attribute. It is used for determining the height of the \n line created when a <ul> or <ol> starting tag is handled.

Should the re-implementation of paragraph indentation be reflected in a doctstring, even though the tag_indents parameter of write_html() has not been touched? If so, where should it be placed?

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

Some debugging and polish needed.

Some variables need tweaking. Needs testing. Code reuse unsatisfactory.

Potentially significant issues with tests: 1. test_html_ln_outside_p - IndexError: list index out of range. 2. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Code reuse unsatisfactory.

Need feedback for handling <dd> and <blockquote>. Potentially significant issues with tests: 1. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Need feedback for whether or not the new indentation that contradicts old tests is satisfactory. Code reuse unsatisfactory. Need feedback.

Bug present: bullets are made one per line instead of one per paragraph. Saving progress before introducing a `Bullet` class.

Feature implemented. Testing and adjustments of tests needed.

Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.

Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets. `<ul>` and `<ol>` tags now cause a creation of a paragraph with the string `\n` being used to generate a fragment of the height `list_pseudo_margin`. Adjusted defaults for `li_tag_indent`.

fpdf/fpdf.py

fpdf/text_region.py

Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line. Dealing with the issue of inappropriately large distance between `<dt>` and their child `<dd>` elements when `Paragraph.top_margin` is 0.

Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.

# Conflicts: # fpdf/html.py

Adjusted old tests.

gmischler

Nice work so far, but the devil is in the detail...

As I'm sure you've noticed, the interplay between the HTML parser, text regions, line wrapping, and rendering is non-trivial. I've added some pointers of how to fix the parts that don't quite add up yet.

fpdf/line_break.py

fpdf/html.py

fpdf/text_region.py

gmischler · 2024-05-20T12:15:49Z

fpdf/html.py

+            else:
+                self.line_height_stack.append(None)
+            if self.indent == 1:
+                self._new_paragraph(top_margin=self.list_top_margin, line_height=0)


Applying the margins with the </?[uo]l> instead of the <li> is the correct and clean design. 👍

But I'm not sure if checking for self.indent == 1: is the right criterion for the top margin.
What is the logic (or HTML spec) behind that?

Btw:
You're using the NBSP to get the actual margin, with list_top_margin only adding a very small and probably unnecessary amount to it. If you do that, you could probably just leave list_top_margin away completely.

But then, actually using the Paragraph() top and bottom margin functionality has additional benefits. For example, the paragraph will not apply them at the top or bottom of a page. Your current solution will create an unnecessary empty space there, possibly resulting in unnecessary page breaks.

The better solution might be to just add a top/bottom margin of the current text size, and modify Paragraph() so it applies its margins even if it contains no text.

But I'm not sure if checking for self.indent == 1: is the right criterion for the top margin.
What is the logic (or HTML spec) behind that?

That is done to avoid applying margins in the case of nested lists, i.e. the margins are only applied if we aren't already handling an HTML element that increases HTML2PDF.indent.

The better solution might be to just add a top/bottom margin of the current text size, and modify Paragraph() so it applies its margins even if it contains no text.

Will do.

That is done to avoid applying margins in the case of nested lists,

The specs suggest the following defaults:

dir, dl, menu, ol, ul { margin-block: 1em; }

which implies that nested lists also should have vertical margins (as there is no exception defined for them).

Browser makers seem to interpret that rather liberally, with both Firefox and Edge (hence Chromium) applying a margin above, but not below. They differ in that Firefox keeps the two nested bullets on the same line, while Chromium jumps one line:

I guess only using vertical margins for the top level list is just as compliant as those two...

gmischler · 2024-05-20T12:47:28Z

test/html/test_html.py

@@ -699,3 +699,11 @@ def test_html_ol_ul_line_height(tmp_path):
    </ul>"""
    )
    assert_pdf_equal(pdf, HERE / "html_ol_ul_line_height.pdf", tmp_path)
+
+
+def test_html_long_list_entries(tmp_path):


You have added functionality to text_region.py, which is very useful even when not parsing HTML. That means we need tests to verify that the new arguments for Paragraph() work correctly when used directly as well. This is necessary to avoid regressions with any future changes.

It is also a good reason to avoid changing line_break.py, because otherwise you'd have to add tests to verify your changes there as well... 😉

On second thought, I would like to also ask what tests I should make for that.

Those tests should allow to verify that Paragraphs are rendered correctly with any combination of indent and bullet string either present or not, both with bullet strings that are longer or shorter than the text indent is wide.

I have added a test for long <ol> bullets. The other tests that are already present seem to already fulfil the need for testing whether or not a Paragraph is rendered correctly with different bullets and tag_indents. Am I missing something?

I currently don't see Paragraph being used outside of handling HTML, hence why the assumption.

Explained above.

There don't seem to be docstrings for ParagraphCollectorMixin.paragraph() and Paragraph. Should I create one?

That would be very helpful! It looks like I have been skimping a bit on docstrings in that module.

Added the docstring to the Paragraph class. TextRegion.md has been edited. Test for generation of Paragraph objects has been added, albeit it does not test the handling of bullet_rel_x_displacement and bullet_rel_y_displacement, as relevant Paragraph-instantiating methods do not have the relevant functionality yet.

I am considering removing bullet_rel_y_displacement from Paragraph altogether in favour of vertical alignment, but the introduction of alignment will require some additional time.

I'd recommend to leave vertical bullet alignment as a future enhancement. Let's get this stable and robust first, and then think about any more bells and whistles.

Bumping.
A commit is ready for examination.

Apologies for bumping again.
Is there anything else required to do regarding this?

…`list_vertical_margin`. Removed the `MultiLineBreak.indent` attribute. Added a test for long `<ol>` bullets.

…ic purposes.

# Conflicts: # CHANGELOG.md

…_bullet_frag_and_tl`.

# Conflicts: # fpdf/text_region.py

…ags_and_tl` method and in the `Bullet` class.

Edited `TextRegion.md` to reflect the introduced changes for `Paragraph`s. Added tests for `Paragraph` generation in `test_html.py`

lcgeneralprojects · 2024-05-26T21:27:38Z

Reformatted html.py using black manually.

# Conflicts: # CHANGELOG.md # test/html/html_features.pdf

gmischler · 2024-05-29T09:09:24Z

Now that the code has largely settled, let's look at the results.

html_blockquote_indent.pdf - I know this is a pre-existing test, but it would probably be good to add a blockquoted paragraph with several lines, just to show the effect, and also to prevent possible future regressions.
html_customize_ul.pdf - Small indents.
html_li_prefix_color.pdf - Small indents.
html_li_tag_indent.pdf - Small indents.
html_ln_outside_p.pdf - Small indents.
html_ol_ul_line_height.pdf - Small indents.
html_ul_type.pdf - Small indents.
html_tag_indent.pdf - Besides the width of the indent, this one would also benefit from a bit more (multi-line) context.
html_description.pdf - Has received a vertical margin it didn't have before. Is this now more "correct" (HTML specs / typical browser behaviour)?
html_features.pdf - Reflects changes already seen in other files.
html_long_list_entries - New. Demonstrates multiline list item indent.
html_long_ol_bullets - New. Demonstrates bullet left cut-off.

The indents for lists are now much smaller than before. The previous default indent depended on the font size (5 x width of NBSP). This now seems to have changed to 5 "document units". With the default of mm this is too small. With the document units set to eg. inches, it will be way too large. You will have to pick a reasonable size in mm (eg. 8 or 10), and then make sure that if the document units aren't mm, this value gets converted appropriately before being used. The documentation also must clearly state that tag_indents values have to be given in document units.
Warning: I'm unable to test this right now, but this is my conclusion looking at the code. You'll have to test what actually happens when the document units aren't mm, and obviously those tests need to be added to our permanent collection.
Maybe the simplest solution will be to whenever we assign a value from DEFAULT_TAG_INDENTS to self.tag_indents, to systematically convert it from mm to document units.

Note that top and bottom margins of Paragraph()s are also in document units (which the documentation currently doesn't make sufficiently clear).
You'll have to take the same precautions with those as with the indents. In most cases they are defined in terms of font height, which is fine, but there are some hard-coded "magic numbers" present in the code as well. Most of those will have been there before you started (you can probably blame me for some of them 😉), but this is a good opportunity to fix them.
The same applies to the last hardcoded self._ln(2) call within the <li> tag code. This also incorrectly assumes mm, and needs to be converted to document units.

The docstring for Paragraph() looks good. However, the end user will access this functionality through ParagraphCollectorMixin.paragraph(), so maybe it should rather go there?

lcgeneralprojects · 2024-05-30T13:38:16Z

Regarding tests with small indents, do you want me to pass tag_indents values into pdf.write_html() calls in tests where that argument is not used, in addition to adjusting default tag indentation handling?

Regarding the new margins between elements in html_description.pdf - Firefox and Edge both produce the margin with that HTML code.

I might not have time to confidently deal with the 'magic numbers' tonight, so I will likely be pushing the changes tomorrow, and not today. Should I just do the conversion into appropriate units with them? If so, would you prefer for me to intentionally change them a little in order for them to look nicer, or would you prefer a more exact conversion?

EDIT: Going to note that, currently, due to the HTML2FPDF._ln() call when handling <li> start tags, there will be a gap of the relevant size, even if the list is the first visible content of the document, in case there will be a need to eliminate it in the future.

…margin values in `html.py` to the chosen document unit of measurement. Adjusted default tag indent values. Moved the `Paragraph` docstring to the `ParagraphCollectorMixin.paragraph()` method. Changed the `CustomPDF` class in `test_html_customize_ul` to have non-static attributes `li_tag_indent` and `ul_bullet_char`. Adjusted tests.

lcgeneralprojects added 13 commits May 5, 2024 05:34

intermediate commit to save progress. Debugging needed.

f794193

Feature mostly implemented.

20c035e

Some debugging and polish needed.

Fixed the issue with indentation of nested lists.

eb93711

Feature implemented.

f8f17a5

Some variables need tweaking. Needs testing. Code reuse unsatisfactory.

Feature implemented.

77a1a31

Some variables need tweaking. Needs testing. Code reuse unsatisfactory.

Feature implemented for <li>.

e80d8d9

Potentially significant issues with tests: 1. test_html_ln_outside_p - IndexError: list index out of range. 2. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Code reuse unsatisfactory.

Merge branch 'refs/heads/master' into issue_1073

dbcce1f

Issue mostly fixed.

fb59849

Bug present: bullets are made one per line instead of one per paragraph. Saving progress before introducing a `Bullet` class.

Issue fixed.

bc1fab8

Feature implemented. Testing and adjustments of tests needed.

Changed <ol> bullets to not introduce an extra whitespace.

d487f7d

Added the li_pseudo_marginattribute to HTML2FPDF.

2caa750

Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.

lcgeneralprojects commented May 17, 2024

View reviewed changes

fpdf/fpdf.py Outdated Show resolved Hide resolved

Merge branch 'refs/heads/master' into issue_1073

070a41d

lcgeneralprojects commented May 17, 2024

View reviewed changes

fpdf/text_region.py Outdated Show resolved Hide resolved

lcgeneralprojects added 6 commits May 19, 2024 11:19

Merge branch 'refs/heads/master' into issue_1073

4ab204e

Fixed the inappropriate TextMode importation.

1e1eb29

Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.

Merge remote-tracking branch 'origin/issue_1073' into issue_1073

dc3d8f8

# Conflicts: # fpdf/html.py

Introduced new test test_html_long_list_entries.

3f56811

Adjusted old tests.

Adjusted Changelog.md and relevant docstrings.

ce7cb9b

lcgeneralprojects marked this pull request as ready for review May 20, 2024 07:37

lcgeneralprojects requested a review from gmischler as a code owner May 20, 2024 07:37

gmischler changed the title ~~Issue 1073~~ Indent HTML lists correctly (Issue 1073) May 20, 2024

gmischler requested changes May 20, 2024

View reviewed changes

gmischler mentioned this pull request May 24, 2024

Nested HTML lists start with a newline #1148

Open

lcgeneralprojects added 3 commits May 25, 2024 13:55

Changed the name of the relevant variables from list_top_margin to …

24626f9

…`list_vertical_margin`. Removed the `MultiLineBreak.indent` attribute. Added a test for long `<ol>` bullets.

Adjusted html code strings in test_hmtl_long_ol_bullets for aesthet…

208e3b3

…ic purposes.

Merge branch 'refs/heads/master' into issue_1073

8cceb1d

# Conflicts: # CHANGELOG.md

lcgeneralprojects added 6 commits May 25, 2024 15:08

Added self.pdf.normalize_text(bullet_string) to `Paragraph.generate…

bf5f0fa

…_bullet_frag_and_tl`.

Added self.pdf.normalize_text(bullet_string) to `Paragraph.generate…

82fbfda

…_bullet_frag_and_tl`.

Merge remote-tracking branch 'origin/issue_1073' into issue_1073

a75a948

# Conflicts: # fpdf/text_region.py

Adjusted handling of fragments in the `Paragraph.generate_bullet_fr…

fc38846

…ags_and_tl` method and in the `Bullet` class.

Added docstring to Paragraph.

2f69001

Edited `TextRegion.md` to reflect the introduced changes for `Paragraph`s. Added tests for `Paragraph` generation in `test_html.py`

Used black on html.py

f7908e4

lcgeneralprojects added 2 commits May 28, 2024 03:11

Merge branch 'refs/heads/master' into issue_1073

5afb935

# Conflicts: # CHANGELOG.md # test/html/html_features.pdf

Merged changes to the branch master into the branch issue_1073.

4e7118b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indent HTML lists correctly (Issue 1073) #1170

Indent HTML lists correctly (Issue 1073) #1170

lcgeneralprojects commented May 17, 2024 •

edited

gmischler left a comment

gmischler May 20, 2024

lcgeneralprojects May 20, 2024

gmischler May 20, 2024

gmischler May 20, 2024

lcgeneralprojects May 20, 2024

lcgeneralprojects May 20, 2024

gmischler May 20, 2024

lcgeneralprojects May 25, 2024

gmischler May 25, 2024

lcgeneralprojects May 26, 2024

gmischler May 29, 2024

lcgeneralprojects May 31, 2024

lcgeneralprojects Jun 3, 2024

lcgeneralprojects commented May 26, 2024

gmischler commented May 29, 2024

lcgeneralprojects commented May 30, 2024 •

edited

Indent HTML lists correctly (Issue 1073) #1170

Are you sure you want to change the base?

Indent HTML lists correctly (Issue 1073) #1170

Conversation

lcgeneralprojects commented May 17, 2024 • edited

gmischler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcgeneralprojects commented May 26, 2024

gmischler commented May 29, 2024

lcgeneralprojects commented May 30, 2024 • edited

lcgeneralprojects commented May 17, 2024 •

edited

lcgeneralprojects commented May 30, 2024 •

edited