Skip to content

Always convert text from html in MessagePreviewExtractor#10919

Open
frabera wants to merge 9 commits intothunderbird:mainfrom
frabera:patch-1
Open

Always convert text from html in MessagePreviewExtractor#10919
frabera wants to merge 9 commits intothunderbird:mainfrom
frabera:patch-1

Conversation

@frabera
Copy link
Copy Markdown

@frabera frabera commented Apr 20, 2026

This solves html characters or code in notifications (#10256) and message previews (#8471)

Previously it only converted from html after checking the mimetype convertFromHtmlIfNecessary() but if the message contained html code or special characters in the plain text part they were rendered in the message previews and in notifications, polluting content.

I kept the function, let me know if there is a cleaner way to accomplish this. I tested it with messages containing special characters and also with html code, they are not shown.

This solves html characters or code in notifications and message previews (thunderbird#10256)
@github-actions

This comment was marked as resolved.

@frabera frabera marked this pull request as draft April 20, 2026 22:17
@frabera frabera marked this pull request as draft April 20, 2026 22:17
@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@frabera frabera marked this pull request as ready for review April 20, 2026 23:53
@github-actions

This comment was marked as resolved.

Updated regex to remove URLs in parentheses from preview text.
@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

before, it interfered with the preview formatting of headers of forwarded and reply emails.
@github-actions

This comment was marked as resolved.

@frabera
Copy link
Copy Markdown
Author

frabera commented Apr 21, 2026

I modified the logic by moving the parsing and cleaning parts in stripTextForPreview() because while it worked, the previous solution didn't parse the quotation header from forwarded emails and replies (it showed -----Original Message----- in the preview). This is caused somehow by parsing before extractUnquotedText().

In this way if a message contains only an html part, it is parsed twice. To avoid this a deeper modification is required, I'll wait for opinions from reviewers.

Here's an example returning a boolean from convertFromHtmlIfNecessary() if a parsing will be required:

 fun extractPreview(textPart: Part): String {
        val text = MessageExtractor.getTextFromPart(textPart, MAX_CHARACTERS_CHECKED_FOR_PREVIEW)
            ?: throw PreviewExtractionException("Couldn't get text from part")

        val (plainText, requiresParsingHtml) = convertFromHtmlIfNecessary(textPart, text)
        return stripTextForPreview(plainText, requiresParsingHtml)
    }

    private fun convertFromHtmlIfNecessary(textPart: Part, text: String): Pair<String, Boolean> {
        return if (isSameMimeType(textPart.mimeType, "text/html")) {
            HtmlConverter.htmlToText(text) to false
        } else {
            text to true
        }
    }

    private fun stripTextForPreview(text: String, parseHtml: Boolean): String {
        var intermediateText = text

        intermediateText = normalizeLineBreaks(intermediateText)
        intermediateText = stripSignature(intermediateText)
        intermediateText = extractUnquotedText(intermediateText)

        // try to remove lines of dashes in the preview
        intermediateText = intermediateText.replace("(?m)^----.*?$".toRegex(), "")
        // Remove horizontal rules.
        intermediateText = intermediateText.replace("\\s*([-=_]{30,}+)\\s*".toRegex(), " ")

        // If the textPart was plaintext, parse as HTML
        if (parseHtml) {
            intermediateText = HtmlConverter.htmlToText(intermediateText)
        }

        // Remove parsed HTML links/images "<url>"
        intermediateText = intermediateText.replace("<https?://\\S+>".toRegex(), " ")

        // URLs in the preview should just be shown as "..." - They're not
        // clickable and they usually overwhelm the preview
        intermediateText = intermediateText.replace("https?://\\S+".toRegex(), "...")
        // Don't show newlines in the preview
        intermediateText = intermediateText.replace('\n', ' ')
        // Collapse whitespace in the preview
        intermediateText = intermediateText.replace("\\s+".toRegex(), " ")
        // Remove any whitespace at the beginning and end of the string.
        intermediateText = intermediateText.trim()

        return if (intermediateText.length > MAX_PREVIEW_LENGTH) {
            intermediateText.substring(0, MAX_PREVIEW_LENGTH - 1) + ""
        } else {
            intermediateText
        }
    }

@rafaeltonholo rafaeltonholo added the report: include Include changes in user-facing reports. label Apr 21, 2026
@frabera frabera changed the title Always convert text to html in MessagePreviewExtractor Always convert text from html in MessagePreviewExtractor Apr 21, 2026
@wmontwe wmontwe requested review from wmontwe and removed request for rafaeltonholo April 24, 2026 10:08
@wmontwe wmontwe assigned wmontwe and unassigned rafaeltonholo Apr 24, 2026
@wmontwe
Copy link
Copy Markdown
Member

wmontwe commented Apr 29, 2026

@frabera Thanks for the fix. I think having the flag improves the handling. I would add some tests to check the behavior especially for forwarded emails, special characters and html in plain text.

This might help to check the logic works, maybe you could also add more variations, html mail with forwarded plain text or plain text with html forwarded email. Then test with special characters and html code in the plain text part.

    @Test
    fun extractPreview_forwardedMessage() {
        val text =
            """
            Here is the forwarded message:

            -----Original Message-----
            From: alice@example.com
            Sent: Monday, January 1, 2024 10:00 AM
            To: bob@example.com
            Subject: Hello

            This is the original content.
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/plain", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

    @Test
    fun extractPreview_htmlForwardedMessage() {
        val text =
            """
            <html>
            <body>
            Here is the forwarded message:<br>
            <br>
            -----Original Message-----<br>
            From: alice@example.com<br>
            Sent: Monday, January 1, 2024 10:00 AM<br>
            To: bob@example.com<br>
            Subject: Hello<br>
            <br>
            This is the original content.
            </body>
            </html>
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/html", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

I'm open for suggestions how to mark the content of the forwarded message. Or even threat this as a separate issue.

When working on the test, keep the line breaks as is, otherwise the tests will fail. Ideally the test emails need to be loaded from a file, instead of being part of the code that reformats the message in an incompatible way. We won't have time to fix this now, so we're open for contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

report: include Include changes in user-facing reports.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants