Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: don't show content of raw HTML <script> tags (and maybe others) in search preview #12052

Closed
mgeier opened this issue Mar 7, 2024 · 6 comments · Fixed by #12057
Closed
Labels
html search javascript Pull requests that update Javascript code type:bug

Comments

@mgeier
Copy link
Contributor

mgeier commented Mar 7, 2024

Describe the bug

Some tags are filtered out before creating the search index (I don't know why this appears in two places in the code):

if 'html' in node.get('format', '').split():
# Some people might put content in raw HTML that should be searched,
# so we just amateurishly strip HTML tags and index the remaining
# content
nodetext = re.sub(r'<style.*?</style>', '', node.astext(), flags=re.IGNORECASE|re.DOTALL)
nodetext = re.sub(r'<script.*?</script>', '', nodetext, flags=re.IGNORECASE|re.DOTALL)
nodetext = re.sub(r'<[^<]+?>', '', nodetext)

if 'html' in node.get('format', '').split():
# Some people might put content in raw HTML that should be searched,
# so we just amateurishly strip HTML tags and index the remaining
# content
nodetext = re.sub(r'<style.*?</style>', '', node.astext(),
flags=re.IGNORECASE | re.DOTALL)
nodetext = re.sub(r'<script.*?</script>', '', nodetext,
flags=re.IGNORECASE | re.DOTALL)
nodetext = re.sub(r'<[^<]+?>', '', nodetext)

However, those parts still seem to be contained in the search preview.

How to Reproduce

index.rst:

Hello
=====

.. raw:: html

    <script type="text/javascript">
      console.log("green");
    </script>

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.

And here the search word: green

When searching for "green", the JavaScript code is shown in the preview text:

image

Note that this only happens if the word "green" is also part of the "normal" text.

Environment Information

any

Sphinx extensions

none

Additional context

This problem was reported to nbsphinx, where the raw HTML is created from output cells of Jupyter notebooks: spatialaudio/nbsphinx#777

@picnixz
Copy link
Member

picnixz commented Mar 7, 2024

Wow, never expected that this bug wasn't noticed before! so PR is welcome though I will be less available in the next 10 days.

@jayaddison
Copy link
Contributor

We may need to implement some equivalent filtering in the client-side JavaScript code to resolve this; when configured to, the client makes an HTTP request per search result to retrieve the summary content (as HTML).

The relevant makeSearchSummary code (current v7.2.6 release link, some small changes since then) is here:

/**
* helper function to return a node containing the
* search summary for a given text. keywords is a list
* of stemmed words.
*/
makeSearchSummary: (htmlText, keywords) => {
const text = Search.htmlToText(htmlText);
if (text === "") return null;
const textLower = text.toLowerCase();
const actualStartPosition = [...keywords]
.map((k) => textLower.indexOf(k.toLowerCase()))
.filter((i) => i > -1)
.slice(-1)[0];
const startWithContext = Math.max(actualStartPosition - 120, 0);
const top = startWithContext === 0 ? "" : "...";
const tail = startWithContext + 240 < text.length ? "..." : "";
let summary = document.createElement("p");
summary.classList.add("context");
summary.textContent = top + text.substr(startWithContext, 240).trim() + tail;
return summary;
},
};

@jayaddison
Copy link
Contributor

HTML/XML comments occurred to me as another thing that we'd want to filter out from the summary text; they already are, though. I'll spend some more time to think about other elements that could require filtering. In the meantime, #12057 begins by handling the <script> elements.

@mgeier
Copy link
Contributor Author

mgeier commented Mar 9, 2024

other elements that could require filtering

According to the search index filtering code I referenced above, it looks like <style> tags should also be filtered out, which makes sense.

I assume that the third regular expression is supposed to filter out the tags themselves, without removing the content within those tags. It looks like this is already working correctly.

@jayaddison
Copy link
Contributor

Note: this can only occur when the non-visible HTML elements are located within role="main" on the page -- because content outside of that area will not be retrieved by the htmlToText function.

@zerothi
Copy link

zerothi commented Mar 20, 2024

freaking love this! Thanks! :)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search javascript Pull requests that update Javascript code type:bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants