Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Ignore any HTML tags when searching but still return response with HTML included #265

Open
sdcosta opened this issue May 2, 2021 · 3 comments

Comments

@sdcosta
Copy link

sdcosta commented May 2, 2021

Description

Ignore HTML tags in the document. So ideally we'd only index text outside of tags.

Ex. <p>Who wants some <b>chimi</b>changas?</p>

If I query for chimichangas the indexed document should ignore <b> and </b> as well as the <p>, </p> tags and index chimichangas as one word.

The returned hits for chimichangas should return <p>Who wants some <b>chimi</b>changas?</p>.

A great bonus would be that highlights works with this as well. Ex: <p>Who wants some <CustomHighlightTag><b>chimi</b>changas</CustomHighlightTag>?</p>

@hqm42
Copy link

hqm42 commented May 4, 2021

For a simple solution you could introduce an artificial field where all html tags are removed. This will not solve your highlight problem, but I think this can not be solved at all:

Ex text: <p>Who wants some <b>vegan chimi</b>changas?</p>
Ex result: Ex: <p>Who wants some <b>vegan <CustomHighlightTag>chimi</b>changas</CustomHighlightTag>?</p>

It would be very easy to generate broken HTML.

@sdcosta
Copy link
Author

sdcosta commented May 5, 2021

Agreed on having a searchable field that excludes the HTML tags and having a non searchable field that has the HTML tags. This would double the size of each document but I think Typesense only stores searchable fields in RAM.

That's a really good point that the highlight tags could potentially break the HTML. Is it possible for typesense to return the offset positions of where the search query matched in the document?

@kishorenc
Copy link
Member

Typesense highlights each token independently so I think mixing HTML + highlight tags should be okay. The result will be like this:

<p>Who wants some <b>vegan <CustomHighlightTag>chimi</CustomHighlightTag></b><CustomHighlightTag>changas</CustomHighlightTag>?</p>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants