Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detection of (almost) hidden text in html #72

Open
arpitest opened this issue Dec 29, 2022 · 2 comments
Open

detection of (almost) hidden text in html #72

arpitest opened this issue Dec 29, 2022 · 2 comments

Comments

@arpitest
Copy link

arpitest commented Dec 29, 2022

Hi! I'm developing spam filters, and have to parse html emails to plain text to analyze. I've used html2text and later my own simplified implementation, but inscriptis looks even better!

Is it possible to implement optional filtering/ignoring of hidden text parts? Text written using very small font size or font color equal (or close to) background color... sometimes this is defined in css/style tags, sometimes in span tag's parameters.
This technique is often used on webpages and spam emails to fool search engines and spam filters with fake content not visible to human viewers.

Here is a sample: http://thot.banki.hu/deepspam/poison.html

@AlbertWeichselbraun
Copy link
Contributor

we could add functions that interpret stylesheet options prior to applying them (e.g., set the text to invisible, if its size or color are below a certain threshold).

limitations:

  • we probably wouldn't activate these functions per default
  • spammers could adapt (e.g., by using stylesheets rather than the style attribute).

would such a feature be helpful for your application?

@arpitest
Copy link
Author

arpitest commented Jan 6, 2023

would such a feature be helpful for your application?

yes, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants