New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noindexing parts of page #49

Closed
Starkmann opened this Issue May 1, 2016 · 8 comments

Comments

Projects
None yet
4 participants
@Starkmann

Starkmann commented May 1, 2016

I would need a possibility to not index a part of a page, for example the footer.
https://en.wikipedia.org/wiki/Noindex#Noindexing_part_of_a_page

I would sponsor this feature if it is not existing.

@Marx1st

This comment has been minimized.

Show comment
Hide comment
@Marx1st

Marx1st May 29, 2017

I second this feature! It would be great to have a filter eg. if <div id=content> exists, only index that...

Marx1st commented May 29, 2017

I second this feature! It would be great to have a filter eg. if <div id=content> exists, only index that...

@Starkmann Starkmann changed the title from Noindexing part pf page to Noindexing parts of page May 30, 2017

@Marx1st

This comment has been minimized.

Show comment
Hide comment
@Marx1st

Marx1st Jun 1, 2017

There are several solutions for Apache Nutch. Maybe those can be used for implementing this in Yacy?

https://github.com/BayanGroup/nutch-custom-search

And there is the OpenSearchServer project. There you can add XPATH excludes for the HTML crawler, etc.

http://www.opensearchserver.com

Marx1st commented Jun 1, 2017

There are several solutions for Apache Nutch. Maybe those can be used for implementing this in Yacy?

https://github.com/BayanGroup/nutch-custom-search

And there is the OpenSearchServer project. There you can add XPATH excludes for the HTML crawler, etc.

http://www.opensearchserver.com

@luccioman

This comment has been minimized.

Show comment
Hide comment
@luccioman

luccioman Jun 7, 2017

Member

Indeed the "Custom Search Tools For Nutch" you mention seems to offer many possibilities. But maybe a bit complex...
What about including first only CSS style selectors support in YaCy, in the form of exclude/include filters to the advanced crawler? Don't you think this could already cover many use cases (at least for HTML documents)?

Member

luccioman commented Jun 7, 2017

Indeed the "Custom Search Tools For Nutch" you mention seems to offer many possibilities. But maybe a bit complex...
What about including first only CSS style selectors support in YaCy, in the form of exclude/include filters to the advanced crawler? Don't you think this could already cover many use cases (at least for HTML documents)?

@Quix0r

This comment has been minimized.

Show comment
Hide comment
@Quix0r

Quix0r Feb 13, 2018

Contributor

Isn't this done with current master and ignoring a comma-separated list of CSS classes?

Contributor

Quix0r commented Feb 13, 2018

Isn't this done with current master and ignoring a comma-separated list of CSS classes?

@Starkmann

This comment has been minimized.

Show comment
Hide comment
@Starkmann

Starkmann Feb 13, 2018

Seems like #158 is duplicate of this issue.

As far as I understand the commit we can enter a list of css classes which we want to ignore. That's great news.

Starkmann commented Feb 13, 2018

Seems like #158 is duplicate of this issue.

As far as I understand the commit we can enter a list of css classes which we want to ignore. That's great news.

@luccioman

This comment has been minimized.

Show comment
Hide comment
@luccioman

luccioman Feb 13, 2018

Member

Yes you can now enter a list of CSS classes to be ignored by the crawler. The main current limitation is that only <div> elements and their content can be filtered.
Ignoring ANY kind of HTML element having one of the given classes would be possible, but would need a deeper rewrite of the current html parser implementation.
Should we close anyway this issue?

Member

luccioman commented Feb 13, 2018

Yes you can now enter a list of CSS classes to be ignored by the crawler. The main current limitation is that only <div> elements and their content can be filtered.
Ignoring ANY kind of HTML element having one of the given classes would be possible, but would need a deeper rewrite of the current html parser implementation.
Should we close anyway this issue?

@Starkmann

This comment has been minimized.

Show comment
Hide comment
@Starkmann

Starkmann Feb 14, 2018

For me it can be closed.

Starkmann commented Feb 14, 2018

For me it can be closed.

@luccioman

This comment has been minimized.

Show comment
Hide comment
@luccioman

luccioman Feb 15, 2018

Member

Ok I close it, thanks for your feedback @Starkmann

Member

luccioman commented Feb 15, 2018

Ok I close it, thanks for your feedback @Starkmann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment