Ignore pages with noindex #9

davwheat · 2020-07-03T12:27:32Z

Thank you so much for this package! I really love using it and it saves me a lot of pain.

I wanted to ask if you could add an option to ignore pages which have been set to have indexing turned off...

e.g. pages with...

<meta name="robots" content="noindex" />

zerodevx · 2020-07-06T08:11:34Z

Hi, thanks for your note - I'm glad this helps!

Regarding your use-case, unfortunately there's no in-built mechanism at the moment that does that... =/ So for now you may need to use the -m flag to specifically ignore those pages.

We may consider adding this feature in a future release, but implementation comes with its own set of challenges:

Th quick way may be to string-detect, but it's definitely not robust. Because

<meta name="robots" content="noindex" />
<meta content="noindex" name="robots">
<meta
    name="robots
    content="noindex">

are all valid HTML.

The other way will be to parse the HTML (using JSDOM or such), but it's a non-trivial task and a resource-intensive operation that will significantly impact speed.

Alternatively, you can continue to include noindex pages into your sitemap - search engines still respect the noindex meta with the highest priority - though this generates a bunch of errors in Search Console. =/

davwheat · 2020-07-08T16:06:48Z

It could be possible to only parse the <head> tag.

node-html-parser claims it can parse an HTML file in under 2ms, which wouldn't be too much of a speed hit, bearing in mind most people would likely only use this tool before deploying their changes to a webserver. I use this directly after prettier which ends up spending up to 750ms per HTML file.

I'll make some changes and see how an implementation of this could affect runtime.

davwheat · 2020-07-08T19:44:51Z

@zerodevx So I've made a version of the tool which follows noindex meta tags using htmlparser2.

It's slower than the normal version by roughly 4x...

Benchmarking with 529 HTML files (totalling 50 MB), I found that by following the noindex tags, it took about 1400-1500ms. By ignoring them, it took about 350-380ms.

At the moment I've implemented it as an argument which needs to be manually enabled. I'll PR and see what you think.

zerodevx · 2020-07-09T08:35:58Z

That's great work! Looking through it right now.

zerodevx · 2020-07-09T09:13:01Z

Looks really good to me. I'll merge #10 and release a new minor.

Thanks for your contribution! 🎉

zerodevx linked a pull request Jul 9, 2020 that will close this issue

Add argument to follow "noindex" meta tags #10

Merged

zerodevx closed this as completed in #10 Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore pages with noindex #9

Ignore pages with noindex #9

davwheat commented Jul 3, 2020

zerodevx commented Jul 6, 2020 •

edited

davwheat commented Jul 8, 2020 •

edited

davwheat commented Jul 8, 2020

zerodevx commented Jul 9, 2020

zerodevx commented Jul 9, 2020

Ignore pages with noindex #9

Ignore pages with noindex #9

Comments

davwheat commented Jul 3, 2020

zerodevx commented Jul 6, 2020 • edited

davwheat commented Jul 8, 2020 • edited

davwheat commented Jul 8, 2020

zerodevx commented Jul 9, 2020

zerodevx commented Jul 9, 2020

zerodevx commented Jul 6, 2020 •

edited

davwheat commented Jul 8, 2020 •

edited