Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore pages with noindex #9

Closed
davwheat opened this issue Jul 3, 2020 · 5 comments · Fixed by #10
Closed

Ignore pages with noindex #9

davwheat opened this issue Jul 3, 2020 · 5 comments · Fixed by #10

Comments

@davwheat
Copy link

davwheat commented Jul 3, 2020

Thank you so much for this package! I really love using it and it saves me a lot of pain.

I wanted to ask if you could add an option to ignore pages which have been set to have indexing turned off...

e.g. pages with...

<meta name="robots" content="noindex" />
@zerodevx
Copy link
Owner

zerodevx commented Jul 6, 2020

Hi, thanks for your note - I'm glad this helps!

Regarding your use-case, unfortunately there's no in-built mechanism at the moment that does that... =/ So for now you may need to use the -m flag to specifically ignore those pages.

We may consider adding this feature in a future release, but implementation comes with its own set of challenges:

Th quick way may be to string-detect, but it's definitely not robust. Because

<meta name="robots" content="noindex" />
<meta content="noindex" name="robots">
<meta
    name="robots
    content="noindex">

are all valid HTML.

The other way will be to parse the HTML (using JSDOM or such), but it's a non-trivial task and a resource-intensive operation that will significantly impact speed.

Alternatively, you can continue to include noindex pages into your sitemap - search engines still respect the noindex meta with the highest priority - though this generates a bunch of errors in Search Console. =/

@davwheat
Copy link
Author

davwheat commented Jul 8, 2020

It could be possible to only parse the <head> tag.

node-html-parser claims it can parse an HTML file in under 2ms, which wouldn't be too much of a speed hit, bearing in mind most people would likely only use this tool before deploying their changes to a webserver. I use this directly after prettier which ends up spending up to 750ms per HTML file.

I'll make some changes and see how an implementation of this could affect runtime.

@davwheat
Copy link
Author

davwheat commented Jul 8, 2020

@zerodevx So I've made a version of the tool which follows noindex meta tags using htmlparser2.

It's slower than the normal version by roughly 4x...

Benchmarking with 529 HTML files (totalling 50 MB), I found that by following the noindex tags, it took about 1400-1500ms. By ignoring them, it took about 350-380ms.

At the moment I've implemented it as an argument which needs to be manually enabled. I'll PR and see what you think.

@zerodevx zerodevx linked a pull request Jul 9, 2020 that will close this issue
@zerodevx
Copy link
Owner

zerodevx commented Jul 9, 2020

That's great work! Looking through it right now.

@zerodevx
Copy link
Owner

zerodevx commented Jul 9, 2020

Looks really good to me. I'll merge #10 and release a new minor.

Thanks for your contribution! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants