Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nofollow doesnt work correcly when there multiple values in rel attribute #1201

Closed
aldarund opened this issue May 1, 2015 · 0 comments
Closed

Comments

@aldarund
Copy link

@aldarund aldarund commented May 1, 2015

According to specs rel can have multiple values: http://www.w3.org/TR/html401/struct/links.html#adef-rel

But scrapy ( LxmlParserLinkExtractor and SgmlLinkExtractor(but this one doesnt matter i guess since its deprecated)) just check if it strictly only follow.

link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)

So the cases when links looks like this will not work correctly:

 <a href='http://blablabla.com/' rel='external nofollow'>bla bla</a>

And its not from a vacuum, its from real world sites where i encountered that scrapy follows nofollow link. For example at this site: www.bruceclay.com/blog/secondary-keywords/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

1 participant
You can’t perform that action at this time.