Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Ticapix/fix srcset discovery #353

Merged

Conversation

ticapix
Copy link
Contributor

@ticapix ticapix commented Feb 20, 2017

What this PR changes

changed the default discovery function to properly detect srcset url according to the specs

Rationale

srcset discovery function wasn't parsing correctly url with zero or more than 2 args.
Added a test as an exemple.

@ticapix
Copy link
Contributor Author

ticapix commented Mar 16, 2017

Any update if this PR will be merged ?

@fredrikekelund
Copy link
Collaborator

Hi @ticapix! Sorry about the slow response here. I've never encountered a srcset attribute with mixed width and pixel densite descriptors, and looking at MDN, it appears to be invalid syntax. I don't think simplecrawler should be too picky about standards if this is something used a lot in the wild, but we have also made the discovery functionality intentionally pluggable, meaning it's easy to swap out for your own needs. Do you know that this is a syntax that appears on a lot of different sites?

@ticapix
Copy link
Contributor Author

ticapix commented Mar 17, 2017

Thanks for looking into it.
From the draft spec http://w3c.github.io/html/semantics-embedded-content.html#element-attrdef-img-srcset, it's a valid syntax. Maybe MDN is describing Mozilla implementation.
But anyway, I would prefer simplecrawler to get better link discovery than carry out to the letter the MDN or other browser documentation.
I hope this PR will be useful to others.

@fredrikekelund
Copy link
Collaborator

Hmm well, the way I read that spec, we can only have "zero or one" of a width descriptor or a pixel density descriptor. I absolutely agree that simplecrawler shouldn't get too bogged down in HTML standards if we see a different syntax being used in the wild, but it's also unnecessary to support an invalid syntax if it's not being widely used. Do you know some places where it is being used?

@ticapix
Copy link
Contributor Author

ticapix commented Mar 17, 2017

var links =
discover("<img src='pic-200.png' srcset='pic-200.png 200px, pic-400.png 400w'>", {
discover("<img src='pic.png' srcset='pic-200.png, pic-400.png 400w, pic-800.png 800w 2x'>", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove 800w towards the end of this line? As we discussed previously, I don't think the spec permits two descriptors there, so it would be better to not make that part of the test suite (even if the new implementation would unofficially support that syntax)

@fredrikekelund
Copy link
Collaborator

I had another look at the diff here, and I don't see any harm in the changes you propose to crawler.js. The current test suite has a minor fault which you got rid of as well, so that's good. I left a single line comment in the diff, but after we've addressed that, I think we're good to merge :)

@ticapix
Copy link
Contributor Author

ticapix commented Mar 19, 2017

should be good now.

@fredrikekelund fredrikekelund merged commit ac0d568 into simplecrawler:master Mar 19, 2017
@fredrikekelund
Copy link
Collaborator

Thanks, @ticapix!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants