Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed default value of attrs argument in SgmlLinkExtractor to be tuple #661

Merged
merged 1 commit into from Mar 27, 2014

Conversation

@ananana
Copy link
Contributor

@ananana ananana commented Mar 20, 2014

  • Fixed bug in issue #652 (default value of attrs is now tuple)
  • Wrapped tags and attrs in arg_to_iter(), so tag_func and attr_func would accept both strings and lists. (so this version of the constructor is compatible with the old one)
@kmike
Copy link
Member

@kmike kmike commented Mar 20, 2014

Hey @ananana,

Thanks for this PR! Could you please add some tests for it?

I like how you used arg_to_iter because it makes the interface more consistent - other parameters for SgmlLinkExtractor already use it. But as it is a new feature (strings or None were not handled properly previously) it should be mentioned in docs for SgmlLinkExtractor. Also, there is deny_extensions that doesn't use arg_to_iter - maybe fix it as well?

@ananana
Copy link
Contributor Author

@ananana ananana commented Mar 25, 2014

I made the following changes:

  • Used arg_to_iter for deny_extensions as well, as suggested (so now it accepts strings as well)
  • Added tests for attrs and for deny_extensions
  • Changed docs for SgmlLinkExtractor to reflect these changes

Still left to do:

  • Add tests for tags parameter
@ananana
Copy link
Contributor Author

@ananana ananana commented Mar 25, 2014

Added tests for tags as well. Should be ready if everything looks ok.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 25, 2014

plenty good testcases to support this change, for me. Thanks very much.

+1 to merge this

(on the discussion about SGMLParser not parsing <area /> correctly, I'll add that self-closing elements are not strictly valid SGML, but XML, so this could be seen as not a bug in sgmllib. This should be a 'fix' in scrapy, if deemed necessary to fix SgmlLinkExtractor)

self.deny_extensions = set(['.' + e for e in deny_extensions])
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
self.deny_extensions = set(['.' + e for e in arg_to_iter(deny_extensions)])

This comment has been minimized.

@dangra

dangra Mar 26, 2014
Member

no need to create a set from a list comprehension, I know it was there but it is a good moment to replace it by

set('.' + e for e in arg_to_iter(deny_extensions))

This comment has been minimized.

@kmike

kmike Mar 26, 2014
Member

as Python 2.6 is no longer supported: {'.' + e for e in arg_to_iter(deny_extensions)}

This comment has been minimized.

@dangra
Copy link
Member

@dangra dangra commented Mar 26, 2014

LGTM. thanks!

attr_func = lambda x: x in attrs
self.deny_extensions = {'.' + e for e in arg_to_iter(deny_extensions)}
tag_func = lambda x: x in arg_to_iter(tags)
attr_func = lambda x: x in arg_to_iter(attrs)

This comment has been minimized.

@kmike

kmike Mar 26, 2014
Member

I suspect that if you remove arg_to_iter and make attrs=('href') again then new tests will still pass because 'href' in 'href' is True. We may e.g. add an element with "ref" attribute to the testing suite to check that condition, but I'm fine with merging without this fix.

@ananana
Copy link
Contributor Author

@ananana ananana commented Mar 26, 2014

That's true, I just added a new test for that (with a "ref" attribute) for the sake of thoroughness.
There are also in the test suite similar tests for the analogous situation for the tags parameter, set to "a" then to "area" (the results for these 2 shouldn't overlap), which seemed like a more natural test case to me.

@dangra
Copy link
Member

@dangra dangra commented Mar 26, 2014

as last request before merging, can we get this changes squashed into a single commit?

…d list parameters (attrs, tags, deny_extensions)
@ananana
Copy link
Contributor Author

@ananana ananana commented Mar 27, 2014

Rebased and squashed.

dangra added a commit that referenced this pull request Mar 27, 2014
Fixed default value of attrs argument in SgmlLinkExtractor to be tuple
@dangra dangra merged commit ade7662 into scrapy:master Mar 27, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants