Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attributes dict accessor #107

Merged
merged 5 commits into from Jun 8, 2018
Merged

Add attributes dict accessor #107

merged 5 commits into from Jun 8, 2018

Conversation

eliasdorneles
Copy link
Member

It's often useful to get the attributes of the underlying elements, and Parsel currently doesn't make it obvious how to do that.

Currently, the first instinct is to use XPath, which makes it a bit awkward because you need a trick like the name(@*[i]) described in this blog post.

This PR proposes adding two methods .attrs() and .attrs_all() (mirroring get and getall) for getting attributes in a way that more or less made sense to me.
What do you think?

@kmike
Copy link
Member

kmike commented Jan 15, 2018

Great idea, attribute extraction is something that's not obvious indeed!

Something to think about: often only a single attribute is needed. In this case users will be writing sel.attrs()['class']. It is tempting to allow it to be an argument: sel.attrs('class'). But in this case attrs name is not good, as a single attribute is returned. A middle-ground could be sel.attrib('class'), which works both for a single and multiple arguments, but it can be confused with .attrib attribute of lxml node, which is just a dict, not a function. FWIW, BeautifulSoup also provides .attrib attribute, like lxml (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes).

sel.attrs_all() looks confusing - based on its name, I'd expect it to return all attributes of an element.

What are use cases for attrs_all methods? I can see how it is useful e.g. for link extraction, but in this case it is better to have attributes only for elements with an attribute of interest. It can also be useful when combining .getall() results with attributes though.

I like that .attrs is implemented for SelectorList, it looks like an useful shortcut. On the other hand, we have .extract() which follows a different logic: a single value is returned for Selector, but a list of values is returned for SelectorList, and then a method with a different name (.get / .extract_first) as a shortcut to get a first value. If we follow this design, .attrs() should have a different return type, and there could be .attr_first() method for SelectorList. But that's a moot point, as .get() was created to change this.

Arguments against current API design (they don't mean current design is not the best, I'm not sure what the best design is, suggestions and ideas are welcome!):

  • Selector.attrs() -> lxml and BS provide .attrib attribute, which allows to select a single attribute with less effort
  • SelectorList.attrs() -> .extract() works in a different way.
  • Selector.attrs_all() -> it doesn't look useful, added only for consistency;
  • SelectorList.attrs_all() -> a different logic can be more helpful for some use cases like img src extraction

An alternative is to have .attrib, without having both methods. attrs_all looks easy to implement in user code, and less useful than .extract(); .attrib attribute makes code more similar to lxml and BS, and also allow to select specific attribute values with less effort. On the other hand, currently Selector instances don't have public attributes (.root is not documented), and it SelectorList.attrib makes less sense. Providing this attribute also would make introducing attrs_all method (or alike) later more problematic, as there won't be a parallel with .get / .getall.

Thoughts?

@eliasdorneles
Copy link
Member Author

Heey Mike, thank you so much for this thoughtful review! <3

It seems you captured everything I thought about, and more!

About the need for something like attrs_all: some time ago I was using Parsel to explore an SVG file and wanted to get all the attribute names used in every element -- it took me a while because I didn't know about the .root.attrib trick and ended up using the XPath position trick. :)
To be honest, I didn't give THAT much thought about it.

Taking into account the things you mentioned, I'm now leaning more towards adding an .attrib element to Selector only, has the niceties of external consistency with lxml and BeautifulSoup, therefore probably being more like what an user would expect.
Plus, as you said, the attrs_all logic is easy to implement in user code.

So lemme update this with that change!

Thanks again for reviewing!

@codecov
Copy link

codecov bot commented Jan 21, 2018

Codecov Report

Merging #107 into master will increase coverage by 0.27%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage   99.34%   99.62%   +0.27%     
==========================================
  Files           5        5              
  Lines         307      266      -41     
  Branches       61       48      -13     
==========================================
- Hits          305      265      -40     
  Misses          1        1              
+ Partials        1        0       -1
Impacted Files Coverage Δ
parsel/selector.py 100% <100%> (+1.07%) ⬆️
parsel/csstranslator.py 98.57% <0%> (-1.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58c85ca...ee042d1. Read the comment docs.

@eliasdorneles eliasdorneles changed the title Add attrs() and attrs_all() methods Add attributes dict accessor Jan 21, 2018
@eliasdorneles eliasdorneles requested review from redapple and removed request for redapple January 21, 2018 01:33
@eliasdorneles
Copy link
Member Author

@redapple sorry, I inadvertently requested a review clicking your name in the reviewer list -- I actually wanted to visit your profile and check out your cool new profile pic 😄

But of course, feel free to weigh in if you want :D

@kmike
Copy link
Member

kmike commented Jan 23, 2018

Exposing .attrib looks minimal and safe, though I would probably miss something attrib-like for SelectorList.

@eliasdorneles
Copy link
Member Author

Hm yeah...

What if we added an .attrib to SelectorList -- should we 1) return only for the first element or 2) do it like extract and return a list?
I lean more into option 1, because it reflects the behavior of XPath of converting a nodeset into a node and also because I think it's weirder to have different return types than ignoring the next sibling elements.

@kmike
Copy link
Member

kmike commented Jan 23, 2018

+1 to return .attrib for a first element in this case, or an empty dict if SelectorList is empty.

//cc @dangra @redapple @lopuhin are you OK with this design?

@dangra
Copy link
Member

dangra commented Jan 25, 2018

+1 to return .attrib for a first element in this case, or an empty dict if SelectorList is empty.
//cc @dangra @redapple @lopuhin are you OK with this design?

Honestly, Selector(List) api looks messy already, old and recent additions went on different directions without a common agreement.

extract() and extract_first() methods

Selector.extract() returns string.
Selector.extract_first() doesn't exist.
SelectorList.extract() returns list of strings.
SelectorList.extract_first() returns string or default=None if empty list.

== .extract() was part of the original API.
-- It returns different types based on parent type.
-- Selector and SelectorList interface is different (no Selector.extract_first())
-- the pattern didn't scale to new features like extract_first() that in retrospective are a clear mistake.

re() and re_first() methods

Selector.re() returns list of strings.
Selector.re_first() returns string or default=None if doesn't match.
SelectorList.re() returns list of strings.
SelectorList.re_first() returns string or default=None if doesn't match.

== re()/re_first() are "perfect"
-- re() is not a pure selector method, it does post extraction filtering plus processing (replace entities).
++ they returns same type and same meaning no matter if called for Selector or SelectorList.
++ They also offer same method signature for Selector and SelectorList.
== they could be called re/reall() following get/getall() pattern except it breaks backwards compatibility.
== Or deprecate re/re_first() methods and extend get/getall() signatures with regex= parameter.

get() and getall() methods

Selector.get() returns string. It doesn't accept default parameter.
Selector.getall() returns list of strings.
SelectorList.get() returns string or default=None if empty list.
SelectorList.getall() returns list of strings.

++ Same return types across classes.
-- .get() signature is different for Selector and SelectorList

New attrib property

Selector.attrib returns dict
SelectorList.attrib returns dict or empty dict if empty list.

== Handy shortcut to access attributes
++ Coherent with lxml and BS interfaces
++ Coherent with get() and clear path to add attriball if needed
-- Not possible to distinguish between not matching node and node with no attributes
-- As it is a property so we can't control default value for above case.

Keeping the discussion around .attrib property, I think the only thing that worries me is the last point about caller not being able to distinguish between non-matching node and node without attributes.

For the other part of the API we probably need a document to discuss on. My starting proposal would be to deprecated extract*() and re*(). Absorb regex matching and replace_entities as get*() arguments and add .text shortcut much like .attrib.

@kmike
Copy link
Member

kmike commented Jan 25, 2018

A small note about .text shortcut: I'd prefer it to be a method, because there are many ways to extract text from html snippet (see also: #34).

I don't have a strong opinion about .attrib being None or an empty dict for empty SelectorList; you need to either handle an exception or check if SelectorList is empty. I slightly prefer an empty dict because an empty SelectorList or a SelectorList where a first element is not what you want look similar from user point of view. Also, with .extract_first it is common to use result directly as an item value, while for .attrib I expect that in most cases it'd be a lookup for a specific attribute. But None is fine as well, you just get a different exception.

As for deprecating extract/extract_first, I think we should start with switching Scrapy docs to get/getall.

@eliasdorneles
Copy link
Member Author

Hey fellows, following up one month later... sorry for the long delay, been a bit behind in the OSS stuff lately!
So I've added the .attrib for SelectorList, does this look good now?
Thanks!

@kmike
Copy link
Member

kmike commented Feb 26, 2018

Thanks @eliasdorneles! I think the current pull request is fine, and the only unresolved question is whether attrib should be None or an empty dict for empty SelectorList objects.

@dangra wrote:

Keeping the discussion around .attrib property, I think the only thing that worries me is the last point about caller not being able to distinguish between non-matching node and node without attributes.

So, let's check example use patterns:

# --- get an attribute, raise an error if attribute is not present:
sel.css('a').attrib['href']  # in both cases

# --- get an attribute, use None as its default value, but raise an error if selector is empty:

# None
sel.css('a').attrib.get('href')

# {}, though not exactly equivalent
sel.css('a')[0].attrib.get('href')

# --- get an attribute, use None as its value when attribute 
# is not present or a selector is empty:

# None
(sel.css('a').attrib or {}).get('href')

# {}
sel.css('a').get('href')

Examples convinced me None is better: it allows to distinguish between non-matching selector and a first element without attributes, and is not much harder to use when the opposite behavior is required. It also looks easier to debug in cases you've got a selector wrong, there will be less questions about "element has attributes, but css('my-selector").attrib is empty".

So +1 to switch to None, add an example to parsel docs, and merge.

@eliasdorneles
Copy link
Member Author

So, I was hoping I wouldn't need to jump into the bikeshed wagon, but here I go... :D

The problem of using None is that it will give rise to this kind of code:

if sel.css('a').attrib:
    something = sel.css('a').attrib['something']

with all its variants:

something = sel.css('a').attrib['something] if sel.css('a').attrib else None
something = (sel.css('a').attrib or {}).get('something')

I would also bet that the null-check would be done after hitting the problem (possibly in production).

It may not seem like a big deal, but forcing all users to add a null-check to guarantee their code won't break just because of an use-case that is likely not that common, it makes for a counter-intuitive API. Moreover, users that actually need to distinguish between empty selectors will probably check sel.css('a') already anyway.

I hereby invoke the motto practicality beats purity. :)

The main reason we're adding attrib to SelectorList is for it to be a shortcut, which means we're already thinking "practicality beats purity" -- we wouldn't want to return attributes for a list if we were caring more about the type consistency than the usability.

TL;DR: I believe that 1) always returning a dict will make it easier for users to write robust code for the most common case, and that 2) returning None will not make the resulting API any cleaner (nor dirtier, either) and is more prone to errors.

Does this make sense?

@cathalgarvey
Copy link

FWIW.. I agree with @eliasdorneles that returning an empty dict would make a more ergonomic API, even if it provides less information to the caller (e.g., on whether the attribute or selector result was empty).

But, to make things a little worse, can I ask whether attributes are guaranteed to be unique in a HTML element? Because I don't think there's anything preventing someone from doing <div foo="bar" foo="baz">, right? So should the API return an empty defaultdict(list), not just an empty dict?

That would mean that the usage starts to look a bit daft, though, so we're back to the question of "purity vs. practicality". Even if sel.attrib['foo'] == ['bar', 'baz'] would be more correct, it would make the vastly-more-common usages more complex.

@kmike
Copy link
Member

kmike commented Feb 27, 2018

But, to make things a little worse, can I ask whether attributes are guaranteed to be unique in a HTML element?

Yes, as per https://www.w3.org/TR/html5/syntax.html#elements-attributes: "There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other." Also, lxml provides attributes as a dict, we're just using its API.

def attrib(self):
"""Return the attributes dictionary for underlying element.
"""
return self.root.attrib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns an lxml.etree._Attrib instance, which can lead to unforseen consequences:

  • changes in this object are reflected in a tree;
  • this object may keep a reference to a tree, preventing response GC;
  • isinstance(sel.attrib, dict) is False

What do you think about returning dict(self.root.attrib) instead? It will ensure a copy, and make output a regular Python dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oohh, good catch!
Sounds good, I'll do that in a jiffy!

@kmike
Copy link
Member

kmike commented Jun 7, 2018

I'm fine with both empty dict and None values for empty SelectorList, @eliasdorneles and @cathalgarvey convinced me that None is not better. Let's put a final decision on @dangra's shoulders :)

This is done to prevent:

- changes in this object be reflected in a tree;
- keeping a reference to a tree and preventing response GC;
- isinstance(sel.attrib, dict) to be False

Thanks @kmike !
@kmike
Copy link
Member

kmike commented Jun 8, 2018

Thanks @eliasdorneles! Actually it was @dangra who brought this up and asked for a copy :)

@dangra dangra merged commit c8ff59b into master Jun 8, 2018
@dangra
Copy link
Member

dangra commented Jun 8, 2018

party time 🎆

@dangra dangra deleted the add-attrs branch June 8, 2018 13:35
@eliasdorneles
Copy link
Member Author

Yaay! Thank you, fellows! 🎉 :)

@kmike kmike added this to the v1.5 milestone Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants