# Selectors

In this tutorial, we will go through the functionality of different `soupsavvy` selectors.

## AttributeSelector

Attribute selectors are used to select elements based on their attributes. They are used to select elements with specific attributes.

### Finding by attribute name

`AttributeSelector` can be used to find html element with specific attribute name. If element has given attribute, it matches the selector. Value of the attribute is ignore in this case.

In [67]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """<p id="12ghj8">Book</p><p class="price">Price: $20</p>""",
    features="lxml"
)
price_selector = AttributeSelector("class")
price_selector.find(soup)

<p class="price">Price: $20</p>

### Finding by value exact match

`AttributeSelector` can be used to match the value of the given attribute. `string` should be passed as `value` parameter to match the exact value of the attribute. 

In [68]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="price">Price: $20</p>""",
    features="lxml"
)
price_selector = AttributeSelector("class", value="price")
price_selector.find(soup)

<p class="price">Price: $20</p>

### Finding by value partial match

Matching element by partial value of the attribute is also possible. Setting `re` parameter to `True` uses compiled regex pattern to match the value of the attribute. It uses `re.search` under the hood.

In [69]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <a href="https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm">fictiondb</a>
        <a href="https://search.worldcat.org/title/1056176764">worldcat</a>
    """,
    features="lxml"
)
price_selector = AttributeSelector("href", value="worldcat", re=True)
price_selector.find(soup)

<a href="https://search.worldcat.org/title/1056176764">worldcat</a>

In similar fashion, compiled regex pattern can be passed as `value` parameter, which is equivalent to above example.

In [70]:
from bs4 import BeautifulSoup
import re

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <a href="https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm">fictiondb</a>
        <a href="https://search.worldcat.org/title/1056176764">worldcat</a>
    """,
    features="lxml"
)
price_selector = AttributeSelector("href", value=re.compile("worldcat"))
price_selector.find(soup)

<a href="https://search.worldcat.org/title/1056176764">worldcat</a>

By default when `re` is set to `True`, complied regex is created from `value` string, it does not escape special characters. In this case, `AttributeSelector` looks for digit in element `class` attribute value followed by literal `escription`, instead of intended `\description` literal string.  

In [98]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
    <span class="title\description">Animal Farm; Some animals are more equal than others</span>
    """,
    features="lxml"
)
price_selector = AttributeSelector("class", value="\description", re=True)
print(price_selector.find(soup))

None


To remedy this, compiled regex with escaped sequence needs to be passed as `value` parameter, which gives user more control over the regex pattern.

In [95]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title\description">Animal Farm; Some animals are more equal than others</span>
        <p class="how_much">Price: $20</p>
    """,
    features="lxml"
)
price_selector = AttributeSelector("class", value=re.compile(re.escape(r"\d")))
print(price_selector.find(soup))

<span class="title\description">Animal Farm; Some animals are more equal than others</span>


When no match is found, `AttributeSelector.find` method behavior depends on `strict` parameter value. If `strict` is set to `True`, it raises `ValueError` exception. If `strict` is set to `False`, which is default, it returns `None`. 

In [73]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector
from soupsavvy.exceptions import TagNotFoundException

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="how_much">Price: $20</p>""",
    features="lxml"
)
price_selector = AttributeSelector("class", value="price")

try:
    price_selector.find(soup, strict=True)
except TagNotFoundException as e:
    print(e)

Tag was not found in markup.


`find_all` method can be used to return all matching elements, similar to `BeautifulSoup` `Tag.find_all` method.

In [74]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: $30</p>
    """,
    features="lxml"
)
price_selector = AttributeSelector("class", value="price")
price_selector.find_all(soup)

[<p class="price">Price: $10</p>,
 <p class="price">Price: $20</p>,
 <p class="price">Price: $30</p>]

In [75]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: $30</p>
    """,
    features="lxml"
)
price_selector = AttributeSelector("class", value="price")
price_selector.find_all(soup, limit=2)

[<p class="price">Price: $10</p>, <p class="price">Price: $20</p>]

In [76]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <div>
            <p class="price">Price: $10</p>
            <p class="price">Price: $20</p>
        </div>
        <p class="price">Price: $30</p>
    """,
    features="html.parser"
)
price_selector = AttributeSelector("class", value="price")
price_selector.find(soup, recursive=False)

<p class="price">Price: $30</p>

In [77]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """<p class="price">Price: $10</p><p class="price">Price: $20</p>""",
    features="lxml"
)
price_selector = TagSelector("p")
price_selector.find(soup)

<p class="price">Price: $10</p>

In [78]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector, AttributeSelector

soup = BeautifulSoup(
    """<p class="title">Animal Farm</p><p class="price">Price: $10</p>""",
    features="lxml"
)
price_selector = TagSelector("p", attributes=[AttributeSelector("class", "price")])
price_selector.find(soup)

<p class="price">Price: $10</p>

In [79]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title">Animal Farm</p>
        <p class="price"><s>Price: $20</s></p>
        <p class="price discount">Price: $10</p>
    """,
    features="lxml"
)
price_selector = TagSelector(
    "p",
    attributes=[
        AttributeSelector("class", "price"),
        AttributeSelector("class", "discount"),
    ],
)
price_selector.find(soup)

<p class="price discount">Price: $10</p>

In [80]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <p class="title" lang="en">Animal Farm</p>
    """,
    features="lxml"
)
title_selector = TagSelector(
    "p",
    attributes=[
        AttributeSelector("class", "title"),
        AttributeSelector("lang", "en"),
    ],
)
title_selector.find(soup)

<p class="title" lang="en">Animal Farm</p>

In [81]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml"
)
title_selector = TagSelector(
    attributes=[
        AttributeSelector("class", "title"),
        AttributeSelector("lang", "en"),
    ],
)
title_selector.find(soup)

<span class="title" lang="en">Animal Farm</span>

In [82]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml"
)
title_selector = TagSelector()
title_selector.find(soup)

<html><body><p class="title" lang="es">Rebelión en la granja</p>
<p class="description" lang="en">Some animals are more equal than others</p>
<span class="title" lang="en">Animal Farm</span>
</body></html>

In [83]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="html.parser"
)
title_selector = TagSelector()
title_selector.find(soup)

<p class="title" lang="es">Rebelión en la granja</p>

In [84]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <p class="title" lang="en">Animal Farm</p>
    """,
    features="lxml"
)
title_selector = TagSelector("p", attributes=[AttributeSelector("class", "title")])
title_selector.find_all(soup)

[<p class="title" lang="es">Rebelión en la granja</p>,
 <p class="title" lang="en">Animal Farm</p>]