# Selectors

In this tutorial, we'll explore the functionality of various simple `soupsavvy` selectors, showcasing how they can be used to perform searches in `BeautifulSoup` objects.

## AttributeSelector

Attribute selectors allow you to select elements based on element attribute values.

### Finding by attribute name

`AttributeSelector` can be used to find HTML elements with a specific attribute name. If an element contains the given attribute, it matches the selector, regardless of the attribute's value.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """<p id="12ghj8">Book</p><p class="price">Price: $20</p>""", features="lxml"
)
price_selector = AttributeSelector("class")
price_selector.find(soup)

### Finding by exact value

`AttributeSelector` can be used to match the exact value of a given attribute. By passing a `string` as the `value` parameter, the selector will only match elements whose attribute matches this exact value.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="price">Price: $20</p>""",
    features="lxml",
)
price_selector = AttributeSelector("class", value="price")
price_selector.find(soup)

### Finding by regex

For more flexible searches, `AttributeSelector` can also match elements based on a regular expression pattern. By passing a compiled regex pattern, you can perform partial matches, complex text patterns, or other advanced text queries on attribute value.

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <a href="https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm">fictiondb</a>
        <a href="https://search.worldcat.org/title/1056176764">worldcat</a>
    """,
    features="lxml",
)
price_selector = AttributeSelector("href", value=re.compile(r"worldcat\.org/.*/\d{10}"))
price_selector.find(soup)

### Other functionalities

Let's explore some common functionalities available with the `AttributeSelector` and other selectors in `soupsavvy`. These functionalities behave consistently across all selector types, so the following examples apply universally.

#### Using `strict` mode

When no match is found, the behavior of the `AttributeSelector.find` method depends on the `strict` parameter. If `strict` is set to `True`, it raises a `TagNotFoundException`. If `strict` is set to `False` (the default), it returns `None`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector
from soupsavvy.exceptions import TagNotFoundException

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="how_much">Price: $20</p>""",
    features="lxml",
)
price_selector = AttributeSelector("class", value="price")

print(f"NOT STRICT: {price_selector.find(soup)}")

try:
    price_selector.find(soup, strict=True)
except TagNotFoundException as e:
    print(f"STRICT: {e}")

#### Finding all elements

The `find_all` method can be used to return all matching elements, similar to the `BeautifulSoup` `Tag.find_all` method. The elements in the result list are always unique and maintain the same order as they appear in the document.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: $30</p>
    """,
    features="lxml",
)
price_selector = AttributeSelector("class", value="price")
price_selector.find_all(soup)

#### Using `limit` option

When using the `find_all` method, the `limit` parameter can be used to restrict the number of elements returned. If `limit` is set to `None` (the default), all matching elements are returned. This functionality is derived from the `BeautifulSoup` `Tag.find_all` method.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: $30</p>
    """,
    features="lxml",
)
price_selector = AttributeSelector("class", value="price")
price_selector.find_all(soup, limit=2)

#### Using `recursive` option

Both `find` and `find_all` methods have a `recursive` parameter. If `recursive` is set to `True` (the default), the search includes the entire document. If `recursive` is set to `False`, the search is limited to the direct children of the element. This behavior is consistent with how the `recursive` parameter works in `BeautifulSoup`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <div>
            <p class="price">Price: $10</p>
            <p class="price">Price: $20</p>
        </div>
        <p class="price">Price: $30</p>
    """,
    features="html.parser",
)
price_selector = AttributeSelector("class", value="price")
price_selector.find(soup, recursive=False)

### Specific attribute selectors

`soupsavvy` offers convenience classes for selecting elements based on specific attributes, such as `id` and `class`. These classes, simplify the selection process by pre-defining commonly used attribute names. 

**Convenience Selectors:**

- **`IdSelector`**: Selector for matching elements by their `id` attribute.

- **`ClassSelector`**: Selector for matching elements by their `class` attribute.

Both are subclasses of `AttributeSelector`. By pre-assigning attribute names, they offer a more intuitive interface and reduce the amount of boilerplate code needed for common attribute-based searches.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="price">Price: $20</p>""",
    features="lxml",
)
price_selector = ClassSelector("price")
price_selector.find(soup)

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import IdSelector

soup = BeautifulSoup(
    """<p id="12ghj8">Book</p><p id="13cji0" class="price">Price: $20</p>""",
    features="lxml",
)
price_selector = IdSelector(re.compile(r"^13.*0$"))
price_selector.find(soup)

These are subclasses of `AttributeSelector` and provide a more user-friendly interface, as they skip `name` parameter which is pre-assigned to the component specific value.

## TagSelector

`TagSelector` is used to select elements based on their tag name and attributes.

### Finding by tag name

The `TagSelector` can be used to find elements based solely on their tag name. For example, selecting all `<p>` elements in a document. This basic usage allows you to target elements without specifying any additional attributes.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """<p class="price">Price: $10</p><p class="price">Price: $20</p>""",
    features="lxml",
)
price_selector = TagSelector("p")
price_selector.find(soup)

### Finding by tag name and single attribute

In addition to the tag name, the `TagSelector` can be configured to match elements with a specific attribute. For example, you can find all `<p>` elements with `class` attribute set to `price`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector, AttributeSelector

soup = BeautifulSoup(
    """<p class="title">Animal Farm</p><p class="price">Price: $10</p>""",
    features="lxml",
)
price_selector = TagSelector("p", attributes=[AttributeSelector("class", "price")])
price_selector.find(soup)

### Finding by multiple class attributes

When you need to select elements with a specific combination of class attributes, `TagSelector` can match those elements accordingly. For instance, you can select all elements that have both `price` and `discount` attributes. CSS `class` attribute is specific in this context, as it can have a list of values separated by spaces.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title">Animal Farm</p>
        <p class="price"><s>Price: $20</s></p>
        <p class="price discount">Price: $10</p>
    """,
    features="lxml",
)
price_selector = TagSelector(
    attributes=[
        AttributeSelector("class", "price"),
        AttributeSelector("class", "discount"),
    ],
)
price_selector.find(soup)

### Finding by tag name and multiple attributes

`TagSelector` can also match elements based on multiple distinct attributes. For example, selecting all `<p>` tags that have both `class` equal to `title` and `lang` equal to `en` attributes. You can specify as many attributes as needed to match the desired elements.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <p class="title" lang="en">Animal Farm</p>
    """,
    features="lxml",
)
title_selector = TagSelector(
    "p",
    attributes=[
        AttributeSelector("class", "title"),
        AttributeSelector("lang", "en"),
    ],
)
title_selector.find(soup)

### Finding only by attributes 

Tag name is not required to define. `TagSelector` can also find elements based only on their attributes. For example, selecting elements with `class` equal to `title` and `lang` equal to `en` attributes, without specifying a tag name.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
title_selector = TagSelector(
    attributes=[
        AttributeSelector("class", "title"),
        AttributeSelector("lang", "en"),
    ],
)
title_selector.find(soup)

### Finding any element with empty selector

When initializing `TagSelector` without any parameters, it matches any element in the document. This behavior is similar to using a wildcard selector in CSS. Using `find` in this context it an equivalent of getting the first element of the soup.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="html.parser",
)
title_selector = TagSelector()
title_selector.find(soup)

## AnyTagSelector

This selector is a wildcard selector that matches any tag. It can be used to select any element in the document. It is equivalent to using `*` in CSS selectors and `TagSelector()` defined without any parameters.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AnyTagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="html.parser",
)
any_selector = AnyTagSelector()
any_selector.find(soup)

Additionally, when two selectors are sure to yield the same results, they are considered equal. `AnyTagSelector` is equal to any other `AnyTagSelector` and `TagSelector` with no tag name specified.

In [None]:
from soupsavvy import AnyTagSelector, TagSelector

any_selector = AnyTagSelector()
another_any_selector = AnyTagSelector()
empty_tag_selector = TagSelector()

any_selector == another_any_selector == empty_tag_selector

When using `find_all` method, it returns all elements in the document. This can be restricted to only direct children of the element by setting `recursive` parameter to `False`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import AnyTagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="html.parser",
)
any_selector = AnyTagSelector()
any_selector.find_all(soup)

## PatternSelector

`PatternSelector` is designed to select elements based on their text content. Unlike the standard `BeautifulSoup` implementation, which returns `NavigableString` when querying by text content, `PatternSelector` provides a more consistent and practical approach.

In `BeautifulSoup`, searching for elements by text content always returns `NavigableString` when element was found, which can be cumbersome and inconsistent, especially when further searching or processing is needed. `NavigableString` does not support additional searches or operations, making it less useful for many applications.

In [None]:
import re

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
result = soup.find(string=re.compile("animal", re.IGNORECASE))
print(f"Result is of type {type(result)} : {result}")

`PatternSelector` addresses this limitation by returning a `Tag` object instead of `NavigableString`. This behavior aligns with other selectors and allows for more effective manipulation and querying of the text-containing elements.

### Finding by plain text

When searching by plain text, `PatternSelector` matches elements whose text content exactly matches the provided string.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import PatternSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
animal_selector = PatternSelector("Animal Farm")
animal_selector.find(soup)

### Finding by regex

For more flexible searches, `PatternSelector` can also match elements based on a regular expression pattern. By passing a compiled regex pattern, you can perform partial matches, complex text patterns, or other advanced text queries

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import PatternSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
animal_selector = PatternSelector(re.compile(r"animal", re.IGNORECASE))
animal_selector.find(soup)

To maintain a streamlined and intuitive interface, `TagSelector` does not include text-based search functionalities. Instead, `PatternSelector` was introduced as a separate selector to provide a more coherent design. However, these two selectors can be combined to match elements based on both text content and attributes or tag names. For example, the snippet below demonstrates how to use `TagSelector` alongside `PatternSelector` to locate all `<p>` elements with a `class` of `title`, a `lang` attribute of `en`, and text content containing the word "animal" (case-insensitive).

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, PatternSelector, TagSelector

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="es">Todos los animales son iguales</p>
        <p class="title" lang="en">Animal Farm</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
    """,
    features="lxml",
)
description_selector = TagSelector(
    "p",
    attributes=[
        AttributeSelector("lang", value="en"),
        AttributeSelector("class", value="description"),
    ],
)
animal_selector = PatternSelector(re.compile(r"animal", re.IGNORECASE))

combined_selector = description_selector & animal_selector
combined_selector.find(soup)

This example uses syntax for combination of selectors specific to `soupsavvy` which results in `AndSelector`. This topic with examples is explained in details in the next tutorial. 

## Overview

These fundamental selectors form the core of `soupsavvy` and provide the building blocks for more complex queries. In the next tutorials, we'll explore advanced options and composite selectors that build upon these basics to enable even more powerful and flexible searching capabilities.

**Enjoy `soupsavvy` and leave us feedback!**  
**Happy scraping!**