# Soupsavvy

## Motivation

`soupsavvy` is flexible search engine for `BeautifulSoup`, designed to provide more powerful capabilities.
While BeautifulSoup excels in excellent at parsing HTML documents, it falls short in performing more complex searches due to the lack of a robust `selector` concept. This limitation inspired the creation of `soupsavvy`, which integrates seamlessly with `BeautifulSoup` to provide advanced search functionalities.

In `BeautifulSoup`, a `Tag` component serves dual roles as both a node in the HTML tree and a search engine. Its search engine allows for basic searches using tag names, attributes, and text, which suffices for simple scenarios. However, this approach becomes cumbersome for more complex searches. The lack of separation between the tree structure and the search engine also makes reusability challenging. Simplified, the public API of `Tag` looks like this:

```
├── Tag
│   ├── search
│   │   ├── find
│   │   ├── find_all
│   │   ├── select
│   │   ├── select_one
│   │   ├── find_sibling
│   ├── node
│   │   ├── descendants
│   │   ├── children
│   │   ├── siblings
│   │   ├── parent
│   │   ├── next
│   │   ├── previous
```

## Concept of Selector


Imitating the concept of a `soupsavvy` selector can be achieved in various ways. For simple searches where the Tag search engine is enough, dictionary of parameters can be assigned to variable and be reused throughout your code by passing it as keyword arguments to the `Tag` find methods. Here is an example:

In [36]:
from bs4 import BeautifulSoup

PRICE_SELECTOR = {"name": "p", "attrs": {"class": "price"}}

soup = BeautifulSoup("""<p class="price">Price: $10</p>""", "lxml")
soup.find_all(**PRICE_SELECTOR, limit=3)
soup.find(**PRICE_SELECTOR, recursive=True)

<p class="price">Price: $10</p>

For more complex searches, we can define a function that takes a Tag as an argument and returns the desired element. This approach encapsulates the search logic, making it reusable across multiple places in your code. Here’s an example:

In [37]:
from bs4 import BeautifulSoup, Tag

# CSS Selector equivalent: p.price > span ~ a.link


def select_sth(tag: Tag):
    # Find the first <p> tag with class "price"
    tag1 = tag.find("p", attrs={"class": "price"})

    if not isinstance(tag1, Tag):
        return None

    # Find the first <span> tag within the <p> tag
    tag2 = tag1.find("span", recursive=False)

    if not isinstance(tag2, Tag):
        return None

    # Find the next sibling <a> tag with class "link"
    return tag2.find_next_sibling("a", attrs={"class": "link"})


soup = BeautifulSoup(
    """<p class="price"><span>Price: $10</span><a class="link"></a></p>""", "lxml"
)
select_sth(soup)

<a class="link"></a>

### Flaws

Moving the search logic outside of the `Tag` class is a step towards a more structured search engine, but this approach has some limitations:

* **Handling Not Found**: What if the function needs to raise an exception when the desired element is not found? In some scenarios, it is critical for the application to break if a required element is missing. This behavior is difficult to enforce with simple selector functions.
* **Parameter Passing**: How to handle extra parameters for the `find` and `find_all` methods, such as `recursive` or `limit` for all functions in consistent way?
* **Maintainability**: Maintaining multiple selector functions can become a nightmare. It makes the code harder to read and more prone to errors.

### Solution

This is where `soupsavvy` comes in with solution to all these problems. It allows you to define selectors declaratively, using a simple and readable syntax. It's a powerful search engine for `BeautifulSoup`, similar to the [soupsieve](https://github.com/facelessuser/soupsieve "soupsieve") library, whose developers describe it as:

* > Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4.
* > Soup Sieve was written with the intent to replace Beautiful Soup's builtin select feature, and as of Beautiful Soup version 4.7.0, it now is 🎊

While `soupsieve` is an excellent tool for CSS selectors, `soupsavvy` is a more general solution, offering greater flexibility in defining selectors.

Perhaps one day `soupsavvy` will be integrated with BeautifulSoup, but for now, it remains a standalone library.

## Selector

The most common feature of `soupsavvy` is the `Selector`. This versatile component can be adjusted to suit various needs. All selectors implement the `SoupSelector` interface, which means:

* **Universal Usage**: Selectors can be used in a consistent manner across different scenarios.
* **Combinability**: Selectors can be easily combined and work seamlessly together.
* **Comparability**: Selectors can be compared with one another.
* **Functionality**: Selectors are used to find elements within the soup.

### `find` method

#### Docs

In [38]:
from soupsavvy.tags.base import SoupSelector

help(SoupSelector.find)

Help on function find in module soupsavvy.tags.base:

find(self, tag: 'Tag', strict: 'bool' = False, recursive: 'bool' = True) -> 'Optional[Tag]'
    Finds a first matching element in provided BeautifulSoup Tag.
    
    Parameters
    ----------
    tag : Tag
        Any BeautifulSoup Tag object.
    strict : bool, optional
        If True, raises an exception if tag was not found in markup,
        if False and tag was not found, returns None by defaulting to BeautifulSoup
        implementation. Value of this parameter does not affect behavior if tag
        was successfully found in the markup. By default False.
    recursive : bool, optional
        bs4.Tag.find method parameter that specifies if search should be recursive.
        If set to False, only direct children of the tag will be searched.
        By default True.
    
    Returns
    -------
    Tag | None
        BeautifulSoup Tag if it tag was found in the markup, else None.
    
    Notes
    -----
    If result is an 

#### Overview

The `find` method is one of the two primary search methods in `soupsavvy` selectors, alongside `find_all`. Much like the `BeautifulSoup` `Tag.find` method, `find` returns the first element that matches the selector. In addition to the `recursive` option from `BeautifulSoup`, it introduces a `strict` parameter. 

The `strict` parameter addresses a common issue: often, not finding an element is not expected and can break the application. Continuously checking if returned objects are `None` creates a lot of boilerplate code. By setting the `strict` parameter to `True`, the `find` method will raise a `TagNotFoundException` if no element is found, ensuring that the application does not proceed in such cases. By default, `strict` is set to `False`.

A particular challenge with `BeautifulSoup` arises when only a string is provided as a parameter to the `find` method, which if anything was matched returns `NavigableString` object. This makes searches based solely on text difficult. `soupsavvy` tackles this problem with the `PatternSelector`, though it won't be covered in this article.

Furthermore, `soupsavvy` raises `NavigableStringException` if any selector's `find` method returns a `NavigableString` object, preventing unintended behavior.

Generally, in most selectors, the `find` method operates by returning the first element from the results of the `find_all` method, but some selectors optimize this process.

#### Examples

In [39]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """<p class="price">Price: $10</p><p class="price">Price: $20</p>""", "lxml"
)
price_selector = TagSelector("p")
price_selector.find(soup)

<p class="price">Price: $10</p>

In [40]:
from bs4 import BeautifulSoup, Tag

from soupsavvy import TagSelector

soup: Tag = BeautifulSoup(
    """<div><p class="price">Price: $10</p></div><p class="price">Price: $20</p>""",
    "lxml",
).body  # type: ignore

price_selector = TagSelector("p")
price_selector.find(soup, recursive=False)

<p class="price">Price: $20</p>

In [41]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector
from soupsavvy.exceptions import TagNotFoundException

soup = BeautifulSoup("""<a class="price">Price: $10</a>""", "lxml")

price_selector = TagSelector("p")

try:
    price_selector.find(soup, strict=True)
except TagNotFoundException as e:
    print(e)

Tag was not found in markup.


### `find_all` method

#### Docs

In [42]:
from soupsavvy.tags.base import SoupSelector

help(SoupSelector.find_all)

Help on function find_all in module soupsavvy.tags.base:

find_all(self, tag: 'Tag', recursive: 'bool' = True, limit: 'Optional[int]' = None) -> 'list[Tag]'
    Finds all matching elements in provided BeautifulSoup Tag.
    
    Parameters
    ----------
    tag : Tag
        Any BeautifulSoup Tag object.
    recursive : bool, optional
        bs4.Tag.find method parameter that specifies if search should be recursive.
        If set to False, only direct children of the tag will be searched.
        By default True.
    limit : int, optional
        bs4.Tag.find_all method parameter that specifies maximum number of elements
        to return. If set to None, all elements are returned. By default None.
    
    Returns
    -------
    list[Tag]
        A list of Tag objects matching tag specifications.
        If none found, the list is empty.



#### Overview

The `find_all` method, similar to `BeautifulSoup`'s `Tag.find_all` method, returns all elements that match the selector. It consistently returns a list, so there is no need for a `strict` parameter. If no elements match the selector, it simply returns an empty list. The `find_all` method also supports the `limit` and `recursive` parameters, familiar from `BeautifulSoup`, allowing for controlled and flexible searches.

#### Examples

In [43]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """<p class="price">Price: $10</p><p class="price">Price: $20</p>""", "lxml"
)
price_selector = TagSelector("p")
price_selector.find_all(soup)

[<p class="price">Price: $10</p>, <p class="price">Price: $20</p>]

In [44]:
from bs4 import BeautifulSoup, Tag

from soupsavvy import TagSelector

soup: Tag = BeautifulSoup(
    """<div><p class="price">Price: $10</p></div><p class="price">Price: $20</p>""",
    "lxml",
).body  # type: ignore

price_selector = TagSelector("p")
price_selector.find_all(soup, recursive=False)

[<p class="price">Price: $20</p>]

In [45]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup("""<a class="price">Price: $10</a>""", "lxml")

price_selector = TagSelector("p")
price_selector.find_all(soup)

[]

In [46]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector

soup = BeautifulSoup(
    """
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: $30</p>
    """,
    "lxml",
)
price_selector = TagSelector("p")
price_selector.find_all(soup, limit=2)

[<p class="price">Price: $10</p>, <p class="price">Price: $20</p>]

### Equality

#### Docs

In [47]:
from soupsavvy.tags.base import SoupSelector

help(SoupSelector.__eq__)

Help on function __eq__ in module soupsavvy.tags.base:

__eq__(self, other: 'object') -> 'bool'
    Check self and other object for equality.
    
    This method is abstract and must be implemented by all selectors.
    Selectors are considered equal if their find methods return the same result.
    
    Calling find or find_all methods on selectors that are equal
    should return the same results.
    
    Example
    -------
    >>> selector1 = ElementTag("div")
    >>> selector2 = ElementTag("div")
    >>> selector1 == selector2
    True
    >>> selector1.find(tag) == selector2.find(tag)
    True



#### Overview

All `soupsavvy` Selectors are comparable, meaning they can be checked for equality. If two selectors are equal, their search results will always be identical. However, if two selectors are not equal, their search results may still be the same, but this is not guaranteed and depends on the structure of the soup being queried.

#### Examples

In [48]:
from soupsavvy import TagSelector, AnyTagSelector

print(f"{TagSelector('p') == TagSelector('div') = }")
print(f"{TagSelector('p') == TagSelector('a') = }")
# selectors of different types can be equal as well
print(f"{TagSelector() == AnyTagSelector() = }")

TagSelector('p') == TagSelector('div') = False
TagSelector('p') == TagSelector('a') = False
TagSelector() == AnyTagSelector() = True


#### Combining selectors

#### Overview

All selectors in `soupsavvy` can be combined using Higher Order Selectors, such as `AndSelector`, or with syntactical sugar operators. This flexibility allows for complex and nuanced queries, enabling users to construct sophisticated search criteria. For more detailed information about combining selectors read the documentation [here](todo).

#### Examples

In [49]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector, PatternSelector, AndSelector

soup = BeautifulSoup(
    """
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
    """,
    "lxml",
)

# combine selector using Higher Order Selector initializer
price_selector = AndSelector(TagSelector("p"), PatternSelector(r"20", re=True))
price_selector.find(soup)

<p class="price">Price: $20</p>

In [50]:
from bs4 import BeautifulSoup

from soupsavvy import TagSelector, PatternSelector, AndSelector

soup = BeautifulSoup(
    """
        <p class="price">Price: $10</p>
        <p class="price">Price: $20</p>
    """,
    "lxml",
)

# combine selectors with the & operator
price_selector = TagSelector("p") & PatternSelector(r"20", re=True)
price_selector.find(soup)

<p class="price">Price: $20</p>

## Summary

`Soupsavvy` is crafted to be a powerful and intuitive search engine for `BeautifulSoup`. By leveraging the concept of selectors, it encapsulates search logic in a structured and reusable manner, addressing the limitations of `BeautifulSoup` and offering seamless integration with it.

**Enjoy the library and leave us feedback!**  
**Happy scraping!**