# About

With many web scraping libraries available, each with unique interfaces and conventions, building workflows and maintaining consistency in DOM traversal and selection often becomes tedious, resulting in complexity and boilerplate code.

`soupsavvy` solves this with a unified, consistent approach to selection, based on these principles:

- **Decoupling**: Selection logic is abstracted away from DOM node and traversal implementations.
- **Framework-Agnostic**: Operates consistently with any supported library.
- **Flexibile & Extensibile**: Lightweight, reusable components used to build complex selection workflows.


Unlike traditional libraries that require various methods and parameters for different tasks, `soupsavvy` uses a simple, consistent selector interface:

```python
selector = TypeSelector("div")
selector.find(element)
selector.find_all(element)
```

Selectors can encapsulate advanced logic, such as XPath queries, logical relationships, sequences and more:

```python
selector = XPathSelector("//div/a")
elements = selector.find_all(element, limit=3)
```

With `soupsavvy`, developers can focus on data extraction workflows instead of wrestling with library-specific quirks and inconsistencies. Boost your web-scraping workflows by eliminating complexity and introducing:

- **Portability**
- **Maintainability**
- **Scalability**

### Portability

`soupsavvy` provides a slim, consistent selector interface, allowing DOM elements from any supported library to be wrapped and used interchangeably. 

Instead of rewriting workflows when switching between libraries like `BeautifulSoup` and `selenium`, `soupsavvy` abstracts selection logic, ensuring the same selectors work seamlessly across different frameworks by eliminating any library-specific logic.

#### Example

This simple workflow extracts text of the header from `www.example.com`. Both `selenium` and `BeautifulSoup` have a different way of executing this operation.

#### Using `BeautifulSoup`

In [None]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")

p = soup.find("h1")

if p is None:
    raise Exception("Element not found")

print(p.text)

Example Domain


#### Using `selenium`

In [21]:
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.example.com")
p = driver.find_element(By.TAG_NAME, "h1")
print(p.text)

Example Domain


Switching between these requires rewriting the logic, as the libraries have different interfaces and conventions.

In `soupsavvy`, selectors are independent of the underlying library, allowing you to use the same workflow across different libraries. Once you know how to use selectors, you can apply them to any supported implementation.

#### Using `soupsavvy`

##### For `selenium`

In [24]:
from selenium import webdriver
from selenium.webdriver.common.by import By

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

driver = webdriver.Chrome()
driver.get("https://www.example.com")
root = driver.find_element(By.TAG_NAME, "html")
element = to_soupsavvy(root)

selector = TypeSelector("h1") | Text()

text = selector.find(element)
print(text)

Example Domain


##### For `BeautifulSoup`

In [26]:
import requests
from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")

element = to_soupsavvy(soup)
selector = TypeSelector("h1") | Text()

text = selector.find(element)
print(text)

Example Domain


### Maintainability

Maintaining complex scraping workflows becomes challenging as projects grow. Adding new selectors, modifying existing ones, or adapting to changes in target websites often requires updating multiple parts of the codebase. 

With traditional libraries, even seemingly simple workflows can lead to verbose code. For example, finding and handling a specific element’s text with `BeautifulSoup`:

In [None]:
from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(
    """<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
span = soup.find("span", attrs={"class": "price"})

if not isinstance(span, Tag):
    raise Exception("Element not found")

p = span.find("p", recursive=False)

if p is None:
    raise Exception("Element not found")

print(p.text)

Price: $10


Switching to `lxml` or `selenium` introduces different methods, parameters, and error handling, adding to the complexity.

With `soupsavvy`, selectors ensures consistency by providing a unified interface, that encapsulates entire logic, however complex defined relationships are, eliminating a lot of boilerplate code.

In [33]:
from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

soup = BeautifulSoup(
    """<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
element = to_soupsavvy(soup)

selector = (TypeSelector("span") > TypeSelector("p")) | Text()
found = selector.find(element)
print(found)


Price: $10


### Scalability

When building complex scraping workflows, you might need to manage various relationships between elements, handle multiple matches, or apply different selection criteria. Traditional libraries force developers to write extensive boilerplate code to manage lists, sets, and operations on multiple elements, leading to tangled and error-prone logic.

#### Using `BeautifulSoup`

Finding all sibling elements with a specific class after an `<h2>` tag using `BeautifulSoup`:

In [40]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """
        <p class="price">Price: $25</p>
        <h2>Discounted</h2>
        <span>Bargain!!!</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)

h2 = soup.find_all("h2")
matches = []

for tag in h2:
    matches.extend(tag.find_next_siblings(attrs={"class": "price"}))
    
print([match.text for match in matches])

['Price: $15', 'Price: $10']


This approach requires manually handling sibling relationships and merging results, which becomes increasingly complex in larger workflows.

#### Using `soupsavvy`

With `soupsavvy`, selectors encapsulate the logic for element relationships, providing a concise and reusable workflow:

In [41]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

soup = BeautifulSoup(
    """
        <p class="price">Price: $25</p>
        <h2>Discounted</h2>
        <span>Bargain!!!</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = (TypeSelector("h2") * ClassSelector("price")) | Text()
selector.find_all(element)

['Price: $15', 'Price: $10']

Even thinking about implementing a workflow like the one below in common web scraping libraries can be overwhelming.
With `soupsavvy`, selectors act as modular building blocks, encapsulating both selection logic and element relationships. These selectors are reusable and can be easily combined to create complex workflows without the usual overhead.

In [20]:
import re

from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.selectors.nth import NthOfSelector

# Define selectors for the workflow
pattern_selector = PatternSelector(re.compile(r"price"))
type_selector = TypeSelector("span")

# Combine selectors using XOR (matches one or the other but not both)
xor_selector = pattern_selector ^ type_selector

# Select every second match of the XOR selector
nth_selector = NthOfSelector(xor_selector, nth="2n+1")

# Combine selectors to find specific children inside elements with a class
child_selector = ClassSelector("container") > nth_selector

## Conclusion

In conclusion, `soupsavvy` revolutionizes web scraping by introducing declarative selectors that simplify and unify the selection process across various libraries. Its flexible and consistent approach eliminates the complexity of managing different APIs, allowing you to focus on the task at hand.

Although the examples and tutorials provided are focused on `BeautifulSoup`, the concepts and workflows are applicable to all supported libraries. Dive deeper into the documentation to explore more examples and see how `soupsavvy` can streamline your web scraping projects, no matter the framework you’re using.

**Enjoy `soupsavvy` and leave us feedback!**  
**Happy scraping!**