# Models

`soupsavvy` provides a streamlined approach to object-oriented web scraping through user-defined models. These models allow you to locate and extract specific structures within HTML content. Each model represents an object expected to be found within a defined scope element.

## Operation

Operations are objects designed to encapsulate transformation logic, such as extracting text from an element, converting data types, or applying custom transformations. You can apply operations after a selector and chain them as needed using the pipe `|` operator. For example, to extract and format the price of a book from a specific HTML structure, you might chain operations to extract text (`Text`), remove a currency symbol (using a custom `Operation`), and convert the value to float (`Operation`).

The `Operation` component allows for any user-defined transformation. It takes a callable that accepts a single argument, performs the transformation, and returns the modified value.

### Applying operation

In [None]:
from soupsavvy.operations import Operation

operation = Operation(lambda x: x.strip("$"))
operation.execute("100$")

All `soupsavvy` operations are mixins of searcher and operation, it can be used both as operation (`execute` method) and as a searcher (`find`, `find_all` methods), which calls the `execute` method internally. They can be used as fields of model to extract data directly from scope element.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Operation

text = """
    <p class="title" id="book1">Animal Farm</p>
"""
soup = BeautifulSoup(text, features="html.parser")

operation = Operation(lambda x: x.get("id"))
operation.find(soup.p)  # type: ignore

### Chaining operations

In [None]:
from soupsavvy.operations import Operation

operation = (
    Operation(lambda x: x.strip("$")) | Operation(int) | Operation(lambda x: x * 2)
)
operation.execute("100$")

### Text

`Text` is a built-in operation that extracts the text content of an element. It's very common and useful operation in web scraping.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Text

text = """
    <p class="title">Animal Farm</p>
"""
soup = BeautifulSoup(text, features="html.parser")
operation = Text()
operation.execute(soup.p)

`Text` wraps the `get_text` method from BeautifulSoup, providing a familiar interface, operation accepts the following arguments:

- **`strip` (bool):** Determines whether to remove leading and trailing whitespaces and newline characters from the extracted text. Defaults to `False`.
- **`separator` (str):** Specifies a separator to join multiple text nodes within the element. Defaults to an empty string (`""`), meaning no separator is applied.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Text

text = """
    <p class="title">  Animal Farm  \n\n</p>
"""
soup = BeautifulSoup(text, features="lxml")
operation = Text(strip=True)
operation.execute(soup.p)

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Text

text = """
    <div>
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
operation = Text(separator=" -- ", strip=True)
operation.execute(soup.div)

### Href

`Href` is a built-in operation that extracts value of `href` attribute from an element. It does not accepts any parameters. If `href` attribute is not found in the element, it returns `None`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Href

text = """
    <a href="www.book.com">Animal Farm</a>
"""
soup = BeautifulSoup(text, features="lxml")
operation = Href()
operation.execute(soup.a)

### Parent

`Parent` is an operation that extracts the parent element of the current element. It can be used to navigate to a higher level in the HTML structure. It does not accepts any parameters. Unlike `Href` or `Text`, it is soup selector, as it returns a BeautifulSoup `Tag` object.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Parent

text = """
    <div><a href="www.book.com">Animal Farm</a></div>
"""
soup = BeautifulSoup(text, features="lxml")
operation = Parent()
operation.execute(soup.a)

`Parent` operations can be chained together resulting in `OperationPipeline` object, as operation is higher in inheritance hierarchy than selector. Instance in example below will return the parent of the parent of the element when performing operation.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Parent

text = """
    <div><div><a href="www.book.com">Animal Farm</a></div></div>
"""
soup = BeautifulSoup(text, features="lxml")
operation = Parent() | Parent()
operation.execute(soup.a)

### Combining with selector

Selector can be combined with operation to extract data from the selected element. Using `|` operator on selector and operation will create instance of `SelectionPipeline`, which first finds element with provided selector and applies operation on it subsequently.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector
from soupsavvy.operations import Operation, Text

text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
selector = ClassSelector("book") > ClassSelector("price")
operation = Text() | Operation(lambda x: x.strip("$")) | Operation(int)

pipeline = selector | operation
pipeline.find(soup)

## Model

### Definition

To create a user-defined model in `soupsavvy`, ensure that it meets the following requirements:

- The class must inherit from `soupsavvy.models.BaseModel`.
- It must include a `__scope__` class attribute that defines the model's scope.
- The model should have at least one field defined as a class attribute.

**Scope:** The scope defines the HTML element that is expected to contain all the fields of the model. This must be a `soupsavvy` selector. When an element matches the scope selector, the model is considered found, and its fields are extracted.

**Field:** A field represents a specific piece of information expected to be found within the scope element. Fields are defined as class attributes, like `title` and `price` in the example below. Any attribute with a value that is an instance of `TagSearcher` is recognized as a model field. The value can be:

- Selector, e.g.: `ClassSelector("book")`
- Selector-Operation pipeline, e.g.: `ClassSelector("price") | Text() | Operation(int)`
- Another model class that inherits from `BaseModel`, e.g.: `Author`
- Mixin that functions as both a selector and an operation, e.g.: `Text()`, `Href()` or `Operation(lambda x: x.get("id"))`

### Finding model

To locate instances of a model within an HTML document, `find` method can be invoked on the model class. This method accepts a BeautifulSoup `Tag` object as an argument and returns the instance of the model found within the first found scope.

In the example below, the `Book` class defines a model expected to be contained within a `div` element with the class `book`. The model includes two fields:

- **`title`**: Extracts the text content from an element with the class `title`.
- **`price`**: Extracts the text content from an element with the class `price` and converts it to an integer.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

If no scope is found in provided tag, `find` method returns `None` and model is not extracted by default. 

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
result = Book.find(soup)
assert result is None

By using `strict` mode, when `find` method fails to find specified scope, `ModelNotFoundException` exception is raised.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.exceptions import ModelNotFoundException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")

try:
    Book.find(soup, strict=True)
except ModelNotFoundException as e:
    print(e)


By default, if an error occurs during data extraction, it is propagated, preventing the model from being built. In the example below, if the `price` element is not found within the specified scope, the selector returns `None`. This causes the `Text` operation to fail, as it cannot extract text from `None`.

It's important to note that the `strict` parameter applies only to the scope search, not to individual field selectors. Field selectors are *forgiving* - they proceed to the next operation even if the previous one returns `None`. As a result, any potential edge cases must be explicitly handled within the model definition.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")

try:
    Book.find(soup)
except FieldExtractionException as e:
    print(e)

### Operations as fields

As mentioned earlier, operations can be used as fields in the model. They are used to extract and transform data from the scope element. In the example below, `Operation` is used to extract `id` attribute from the element and `Href` is used to extract `href` attribute.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Href


class Book(BaseModel):

    __scope__ = TypeSelector("div")

    id = Operation(lambda x: x.get("id"))
    link = Href()


text = """
    <div id="book1" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

### Wrappers

Wrappers are higher-order components that modify the behavior of operations or selectors. They provide enhanced control over the extraction process, allowing you to handle edge cases more gracefully and build more complex models. By wrapping operations or selectors, you can customize how data is extracted, transformed, or handled when certain conditions are met.

#### SkipNone

If you anticipate that the `price` element might be absent from the scope, you can use the `SkipNone` operation wrapper. This wrapper ensures that subsequent operations, like extracting text or converting to an integer, are only performed if the `price` element is found. If the input is `None`, `SkipNone` simply returns `None` without applying the wrapped operation. As a result, if a field selector returns `None`, the corresponding field in the model instance is automatically set to `None`, since all fields are nullable by default.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, SkipNone
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | SkipNone(Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

#### Suppress

Another way to handle errors is by using the `Suppress` operation wrapper, which catches and suppresses any exceptions that occur during operation execution. When an exception is raised, `Suppress` returns `None`. This is particularly useful when you anticipate that the input might be incompatible with the operation. In the example below, while the `price` element is expected to be present, the text might be empty. By using `Suppress`, we handle the case where converting an empty string to an integer would normally raise an exception, ensuring the operation returns `None` instead.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, Suppress
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Suppress(Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"></p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

This scenario can be handled more precisely using the `Default` field wrapper. When the field selector returns `None`, the default value is used. For instance, if an empty `price` element should be interpreted as a price of `0`, the `Default` operation can be employed to manage this case. It's important to note that while `Default` handles missing elements by substituting the default value, it will still propagate any exceptions that occur during extraction.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, Default, Suppress
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Default(ClassSelector("price") | Text() | Suppress(Operation(int)), 0)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">hundred</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

`Suppress` by default suppresses operation when it raised any exception, which is subclass of `Exception`. It allows to specify category of exceptions to suppress by passing `category` parameter.  

It can be either:
- **single exception class** ex. `category=ValueError`
- **tuple of exception classes** ex. `category=(ValueError, TypeError)`

to pass to `issubclass` function.  
In example below, `ValueError` is raised when converting empty non-integer string to integer and it's suppressed by operation.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, Default, Suppress
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (
        ClassSelector("price") | Text() | Suppress(Operation(int), category=ValueError)
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">hundred</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

Any other then specified exception is not suppressed. In example below, only `TypeError` is handled by `Suppress` operation, while `ValueError`, which is raised by wrapped operation is propagated, as it's not `TypeError` or its subclass.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel, Suppress
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (
        ClassSelector("price") | Text() | Suppress(Operation(int), category=TypeError)
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">hundred</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")

try:
    Book.find(soup)
except FieldExtractionException as e:
    print(e.__cause__)

#### Required

By default, all fields are nullable, meaning that if a field selector returns `None`, the corresponding field in the model instance is set to `None`. However, this behavior can be modified using the `Required` field wrapper, which raises a `FieldExtractionException` when a field selector returns `None`. This ensures that the specified field must be present in the model instance. Note that, similarly to `Default`, all errors must be explicitly handled, as the `Required` selector will propagate any exceptions that occur during extraction.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel, Required, SkipNone
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Required(ClassSelector("price") | SkipNone(Text() | Operation(int)))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")

try:
    Book.find(soup)
except FieldExtractionException as e:
    print(e)

#### All

By default, only the first element matching a field selector is used for extraction. If a different result is desired, it can be specified with the appropriate selector. For instance, the `soupsavvy.selectors.nth` selectors can be utilized to match the nth element that meets the selector criteria. In the example provided, the `NthLastOfSelector` is used to find the last element with the class `price` within the scope element.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.nth import NthLastOfSelector


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = NthLastOfSelector(ClassSelector("price"), nth="1") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="title">Animal Farm</p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

If we expect multiple elements to be found within the scope, the `All` field wrapper can be used. This wrapper extracts all elements matching the field selector. In the example below, all available prices for a given book are extracted.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

### Post-Initialization

When an extracted field requires further transformation, the `__post_init__` method can be defined in the model class to handle these cases during the post-initialization step. This method functions similarly to the `__post_init__` in Python's `dataclass`. For instance, in the example below, the `price` is set to the minimum of all prices for the specific book.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))

    def __post_init__(self) -> None:
        self.price = min(self.price)  # type: ignore


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

The `__post_init__` method can also be utilized to replace certain operations or perform more complex transformations that depend on other extracted fields. Additionally, new attributes can be set within this method, allowing for enhanced customization and flexibility in the model's behavior.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text()
    author = (LastOfType() & TypeSelector("p")) | Text()

    def __post_init__(self) -> None:
        self.price = int(str(self.price).strip("$"))
        self.title = str(self.title).upper()
        self.affordable = (self.price < 100) or (self.author == "George Orwell")


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
result = Book.find(soup)
print(result)
print(f"Is affordable: {result.affordable}")

As previously mentioned, any model class can serve as a field selector. In the example below, the `Author` model is utilized to extract author information from the `author` element within the scope. The `Author` model class is assigned as an attribute of the `Book` model.

In [None]:
import re
from datetime import datetime

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import FirstChild


class Author(BaseModel):
    __scope__ = ClassSelector("author")

    birth = (
        PatternSelector(re.compile(r"\d{4}-\d{2}-\d{2}"))
        | Text()
        | Operation(lambda x: datetime.strptime(x, "%Y-%m-%d"))
    )
    name = FirstChild() | Text()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    author = Author
    title = ClassSelector("title") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <div class="author">
            <p>George Orwell</p>
            <p>Great author</p>
            <p>1903-06-25</p>
        </div>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

### Inheritance

In the case of inheritance, fields are inherited by default, which is the expected behavior. New fields can be added to extend the parent model. For example, the `eBook` model inherits from the `Book` model and adds two additional fields: `link` and `duration`. It also overrides the `__scope__`, although this is not mandatory since special fields are inherited from the parent model.

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


class eBook(Book):
    __scope__ = TypeSelector("div") & ClassSelector("ebook")

    link = Href()
    duration = PatternSelector(re.compile(r"\d{1,2}:\d{2}")) | Text()


text = """
    <div class="ebook" href="www.ebook.com">
        <p class="title">Animal Farm</p>
        <p class="price">50</p>
        <p>George Orwell</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
eBook.find(soup)

This default inheritance behavior can be disabled by setting `inherit_fields` to `False` in the model class. In this case, only the fields defined in the subclass will be used. For instance, in the example below, the `eBook` model does not inherit fields from the `Book` model, so only the `link` and `duration` fields are extracted.

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


class eBook(Book):
    __inherit_fields__ = False

    link = Href()
    duration = PatternSelector(re.compile(r"\d{1,2}:\d{2}")) | Text()


text = """
    <div class="ebook" href="www.ebook.com">
        <p class="title">Animal Farm</p>
        <p class="price">50</p>
        <p>George Orwell</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
eBook.find(soup)

### Scope

It is generally recommended to use the most specific scope selector possible to avoid matching elements that are unrelated to the model. This ensures that the scope selector targets only those elements expected to contain the model instance. You can used `HasSelector` to match elements that contain the elements used for extracting fields, further refining your selection criteria.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, HasSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

PRICE_SELECTOR = ClassSelector("price")
TITLE_SELECTOR = ClassSelector("title")


class Book(BaseModel):

    __scope__ = (
        ClassSelector("book")
        & HasSelector(PRICE_SELECTOR)
        & HasSelector(TITLE_SELECTOR)
    )

    title = TITLE_SELECTOR | Text()
    price = PRICE_SELECTOR | Text() | Operation(int)


text = """
    <div class="book">Unavailable</div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
        <p>4:30</p>
    </div>
    <div class="book">
        <p class="price">50</p>
        <p>Lois Lowry</p>
        <p>3:30</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">50</p>
        <p>Aldous Huxley</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

If a sub-model shares the same scope as its parent model, you can use the `SelfSelector` to utilize the provided tag as the scope without searching for it. For example, in the `Author` model, the scope is inherited directly from the parent model. Alternatively, the scope of a sub-model can be defined outside of the base model's scope, parent element can be used as scope with `Parent` selector, or any specific ancestor that match selector with `Anchor << selector`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, SelfSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Parent, Text


class Author(BaseModel):
    __scope__ = SelfSelector()

    name = ClassSelector("author") | Text()


class eBook(BaseModel):
    __scope__ = Parent()

    link = ClassSelector("ebook") | Href()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    author = Author
    ebook = eBook
    title = ClassSelector("title") | Text()


text = """
    <div>
        <a class="ebook" href="www.ebook.com"></a>
        <div class="book" href="www.book.com">
            <p class="title">Animal Farm</p>
            <p class="author">George Orwell</p>
            <p>Great author</p>
            <p>1903-06-25</p>
        </div>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

### Finding all

The `find_all` method operates similarly to the `find` method, first identifying the scope element before extracting all fields from it. It returns a list of model instances for all elements that match the scope selector.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">100</p>
        <p>Aldous Huxley</p>
    </div>
    <div class="book">
        <p class="title">The Giver</p>
        <p class="price">80</p>
        <p>Lois Lowry</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find_all(soup)

Any errors encountered during extraction are propagated; if the extraction of any model fails, `find_all` method raises `FieldExtractionException`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType
from soupsavvy.exceptions import FieldExtractionException


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">100</p>
        <p>Al
    </div>
    <div class="book">
        <p class="title">The Giver</p>
        <p>Lois Lowry</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")

try:
    Book.find_all(soup)
except FieldExtractionException as e:
    print(e)

If no scope elements are found, the `find_all` method simply returns an empty list.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
    <p class="title">Animal Farm</p>
    <p class="price">100</p>
    <p>George Orwell</p>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find_all(soup)

### Recursive option

The recursive option is applicable exclusively to scope searches. When set to `True`, the model's scope is searched throughout all descendants of the specified tag; if set to `False`, only direct children are considered. Once the scope is identified, field selectors always perform searches in recursive mode, regardless of the `recursive` parameter.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="book" href="www.book.com">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <span class="author">
                <p>George Orwell</p>
            </span>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="html.parser")
Book.find(soup, recursive=False)

To modify this behavior and restrict field element searches to only the children of the scope element, a relative selector can be used, best created with `Anchor`. Using `Anchor > selector` limits the search to direct child elements. For example, only `price` elements that are immediate children of the `book` element will be matched. To find out more about `Anchor`, see [docs](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html#anchor).

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, TypeSelector
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All((Anchor > ClassSelector("price")) | Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p class="price">50</p>
            <span class="author">
                <p>George Orwell</p>
            </span>
        </span>
        <p class="price">200</p>
    </div>
"""
soup = BeautifulSoup(text, features="html.parser")
Book.find(soup)

With the non-recursive option, the scope is searched solely within the children of the element passed to the `find` methods. In the example below, the scope element is only found if it is a `span`. If `body` is passed instead, and it does not contain any `div` elements with class `book`, the scope will not be located, resulting in the model being `None`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <span>
        <div class="book" href="www.book.com">
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p>George Orwell</p>
        </div>
    </span>
"""
soup = BeautifulSoup(text, features="lxml")
result = Book.find(soup.body, recursive=False)  # type: ignore
assert result is None

Book.find(soup.span, recursive=False)  # type: ignore

### Typing

For those who prefer clean and consistent typing, `typing.cast` can be used to provide type checkers with hints regarding the types of instance fields. By default, it anticipates the same type as the type of field selector. In the example below, `typing.cast` is used to indicate to the type checker that the `title` attribute is of type `str`, while the `price` attribute can be of type `int` or `None`.

In [None]:
from typing import cast, Optional

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, SkipNone
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = cast(str, ClassSelector("title") | Text())
    price = cast(
        Optional[int], ClassSelector("price") | SkipNone(Text() | Operation(int))
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
Book.find(soup)

### Migration

You can migrate an instance of a model to another model using `migrate` method. This method takes another class as an argument and initializes it with the field values from the current model instance. It's particularly useful for migrating data to `pydantic` model for validation or to an `SQLAlchemy` model for database operations. The fields in both models must match for migration to work seamlessly. Additionally, you can pass extra parameters as keyword arguments, which will be forwarded to the target model's constructor.

#### Pydantic

In [None]:
from bs4 import BeautifulSoup
import pydantic
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class PydanticBook(pydantic.BaseModel):
    title: str
    price: int


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
book = Book.find(soup)

book.migrate(PydanticBook)

Migration to `pydantic` model raises `ValidationError` if validation fails. 

In [None]:
from bs4 import BeautifulSoup
import pydantic
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class PydanticBook(pydantic.BaseModel):
    title: str
    price: str


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
book = Book.find(soup)

try:
    book.migrate(PydanticBook)
except pydantic.ValidationError as e:
    print(e)

#### SQLAlchemy

In [None]:
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import DeclarativeBase
from bs4 import BeautifulSoup
import pydantic
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Base(DeclarativeBase): ...


class SABook(Base):
    """Mock model for testing migration to SQLAlchemy model."""

    __tablename__ = "book"

    id = Column(Integer, primary_key=True)
    title = Column(String, nullable=True)
    price = Column(Integer, nullable=True)

    def __repr__(self):
        return f"<SABook(title={self.title}, price={self.price})>"


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
book = Book.find(soup)

book.migrate(SABook)

## Conclusion

`soupsavvy` offers a framework for object-oriented web scraping through user-defined models.  
This allows users to define the structure of data they wish to extract from HTML documents.

**Enjoy `soupsavvy` and leave us feedback!**  
**Happy scraping!**