# Models

`soupsavvy` **Model** is user-defined **scraping schema**, that uses selectors and operations to extract structured information from the webpage. Model represents an entity of interest in scraping like product, article, job, etc.

## Operations

Operations encapsulate transformation logic, such as extracting text, converting data types, or applying custom transformations.

### Applying operation

In [None]:
from soupsavvy.operations import Operation

operation = Operation(lambda x: x.strip("$"))
operation.execute("100$")

`Operation` accepts positional and keyword arguments, which are passed to execution function. 

In [None]:
from datetime import datetime

from soupsavvy.operations import Operation

operation = Operation(datetime.strptime, "%d-%m-%Y")
operation.execute("01-02-2020")

### Chaining operations

Operations can be chained with `|` operator to apply multiple in the sequence.

In [None]:
from soupsavvy.operations import Operation

operation = (
    Operation(lambda x: x.strip("$")) | Operation(int) | Operation(lambda x: x * 2)
)
operation.execute("100$")

### Text

`Text` is a built-in operation that extracts the text content of an element. It's very common and useful operation in web scraping.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Text
from soupsavvy import to_soupsavvy

text = """
    <p class="title">Animal Farm</p>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.p)
operation = Text()
operation.execute(element)

### Href

`Href` is a built-in operation that extracts value of `href` attribute from an element. If `href` attribute is not found in the element, it returns `None`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Href
from soupsavvy import to_soupsavvy

text = """
    <a href="www.book.com">Animal Farm</a>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.a)
operation = Href()
operation.execute(element)

### Parent

`Parent` is an operation that extracts the parent element of the current element. It can be sued as selector as well.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy.operations import Parent
from soupsavvy import to_soupsavvy

text = """
    <div><a href="www.book.com">Animal Farm</a></div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.a)
operation = Parent()
operation.execute(element)

### Combining with selector

Selectors in `soupsavvy` can be combined with operations by using `|` operator. Created pipeline first locates the element and then applies the operation on it.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy
from soupsavvy.operations import Operation, Text

text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

selector = ClassSelector("book") > ClassSelector("price")
operation = Text() | Operation(lambda x: x.strip("$")) | Operation(int)

pipeline = selector | operation
pipeline.find(element)

## Models

### Definition

To create a user-defined model in `soupsavvy`, it must:

- Inherit from `soupsavvy.models.BaseModel`.
- Define a `__scope__` class attribute specifying the HTML element containing the model's fields.
- Include at least one field as a class attribute.

**Scope:** This selector defines the HTML element that encapsulates all fields of the model.

**Fields:** Class attributes that extract data from within the scope element. These can be:
- Selectors, e.g., `ClassSelector("book")`
- Selector-operation pipelines, e.g., `ClassSelector("price") | Text() | Operation(int)`
- Another model class inheriting from `BaseModel`
- Mixins like `Text()`, `Href()`, or custom `Operation()`


### Finding model

`find` method of model class can be used to extract model from `bs4` object. It returns model instance within the first found scope element.

`Book` class can define a model expected to be contained within a `div.book` element and includes two fields:

- **`title`**: Extracts text from `.title`.
- **`price`**: Extracts text from `.price` and converts it to an integer.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
Book.find(element)

If the `find` method doesn't locate the scope in the provided tag, it returns `None`, and the model is not extracted by default. 

However, in `strict` mode, when the `find` method fails to find the specified scope, it raises a `ModelNotFoundException` exception instead.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import ModelNotFoundException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
result = Book.find(element)
assert result is None

try:
    Book.find(element, strict=True)
except ModelNotFoundException as e:
    print(e)

By default, errors during data extraction are propagated, stopping the model from being built. For instance, if the `price` element isn't found, the `Text` operation fails since it can't extract text from `None`.

The `strict` parameter only affects scope searches, not individual field selectors. Field selectors are forgiving, meaning they continue even if a previous step returns `None`. Edge cases must be handled explicitly within the model definition.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

try:
    Book.find(element)
except FieldExtractionException as e:
    print(e)

### Operations as fields

As noted earlier, operations can be used as fields in the model to extract and transform data from the scope element. For example, `Operation` can extract the `id` attribute, and `Href` can be used to retrieve the `href` attribute from an element.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Href


class Book(BaseModel):

    __scope__ = TypeSelector("div")

    id = Operation(lambda x: x.get()["id"])
    link = Href()


text = """
    <div id="book1" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

### Wrappers

Wrappers are composite components that modify the behavior of operations or selectors.  
They handle edge cases, like missing data, more gracefully in the model.

#### SkipNone

The `SkipNone` wrapper prevents operations like text extraction or type conversion from running if the input is `None`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | SkipNone(Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

#### Suppress

The `Suppress` operation wrapper catches and suppresses exceptions during execution, returning `None` if an exception occurs. This is useful for handling potential incompatibilities, such as converting an empty string to an integer.
It allows to specify category of exceptions to suppress by passing `category` parameter as exception or tuple of exceptions, in such case only exceptions of specified category will be suppressed.


In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Suppress, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (
        ClassSelector("price") | Text() | Suppress(Operation(int), category=ValueError)
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"></p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

The `Default` wrapper provides a default value when a field selector returns `None`, allowing for specific interpretations, like treating an empty `price` as `0`. However, it does not suppress exceptions that arise during extraction.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, Default
from soupsavvy.operations import Operation, Suppress, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Default(ClassSelector("price") | Text() | Suppress(Operation(int)), 0)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">hundred</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

#### IfElse

The `IfElse` operation enables conditional data transformations, taking three arguments:

- **`condition` (callable):** A function that determines which operation to execute.
- **`if_` (operation):** The operation performed if the condition is met.
- **`else_` (operation):** The operation executed if the condition is not met.

This is useful for applying different transformations based on the HTML structure or values.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, IfElse, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = (
        ClassSelector("title")
        | Text()
        | IfElse(lambda x: x == "", Operation(lambda x: None), Operation(str.upper))
    )
    price = ClassSelector("price") | IfElse(lambda x: x.name == "a", Href(), Text())


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">10</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

#### Break and Continue

Additionally, `Break` and `Continue` operations enhance `IfElse` by providing control flow capabilities:

- **`Break`:** Terminates the operation pipeline.
- **`Continue`:** Skips the current operation and moves to the next one.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Break, Continue, IfElse, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = (
        ClassSelector("title")
        | Text()
        | IfElse(
            lambda x: x == "",
            Break(),
            Operation(str.upper),
        )
        | Operation(lambda x: x + "!")
    )
    price = (
        ClassSelector("divider")
        | Text()
        | Operation(int)
        | IfElse(lambda x: x == 0, Continue(), Operation(lambda x: 100 / x))
        | Operation(lambda x: f"{x}$")
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title"></p>
        <p class="divider">0</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

#### Required

By default, all fields in a model are nullable, setting the corresponding field to `None` if the field selector returns `None`. You can modify this behavior with the `Required` field wrapper, which enforces not null value for the field. 

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel, Required
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Required(ClassSelector("price") | SkipNone(Text() | Operation(int)))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

try:
    Book.find(element)
except FieldExtractionException as e:
    print(e)

#### All

If we expect multiple elements to be found within the scope, the `All` field wrapper can be used.  
This wrapper extracts all elements matching the field selector.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

### Post-Initialization

To handle further transformations of extracted fields, you can define the `__post_init__` method in your model class, similar to Python's `dataclass`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))

    def __post_init__(self) -> None:
        self.price = min(self.price)  # type: ignore


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

You can also create individual postprocessing methods for each field. It is a method with any name, but must be decorated with `@soupsavvy.models.post`.

This transformation is applied before `__post_init__` and assignment of instance attributes.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text()
    author = (LastOfType() & TypeSelector("p")) | Text()

    @post("title")
    def process_title(self, value: str) -> str:
        return value.upper()

    @post("price")
    def process_price(self, value: str) -> int:
        return int(value.strip("$"))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

### Inheritance

By default, fields in a model are inherited, allowing subclasses to extend parent models easily. For example, the `eBook` model inherits from the `Book` model, adding fields like `link` and `duration`. It can also override `__scope__`, although this isn't required, as all special fields are inherited.

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


class eBook(Book):
    __scope__ = TypeSelector("div") & ClassSelector("ebook")

    link = Href()
    duration = PatternSelector(re.compile(r"\d{1,2}:\d{2}")) | Text()


text = """
    <div class="ebook" href="www.ebook.com">
        <p class="title">Animal Farm</p>
        <p class="price">50</p>
        <p>George Orwell</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

eBook.find(element)

To disable this default inheritance behavior, set `inherit_fields` to `False` in the model class.  
In this case, only the fields defined in the subclass will be extracted.

In [None]:
import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


class eBook(Book):
    __inherit_fields__ = False

    link = Href()
    duration = PatternSelector(re.compile(r"\d{1,2}:\d{2}")) | Text()


text = """
    <div class="ebook" href="www.ebook.com">
        <p class="title">Animal Farm</p>
        <p class="price">50</p>
        <p>George Orwell</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

eBook.find(element)

### Scope

It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.  
You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, HasSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

PRICE_SELECTOR = ClassSelector("price")
TITLE_SELECTOR = ClassSelector("title")


class Book(BaseModel):

    __scope__ = (
        ClassSelector("book")
        & HasSelector(PRICE_SELECTOR)
        & HasSelector(TITLE_SELECTOR)
    )

    title = TITLE_SELECTOR | Text()
    price = PRICE_SELECTOR | Text() | Operation(int)


text = """
    <div class="book">Unavailable</div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
        <p>4:30</p>
    </div>
    <div class="book">
        <p class="price">50</p>
        <p>Lois Lowry</p>
        <p>3:30</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">50</p>
        <p>Aldous Huxley</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

When scope element is the same as current element, `SelfSelector` can be used as scope selector.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class Book(BaseModel):
    __scope__ = SelfSelector()

    title = ClassSelector("title") | Text()
    author = ClassSelector("author") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="author">George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

Book.find(div)

### Finding all

The `find_all` method returns a list of model instances for all elements that match the scope selector.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">100</p>
        <p>Aldous Huxley</p>
    </div>
    <div class="book">
        <p class="title">The Giver</p>
        <p class="price">80</p>
        <p>Lois Lowry</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find_all(element)

### Recursive option

The recursive option applies only to scope searches. When set to `True`, the model's scope is searched among all descendants of the specified tag; when set to `False`, only direct children are considered. Field selectors, however, always search recursively, regardless of this setting.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <span>
        <div class="book">
            <p class="title">Not a child</p>
            <p class="price">200</p>
            <p>Author</p>
        </div>
    </span>
    <div class="book">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p>George Orwell</p>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="html.parser")
element = to_soupsavvy(soup)

Book.find(element, recursive=False)

To restrict field searches to only the children of the scope element, you can use a relative selector.  
To find out more, see [docs](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html#relative-selectors).

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (Anchor > ClassSelector("price")) | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p class="price">50</p>
            <span class="author">
                <p>George Orwell</p>
            </span>
        </span>
        <p class="price">200</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

### Model Composition

Any model class can be a field selector as `Author` class in this example.

In [None]:
import re
from datetime import datetime

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import FirstChild


class Author(BaseModel):
    __scope__ = ClassSelector("author")

    birth = (
        PatternSelector(re.compile(r"\d{4}-\d{2}-\d{2}"))
        | Text()
        | Operation(lambda x: datetime.strptime(x, "%Y-%m-%d"))
    )
    name = FirstChild() | Text()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    author = Author
    title = ClassSelector("title") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <div class="author">
            <p>George Orwell</p>
            <p>Great author</p>
            <p>1903-06-25</p>
        </div>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

### Frozen Model

To make the model immutable, set the `__frozen__` attribute to `True`. This enforces immutability, making the model hashable. Modifying any field of a frozen instance will raise a `FrozenModelException`. Regardless of immutability, attempting to set an attribute not defined as a field will raise an `AttributeError`.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text
from soupsavvy.exceptions import FrozenModelException


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")
    __frozen__ = True

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

result = Book.find(element)

try:
    result.price = 200  # type: ignore
except FrozenModelException as e:
    print(e)

### Field

By default, all fields are part of instance:

- **String Representation:**
- **Equality Comparison:**
- **Hash Calculation:**
- **Migration:**

To exclude a field from these operations, use the `Field` class as a wrapper with the following boolean parameters, that default to `True`:

- **`repr`**
- **`compare`**
- **`migrate`**

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, Field
from soupsavvy.operations import Text

PRICE_SELECTOR = ClassSelector("price") | Text()


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")
    __frozen__ = True

    title = ClassSelector("title") | Text()
    price = Field(PRICE_SELECTOR, compare=False, repr=False, migrate=False)


text = """
    <div class="book">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
    </div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p class="price">50$</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

result = Book.find_all(element)
print(f"{result[0]} == {result[1]}: {result[0] == result[1]}")

### Migration

You can migrate a model instance to another model using the `migrate` method, which takes a target class as an argument and initializes it with the current model's field values.

#### Pydantic

In [None]:
import pydantic
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class PydanticBook(pydantic.BaseModel):
    title: str
    price: int


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(PydanticBook)

#### SQLAlchemy

In [None]:
from bs4 import BeautifulSoup
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import DeclarativeBase

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Base(DeclarativeBase): ...


class SABook(Base):
    __tablename__ = "book"

    id = Column(Integer, primary_key=True)
    title = Column(String, nullable=True)
    price = Column(Integer, nullable=True)

    def __repr__(self):
        return f"<SABook(title={self.title}, price={self.price})>"


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(SABook)

#### Mapping

For more complex models with another models as fields, `mapping` can be passed to `migrate` method to specify how `soupsavvy` models should be transformed into respective target models.

In [None]:
import pydantic
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class PydanticAuthor(pydantic.BaseModel):
    name: str
    country: str


class PydanticBook(pydantic.BaseModel):
    title: str
    author: PydanticAuthor


class Author(BaseModel):
    __scope__ = TypeSelector("span")

    name = TypeSelector("p") | Text()
    country = ClassSelector("country") | Text()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    author = Author


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <span>
            <p>George Orwell</p>
            <a class="country">United Kingdom</a>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(PydanticBook, mapping={Author: PydanticAuthor})

#### MigrationSchema

While the `migrate` method accepts keyword arguments, these apply only to the target model, not to nested models. 

When additional initialization parameters are needed, use `MigrationSchema`, which includes the target model and a dictionary of keyword arguments for the initializer.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, MigrationSchema
from soupsavvy.operations import Text


class TargetAuthor:
    def __init__(self, name: str, country: str, genre=None):
        self.name = name
        self.country = country
        self.genre = genre

    def __repr__(self):
        return f"TargetAuthor(name={self.name!r}, country={self.country!r}, genre={self.genre!r})"


class TargetBook:
    def __init__(self, title: str, author: TargetAuthor, price=None):
        self.title = title
        self.author = author
        self.price = price

    def __repr__(self):
        return f"TargetBook(title={self.title!r}, author={self.author!r}, price={self.price!r})"


class Author(BaseModel):
    __scope__ = TypeSelector("span")

    name = TypeSelector("p") | Text()
    country = ClassSelector("country") | Text()


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    author = Author


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <span>
            <p>George Orwell</p>
            <a class="country">United Kingdom</a>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(
    TargetBook,
    mapping={Author: MigrationSchema(TargetAuthor, params={"genre": "Dystopia"})},
    price=10,
)

#### Copy

Additionally, the `copy` method allows a model to be *migrated to itself*, creating a new identical instance as a deep copy, including all nested models.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book_copy = book.copy()

assert book == book_copy
assert book is not book_copy

print(book_copy)

## Tips

### Scope

It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.  
You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, HasSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

PRICE_SELECTOR = ClassSelector("price")
TITLE_SELECTOR = ClassSelector("title")


class Book(BaseModel):
    __scope__ = (
        ClassSelector("book")
        & HasSelector(PRICE_SELECTOR)
        & HasSelector(TITLE_SELECTOR)
    )

    title = TITLE_SELECTOR | Text()
    price = PRICE_SELECTOR | Text() | Operation(int)


text = """
    <div class="book">Unavailable</div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
        <p>4:30</p>
    </div>
    <div class="book">
        <p class="price">50</p>
        <p>Lois Lowry</p>
        <p>3:30</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">50</p>
        <p>Aldous Huxley</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

When scope element is the same as current element, `SelfSelector` can be used as scope selector.

In [None]:
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class Book(BaseModel):
    __scope__ = SelfSelector()

    title = ClassSelector("title") | Text()
    author = ClassSelector("author") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="author">George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

Book.find(div)

### Typing

To maintain clean and consistent typing, you can use `typing.cast` to provide type checkers with hints about instance field types.

In [None]:
from typing import cast, Optional

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = cast(str, ClassSelector("title") | Text())
    price = cast(
        Optional[int], ClassSelector("price") | SkipNone(Text() | Operation(int))
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)

## Conclusion

`soupsavvy` offers a framework for object-oriented web scraping through user-defined models.  
This allows users to define the structure of data they wish to extract from HTML documents.

**Enjoy `soupsavvy` and leave us feedback!**  
**Happy scraping!**