functions that work on soup
To install: pip install ow
The ow package provides a collection of utilities designed to facilitate the manipulation and querying of HTML/XML structures using BeautifulSoup. It includes functions to navigate through the structure, extract information, and even open HTML content in a web browser for debugging or inspection purposes.
- Navigational Utilities: Traverse through the HTML tree structure to find parent elements or specific paths.
- HTML Content Handling: Save HTML tags to a file and view them in Firefox, aiding in debugging and visualization.
- Data Extraction: Simplify the extraction of text from specified tags and automatically apply text transformations.
- Batch Element Retrieval: Retrieve multiple elements based on complex path specifications, supporting both simple and nested queries.
Install the package using pip:
pip install owTo find the root parent of a BeautifulSoup tag:
from bs4 import BeautifulSoup
from ow import root_parent
soup = BeautifulSoup("<div><span>Example</span></div>", "html.parser")
span_tag = soup.find('span')
root = root_parent(span_tag)
print(root) # Outputs the div tagTo open a tag's HTML content in Firefox for debugging:
from bs4 import BeautifulSoup
from ow import open_tag_in_firefox
soup = BeautifulSoup('<div><span>Open me in Firefox</span></div>', 'html.parser')
span_tag = soup.find('span')
open_tag_in_firefox(span_tag)Extract text from a specified tag and add it to a dictionary, optionally applying a text transformation:
from bs4 import BeautifulSoup
from ow import add_text_to_parse_dict
soup = BeautifulSoup('<div><p id="para"> Some text </p></div>', 'html.parser')
parse_dict = {}
add_text_to_parse_dict(soup, parse_dict, key='paragraph', name='p', attrs={'id': 'para'}, text_transform=str.strip)
print(parse_dict) # Outputs: {'paragraph': 'Some text'}Retrieve elements from a BeautifulSoup object by specifying a path:
from bs4 import BeautifulSoup
from ow import get_elements
soup = BeautifulSoup('<div><p>First</p><p>Second</p></div>', 'html.parser')
elements = get_elements(soup, ['p'])
print([e.text for e in elements]) # Outputs: ['First', 'Second']Returns the furthest ancestor of a BeautifulSoup tag.
Saves the HTML of a BeautifulSoup tag to a temporary file and opens it in Firefox.
Finds a tag in the given BeautifulSoup object soup by name and attrs, extracts its text, applies a text_transform function, and adds it to parse_dict under key.
Retrieves an element from a BeautifulSoup node by following a specified path. The path can be a string, list, or dictionary describing how to find the element.
Recursively retrieves elements from a node or list of nodes in a BeautifulSoup object by following a list of paths. Each path can be a string, list, or dictionary that specifies how to find the elements.
Contributions to the ow package are welcome. Please ensure that any pull requests or issues are detailed with examples and expected outcomes.