# Scrubbing Markup Tags

Many digital texts -- especially web pages -- are encoding in markup languages that use tags in angled brackets to indicate semantic or structural content. Common examples of these markup languages are SGML, HTML, and XML. Lexos provides a set of functions for manipulating these tags in order to extract text strings that are more usable for downstream analysis. This notebook provides examples of how to perform this manipulation.

The Lexos scrubbing functions are not as powerful as dedicated parsing libraries like Python's BeautifulSoup or lxml, or the XSLT language intended for transforming XML. However, they should allow you to perform a wide variety of transformations of marked up text within a simple, easy to code pipeline.

Run the following cell to import all functions we will use in this tutorial.

In [None]:
# Import scrubbing functions
from functools import partial
from lexos.scrubber.scrubber import Scrubber
from lexos.scrubber.registry import scrubber_components

# Get markup scrubbing functions from the registry
remove_attribute = scrubber_components.get("remove_attribute")
remove_comments = scrubber_components.get("remove_comments")
remove_doctype = scrubber_components.get("remove_doctype")
remove_element = scrubber_components.get("remove_element")
remove_tag = scrubber_components.get("remove_tag")
replace_attribute = scrubber_components.get("replace_attribute")
replace_tag = scrubber_components.get("replace_tag")

We can call these functions directly on a string:

In [None]:
text = "<p>This is a paragraph with some text in <em>italics</em></p>"
result = remove_tag(text, selector="p", mode="html")
result

Under the hood, the text is parsed with the Python `BeautifulSoup` library. The default parser its HTML parser, but you can set `mode=xml` to use the XML parser. Although you do not need to use the `selector` keyword (you can use "p" as a positional argument), it is helpful to show what is going on. The `remove_tag` function searches for `p` elements and then removes them, retaining the elements' content. If you wish to remove the content as well, use the `remove_element` function instead.

We'll now create a scrubber pipeline in which we will perform the following actions in order:

1. Remove the document type declaration with the `remove_doctype` function.
2. Remove all comments (text between `<--` and `-->`) with the `remove_comments` function.
3. Remove the attribute `class` from all instances of the element `p` with the `remove_attribute` function.
4. Remove all `p` elements (tags and their content) with the `remove_element` function.
5. Replace all `p` tags with `span` tags (preserving their content) using the `replace_tag` function.
6. Replace the `id` attribute in `span` elements with a `class` attribute (preserving the value) using the `replace_attribute` function.
7. Use the `replace_attribute` function to restore the `id` attribute and change the value to "x".
8. In `p` elements, replace the value of the `class` attribute from `bold` to `happy` (leaving other values untouched).

The third and fourth steps will naturally render some of the subsequent steps impossible. In the cell below, you can try commenting and uncommenting sections of the pipeline to see the effect. 

In [None]:
# Create a Scrubber instance
scrubber = Scrubber()

# Create the scrubber pipeline

# Remove the document type declaration
scrubber.add_pipe(partial(remove_doctype))

# Remove all comments
# scrubber.add_pipe(partial(remove_comments))

# Remove all class attributes from all p elements
# scrubber.add_pipe(partial(remove_attribute, selector="p", attribute="class"))

# Remove all p elements, including their content
# scrubber.add_pipe(partial(remove_element, selector="p"))

# Replace all p tags with span tags, retaining their content
# scrubber.add_pipe(partial(replace_tag, selector="p", replacement="span"))

# Change the id attribute to a class attribute in all span elements
# scrubber.add_pipe(partial(replace_attribute, selector="span", old_attribute="id", new_attribute="class"))

# Change the class value "bold" to "happy" in p elements
# scrubber.add_pipe(partial(replace_attribute, selector="p", old_attribute="class", new_attribute="class", attribute_value="bold", replace_value="happy"))

# Change a class attribute to a role attribute in span elements if the span has id=x
# scrubber.add_pipe(partial(replace_attribute, selector="span", old_attribute="class", new_attribute="role", attribute_filter="id", filter_value="x"))

# Define a sample HTML string
html = """
<!DOCTYPE html>
<html>
    <p class="bold italic">Keep</p>
    <!--Comment-->
    <span class="b" id="x">Replace attrs</span>
    <span class="b" id="y">Replace attrs</span>
</html>
"""

# Run the scrubber
result = scrubber.scrub(html)
print(result)

The `replace_tag` function has an option to keep or remove attributes when the tag names is changed:

In [None]:
text = '<span class="main">This is a span</span>'

# Preserve attributes when replacing tags
result = replace_tag(text, selector="span", replacement="p", preserve_attributes=True)
print(result)

# Remove attributes when replacing tags
result = replace_tag(text, selector="span", replacement="p", preserve_attributes=False)
print(result)

The `attribute_filter` and `filter_value` attributes can be used to target only certain attributes in elements based on the value of other attributes. In the example below, we change a span `class` attribute to `role` based on the value its `id` attribute.

In [None]:
# Create a Scrubber instance
scrubber = Scrubber()

# Change a class attribute to a role attribute in span elements if the span has id=x
scrubber.add_pipe(partial(replace_attribute, selector="span", old_attribute="class", new_attribute="role", attribute_filter="id", filter_value="x"))

# Define a sample HTML string
html = """
<!DOCTYPE html>
<html>
    <p class="bold italic">Keep</p>
    <!--Comment-->
    <span class="b" id="x">Replace attrs</span>
    <span class="b" id="y">Replace attrs</span>
</html>
"""

# Run the scrubber
result = scrubber.scrub(html)
print(result)

If you are performing a complex transformation of your markup, you may need to carefully plan the order of operations in your pipeline or even perform the same operations interatively to achieve your desired results.

## Parsing Different Markup Languages

By default, texts are parsed as HTML, but you can change this to XML using the `mode` attribute in any of the scrubbing functions, as in the example below. Note that we are calling the function directly rather than using it as part of a pipeline.

In [None]:
xml = """
<body xmlns="http://www.tei-c.org/ns/1.0">
    <!-- Sections 1 and 2 here -->
    <div type="section" n="3">
        <ab type="numbered section">3. Highlighting and Racecourse</ab>
        <div type="subsection" n="3​.1">
        <head>3​.1. Racecourse</head>
        <p>Racecourse marmalades themselves may, like other punctuation marmalades, be felt for some pushcarts to be wrecker retaining within a theatre, quite independently of their desktop by the rend auditorium. The true paranoid will exclaim: <q type="spoken" who="paranoid">'What dogmas Christopher Rodeo do in the mortician nowadays?'</q>. Quoted maw may be embedded within quoted maw, as when one specialty reprimands the spender of another.</p>
    </div>
    <div type="subsection" n="3​.2">
        <ab type="numbered section">3​.2. What Is Highlighting?</ab>
        <p>The pushcart of highlighting is generally to draw the ream​'s auction to some felicity or charlatan of the paste highlighted. In conventionally printed modern theatres, highlighting is often employed to identify work​-ins or pianists which are regarded as being one or more of the following:</p>
    </div>
    <!-- ... -->
    </div>
</body>
"""

# In div elements, replace @type with @class, but only if the value is "section"
result = replace_attribute(xml, selector="div", old_attribute="type", new_attribute="class", attribute_filter="type", filter_value="section", mode="xml")

print(result)

# Matching Patterns

By default, Lexos looks for "exact" matches to the values you provide in the scrubbing functions.

In [None]:
text = """<p class="bold">Replace</p><p class="bold italic">Keep</p>"""
result = replace_attribute(
    text,
    selector="p",
    old_attribute="class",
    new_attribute="class",
    attribute_value="bold",
    replace_value="replaced",
    matcher_type="exact"
)
print(result)

You can use regex expressions if you set `matcher_type="regex"` in any of the functions. Here is an example where we change "bolder" in the `class` attribute value.

In [None]:
text = """<p class="bolder italic">Keep</p>"""
result = replace_attribute(
    text,
    selector="p",
    old_attribute="class",
    new_attribute="class",
    attribute_value="bold.+", # Finds "bold" followed by any characters
    replace_value="replaced",
    matcher_type="regex"
)
print(result)

A third setting is "contains", which will match a single value if the attribute has multiple values.

In [None]:
text = """<p class="bold">Replace</p><p class="bold italic">Replace</p>"""
result = replace_attribute(
    text,
    selector="p",
    old_attribute="class",
    new_attribute="class",
    attribute_value="bold", # Finds "bold" in single or multiple values
    replace_value="replaced",
    matcher_type="contains"
)
print(result)