# Test Scrubber
   
This notebook is to show examples of how to use the `scrubber`

## Add Lexos to the Jupyter `sys.path`

We run a small script to do this in the same folder as the notebook.

In [1]:
%run jupyter_local_setup.py ../../../lexos

System path set to `../../../lexos`.


## Import Scrubber Modules

In [2]:
from lexos.scrubber.registry import scrubber_components, load_components, load_component
from lexos.scrubber.pipeline import pipe, make_pipeline

## Load Components

Components can be loaded individually or multiple at a time

In [4]:
# Load a component from the registry
lower_case = scrubber_components.get("lower_case")

#Alternative way to load a single component
whitespace = load_component("whitespace")

# Load multiple components at once
punctuation, remove_digits = load_components(("punctuation", "digits"))

## Apply individual components

Components can be used individually like normal functions

In [5]:
original_text = ("There  are  35 characters in this  sentence. ")

scrubbed_text = lower_case(original_text)
print (scrubbed_text)

scrubbed_text = whitespace(scrubbed_text)
print(scrubbed_text)

scrubbed_text = punctuation(scrubbed_text)
print(scrubbed_text)

scrubbed_text = remove_digits(scrubbed_text, only="5")
print(scrubbed_text)

there  are  35 characters in this  sentence. 
there are 35 characters in this sentence.
there are 35 characters in this sentence
there are 3 characters in this sentence


## Pipelines

Components can be put into a pipeline to apply multiple components to a text in a specific order.

In [8]:
scrub = make_pipeline(
    lower_case,
    whitespace,
    punctuation,
    pipe(remove_digits, only=["5"]) #components with additional arguments beyond the text need to be passed
                                    #through the pipe() function with the component name as the first argument
)
scrubbed_text = scrub(original_text)
print(scrubbed_text)

there are 3 characters in this sentence


## Making Custom Components

Users can write custom components. They are written just like a standard function but must be registered before they are loaded as a component.

In [9]:
# Define the custom function
def title_case(text: str) -> str:
    """Our custom function to convert text to title case."""
    return text.title()

# Register the custom function
scrubber_components.register("title_case", func=title_case)

title = load_component("title_case")

scrubbed_text = title(original_text)
print (scrubbed_text)

There  Are  35 Characters In This  Sentence. 
