# `Scrubber` Tutorial
   
`Scrubber` is a "preprocessing" module which can help you clean up your data to make it suitable for further analysis. It is important to remember that `Scrubber` operates using patterns and has no awareness of the language of the text other than the implicit knowledge you bring when you decide how to use it. As a result, `Scrubber` may not always have the desired results. Make sure you inspect the output before using it for your analysis.

Scrubber can be defined as a "destructive preprocessor". In other words, it changes the text in ways that potentially make mapping the results onto the original text potentially impossible. It is therefore best used before other procedures so that the scrubbed text is essentially treated as the "original" text.

## Load Some Data

We'll start by loading some data using the `Loader` module.

In [None]:
from lexos.io.smart import Loader

loader = Loader()
loader.load("../test_data/txt/Austen_Pride.txt")

## Import `Scrubber` Functions

The `Scrubber` module has a lot of functions that don't all have to be loaded. However, for convenience, we are going to load everything we need for this tutorial at once. Here are some brief explanations.

`Scrubber` consists of a "registry" of component functions. Each function does something different. Components must be loaded before they can be used, and this is the purpose of the `load_component()` and `load_components()` functions.

`Scrubber` also has "pipeline" functions that allow you to combine components in any order you want, and pass the text through each component in turn. Remember that the effects of `Scrubber` are destructive, so the order in which components are combined can often make a difference in the output. The `make_pipeline()` and `pipe()` functions are used to set up the pipeline.

In [None]:
from lexos.scrubber.registry import scrubber_components, load_component, load_components
from lexos.scrubber.pipeline import make_pipeline, pipe

## Loading Components

When components are loaded, they should be assigned to a variable that describes what they do (normally, but not always the same as the component's name).
 
Components can be loaded using pure Python, but Lexos also has the `load_component()` helper function, which may be easier to remember. Either of the commands below will work.

In [None]:
# Python code to load a component from the registry
lower_case = scrubber_components.get("lower_case")

# Lexos helper function to load a single component
whitespace = load_component("whitespace")

You can also load multiple components at once by putting them in a tuple and using the `load_components()` function:

In [None]:
# Load multiple components at once
punctuation, remove_digits = load_components(
    ("punctuation", "digits")
)

## Applying Individual Components

Components can be used individually like normal functions. If you have successfully run the cells above, you should have loaded the following components: "lower_case", "whitespace", "punctuation", and "digits". We are going to apply these in order, converting to lowercase, converting extra whitespace characters to spaces (and stripping final ones), removing punction, and removing digits (but only the digit "9").

In [None]:
original_text = ("There  are  39 characters in this  sentence. ")
print(f"Original text length: {len(original_text)}")

scrubbed_text = lower_case(original_text)
print (f"- {scrubbed_text}")

scrubbed_text = whitespace(scrubbed_text)
print(f"- {scrubbed_text}")

scrubbed_text = punctuation(scrubbed_text)
print(f"- {scrubbed_text}")

scrubbed_text = remove_digits(scrubbed_text, only="9")
print(f"- {scrubbed_text}")

print(f"Scrubbed text length: {len(scrubbed_text)}")

## Pipelines

We can combine our components into a single function called `scrub` by putting them in a pipeline to apply multiple components to a text in a specific order. Notice that we are just passing the names of the components into the `make_pipeline()` function. The only complexity is the `remove_digits` component which takes a keyword argument `only`. In order to use the keyword, we need to pass it, along with the name of the component through the `pipe()` function.

In [None]:
scrub = make_pipeline(
    lower_case,
    whitespace,
    punctuation,
    pipe(remove_digits, only=["9"])
)
print(f"Original text length: {len(original_text)}")
scrubbed_text = scrub(original_text)
print(f"- {scrubbed_text}")
print(f"Scrubbed text length: {len(scrubbed_text)}")

## Choosing Components

`Scrubber` has _a lot_ of components that allow you to do some pretty powerful work. They fall into three categories:

<ol>
<li><a href="https://scottkleinman.github.io/lexos/api/scrubber/normalize/" target="_blank">Normalize</a> components are used to manipulate text into a standardized form.</li>
<li><a href="https://scottkleinman.github.io/lexos/api/scrubber/remove/" target="_blank">Remove</a> components are used to remove strings and patterns from text.</li>
<li><a href="https://scottkleinman.github.io/lexos/api/scrubber/replace/" target="_blank">Replace</a> components are used to replace strings and patterns in text.</li>
</ol>

Just click the links to read more about each category.

## Making Custom Components

If `Scrubber` does not have a component you find useful, you can write your own. Custom components are written just like a standard function but must be registered (added `Scrubber`'s component registry) before they are loaded. The example below is a custom component that applies Python's `title()` function to capitalise the first letter of every word.

If you are not familiar with the format `title_case(text: str) -> str`, the `: str` and `--> str` code is called "type hinting". It tells us that the input text must be a string data type and that the output text will also be a string. This is not strictly necessary (in Python), but it is good coding practice, so we include it here.

In [None]:
# Define the custom function
def title_case(text: str) -> str:
    """Our custom function to convert text to title case."""
    return text.title()

# Register the custom function, giving it the name "title_case" and assign it our custom function
scrubber_components.register("title_case", func=title_case)

# Load our custom component from the registry
title = load_component("title_case")

# Scrub our tetxt with our custom function
scrubbed_text = title(original_text)
print (scrubbed_text)

## Practice

Remember at the very beginning of this notebook we loaded some data (Jane Austen's _Pride and Prejudice_). If you haven't restarted the kernel, it should still be there in `loader.texts`. Use the cell below to test out some scrubbing components of your choice.

# Let's just reference the text with the `text` variable
text = loader.texts[0]

# Practise your scrubbing below
