# Filter Module Tutorial

The `filter` module provides a set of tools for filtering and identifying tokens within spaCy `Doc` objects. Filters allow you to identify specific types of tokens (such as words, Roman numerals, or stop words) and work with the matched results. This is useful for preprocessing, analysis, and text transformation tasks.

## Basic Concepts

Lexos filters are built around the concept of matching tokens based on specific criteria. Each filter:

1. Accepts a spaCy `Doc` object
2. Uses a spaCy `Matcher` to identify tokens matching certain criteria
3. Returns a modified document or provides access to matched tokens

The standard output is either a filtered spaCy `Doc` or lists of matched token IDs, allowing flexible use of the results.

We'll demonstrate the basic procedure for using filters below by importing the `IsWordFilter` class. This class identifies tokens that are words (as opposed to other symbols) and creates a new `Token` attribute `._.is_word` indicating whether the token is a word.

In [None]:
# Import the desired filter class
from lexos.filter import IsWordFilter

# Create a spaCy doc from your text
from lexos.tokenizer import Tokenizer

tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc("Hello, world! This is a test.")

# Create an instance of a filter
word_filter = IsWordFilter()

# Apply the filter to the doc
filtered_doc = word_filter(doc)

# Display whether tokens are word
for token in filtered_doc:
    if token._.is_word:
        print(f"'{token.text}' is a word")
    else:
        print(f"'{token.text}' is not a word")


The `IsWordFilter` has built in a specific definition of what constitutes a "word". The token must not be a space, a punctuation mark, a digit, or a Roman numeral. This definition may not meet your requirements, so you can customise the behaviour with the following parameters.

- `exclude` (list[str] | str, optional): Patterns to exclude from being considered words (default: [" ", "\n"])
- `exclude_digits` (bool, optional): If True, numeric tokens will not be treated as words (default: False)
- `exclude_roman_numerals` (bool, optional): If True, Roman numerals (capital letters only) will not be treated as words (default: False)
- `exclude_pattern` (list[str] | str, optional): Additional regex patterns to exclude

Here is a more complex example:

In [None]:
# Re-import classes if needed
from lexos.tokenizer import Tokenizer
from lexos.filter import IsWordFilter

tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc("Hello, world! 123 and IV are not words.")

# Create a filter with specific options
word_filter = IsWordFilter(exclude_digits=True, exclude_roman_numerals=False)
filtered_doc = word_filter(doc)

# Use matched_tokens instead of matches
print(f"Matched tokens: {[token.text for token in word_filter.matched_tokens]}")

# Or use matched_token_ids
print("\nMatched token IDs:", word_filter.matched_token_ids)

# Or use custom extensions:
print()
for token in filtered_doc:
    if token._.is_word == True:
        print(f"'{token.text}' is a word")
    else:
        print(f"'{token.text}' is not a word")

## The `BaseFilter` Class

The `IsWordFilter` inherits from `BaseFilter`, which has the following properties and methods:

### Properties

- `matched_tokens`: Returns a list of tokens that matched the filter criteria
- `matched_token_ids`: Returns a set of token IDs that matched the filter criteria
- `filtered_token_ids`: Returns a set of token IDs that did NOT match the filter criteria
- `filtered_tokens`: Returns a list of tokens that did NOT match the filter criteria

### Methods

- `get_matched_doc()`: Creates a new spaCy `Doc` containing only the matched tokens
- `get_filtered_doc()`: Creates a new spaCy `Doc` containing only the filtered (non-matched) tokens

You can create your own filters that inherit from `BaseFilter` with:

```python
class MyCustomFilter(BaseFilter):
    def __init__(self, ...):
        super().__init__(...)
        # Custom initialization code here
```

Filtering tokens with `BaseFilter` methods can cause the neighbouring tokens to run together in the new document. You can use the `add_spaces` parameter to insert spaces between tokens in the new document to prevent this.

In [None]:
print(word_filter.get_matched_doc(add_spaces=True))
print(word_filter.get_filtered_doc(add_spaces=True))

The `filters` module provides two other classes deriving from `BaseFilter`: `IsRomanFilter` and `IsStopWordFilter`. The first identifies stop words, while the second identifies Roman numerals. Their use is demonstrated below.

`IsRomanFilter` identifies Roman numerals based on a specific pattern of capital letters (a limitation is that it only works on capital letters).

In [None]:
# Python imports
from lexos.filter import IsRomanFilter

# Create a spaCy doc
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc("Chapter IV begins here. Not Roman: iv, but Roman: IV.")

# Create a Roman numeral filter
roman_filter = IsRomanFilter(attr="is_roman")
roman_filter(doc)

# Access matched Roman numerals
roman_numerals = roman_filter.matched_tokens
print(f"Roman Numerals: {[token.text for token in roman_numerals]}")
print(doc[1], doc[1]._.is_roman)  # Should be True for "IV"
print(doc[8], doc[8]._.is_roman)  # Should be False for "iv"

The `IsStopwordFilter` class manages stop words in a spaCy model. Stop words are common words that are often filtered out during text processing (such as "the", "a", "is", etc.).

!!! important
    This filter modifies the model's default stop words. Changes will apply to any document created with that model unless the model is reloaded.

The class has the following attributes:

- `stopwords`: A list or string containing the stop word(s) to add or remove
- `remove`: If True, the specified stop words will be removed from the model. If False, they will be added (default: False)
- `case_sensitive`: If False (default), stop word changes apply to all case variations (lowercase, original, and capitalized). If True, only the exact case provided is modified (default: False)

Here are some examples:

In [None]:
# Adding Stop Words:

# Python imports
from lexos.filter import IsStopwordFilter
from lexos.tokenizer import Tokenizer

# Create a spaCy doc from your text
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc("The quick brown fox jumps over the lazy dog.")

# Add custom stop words
stopword_filter = IsStopwordFilter()
print("Adding custom stop words 'quick', 'brown' to stop words...")
stopword_filter(doc, stopwords=["quick", "brown"], remove=False)

# Now "quick" and "brown" are marked as stop words
doc2 = tokenizer.make_doc("The quick brown fox")
for token in doc2:
    if token.is_stop:
        print(f"-'{token.text}' is a stop word")

print()

# Removing Stop Words:

# Remove "the" from stop words -- try change case sensitivity
print("Removing 'the' from stop words...")
stopword_filter(doc, stopwords="the", remove=True, case_sensitive=False)

# Now "the" is no longer marked as a stop word
doc2 = tokenizer.make_doc("The quick brown fox")
for token in doc2:
    if token.is_stop:
        print(f"- '{token.text}' is a stop word")

You can set stop words back to their original state by reloading the spaCy model or by instantiating a new `Tokenizer`:

In [None]:
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc("The quick brown fox jumps over the lazy dog.")
for token in doc:
    if token.is_stop:
        print(f"- '{token.text}' is a stop word")