## `idscrub` custom patterns and methods

### Custom regex patterns
* If you want to scrub a particular regex pattern on a one-off basis, it can be defined in the pipeline:

In [1]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

pipeline = [
    {
        "method": "custom_regex",
        "patterns": {"university": {"pattern": r"Lapland", "replacement": "[REDACTED]", "priority": 1.0}},
    }
]

scrubbed_texts = scrub.scrub(pipeline=pipeline)

scrubbed_texts

INFO: Texts loaded.
INFO: Scrubbing using custom_regex with parameters {'patterns': {'university': {'pattern': 'Lapland', 'replacement': '[REDACTED]', 'priority': 1.0}}}...


['Our names are Hamish McDonald, L. Salah, and Elena Suárez.',
 'My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, [REDACTED]']

In [2]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,university
0,2,[Lapland]


### Custom methods

* Completely custom methods can be added using the code pattern below. 
* Once you are a happy with the method you have developed, you can add it to main `idscrub` codebase for others to use via a pull request.

#### Regex 
* Regex methods can be added by utilising the internal `find_regex` method, so you only need to pass a pattern. Example:

In [3]:
from types import MethodType

from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, Hamish the Second, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)


def find_hamish_regex(
    self,
    texts: list[str] = None,
    text_ids: list = None,
    replacement: str = "[HAMISH_REGEX]",
    label: str = "hamish_regex",
    priority: float = 0.7,
):
    pattern = r"Hamish"

    return self.find_regex(
        texts=texts, text_ids=text_ids, pattern=pattern, label=label, replacement=replacement, priority=priority
    )


scrub.find_hamish_regex = MethodType(find_hamish_regex, scrub)

scrubbed_text = scrub.scrub(pipeline=[{"method": "find_hamish_regex"}, {"method": "uk_postcodes"}])

scrubbed_text

INFO: Texts loaded.
INFO: Scrubbing using find_hamish_regex with default parameters...
INFO: Scrubbing using uk_postcodes with default parameters...


['Our names are [HAMISH_REGEX] McDonald, [HAMISH_REGEX] the Second, L. Salah, and Elena Suárez.',
 'My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, [POSTCODE], Lapland']

In [4]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,hamish_regex,uk_postcode
0,1,"[Hamish, Hamish]",
1,2,,[AA11 1AA]


#### Any custom method

Any other method can be added, as long as it returns a list of `IDEnt` objects. Example:

In [5]:
from types import MethodType

from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, Hamish the Second, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)


def find_first_hamish(
    self,
    texts: list[str] = None,
    text_ids: list = None,
    replacement: str = "[FIRST_HAMISH]",
    label: str = "find_hamish",
    priority: float = 0.7,
    source: str = "custom_hamish_finder",
) -> list:
    idents = []

    for text, text_id in zip(texts, text_ids):
        idx = text.find("Hamish")

        if idx == -1:
            continue

        else:
            idents.append(
                self.IDEnt(
                    text_id=text_id,
                    text="Hamish",
                    start=idx,
                    end=idx + len("Hamish"),
                    label="first_hamish",
                    replacement=replacement,
                    priority=priority,
                    source=source,
                )
            )

    return idents


scrub.find_first_hamish = MethodType(find_first_hamish, scrub)

scrubbed_text = scrub.scrub(pipeline=[{"method": "find_first_hamish"}, {"method": "uk_postcodes"}])

scrubbed_text

INFO: Texts loaded.
INFO: Scrubbing using find_first_hamish with default parameters...
INFO: Scrubbing using uk_postcodes with default parameters...


['Our names are [FIRST_HAMISH] McDonald, Hamish the Second, L. Salah, and Elena Suárez.',
 'My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, [POSTCODE], Lapland']

In [6]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,first_hamish,uk_postcode
0,1,[Hamish],
1,2,,[AA11 1AA]
