### `idscrub` basic usage examples

In [1]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

scrubbed_texts = scrub.scrub(scrub_methods=["spacy_entities", "uk_phone_numbers", "uk_addresses", "uk_postcodes"])

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing SpaCy entities `PERSON, ORG, NORP` using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00, 33.83it/s]
INFO: 1 org scrubbed.
INFO: 3 person scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 1 uk_phone_number scrubbed.
INFO: Scrubbing addresses using regex...
INFO: 1 uk_address scrubbed.
INFO: Scrubbing postcodes using regex...
INFO: 1 uk_postcode scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I work at [ORG], [ADDRESS], [POSTCODE], Lapland']


In [2]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,person,org,uk_phone_number,uk_address,uk_postcode
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,,
1,2,,[the Department for Business and Trade],[+441111111111],[15 Elf Road],[AA11 1AA]


Or scrub `all`:

In [3]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

scrubbed_texts = scrub.scrub(scrub_methods=["all"])

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing Presidio entities `PERSON, EMAIL_ADDRESS, UK_NINO, UK_NHS, CREDIT_CARD, CRYPTO, MEDICAL_LICENSE, URL, SWIFT_CODE, IBAN_CODE, LOCATION, NRP` using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00,  9.14it/s]
INFO: 3 person scrubbed.
INFO: 1 location scrubbed.
INFO: Scrubbing SpaCy entities `PERSON, ORG, NORP` using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00, 42.62it/s]
INFO: 1 org scrubbed.
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 phone_number scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email_address scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handle scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip_address scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 1 uk_phone_number scrubbed.
INFO: Scrubbing addresses using regex...
INFO: 1 uk_address scrubbed.
INFO: Scrubbing postcodes using regex...
INFO: 1 uk_postcode 

['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I work at [ORG], [ADDRESS], [POSTCODE], [LOCATION]']


In [4]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,person,location,org,uk_phone_number,uk_address,uk_postcode
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,,,
1,2,,[Lapland],[Department for Business and Trade],[+441111111111],[15 Elf Road],[AA11 1AA]


### `idscrub` example - chaining methods together

In [5]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

scrub.spacy_entities()
scrub.google_phone_numbers(region="GB")

# Remove specific regex pattern(s). This can also be passed to all().
scrub.custom_regex(
    custom_regex_patterns=[r"Lapland", r"ACHILLES"], custom_replacement_texts=["[UNIVERSITY]", "[REDACTED]"]
)

scrubbed_texts = scrub.all_regex()

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing SpaCy entities `PERSON, ORG, NORP` using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00, 42.58it/s]
INFO: 1 org scrubbed.
INFO: 3 person scrubbed.
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 phone_number scrubbed.
INFO: Scrubbing custom regex...
INFO: 1 custom_regex_1 scrubbed.
INFO: 0 custom_regex_2 scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email_address scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handle scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip_address scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 1 uk_phone_number scrubbed.
INFO: Scrubbing addresses using regex...
INFO: 1 uk_address scrubbed.
INFO: Scrubbing postcodes using regex...
INFO: 1 uk_postcode scrubbed.
INFO: Scrubbing titles using regex...
INFO: 0 title scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I work at [ORG], [ADDRESS], [POSTCODE], [UNIVERSITY]']


In [6]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,person,org,custom_regex_1,uk_phone_number,uk_address,uk_postcode
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,,,
1,2,,[Department for Business and Trade],[Lapland],[+441111111111],[15 Elf Road],[AA11 1AA]


### `idscrub` example - using Presidio
We can also leverage the power of [Presidio](https://microsoft.github.io/presidio/) and use their entity recognition methods

In [7]:
from idscrub import IDScrub

scrub = IDScrub(
    ["Our names are Hamish McDonald, L. Salah, and Elena Suárez.", "My IBAN code is GB91BKEN10000041610008"]
)
scrubbed_texts = scrub.presidio_entities()

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing Presidio entities `PERSON, UK_NINO, UK_NHS, CREDIT_CARD, CRYPTO, MEDICAL_LICENSE, URL, IBAN_CODE` using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00, 24.36it/s]
INFO: 1 iban_code scrubbed.
INFO: 3 person scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My IBAN code is [IBAN_CODE]']


In [8]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,person,iban_code
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",
1,2,,[GB91BKEN10000041610008]


### `idscrub` example - scrubbing a whole dataframe

In [9]:
import pandas as pd

data = {
    "ID": ["A", "B", "C", "D", "E"],
    "Pride and Prejudice": [
        "Mr. Darcy walked off; and Elizabeth remained with no very cordial feelings toward him.",
        "Mr. Bennet was so odd a mixture of quick parts, sarcastic humour, reserve, and caprice.",
        "Elizabeth's spirits were so high that they could not be damped for long.",
        "The business of her life was to get her daughters married.",
        "She is tolerable; but not handsome enough to tempt me.",
    ],
    "The Adventures of Sherlock Holmes": [
        "To Sherlock Holmes she is always the woman.",
        "You see, but you do not observe.",
        "The world is full of obvious things which nobody by any chance ever observes.",
        "I am a brain, Watson. The rest of me is a mere appendix.",
        "When you have eliminated the impossible, whatever remains, however improbable, must be the truth.",
    ],
    "Frankenstein": [
        "My dear Victor, do not waste your time upon this; it is sad trash.",
        "Learn from me, if not by my precepts, at least by my example.",
        "I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body.",
        "Nothing is more painful to the human mind than a great and sudden change.",
        "Beware; for I am fearless, and therefore powerful.",
    ],
    "Fake book": [
        "The letter to freddie.mercury@queen.com was stamped with SW1A 2AA. His IBAN was GB91BKEN10000041610008.",
        "She forwarded the memo from Mick Jagger and David Bowie to her chief of staff, noting the postcode SW1A 2WH.",
        "The dossier marked confidential came from serena.williams@tennis.com, with SW19 5AE etched in bold across the envelope.",
        "A message arrived just as the Downing Street clock struck midnight.",
        "They did not expected a reply from otis.redding@dockofthebay.org, especially one routed through EH8 8DX.",
    ],
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,Mr. Darcy walked off; and Elizabeth remained w...,To Sherlock Holmes she is always the woman.,"My dear Victor, do not waste your time upon th...",The letter to freddie.mercury@queen.com was st...
1,B,Mr. Bennet was so odd a mixture of quick parts...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from Mick Jagger and Da...
2,C,Elizabeth's spirits were so high that they cou...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from sere...
3,D,The business of her life was to get her daught...,"I am a brain, Watson. The rest of me is a mere...",Nothing is more painful to the human mind than...,A message arrived just as the Downing Street c...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from otis.reddin...


In [10]:
from idscrub import IDScrub

scrubbed_df, scrubbed_data = IDScrub.dataframe(df=df, id_col="ID", exclude_cols=["Frankenstein"], scrub_methods=["all"])

scrubbed_df

  0%|          | 0/3 [00:00<?, ?it/s]INFO: Texts loaded.
INFO: Scrubbing column `Pride and Prejudice`...
INFO: Scrubbing Presidio entities `PERSON, EMAIL_ADDRESS, UK_NINO, UK_NHS, CREDIT_CARD, CRYPTO, MEDICAL_LICENSE, URL, SWIFT_CODE, IBAN_CODE, LOCATION, NRP` using SpaCy model `en_core_web_trf`...
100%|██████████| 5/5 [00:00<00:00, 23.73it/s]
INFO: 4 person scrubbed.
INFO: Scrubbing SpaCy entities `PERSON, ORG, NORP` using SpaCy model `en_core_web_trf`...
100%|██████████| 5/5 [00:00<00:00, 77.84it/s]
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 phone_number scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email_address scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handle scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip_address scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 0 uk_phone_number scrubbed.
INFO: Scrubbing addresses using regex...
INFO: 0 uk_address scrubbed.
INFO: Scrubbing postcodes

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,[TITLE]. [PERSON] walked off; and [PERSON] rem...,To [PERSON] she is always the woman.,"My dear Victor, do not waste your time upon th...",The letter to [EMAIL_ADDRESS] was stamped with...
1,B,[TITLE]. [PERSON] was so odd a mixture of quic...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from [PERSON] and [PERS...
2,C,[PERSON]'s spirits were so high that they coul...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from [EMA...
3,D,The business of her life was to get her daught...,"I am a brain, [PERSON]. The rest of me is a me...",Nothing is more painful to the human mind than...,A message arrived just as the [ORG] clock stru...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from [EMAIL_ADDR...


In [11]:
scrubbed_data

Unnamed: 0,ID,column,person,title,email_address,iban_code,url,org,uk_postcode
0,A,Pride and Prejudice,"[Darcy, Elizabeth]",[Mr],,,,,
1,B,Pride and Prejudice,[Bennet],[Mr],,,,,
2,C,Pride and Prejudice,[Elizabeth],,,,,,
3,A,The Adventures of Sherlock Holmes,[Sherlock Holmes],,,,,,
4,D,The Adventures of Sherlock Holmes,[Watson],,,,,,
5,A,Fake book,,,[freddie.mercury@queen.com],[GB91BKEN10000041610008],"[freddie.me, queen.com]",,[SW1A 2AA]
6,B,Fake book,"[Mick Jagger, David Bowie]",,,,,,[SW1A 2WH]
7,C,Fake book,,,[serena.williams@tennis.com],,[tennis.com],,[SW19 5AE]
8,E,Fake book,,,[otis.redding@dockofthebay.org],,"[otis.red, dockofthebay.org]",,[EH8 8DX]
9,D,Fake book,,,,,,[Downing Street],
