### `idscrub` basic usage example

In [1]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I live at AA11 1AA, Lapland.",
    ]
)
scrubbed_texts = scrub.all()

print(scrubbed_texts)

  from .autonotebook import tqdm as notebook_tqdm
INFO: Texts loaded.
INFO: Scrubbing using Presidio...
100%|██████████| 2/2 [00:00<00:00,  9.48it/s]
INFO: 3 presidio person scrubbed.
INFO: 1 presidio location scrubbed.
INFO: Scrubbing names using SpaCy model `en_core_web_trf`...
100%|██████████| 2/2 [00:00<00:00, 55.76it/s]
INFO: 0 spacy person scrubbed.
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 gb phone numbers scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email addresses scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handles scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip addresses scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 1 uk phone numbers scrubbed.
INFO: Scrubbing UK postcodes using regex...
INFO: 1 uk postcodes scrubbed.
INFO: Scrubbing titles using regex...
INFO: 0 titles scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE], [LOCATION].']


In [2]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,scrubbed_presidio_person,scrubbed_presidio_location,scrubbed_uk_phone_numbers,scrubbed_uk_postcodes
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,
1,2,,[Lapland],[+441111111111],[AA11 1AA]


### `idscrub` example - chaining methods together

In [3]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I live at AA11 1AA, University of Lapland where I am on secret mission ACHILLES.",
    ]
)

scrub.presidio()
scrub.google_phone_numbers(region="GB")
scrub.custom_regex(
    custom_regex_patterns=[r"Lapland", r"ACHILLES"], custom_replacement_texts=["[UNIVERSITY]", "[REDACTED]"]
)  # Remove specific regex pattern(s). This can also be passed to all().
scrubbed_texts = scrub.all_regex()

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using Presidio...
100%|██████████| 2/2 [00:00<00:00, 25.84it/s]
INFO: 3 presidio person scrubbed.
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 gb phone numbers scrubbed.
INFO: Scrubbing custom regex...
INFO: 1 custom regex 1 scrubbed.
INFO: 1 custom regex 2 scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email addresses scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handles scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip addresses scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 1 uk phone numbers scrubbed.
INFO: Scrubbing UK postcodes using regex...
INFO: 1 uk postcodes scrubbed.
INFO: Scrubbing titles using regex...
INFO: 0 titles scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I live at [POSTCODE], University of [UNIVERSITY] where I am on secret mission [REDACTED].']


In [4]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,scrubbed_presidio_person,scrubbed_custom_regex_1,scrubbed_custom_regex_2,scrubbed_uk_phone_numbers,scrubbed_uk_postcodes
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,,
1,2,,[Lapland],[ACHILLES],[+441111111111],[AA11 1AA]


## `idscrub` example - using Presidio
We can also leverage the power of [Presidio](https://microsoft.github.io/presidio/) and use their entity recognition methods

In [5]:
from idscrub import IDScrub

scrub = IDScrub(
    ["Our names are Hamish McDonald, L. Salah, and Elena Suárez.", "My IBAN code is GB91BKEN10000041610008"]
)
scrubbed_texts = scrub.presidio()

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using Presidio...
100%|██████████| 2/2 [00:00<00:00, 26.18it/s]
INFO: 3 presidio person scrubbed.
INFO: 1 presidio iban code scrubbed.


['Our names are [PERSON], [PERSON], and [PERSON].', 'My IBAN code is [IBAN_CODE]']


In [6]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,scrubbed_presidio_person,scrubbed_presidio_iban_code
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",
1,2,,[GB91BKEN10000041610008]


### `idscrub` example - scrubbing a whole dataframe

In [7]:
import pandas as pd

data = {
    "ID": ["A", "B", "C", "D", "E"],
    "Pride and Prejudice": [
        "Mr. Darcy walked off; and Elizabeth remained with no very cordial feelings toward him.",
        "Mr. Bennet was so odd a mixture of quick parts, sarcastic humour, reserve, and caprice.",
        "Elizabeth's spirits were so high that they could not be damped for long.",
        "The business of her life was to get her daughters married.",
        "She is tolerable; but not handsome enough to tempt me.",
    ],
    "The Adventures of Sherlock Holmes": [
        "To Sherlock Holmes she is always the woman.",
        "You see, but you do not observe.",
        "The world is full of obvious things which nobody by any chance ever observes.",
        "I am a brain, Watson. The rest of me is a mere appendix.",
        "When you have eliminated the impossible, whatever remains, however improbable, must be the truth.",
    ],
    "Frankenstein": [
        "My dear Victor, do not waste your time upon this; it is sad trash.",
        "Learn from me, if not by my precepts, at least by my example.",
        "I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body.",
        "Nothing is more painful to the human mind than a great and sudden change.",
        "Beware; for I am fearless, and therefore powerful.",
    ],
    "Fake book": [
        "The letter to freddie.mercury@queen.com was stamped with SW1A 2AA. His IBAN was GB91BKEN10000041610008.",
        "She forwarded the memo from Mick Jagger and David Bowie to her chief of staff, noting the postcode SW1A 2WH.",
        "The dossier marked confidential came from serena.williams@tennis.com, with SW19 5AE etched in bold across the envelope.",
        "A message arrived just as the Downing Street clock struck midnight.",
        "They did not expected a reply from otis.redding@dockofthebay.org, especially one routed through EH8 8DX.",
    ],
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,Mr. Darcy walked off; and Elizabeth remained w...,To Sherlock Holmes she is always the woman.,"My dear Victor, do not waste your time upon th...",The letter to freddie.mercury@queen.com was st...
1,B,Mr. Bennet was so odd a mixture of quick parts...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from Mick Jagger and Da...
2,C,Elizabeth's spirits were so high that they cou...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from sere...
3,D,The business of her life was to get her daught...,"I am a brain, Watson. The rest of me is a mere...",Nothing is more painful to the human mind than...,A message arrived just as the Downing Street c...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from otis.reddin...


In [8]:
from idscrub import IDScrub

scrubbed_df, scrubbed_data = IDScrub.dataframe(df=df, id_col="ID", scrub_methods=["all"])

scrubbed_df

  0%|          | 0/5 [00:00<?, ?it/s]INFO: Texts loaded.
INFO: Scrubbing using Presidio...
100%|██████████| 5/5 [00:00<00:00, 24.83it/s]
INFO: 4 presidio person scrubbed.
INFO: 4 presidio person scrubbed.
INFO: 4 presidio person scrubbed.
INFO: Scrubbing names using SpaCy model `en_core_web_trf`...
100%|██████████| 5/5 [00:00<00:00, 71.71it/s]
INFO: 0 spacy person scrubbed.
INFO: Scrubbing GB phone numbers using Google's `phonenumbers`...
INFO: 0 gb phone numbers scrubbed.
INFO: Scrubbing email addresses using regex...
INFO: 0 email addresses scrubbed.
INFO: Scrubbing @user handles using regex...
INFO: 0 handles scrubbed.
INFO: Scrubbing IP addresses using regex...
INFO: 0 ip addresses scrubbed.
INFO: Scrubbing phone numbers using regex...
INFO: 0 uk phone numbers scrubbed.
INFO: Scrubbing UK postcodes using regex...
INFO: 0 uk postcodes scrubbed.
INFO: Scrubbing titles using regex...
INFO: 2 titles scrubbed.
 40%|████      | 2/5 [00:02<00:03,  1.25s/it]INFO: Texts loaded.
INFO: Scrubb

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,[TITLE]. [PERSON] walked off; and [PERSON] rem...,To [PERSON] she is always the woman.,"My dear [PERSON], do not waste your time upon ...",The letter to [EMAIL_ADDRESS] was stamped with...
1,B,[TITLE]. [PERSON] was so odd a mixture of quic...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from [PERSON] and [PERS...
2,C,[PERSON]'s spirits were so high that they coul...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from [EMA...
3,D,The business of her life was to get her daught...,"I am a brain, [PERSON]. The rest of me is a me...",Nothing is more painful to the human mind than...,A message arrived just as the Downing Street c...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from [EMAIL_ADDR...


In [9]:
scrubbed_data

Unnamed: 0,ID,column,scrubbed_presidio_person,scrubbed_titles,scrubbed_presidio_email_address,scrubbed_presidio_iban_code,scrubbed_presidio_url,scrubbed_uk_postcodes
0,A,Pride and Prejudice,"[Darcy, Elizabeth]",[Mr],,,,
1,B,Pride and Prejudice,[Bennet],[Mr],,,,
2,C,Pride and Prejudice,[Elizabeth],,,,,
3,A,The Adventures of Sherlock Holmes,[Sherlock Holmes],,,,,
4,D,The Adventures of Sherlock Holmes,[Watson],,,,,
5,A,Frankenstein,[Victor],,,,,
6,A,Fake book,,,[freddie.mercury@queen.com],[GB91BKEN10000041610008],"[freddie.me, queen.com]",[SW1A 2AA]
7,B,Fake book,"[Mick Jagger, David Bowie]",,,,,[SW1A 2WH]
8,C,Fake book,,,[serena.williams@tennis.com],,[tennis.com],[SW19 5AE]
9,E,Fake book,,,[otis.redding@dockofthebay.org],,"[otis.red, dockofthebay.org]",[EH8 8DX]
