### `idscrub` basic usage examples

With a default pipeline:

In [1]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

scrubbed_texts = scrub.scrub()

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using presidio_entities with default parameters...
INFO: Scrubbing using spacy_entities with default parameters...
INFO: Scrubbing using email_addresses with default parameters...
INFO: Scrubbing using handles with default parameters...
INFO: Scrubbing using ip_addresses with default parameters...
INFO: Scrubbing using uk_addresses with default parameters...
INFO: Scrubbing using uk_phone_numbers with default parameters...
INFO: Scrubbing using google_phone_numbers with default parameters...
INFO: Scrubbing using uk_postcodes with default parameters...
INFO: Scrubbing using urls with default parameters...
INFO: Scrubbing using titles with default parameters...


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I work at [ORG], [ADDRESS], [POSTCODE], [LOCATION]']


With a custom pipeline:

In [2]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

pipeline = [
    {"method": "presidio_entities", "entity_types": ["PERSON"]},
    {"method": "spacy_entities", "entity_types": ["ORG"]},
    {"method": "google_phone_numbers", "region": "GB"},
    {"method": "titles", "strict": False},
    {"method": "email_addresses"},
    {"method": "handles"},
    {"method": "ip_addresses"},
    {"method": "uk_addresses"},
    {"method": "uk_phone_numbers"},
    {"method": "uk_postcodes"},
    {"method": "urls"},
]

scrubbed_texts = scrub.scrub(pipeline=pipeline)

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using presidio_entities with parameters {'entity_types': ['PERSON']}...
INFO: Scrubbing using spacy_entities with parameters {'entity_types': ['ORG']}...
INFO: Scrubbing using google_phone_numbers with parameters {'region': 'GB'}...
INFO: Scrubbing using titles with parameters {'strict': False}...
INFO: Scrubbing using email_addresses with default parameters...
INFO: Scrubbing using handles with default parameters...
INFO: Scrubbing using ip_addresses with default parameters...
INFO: Scrubbing using uk_addresses with default parameters...
INFO: Scrubbing using uk_phone_numbers with default parameters...
INFO: Scrubbing using uk_postcodes with default parameters...
INFO: Scrubbing using urls with default parameters...


['Our names are [PERSON], [PERSON], and [PERSON].', 'My number is [PHONENO] and I work at [ORG], [ADDRESS], [POSTCODE], Lapland']


In [3]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,person,org,uk_phone_number,uk_address,uk_postcode
0,1,"[Hamish McDonald, L. Salah, Elena Suárez]",,,,
1,2,,[the Department for Business and Trade],[+441111111111],[15 Elf Road],[AA11 1AA]


### `idscrub` example - priority scoring

If multiple different types of personal data have been identified in the same string, such as a handle, a email address and a URL, you can score one higher to ensure it is scrubbed:

In [4]:
from idscrub import IDScrub

scrub = IDScrub(texts=["My email is www.person@person.com"])

scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "handles", "priority": 0.1},
        {"method": "urls", "priority": 0.1},
        {"method": "email_addresses", "priority": 0.2},
    ]
)

print(f"\nAll personal data identified: {[(ident.label, ident.text) for ident in scrub.idents_all]}\n")
print(f"Personal data removed after priority scoring: {[(ident.label, ident.text) for ident in scrub.idents]}\n")
print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using handles with parameters {'priority': 0.1}...
INFO: Scrubbing using urls with parameters {'priority': 0.1}...
INFO: Scrubbing using email_addresses with parameters {'priority': 0.2}...



All personal data identified: [('handle', '@person.com'), ('url', 'www.person@person.com'), ('email_address', 'www.person@person.com')]

Personal data removed after priority scoring: [('email_address', 'www.person@person.com')]

['My email is [EMAIL_ADDRESS]']


To view all of the identified data:

In [5]:
scrub.get_all_identified_data()

Unnamed: 0,text_id,text,start,end,label,replacement,priority,source
0,1,@person.com,22,33,handle,[HANDLE],0.1,regex
1,1,www.person@person.com,12,33,url,[URL],0.1,regex
2,1,www.person@person.com,12,33,email_address,[EMAIL_ADDRESS],0.2,regex


Note that methods which identify multiple identities, like `spacy_entities` and `presidio_entities`, will have the same priority score applied to each entity type. 

To assign priority scores based on entity types, you can chain methods together. For example, if you wanted to prioritise email addresses over names when using `presidio_entities`:

In [6]:
from idscrub import IDScrub

scrub = IDScrub(["John Smith@mail.com"])

scrubbed_texts = scrub.scrub(
    pipeline=[
        {"method": "presidio_entities", "entity_types": ["PERSON"], "priority": 0.1},
        {"method": "presidio_entities", "entity_types": ["EMAIL_ADDRESS"], "priority": 0.2},
    ]
)

print(scrub.get_all_identified_data())

print(scrubbed_texts)

INFO: Texts loaded.
INFO: Scrubbing using presidio_entities with parameters {'entity_types': ['PERSON'], 'priority': 0.1}...
INFO: Scrubbing using presidio_entities with parameters {'entity_types': ['EMAIL_ADDRESS'], 'priority': 0.2}...


   text_id                 text  start  end          label      replacement  \
0        1  John Smith@mail.com      0   19         person         [PERSON]   
1        1       Smith@mail.com      5   19  email_address  [EMAIL_ADDRESS]   

   priority    source  
0       0.1  presidio  
1       0.2  presidio  
['John [EMAIL_ADDRESS]']


### `idscrub` example - scrubbing custom regex patterns

In [7]:
from idscrub import IDScrub

scrub = IDScrub(
    [
        "Our names are Hamish McDonald, L. Salah, and Elena Suárez.",
        "My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, Lapland",
    ]
)

pipeline = [
    {
        "method": "custom_regex",
        "patterns": {"university": {"pattern": r"Lapland", "replacement": "[UNIVERSITY]", "priority": 1.0}},
    }
]

scrubbed_texts = scrub.scrub(pipeline=pipeline)

scrubbed_texts

INFO: Texts loaded.
INFO: Scrubbing using custom_regex with parameters {'patterns': {'university': {'pattern': 'Lapland', 'replacement': '[UNIVERSITY]', 'priority': 1.0}}}...


['Our names are Hamish McDonald, L. Salah, and Elena Suárez.',
 'My number is +441111111111 and I work at the Department for Business and Trade, 15 Elf Road, AA11 1AA, [UNIVERSITY]']

In [8]:
scrub.get_scrubbed_data()

Unnamed: 0,text_id,university
0,2,[Lapland]


### `idscrub` example - scrubbing a whole dataframe

In [9]:
import pandas as pd

data = {
    "ID": ["A", "B", "C", "D", "E"],
    "Pride and Prejudice": [
        "Mr. Darcy walked off; and Elizabeth remained with no very cordial feelings toward him.",
        "Mr. Bennet was so odd a mixture of quick parts, sarcastic humour, reserve, and caprice.",
        "Elizabeth's spirits were so high that they could not be damped for long.",
        "The business of her life was to get her daughters married.",
        "She is tolerable; but not handsome enough to tempt me.",
    ],
    "The Adventures of Sherlock Holmes": [
        "To Sherlock Holmes she is always the woman.",
        "You see, but you do not observe.",
        "The world is full of obvious things which nobody by any chance ever observes.",
        "I am a brain, Watson. The rest of me is a mere appendix.",
        "When you have eliminated the impossible, whatever remains, however improbable, must be the truth.",
    ],
    "Frankenstein": [
        "My dear Victor, do not waste your time upon this; it is sad trash.",
        "Learn from me, if not by my precepts, at least by my example.",
        "I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body.",
        "Nothing is more painful to the human mind than a great and sudden change.",
        "Beware; for I am fearless, and therefore powerful.",
    ],
    "Fake book": [
        "The letter to freddie.mercury@queen.com was stamped with SW1A 2AA. He was British.",
        "She forwarded the memo from Mick Jagger and David Bowie to her chief of staff, noting the postcode SW1A 2WH.",
        "The dossier marked confidential came from serena.williams@tennis.com, with SW19 5AE etched in bold across the envelope.",
        "A message arrived just as the Downing Street clock struck midnight.",
        "They did not expected a reply from otis.redding@dockofthebay.org, especially one routed through EH8 8DX.",
    ],
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,Mr. Darcy walked off; and Elizabeth remained w...,To Sherlock Holmes she is always the woman.,"My dear Victor, do not waste your time upon th...",The letter to freddie.mercury@queen.com was st...
1,B,Mr. Bennet was so odd a mixture of quick parts...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from Mick Jagger and Da...
2,C,Elizabeth's spirits were so high that they cou...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from sere...
3,D,The business of her life was to get her daught...,"I am a brain, Watson. The rest of me is a mere...",Nothing is more painful to the human mind than...,A message arrived just as the Downing Street c...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from otis.reddin...


In [10]:
from idscrub import IDScrub

pipeline = [
    {"method": "presidio_entities", "entity_types": ["PERSON", "NRP"]},
    {"method": "spacy_entities", "entity_types": ["ORG"]},
    {"method": "google_phone_numbers", "region": "GB"},
    {"method": "titles", "strict": False},
    {"method": "email_addresses"},
    {"method": "handles"},
    {"method": "ip_addresses"},
    {"method": "uk_addresses"},
    {"method": "uk_phone_numbers"},
    {"method": "uk_postcodes"},
    {"method": "urls"},
]

scrubbed_df, scrubbed_data = IDScrub.dataframe(df=df, id_col="ID", exclude_cols=["Frankenstein"], pipeline=pipeline)

scrubbed_df

  0%|          | 0/3 [00:00<?, ?it/s]INFO: Texts loaded.
INFO: Scrubbing column `Pride and Prejudice`...
INFO: Scrubbing using presidio_entities with parameters {'entity_types': ['PERSON', 'NRP']}...
INFO: Scrubbing using spacy_entities with parameters {'entity_types': ['ORG']}...
INFO: Scrubbing using google_phone_numbers with parameters {'region': 'GB'}...
INFO: Scrubbing using titles with parameters {'strict': False}...
INFO: Scrubbing using email_addresses with default parameters...
INFO: Scrubbing using handles with default parameters...
INFO: Scrubbing using ip_addresses with default parameters...
INFO: Scrubbing using uk_addresses with default parameters...
INFO: Scrubbing using uk_phone_numbers with default parameters...
INFO: Scrubbing using uk_postcodes with default parameters...
INFO: Scrubbing using urls with default parameters...
 33%|███▎      | 1/3 [00:02<00:04,  2.44s/it]INFO: Texts loaded.
INFO: Scrubbing column `The Adventures of Sherlock Holmes`...
INFO: Scrubbing us

Unnamed: 0,ID,Pride and Prejudice,The Adventures of Sherlock Holmes,Frankenstein,Fake book
0,A,[TITLE]. [PERSON] walked off; and [PERSON] rem...,To [PERSON] she is always the woman.,"My dear Victor, do not waste your time upon th...",The letter to [EMAIL_ADDRESS] was stamped with...
1,B,[TITLE]. [PERSON] was so odd a mixture of quic...,"You see, but you do not observe.","Learn from me, if not by my precepts, at least...",She forwarded the memo from [PERSON] and [PERS...
2,C,[PERSON]'s spirits were so high that they coul...,The world is full of obvious things which nobo...,"I had worked hard for nearly two years, for th...",The dossier marked confidential came from [EMA...
3,D,The business of her life was to get her daught...,"I am a brain, [PERSON]. The rest of me is a me...",Nothing is more painful to the human mind than...,A message arrived just as the [ORG] clock stru...
4,E,She is tolerable; but not handsome enough to t...,"When you have eliminated the impossible, whate...","Beware; for I am fearless, and therefore power...",They did not expected a reply from [EMAIL_ADDR...


In [11]:
scrubbed_data

Unnamed: 0,ID,column,person,title,nrp,email_address,uk_postcode,org
0,A,Pride and Prejudice,"[Darcy, Elizabeth]",[Mr],,,,
1,B,Pride and Prejudice,[Bennet],[Mr],,,,
2,C,Pride and Prejudice,[Elizabeth],,,,,
3,A,The Adventures of Sherlock Holmes,[Sherlock Holmes],,,,,
4,D,The Adventures of Sherlock Holmes,[Watson],,,,,
5,A,Fake book,,,[British],[freddie.mercury@queen.com],[SW1A 2AA],
6,B,Fake book,"[Mick Jagger, David Bowie]",,,,[SW1A 2WH],
7,C,Fake book,,,,[serena.williams@tennis.com],[SW19 5AE],
8,D,Fake book,,,,,,[Downing Street]
9,E,Fake book,,,,[otis.redding@dockofthebay.org],[EH8 8DX],
