A smart PII redactor built for anonymizing unstructured string data.
Muted comes with both a CLI version and a GUI version!
Muted requires a Hugging Face API token. You can set it up in either of the following ways:
export HF_API_KEY="your_hugging_face_api_key_here"- Create a
.envfile in the project root. - Add the following line in the file:
HF_API_KEY=your_hugging_face_api_key_herepip install -r requirements.txtpython muted.py data.json clean.jsonYou can locally host the tool's GUI in the following steps. I used streamlit to quickly build the app; this is only for the purpose of a quick demonstration, and is not a scalable tech stack for production environments.
Ensure that you already have your requirements.txt installed.
cd appstreamlit runOr, you can simply try the tool here: Muted! 🤫
bert-base-NER is a fine-tuned BERT model (trained by David S. Lim, Stanford) that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for NER tasks. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.
- It is unable to completely redact personal information when they are not punctuated appropriately. For example, the tool will redact
Santoshbut notsantosh. - It is struggles with identifying certain uncommon names and redacting them efficiently. For example, it will redact
Vennelaas<REDACTED>la. The issue of uncommon names extends to certain locations as well.
- The fix I have used for inappropriate punctuations for now is normalizing the given string inputs. It is not completely eliminating such issues, but it solves at the very least a part of it.
dslim/bert-base-NERworks much better when sentences are capitalized, the same way as many pre-trained NERs, since they often expect capitalized names/normal sentence casing and thus miss out on identifying short, informal patterns, even though they might be very common. - The
ai4bharat/IndicNERmodel should theoretically be able to identify Indian names and locations well. However, since it is an older model (last updated in 2022), it is not deployable via the HF Inference API. With a bit more time and effort, either this approach can be taken, or a different method or a better model can be tried.
Here is an example of the input:
[
{
"id": 1,
"text": "Please contact Santosh at santosh@unmute.now regarding the meeting at Koramangala for further steps."
},
{
"id": 2,
"text": "Radhika said you can call her on +91-98765-43210 before sending the documents to her office in Banjara Hills."
}
]Here is an example of the generated output:
[
{
"id": 1,
"cleaned_text": "Please contact <REDACTED>h at <EMAIL_REDACTED> regarding the meeting at <REDACTED> for further steps."
},
{
"id": 2,
"cleaned_text": "<REDACTED> said you can call her on <PHONE_REDACTED> before sending the documents to her office in <REDACTED>."
}
]Please feel free to use the input.json file attached in this repo for your testing purposes!
