Muted! 🤫

A smart PII redactor built for anonymizing unstructured string data.

How to run?

Muted comes with both a CLI version and a GUI version!

CLI Version

Muted requires a Hugging Face API token. You can set it up in either of the following ways:

- Using an Environment Variable

export HF_API_KEY="your_hugging_face_api_key_here"

- Using a `.env` file

Create a .env file in the project root.
Add the following line in the file:

HF_API_KEY=your_hugging_face_api_key_here

1. Install the required dependencies.

pip install -r requirements.txt

2. Run the following command in your terminal.

python muted.py data.json clean.json

GUI Version

You can locally host the tool's GUI in the following steps. I used streamlit to quickly build the app; this is only for the purpose of a quick demonstration, and is not a scalable tech stack for production environments.

Ensure that you already have your requirements.txt installed.

1. Navigate to the `app` folder in the repo using the terminal.

cd app

2. Run the app.

streamlit run

Or, you can simply try the tool here: Muted! 🤫

Why did I pick `bert-base-NER`?

bert-base-NER is a fine-tuned BERT model (trained by David S. Lim, Stanford) that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for NER tasks. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.

What are some limitations of the tool?

It is unable to completely redact personal information when they are not punctuated appropriately. For example, the tool will redact Santosh but not santosh.
It is struggles with identifying certain uncommon names and redacting them efficiently. For example, it will redact Vennela as <REDACTED>la. The issue of uncommon names extends to certain locations as well.

How to potentially deal with the tool's limitations?

The fix I have used for inappropriate punctuations for now is normalizing the given string inputs. It is not completely eliminating such issues, but it solves at the very least a part of it. dslim/bert-base-NER works much better when sentences are capitalized, the same way as many pre-trained NERs, since they often expect capitalized names/normal sentence casing and thus miss out on identifying short, informal patterns, even though they might be very common.
The ai4bharat/IndicNER model should theoretically be able to identify Indian names and locations well. However, since it is an older model (last updated in 2022), it is not deployable via the HF Inference API. With a bit more time and effort, either this approach can be taken, or a different method or a better model can be tried.

Sample JSON Inputs and Outputs

Here is an example of the input:

[
  {
    "id": 1,
    "text": "Please contact Santosh at santosh@unmute.now regarding the meeting at Koramangala for further steps."
  },
  {
    "id": 2,
    "text": "Radhika said you can call her on +91-98765-43210 before sending the documents to her office in Banjara Hills."
  }
]

Here is an example of the generated output:

[
    {
        "id": 1,
        "cleaned_text": "Please contact <REDACTED>h at <EMAIL_REDACTED> regarding the meeting at <REDACTED> for further steps."
    },
    {
        "id": 2,
        "cleaned_text": "<REDACTED> said you can call her on <PHONE_REDACTED> before sending the documents to her office in <REDACTED>."
    }
]

Please feel free to use the input.json file attached in this repo for your testing purposes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Muted! 🤫

How to run?

CLI Version

- Using an Environment Variable

- Using a `.env` file

1. Install the required dependencies.

2. Run the following command in your terminal.

GUI Version

1. Navigate to the `app` folder in the repo using the terminal.

2. Run the app.

Why did I pick `bert-base-NER`?

What are some limitations of the tool?

How to potentially deal with the tool's limitations?

Sample JSON Inputs and Outputs

PS: I have used `venv` for my virtual environment, since that is what I have been most comfortable with.

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
.gitignore		.gitignore
README.md		README.md
input.json		input.json
muted.py		muted.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Muted! 🤫

How to run?

CLI Version

- Using an Environment Variable

- Using a .env file

1. Install the required dependencies.

2. Run the following command in your terminal.

GUI Version

1. Navigate to the app folder in the repo using the terminal.

2. Run the app.

Why did I pick bert-base-NER?

What are some limitations of the tool?

How to potentially deal with the tool's limitations?

Sample JSON Inputs and Outputs

PS: I have used venv for my virtual environment, since that is what I have been most comfortable with.

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

- Using a `.env` file

1. Navigate to the `app` folder in the repo using the terminal.

Why did I pick `bert-base-NER`?

PS: I have used `venv` for my virtual environment, since that is what I have been most comfortable with.