Skip to content

Add Batch Decryption #1617

Open
Open
@tylerlittlefield

Description

@tylerlittlefield

Is your feature request related to a problem? Please describe.

I cannot figure out how to decrypt in batch.

Describe the solution you'd like

I would like to be able to decrypt in batch.

Describe alternatives you've considered
I initially was going row by row, too time consuming.

Additional context

What I have so far:

import pandas as pd
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine
from presidio_anonymizer import BatchAnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig
import pyarrow.dataset as ds
import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Open the Parquet dataset from S3
dataset = ds.dataset("s3://my-large-dataset.parquet", format="parquet")

# Take a subset of the dataset
scanner = dataset.scanner(columns=["free_text"])
table = scanner.head(100000)
df = table.to_pandas().reset_index(drop=True)
df = df["free_text"].drop_duplicates().reset_index(drop=True)

# DataFrame to dict
df_dict = {"free_text": df.tolist()}

# Analyze
analyzer = AnalyzerEngine()
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)
analyzer_results = batch_analyzer.analyze_dict(df_dict, language="en")
analyzer_results = list(analyzer_results)

# Encrypt
anonymizer_config = {"DEFAULT": OperatorConfig("encrypt", {"key": "SOME KEY"})}
batch_anonymizer = BatchAnonymizerEngine()
anonymizer_results = batch_anonymizer.anonymize_dict(analyzer_results, operators=anonymizer_config)
scrubbed_df = pd.DataFrame(anonymizer_results)
scrubbed_df

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions