Skip to content

INAADHAARRecognizer not working as expected #1632

Open
@rajatm91

Description

@rajatm91

Describe the bug
I'm using presidio-analyzer to identify indian aadhar card number from the different texts. InAadhaarRecognizer is able to recognize it if the format is xxxxxxxxxxxx but unable to identify if the format is xxxx-xxxx-xxxx or xxxx xxxx xxxx. Both the format standard way of representing aadhaar number.

To Reproduce

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Configuration for the NLP engine
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_lg"}]
}

# Create NLP engine
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# Plug NLP engine into Analyzer
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])

# Test input
text = "My name is Rajat Mishra and my email is rajat@example.com and PAN is xxxx-xxxx-xxxx"
results = analyzer.analyze(text=text, entities=[], language="en")

for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]} (score={result.score:.2f})")

I have replaced aadhaar number with xxxx-xxxx-xxxx.

Expected behavior
It should be able to detect all the three formats of representing aadhaar.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions