Open
Description
Describe the bug
I'm using presidio-analyzer to identify indian aadhar card number from the different texts. InAadhaarRecognizer
is able to recognize it if the format is xxxxxxxxxxxx
but unable to identify if the format is xxxx-xxxx-xxxx
or xxxx xxxx xxxx
. Both the format standard way of representing aadhaar number.
To Reproduce
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
# Configuration for the NLP engine
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_lg"}]
}
# Create NLP engine
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# Plug NLP engine into Analyzer
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])
# Test input
text = "My name is Rajat Mishra and my email is rajat@example.com and PAN is xxxx-xxxx-xxxx"
results = analyzer.analyze(text=text, entities=[], language="en")
for result in results:
print(f"{result.entity_type}: {text[result.start:result.end]} (score={result.score:.2f})")
I have replaced aadhaar number with xxxx-xxxx-xxxx.
Expected behavior
It should be able to detect all the three formats of representing aadhaar.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.