A small Python utility to extract <doc-number> values.
It prioritizes epo format documents first, followed by patent-office ones.
- Clone the repository - git clone https://github.com/reenaturak/xml_extractor.git
- cd xml_extractor
- uv sync
- Run - uv run python -m xml_package.extractor
- File type is .xml, .txt, .data, .zip, .html
- File contains valid xml contents, encoding = utf-8
- The XML has a single root element
- There can be multiple application-reference
- Every element contains exactly one child element
- document-id elements may have a format attribute and/or a load-source attribute; either can indicate source type.
- Format is consistently spelled as "docdb" and "original"
- Tags may have XML namespaces
To do's for real scenario
- Set default value for doc-number if its empty or not found, before storing it or passing to downstream
- Validate for unacceptable or corrupt values like special characters, alphanumeric, etc
- Validate for the length of the doc-number