XML Extractor

A small Python utility to extract <doc-number> values. It prioritizes epo format documents first, followed by patent-office ones.

🔧 Installation and Run

Clone the repository - git clone https://github.com/reenaturak/xml_extractor.git
cd xml_extractor
uv sync
Run - uv run python -m xml_package.extractor

File type is .xml, .txt, .data, .zip, .html
File contains valid xml contents, encoding = utf-8
The XML has a single root element
There can be multiple application-reference
Every element contains exactly one child element
document-id elements may have a format attribute and/or a load-source attribute; either can indicate source type.
Format is consistently spelled as "docdb" and "original"
Tags may have XML namespaces

To do's for real scenario

Set default value for doc-number if its empty or not found, before storing it or passing to downstream
Validate for unacceptable or corrupt values like special characters, alphanumeric, etc
Validate for the length of the doc-number

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
xml_extractor.egg-info		xml_extractor.egg-info
xml_package		xml_package
.DS_Store		.DS_Store
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock