An agentic pipeline that simulates human browsing behaviour to collect clinical papers, normalizes the content, and extracts structured patient records and diagnoses.
- MCP-inspired browsing agent with randomized delays and retry logic
- HTML/PDF normalization pipeline powered by BeautifulSoup and pdfminer
- Information extraction agent with pluggable LLM clients (heuristic offline default)
- Diagnosis validation to highlight low-confidence outputs
- Knowledge store that writes provenance-rich JSON artefacts
- Async CLI for batch processing lists of URLs
pip install -e .[dev]python -m extractor.cli https://example.com/paper1 https://example.com/paper2or provide a file with URLs:
python -m extractor.cli --urls-file urls.txt --output-dir outputsThe CLI writes JSON files with extracted data into the chosen output directory.
pytest