Skip to content

ytydt/ExtractFromPaperMCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExtractFromPaperMCP

An agentic pipeline that simulates human browsing behaviour to collect clinical papers, normalizes the content, and extracts structured patient records and diagnoses.

Features

  • MCP-inspired browsing agent with randomized delays and retry logic
  • HTML/PDF normalization pipeline powered by BeautifulSoup and pdfminer
  • Information extraction agent with pluggable LLM clients (heuristic offline default)
  • Diagnosis validation to highlight low-confidence outputs
  • Knowledge store that writes provenance-rich JSON artefacts
  • Async CLI for batch processing lists of URLs

Installation

pip install -e .[dev]

Usage

python -m extractor.cli https://example.com/paper1 https://example.com/paper2

or provide a file with URLs:

python -m extractor.cli --urls-file urls.txt --output-dir outputs

The CLI writes JSON files with extracted data into the chosen output directory.

Testing

pytest

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors