A robust Python tool for scraping academic papers from top AI and machine learning conferences. It extracts high-quality metadata and full PDFs to support applications like citation analysis and research recommendation.
This tool covers all major AI and machine learning conferences, such as NeurIPS, ICML, ICLR, CVPR, ACL, and others, with a focus on papers published since the rise of deep learning. It automatically filters out non-archival content like workshop papers, extended abstracts, and demos, ensuring that only peer-reviewed full conference papers are included.
- NeurIPS(2000–2024)
- ICML(2013–2025)
- ICLR(2013–2025)
- AAAI(2010–2025)
- CVPR(2012-2025)
- COLT(2011-2025)
- UAI(2015-2025)
- JMLR(2000-2025)
- AISTATS(2009-2025)
- IJCAI(2017-2025)
- ACL(2017-2025)
- EMNLP(2017-2025)
- NAACL(2013-2025)
- ICCV(2013-2025)
- ECCV(2018-2024)
⚠️ Note: Due to access restrictions, the tool currently does not support scraping papers from KDD, TPAMI, and ICDM, as their full metadata or PDFs are not publicly available without a subscription or institutional access.
- Scrapes paper metadata (title, authors, abstract)
- Downloads PDFs automatically
- Resume capability for interrupted scraping
- Year-specific scrapers for different conference formats
- Robust error handling and rate limiting
- Configurable delays and retry mechanisms
- Clone the repository
- Install dependencies:
pip install -r requirements.txt- Create a
.envfile (optional) to configure data directory:
SCRAPER_DATA_ROOT=./dataList available conferences:
python main.py --list-conferencesScrape a single year:
python main.py neurips 2022Scrape multiple years:
python main.py iclr 2020 2021 2022Skip PDF downloads (metadata only):
python main.py icml 2023 --no-pdfsStart fresh (ignore existing data):
python main.py aaai 2024 --no-resumePapers are saved in the following structure:
data/
├── metadata/
│ └── conference/
│ └── conference_year.json
└── papers/
└── conference/
└── year/
└── paper_files.pdf
Conference-specific settings are defined in config.py:
- Request delays and timeouts
- Retry attempts
- Rate limiting parameters
- Base URLs for each conference
The scraper generates detailed logs saved to scraper.log and displays progress in the console. Use --verbose for debug-level logging.
- Some conferences have year-specific scrapers for different website formats
- The scraper respects rate limits and includes delays between requests
- PDF downloads are optional and can be skipped for faster metadata collection
- All scraped data is saved incrementally to prevent data loss
In recent years, the rapid growth of AI and machine learning research has resulted in an overwhelming number of papers published annually, making it increasingly difficult for researchers to stay up to date with developments in their specific subfields. While platforms like Google Scholar, Semantic Scholar, OpenReview, and Paper Copilot attempt to aggregate publication data, our observations suggest that these sources often suffer from incomplete coverage and noisy metadata. To address this gap, we developed a suite of dedicated scrapers targeting the top-tier AI/ML conferences and journals, aiming to build a high-quality, comprehensive dataset of research papers. Our system extracts reliable metadata and downloads full PDFs, which can later be processed using tools like GROBID for structured content analysis. This curated dataset is intended to power downstream applications such as research limitation analysis, citation and reference recommendation, and intelligent paper reading recommendation. Our current focus spans conferences from 2013-ish onward—when deep learning began reshaping the field—though earlier years may also be partially included.