ArXiv Paper Analyzer

A Python-based tool for scraping and analyzing recent papers from arXiv.org using local LLM-powered analysis. The system fetches papers from specified categories, generates summaries and technical analysis, and saves the results.

Features

Scrapes recent papers from specified arXiv categories
Local LLM-powered paper analysis (using KoboldCPP backend)
Configurable paper processing pipeline
JSON-based output for easy integration
Support for multiple arXiv categories and subjects
Rate-limiting to respect arXiv's servers

Requirements

Python 3.7+
BeautifulSoup4 for web scraping
Requests library
Local KoboldCPP instance running (default: http://192.168.0.6:8051)
Access to arXiv.org

Configuration

ArXiv Categories

Categories are loaded from parsed_arxiv_subjects.json. Default monitoring includes:

cs.AI (Artificial Intelligence)
cs.LG (Machine Learning)
cs.CL (Computation and Language)

LLM Configuration

The system uses KoboldCPP with Gemma-2 type models. Configuration is stored in kobold_config.json with parameters for:

Context length
Temperature
Top-p sampling
Response length
Other generation parameters

Usage

# Initialize components
subject_manager = SubjectManager('parsed_arxiv_subjects.json')
scraper = EnhancedArxivScraper(subject_manager)
analyzer = TextGenerationHandler()
feed_generator = PaperFeedGenerator(scraper, analyzer)

# Generate daily feed
papers = feed_generator.generate_daily_feed(['cs.AI', 'cs.LG', 'cs.CL'])
feed_generator.save_feed(papers)

Output Format

Papers are saved in JSON format with the following structure:

{
    "id": "paper_id",
    "title": "Paper Title",
    "authors": ["Author 1", "Author 2"],
    "abstract": "Paper abstract...",
    "date": "Submission date",
    "url": "arXiv URL",
    "primary_category": "Main category",
    "all_categories": ["cat1", "cat2"],
    "analysis": "LLM-generated analysis"
}

TODO List

High Priority Fixes

Paper Count and Batching
- Implement proper paper count tracking
- Add batch processing capability
- Add configurable batch size
Data Processing Flow
- Separate scraping and LLM processing steps
- Save raw scraped data before LLM processing
- Implement retry mechanism for failed LLM analyses
Date Extraction Issues
- Fix submission date extraction from arXiv pages
- Add proper date parsing and formatting
- Handle multiple versions of papers
Category Management
- Fix sub-categories extraction
- Improve category parsing from paper pages
- Create readable format for category configuration

General Improvements

Error Handling and Logging
- Add comprehensive error handling
- Implement proper logging system
- Add error recovery mechanisms
Data Storage
- Implement more robust saving mechanism
- Add database support (optional)
- Add incremental save capability
Configuration
- Move hardcoded values to config file
- Add CLI arguments for common parameters
- Create example configuration files
Documentation
- Add docstrings to all classes and methods
- Create API documentation
- Add usage examples

Optional Enhancements

Features
- Add support for different LLM backends
- Implement paper similarity analysis
- Add export formats (PDF, HTML)
- Create web interface
Performance
- Optimize scraping with async requests
- Add caching mechanism
- Implement parallel processing

Notes

The TODO items are based on the comments in the source code and apparent needs
Priority should be given to fixing core functionality before adding new features
Testing should be added throughout the improvement process

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
README.md		README.md
arxiv_rss.py		arxiv_rss.py
kobold_config.json		kobold_config.json
parsed_arxiv_subjects.json		parsed_arxiv_subjects.json
process_papers.py		process_papers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArXiv Paper Analyzer

Features

Requirements

Configuration

ArXiv Categories

LLM Configuration

Usage

Output Format

TODO List

High Priority Fixes

General Improvements

Optional Enhancements

Notes

About

Releases

Packages

Languages

ubernicholi/arxiv_crawler

Folders and files

Latest commit

History

Repository files navigation

ArXiv Paper Analyzer

Features

Requirements

Configuration

ArXiv Categories

LLM Configuration

Usage

Output Format

TODO List

High Priority Fixes

General Improvements

Optional Enhancements

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages