Skip to content

Zikt/scholare

πŸ“š Scholare β€” Automated Literature Review Pipeline

License: MIT Python 3.10+ Docs Open In Colab

An end-to-end, config-driven Python tool that searches academic literature, downloads papers, and generates structured research notes β€” ready to plug into any research topic.

πŸš€ Try it instantly β€” no installation required!

Open In Colab

Click the badge above to run Scholare directly in your browser via Google Colab.


✨ What It Does

  1. Searches Free APIs via OpenAlex natively, along with preprint servers like arXiv and bioRxiv/medRxiv.
  2. Enriches every result through Semantic Scholar β€” abstracts, TLDRs, DOIs, code/data hints. (Falls back to DOI lookups for highest accuracy).
  3. Discovers Open-Access Links dynamically using the Unpaywall API.
  4. Categorizes papers using configurable keyword rules.
  5. Downloads open-access PDFs into a local folder (with a --no-download CLI override).
  6. Generates visualizations β€” category distribution, open-access status, citation histogram, year timeline.
  7. Produces structured Markdown research notes β€” executive summary, taxonomy, top-cited, per-category breakdown with TLDRs, embedded charts, full paper index.
  8. Compares runs β€” pass a previous CSV to isolate newly discovered papers.
  9. Semantic Relevance Scoring (Optional) β€” ranks papers using keyword heuristics or deep-learning embeddings via sentence-transformers.

πŸ› οΈ Setup

1. Prerequisites

2. Install

From source (recommended for now):

git clone https://github.com/OWNER/scholare.git
cd scholare
python -m venv venv

# Activate
# Windows PowerShell:
.\venv\Scripts\Activate.ps1
# macOS / Linux:
source venv/bin/activate

pip install -e .

Eventually via PyPI:

pip install scholare

3. Configure API Keys

cp .env.example .env

Edit .env:

S2_API_KEY=your_actual_semantic_scholar_key
UNPAYWALL_EMAIL=your_email@example.com

4. Create Your Config

cp config_example.json my_config.json

Edit my_config.json:

{
  "query": "your search query here",
  "limit": 30,
  "output_dir": "./my_output",
  "categories": {
    "Category A": ["keyword1", "keyword2"],
    "Other": []
  },
  "default_category": "Other",
  "download_pdfs": true,
  "sources": ["openalex", "arxiv", "biorxiv"],
  "search_intent": "your natural language description of what you are looking for",
  "use_embeddings": true,
  "compare_methods": false
}
Field Description
query Search string (mapped appropriately across OpenAlex and preprints)
limit Max number of papers to retrieve per API source
output_dir Base output directory (subfolders auto-named by date + terms)
categories Category name β†’ keyword list for paper classification
default_category Fallback when no keywords match
download_pdfs Set false to skip PDF downloading by default
sources (Optional) List of sources to query. Available: openalex, arxiv, biorxiv
search_intent (Optional) Natural language phrase for semantic relevance scoring
use_embeddings (Optional) Set to true to use sentence-transformers for ML-based relevance ranking (requires pip install scholare[ml])
compare_methods (Optional) Set to true to output both keyword and ML embedding scores for comparison in the CSV

πŸš€ Usage

CLI

# Run the pipeline
scholare --config my_config.json

# Skip downloading PDFs (overrides config)
scholare --config my_config.json --no-download

# Compare with a previous run
scholare --config my_config.json --previous-csv ./old_output/results.csv

Programmatic & Cloud Notebooks (Colab / Kaggle)

⚑ Zero-install quick start β€” Run Scholare directly in your browser!

Open In Colab

No Python setup, no terminal, no installation. Just click and run.

You can also install the package manually in any cloud notebook: Install the package directly from GitHub:

!pip install git+https://github.com/zikt/scholare.git

Then, you can define your configuration natively in Python and pass it to the pipeline:

import os
from scholare.config import load_config
from scholare.pipeline import run_pipeline

# Setting API Keys:
# Method A: Direct Injection
# os.environ["S2_API_KEY"] = "your_key_here"
# os.environ["UNPAYWALL_EMAIL"] = "your_email@example.com"

# Method B: Secure Colab Secrets (Recommended)
# from google.colab import userdata
# os.environ["S2_API_KEY"] = userdata.get('S2_API_KEY')

# Define config as a dictionary mapping
my_config = {
    "query": "federated learning",
    "limit": 10,
    "output_dir": "./output",
    "categories": {"Privacy": ["dp"]},
    "default_category": "Other",
    "download_pdfs": False
}

config_obj = load_config(my_config)
df = run_pipeline(config_obj)

print(f"Found {len(df)} papers")

Tip

See the full interactive Cloud Notebook Template (examples/cloud_notebook_template.ipynb) to get started immediately!


πŸ“ Output Structure

Each run creates a descriptive subfolder:

output/
└── 2026-02-25_EEG_BCI_melanin_bias/
    β”œβ”€β”€ papers/                        # Downloaded open-access PDFs
    β”œβ”€β”€ visualizations/                # PNG charts
    β”‚   β”œβ”€β”€ category_distribution.png
    β”‚   β”œβ”€β”€ open_access_status.png
    β”‚   β”œβ”€β”€ citation_distribution.png
    β”‚   └── year_distribution.png
    β”œβ”€β”€ research_notes.md              # Structured Markdown summary
    β”œβ”€β”€ results.csv                    # Raw data
    └── new_discoveries.csv            # (if --previous-csv was used)

πŸ“ Example Configs

See the examples/ directory for ready-to-use configs:


πŸ—ΊοΈ Roadmap

Scholare is actively growing. See ROADMAP.md for planned features including:

  • πŸ“‘ More APIs β€” OpenAlex, Unpaywall, arXiv, Crossref, PubMed, CORE
  • πŸ“€ Export β€” BibTeX/RIS for Zotero & Mendeley
  • πŸ”— Integration β€” Zotero/Mendeley library sync
  • 🌐 Chrome extension β€” one-click literature searches
  • πŸ“Š Analysis β€” citation networks, clustering, PRISMA diagrams
  • 🎨 UX β€” rich CLI, interactive HTML dashboards
  • πŸ€– RAG Chat β€” interactive CLI chat across downloaded PDFs using local or cloud LLMs

Contributions welcome! Pick any item from the roadmap.


🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for setup and guidelines.


⚠️ Limitations

  • Only open-access PDFs can be downloaded. Paywalled papers are noted with links.
  • Semantic Scholar without a key is limited, which might slow down enrichment on large searches. Adding an S2_API_KEY solves this.
  • Categorization is keyword-based (heuristic, not an AI classifier).
  • bioRxiv searching uses the Crossref endpoint simulating a search, leading to slightly different handling.

πŸ“¦ Package Structure

scholare/
β”œβ”€β”€ __init__.py          # Public API 
β”œβ”€β”€ __main__.py          # CLI entry point
β”œβ”€β”€ config.py            # Config loader
β”œβ”€β”€ api.py               # OpenAlex, preprint, Unpaywall, Semantic Scholar clients
β”œβ”€β”€ pipeline.py          # Main orchestration
β”œβ”€β”€ downloader.py        # PDF downloading
β”œβ”€β”€ notes.py             # Markdown research notes generator
β”œβ”€β”€ visualizations.py    # Chart generation
└── utils.py             # Categorization & comparison helpers

πŸ“„ License

MIT β€” use it, fork it, build on it.

About

An automated, config-driven pipeline for literature reviews. Searches academic papers, enriches metadata, downloads open-access PDFs, and generates structured Markdown notes.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages