A data pipeline and analysis for Georges Six's Dictionnaire biographique des généraux et amiraux français de la Révolution et de l'Empire (1934), a two-volume biographical dictionary covering 2,206 generals and admirals who served France between roughly 1792 and 1815.
Six organized each entry as a semicolon-delimited series of facts — rank promotions, unit assignments, campaigns, wounds, honors, administrative acts — which makes the text unusually amenable to structured extraction. This repository contains the pipeline that converts Six's prose into machine-readable data, along with exploratory analyses of what that data reveals about French military careers, geography, and battlefield history.
A companion scholarly paper is in preparation.
The full structured dataset is included in data/gold/generals/:
| File | Description |
|---|---|
gold_generals_events.csv |
Rectangular export: 88,282 rows × 52 columns, one row per biographical clause |
gold_generals_enriched.json |
Enriched entry-level records with errata applied and all metadata |
gold_generals.json |
Entry-level records prior to errata application |
The raw PDF scans are not included in this repository. Bronze- and silver-stage intermediate files (cleaned OCR text and assembled entry records) are included in data/bronze/ and data/silver/.
The pipeline transforms Six's scanned pages into structured data in seven stages:
- Digitization — OCR via Marker, producing one Markdown file per page (bronze pages)
- Page cleaning — deterministic removal of OCR artifacts: markup residue, column-break insertions, inconsistent ordinal rendering, and dehyphenation failures (silver pages)
- Entry assembly — identification of entry boundaries and stitching of entries that span page breaks (bronze generals)
- Entry enrichment — four metadata passes: provisional-status flags from Six's typography; errata tracking linking Six's published corrections to affected entries; name-variant consolidation; and application of Six's errata to entry text with original text preserved (silver generals)
- Clause parsing and extraction — segmentation of entries into clauses and rule-based classification by type (rank promotion, unit assignment, campaign, honor, wound, administrative event), with type-specific fields extracted into structured records (gold generals)
- Gold enrichment — date propagation across year-ellipsis clauses; father-military detection
- Tabular export — flattening of enriched JSON into the rectangular CSV
Full methodology is described in the pipeline documentation.
The analysis introduction covers eight questions the structured data can address:
- Rank hierarchy and promotion speed across the Revolutionary and Napoleonic eras
- Geographic origins of the officer corps
- Arc de Triomphe inscriptions — who was named, and where
- Battle participation and the relationship between officer quality and battlefield outcomes
- Army assignments across named commands
- Post-career provisions: pensions and retirement
- The Legion of Honor and the imperial honor system
data/
bronze/ Cleaned OCR output (one .md per page) and assembled entry records
silver/ Cleaned page and entry-level records
gold/ Final structured dataset (CSV + JSON)
documentation/
pipeline/ Pipeline methodology (Quarto)
analysis/ Exploratory analysis (Quarto)
notebooks/ Jupyter notebooks for exploration and analysis
scripts/ Pipeline scripts (run in order by stage)
src/generals/ Supporting Python package
tests/ Unit tests
The gold dataset can be used directly without running the pipeline. For those who wish to reproduce or extend the pipeline from the bronze stage onward, the intermediate files are included. Reproducing from raw scans requires the original PDF files, which are not distributed here.
Dependencies are declared in pyproject.toml and managed with uv. To install the core analysis dependencies:
uv syncThe digitization stage (stage 1) requires additional dependencies including Marker and its ML model weights. Install with:
uv sync --extra ocrPipeline scripts in scripts/ correspond to pipeline stages and should be run in the order listed in the pipeline documentation.
Data extraction and initial analyses are complete. A scholarly paper drawing on this dataset is in preparation. The repository will be updated as that work progresses.
Code and documentation in this repository are released under the MIT License. Six's Dictionnaire biographique (1934) is in the public domain; Six died in 1947.