Six's Generals

A data pipeline and analysis for Georges Six's Dictionnaire biographique des généraux et amiraux français de la Révolution et de l'Empire (1934), a two-volume biographical dictionary covering 2,206 generals and admirals who served France between roughly 1792 and 1815.

Six organized each entry as a semicolon-delimited series of facts — rank promotions, unit assignments, campaigns, wounds, honors, administrative acts — which makes the text unusually amenable to structured extraction. This repository contains the pipeline that converts Six's prose into machine-readable data, along with exploratory analyses of what that data reveals about French military careers, geography, and battlefield history.

A companion scholarly paper is in preparation.

The data

The full structured dataset is included in data/gold/generals/:

File	Description
`gold_generals_events.csv`	Rectangular export: 88,282 rows × 52 columns, one row per biographical clause
`gold_generals_enriched.json`	Enriched entry-level records with errata applied and all metadata
`gold_generals.json`	Entry-level records prior to errata application

The raw PDF scans are not included in this repository. Bronze- and silver-stage intermediate files (cleaned OCR text and assembled entry records) are included in data/bronze/ and data/silver/.

Pipeline

The pipeline transforms Six's scanned pages into structured data in seven stages:

Digitization — OCR via Marker, producing one Markdown file per page (bronze pages)
Page cleaning — deterministic removal of OCR artifacts: markup residue, column-break insertions, inconsistent ordinal rendering, and dehyphenation failures (silver pages)
Entry assembly — identification of entry boundaries and stitching of entries that span page breaks (bronze generals)
Entry enrichment — four metadata passes: provisional-status flags from Six's typography; errata tracking linking Six's published corrections to affected entries; name-variant consolidation; and application of Six's errata to entry text with original text preserved (silver generals)
Clause parsing and extraction — segmentation of entries into clauses and rule-based classification by type (rank promotion, unit assignment, campaign, honor, wound, administrative event), with type-specific fields extracted into structured records (gold generals)
Gold enrichment — date propagation across year-ellipsis clauses; father-military detection
Tabular export — flattening of enriched JSON into the rectangular CSV

Full methodology is described in the pipeline documentation.

Analysis

The analysis introduction covers eight questions the structured data can address:

Rank hierarchy and promotion speed across the Revolutionary and Napoleonic eras
Geographic origins of the officer corps
Arc de Triomphe inscriptions — who was named, and where
Battle participation and the relationship between officer quality and battlefield outcomes
Army assignments across named commands
Post-career provisions: pensions and retirement
The Legion of Honor and the imperial honor system

Repository layout

data/
  bronze/         Cleaned OCR output (one .md per page) and assembled entry records
  silver/         Cleaned page and entry-level records
  gold/           Final structured dataset (CSV + JSON)
documentation/
  pipeline/       Pipeline methodology (Quarto)
  analysis/       Exploratory analysis (Quarto)
notebooks/        Jupyter notebooks for exploration and analysis
scripts/          Pipeline scripts (run in order by stage)
src/generals/     Supporting Python package
tests/            Unit tests

Reproducibility

The gold dataset can be used directly without running the pipeline. For those who wish to reproduce or extend the pipeline from the bronze stage onward, the intermediate files are included. Reproducing from raw scans requires the original PDF files, which are not distributed here.

Dependencies are declared in pyproject.toml and managed with uv. To install the core analysis dependencies:

uv sync

The digitization stage (stage 1) requires additional dependencies including Marker and its ML model weights. Install with:

uv sync --extra ocr

Pipeline scripts in scripts/ correspond to pipeline stages and should be run in the order listed in the pipeline documentation.

Status

Data extraction and initial analyses are complete. A scholarly paper drawing on this dataset is in preparation. The repository will be updated as that work progresses.

License

Code and documentation in this repository are released under the MIT License. Six's Dictionnaire biographique (1934) is in the public domain; Six died in 1947.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Six's Generals

The data

Pipeline

Analysis

Repository layout

Reproducibility

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.vscode		.vscode
data		data
documentation		documentation
notebooks		notebooks
scripts		scripts
src/generals		src/generals
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Six's Generals

The data

Pipeline

Analysis

Repository layout

Reproducibility

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages