Skip to content

rjcarroll/generals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Six's Generals

A data pipeline and analysis for Georges Six's Dictionnaire biographique des généraux et amiraux français de la Révolution et de l'Empire (1934), a two-volume biographical dictionary covering 2,206 generals and admirals who served France between roughly 1792 and 1815.

Six organized each entry as a semicolon-delimited series of facts — rank promotions, unit assignments, campaigns, wounds, honors, administrative acts — which makes the text unusually amenable to structured extraction. This repository contains the pipeline that converts Six's prose into machine-readable data, along with exploratory analyses of what that data reveals about French military careers, geography, and battlefield history.

A companion scholarly paper is in preparation.

The data

The full structured dataset is included in data/gold/generals/:

File Description
gold_generals_events.csv Rectangular export: 88,282 rows × 52 columns, one row per biographical clause
gold_generals_enriched.json Enriched entry-level records with errata applied and all metadata
gold_generals.json Entry-level records prior to errata application

The raw PDF scans are not included in this repository. Bronze- and silver-stage intermediate files (cleaned OCR text and assembled entry records) are included in data/bronze/ and data/silver/.

Pipeline

The pipeline transforms Six's scanned pages into structured data in seven stages:

  1. Digitization — OCR via Marker, producing one Markdown file per page (bronze pages)
  2. Page cleaning — deterministic removal of OCR artifacts: markup residue, column-break insertions, inconsistent ordinal rendering, and dehyphenation failures (silver pages)
  3. Entry assembly — identification of entry boundaries and stitching of entries that span page breaks (bronze generals)
  4. Entry enrichment — four metadata passes: provisional-status flags from Six's typography; errata tracking linking Six's published corrections to affected entries; name-variant consolidation; and application of Six's errata to entry text with original text preserved (silver generals)
  5. Clause parsing and extraction — segmentation of entries into clauses and rule-based classification by type (rank promotion, unit assignment, campaign, honor, wound, administrative event), with type-specific fields extracted into structured records (gold generals)
  6. Gold enrichment — date propagation across year-ellipsis clauses; father-military detection
  7. Tabular export — flattening of enriched JSON into the rectangular CSV

Full methodology is described in the pipeline documentation.

Analysis

The analysis introduction covers eight questions the structured data can address:

  • Rank hierarchy and promotion speed across the Revolutionary and Napoleonic eras
  • Geographic origins of the officer corps
  • Arc de Triomphe inscriptions — who was named, and where
  • Battle participation and the relationship between officer quality and battlefield outcomes
  • Army assignments across named commands
  • Post-career provisions: pensions and retirement
  • The Legion of Honor and the imperial honor system

Repository layout

data/
  bronze/         Cleaned OCR output (one .md per page) and assembled entry records
  silver/         Cleaned page and entry-level records
  gold/           Final structured dataset (CSV + JSON)
documentation/
  pipeline/       Pipeline methodology (Quarto)
  analysis/       Exploratory analysis (Quarto)
notebooks/        Jupyter notebooks for exploration and analysis
scripts/          Pipeline scripts (run in order by stage)
src/generals/     Supporting Python package
tests/            Unit tests

Reproducibility

The gold dataset can be used directly without running the pipeline. For those who wish to reproduce or extend the pipeline from the bronze stage onward, the intermediate files are included. Reproducing from raw scans requires the original PDF files, which are not distributed here.

Dependencies are declared in pyproject.toml and managed with uv. To install the core analysis dependencies:

uv sync

The digitization stage (stage 1) requires additional dependencies including Marker and its ML model weights. Install with:

uv sync --extra ocr

Pipeline scripts in scripts/ correspond to pipeline stages and should be run in the order listed in the pipeline documentation.

Status

Data extraction and initial analyses are complete. A scholarly paper drawing on this dataset is in preparation. The repository will be updated as that work progresses.

License

Code and documentation in this repository are released under the MIT License. Six's Dictionnaire biographique (1934) is in the public domain; Six died in 1947.

About

OCR & extraction pipeline for Georges Six's 1934 biographical dictionary of Napoleonic generals. Converts poor-quality French PDF scans into a structured, searchable SQLite database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors