This repo serves a prototype of a PDF-to-Git workflow, aimed at processing the various editions of the EU AI Act.
https://artificialintelligenceact.eu/ is, as of [2023-03-01 Wed], undergoing edits.
The edits are happening, as far as we can tell, in Word.
Yes, they use Track Changes.
No, they do not share the Word docs with the public.
Instead, they share the PDF exports of the Track Changes with the public: deletions are shown as strikeouts, and new text is shown in bold.
These are what we might call passive-aggressive PDFs: almost as bad as though one printed the file and then scanned the printout to PDF.
Being programmers, we see all this is a job for Git: see https://ben.balter.com/2015/02/06/word-diff/ for a longer apologia.
Anyway, we download the available PDFs from https://artificialintelligenceact.eu/documents to input-pdfs.
We identify those PDFs that seem to contain revisions of the AI Act, and rename them using YYYY-MM-DD prefixes.
We use Adobe Acrobat to perform PDF-to-Word extraction on those files; the results are found in input-docx.
Then we run actxtract, a shell script that drives the rest of the process.
Pandoc takes us from Word to Markdown.
We cope with the strikeouts and bold text to arrive at snapshots.
We create a git branch containing a single file act.md
whose history shows all the snapshots… and their diffs.
- Pandoc
- Parallel
bin/actxtract ~/src/smucclaw/eu-ai-act input-docx workdir 'Having regard to the Treaty on the Functioning of the European Union' | tee actxtract.org
The arguments are:
- your local repo
- directory containing docx input files
- directory for temporary work files
- string that delimits the preamble; we discard text above this
The output gets saved to actxtract.org
which you can open as an org-mode in your favourite editor, such as Emacs.
If all goes well, a new branch is created and uploaded to Github. In that branch you will find an act.md
which has a history of commits. Those commits correspond to the various versions of the Act found in the original set of input PDFs and Word Docx.
GraphTage: diffing of trees like XML https://blog.trailofbits.com/2020/08/28/graphtage/