Skip to content

Explore github CI through PDF processing shenanigans.

Notifications You must be signed in to change notification settings

samayer12/Interleave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Messing around with Github CI and PDF processing.

This program will take two PDF files and match all paragraphs (which SHOULD be numbered) and store the matched pairs in a .csv. file.

Example usage python interleave.py file1.pdf file2.pdf output.csv

Example output

Document1,Document2
1. First Entry.,1. First Entry
2. Second Entry.,2. Second Entry
3. Third Entry,3. Third Entry

Current Status

102/1088 paragraphs in test data have an anomaly. A complete list of observed errors is in Errors.csv Error types:

  • Double-Number Parse
  • Excessive Heading
  • Grouped Response
  • Heading Parse
  • Missing Character(s)
  • Missing Text
  • Pagebreak Parse
  • Parse Error
  • Parsed Count
  • Preceding Data
  • WTF

Here's a tabular representation of the anomalies.

1/2 3/4* 5/6 7/8 Total
EJ 17 6 8 17 48
EPA 15 3 21 15 54
Total 32 9 29 32 102

About

Explore github CI through PDF processing shenanigans.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages