PR to merge all work from NCSU Senior Design project #165

dan144 · 2017-06-21T22:35:54Z

This pull request contains all of the work from the NC State University ECE Senior Design team. The major features added include string search, batch processing, and OCR.

…tation

files for alpha demo

…ithin coordinates

Merging all development by ECE485 team into master branch

…rners

…iles created in temporary directories. removed OCR renaming and simply overwrite all OCR files, since all are now temporary

jeremybmerrill · 2017-07-08T14:54:19Z

src/main/java/technology/tabula/extractors/BatchSelectionExtractor.java

+// NOTES:
+//		need to remove tables from auto/spread list if they are used as a best guess
+//		or remove very similar tables from the list before extracting data
+public class BatchSelectionExtractor {


It might be worthwhile for this to mimic https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/extractors/SpreadsheetExtractionAlgorithm.java#L85 and have the extract() method return a list of tables, rather than writing the output to disk directly.

I think it would also work for it to return a list of lists of tables (one list of tables per document in the batch job) or even a dictionary/hash mapping batch document filenames to a list of tables.

This will also, I believe, solve one of the problems that's causing the continuous integration job to fail. Rather than requiring that an output directory exist, this extract method can "write" the output to a stream object, rather than a real file. That way, the tests aren't side-effecty in terms of creating files and folders.

Incidentally, I think there needs to be a little bit of work to deal with Linux/Mac compatibility. Where the test for this method is supposed to write some files to the /src/test/resources/technology/tabula/batch/output/ directory, I end up with a file in java/src/test/resources/technology/tabula/batch/ called output\well_text_a.csv. Java has the ability to let you name files without using \ or / and letting Java figure out the right one to use.

I believe the last section of this is caused by this section of the code. I'll switch this to use Paths.get. I'll look into making changes for the rest as well.

Just to keep visibility, @dbangera23 is looking into this presently; it hasn't been forgotten.

Hey @jeremybmerrill, I changed the code to make the extract method close to how spreadsheetExtractionAlgorithm. It now returns a Map<String fileName, List
> format. I also changed some of the code to make it modular. Now batch processing is done separately than writing.
Let me know if there is anything else.
Thanks.

Oh, yes. The type signature looks good. Let's figure out why CI is failing, then, since @jazzido approved these too, we can see about getting these merged.

…ted test to use new process

…formatting to keep in line with CSV writer and rest of tabula

jeremybmerrill · 2017-08-12T20:38:34Z

Okeedoke. Thanks for fixing the tests ("good failures" the ones where the tests fail because extraction got improved are common and kinda funny). I'm good to merge this. @jazzido ?

jazzido · 2017-08-16T16:31:58Z

Hi @dan and @dbangera23,

Ran into another issue when playing with your branch. This command:

java -jar target/tabula-1.0.2-SNAPSHOT-jar-with-dependencies.jar -e -p 10 -g ~/Downloads/170803_crosstabs_Politico_v1_TB-1.pdf
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Users/manuel/Work/code/tabula-java/target/tabula-1.0.2-SNAPSHOT-jar-with-dependencies.jar])
	at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:271)
	at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
	at com.sun.jna.Library$Handler.<init>(Library.java:147)
	at com.sun.jna.Native.loadLibrary(Native.java:412)
	at com.sun.jna.Native.loadLibrary(Native.java:391)
	at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:75)
	at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42)
	at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:368)
	at net.sourceforge.tess4j.Tesseract.createDocuments(Tesseract.java:524)
	at net.sourceforge.tess4j.Tesseract.createDocuments(Tesseract.java:507)
	at technology.tabula.extractors.OcrConverter.extract(OcrConverter.java:45)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Even though I indicated that only page 10 needs to be processed, the app generated PNG images for all pages in the same folder that contains the input PDF. Of course, that should not happen. If you need to generate temporary files, they need to be generated in a temporary folder that is deleted afterwards.
As the trace shows, it failed. I'm on a Mac.

The document that I used for testing is here, although any document will show this behavior.

Thanks!

jeremybmerrill · 2017-08-17T14:29:51Z

Hey guys, does String Search have a way to use it from teh command line?

dbangera23 · 2017-08-18T18:18:14Z

Hey @jazzido and @jeremybmerrill, Me and daniel will take a look at the mac error that you guys are seeing this weekend and see what we can do. I also didn't add in the functionality to do ocr on a per page basis. Just the whole document so we can look into that.

Currently String search isn't implemented in the command line. We can look into making that work.

dan144 · 2017-08-20T17:36:14Z

@jazzido I looked into your error on Mac, and I think it's the same issue we're seeing on Linux. Since the included JAR file only includes Windows-compatible Tess4J functions by default, I think you'll have manually install the Tesseract libraries. Instructions are given here.

…r ocr

dbangera23 · 2017-08-20T18:27:20Z

@jeremybmerrill Hey Jeremy, I looked into the option for adding a string based command line search and I don't think it's a good idea to do so at the moment. Maybe after the merge we can take a closer look.

The problem is that CommandLineApp.whichArea(line); get the rectangle to determine where to process. After which extraction is done later at exactly those rectangles. The string search might return a rectangle that is different from one page to another. I could dynamically search for the rectangle in extractFile but even then String search can return multiple rectangles per page.

I'm not sure how we want to handle this since the above "solution" would break how modular the code is and would need a redesign of the page class. Might be better to handle this functionality at a later date.

dbangera23 · 2017-08-21T02:06:56Z

I'm finishing up the OCR per page functionality. Should have it done soon.

rosenjcb · 2018-09-05T23:11:19Z

Was this pull request ever merged into the master branch? I've thought about adding OCR (tess4j) functionality but it looks like it was already attempted.

dan144 · 2018-09-06T00:54:05Z

Hey @rosenjcb, this PR has not been worked on in some time. We had trouble with the Ubuntu version used for automated testing in this repo. No dev work has been done on this project in about 16 months, so if you're interested in picking up OCR effort on this project, this certainly shouldn't stop you.

Dean and others added 30 commits April 13, 2017 22:32

added changes for regex search

d884a2e

missed some changes

3ad86c7

fixed java code for regex search

19ab131

Intermediary change while we get 4 corners working

225d648

added Tess4j dependency to pom.xml, added base class for ocr implemen…

fffae84

…tation

OCR Full conversion complete, dumped into individual pdf's folder

1b490b0

Changed return to succes or failure

0026ac7

Add files via upload

3241583

files for alpha demo

Add files via upload

c721649

files for alpha demo

fixed imports for full build dependencies

a130b12

fix double iteration on backup

2148259

temp fix for out of position strings that begin with same char

7440fcb

batch test for GUI

b8e904b

updated container class for regular expressions

57093cc

fixed a couple errors. file needed to be deleted

1210982

added basic coordinate search

c9f60ef

fixed check for coordinates

2e174a4

System output change in regex and batch coordinate change

1879eab

system output changes and fixed bug with bottom of string not being w…

592ba47

…ithin coordinates

ocr selection for batch now passed in

2f40969

Fixed issue where first element wasn't within the coordinates

409f902

attempt to merge ocr into batch, overlap might do something

4980374

Merge branch 'master' into dev

f533d9d

Merging all development by ECE485 team into master branch

allowing 3 corner search, new parsing method

643eb73

changed how string arrays are handled, should be able to perform 3 co…

eb3fe2e

…rners

removed backend cause of need for renaming operations.

46daf5c

cleaned up code

d17d044

removed print statements

15b6d54

tried to format tables better, removed prints and delete ocr files

cd96c69

fixes for ocr and indexing errors

f5ec776

Daniel Gross added 2 commits June 29, 2017 20:15

changed OCR to remove creation of new files in user system. all new f…

9296532

…iles created in temporary directories. removed OCR renaming and simply overwrite all OCR files, since all are now temporary

reorganized batch search to remove all catch(Exception) swallowing.

3e69d60

jeremybmerrill reviewed Jul 8, 2017

View reviewed changes

Daniel Gross added 4 commits July 11, 2017 20:24

generalized path creation cross platform

6692f71

fixing potential null pointers

701d3f6

fixed (another) potential nullpointerexception

ad2c520

fixing linux error when files read in different order

5f39398

jazzido approved these changes Jul 12, 2017

View reviewed changes

jazzido and others added 9 commits July 12, 2017 00:32

Merge branch 'master' into master

680d375

install tesseract in travis

c18ff67

we need sudo in travis

e24fc19

sudo for travis

5fd099b

travis: ghostscript

4471cee

Wrote batch using file writer example in spreadsheetextractor

b716766

Updated java process for batch processing. more modular process. upda…

5bde097

…ted test to use new process

Changed expected files in expected of batch testing due to change in …

763f341

…formatting to keep in line with CSV writer and rest of tabula

Merge branch 'master' into master

0878b17

jeremybmerrill approved these changes Aug 12, 2017

View reviewed changes

updated command line interface, -b now takes into consideration -e fo…

dd9ecea

…r ocr

Now pass a page list to OcrConvertor and only run OCR on specified pages

54b3ec7

jazzido closed this Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR to merge all work from NCSU Senior Design project #165

PR to merge all work from NCSU Senior Design project #165

dan144 commented Jun 21, 2017

jeremybmerrill Jul 8, 2017

dan144 Jul 11, 2017 •

edited

dbangera23 Jul 22, 2017

jeremybmerrill Aug 10, 2017

jeremybmerrill commented Aug 12, 2017

jazzido commented Aug 16, 2017 •

edited

jeremybmerrill commented Aug 17, 2017

dbangera23 commented Aug 18, 2017

dan144 commented Aug 20, 2017

dbangera23 commented Aug 20, 2017

dbangera23 commented Aug 21, 2017

rosenjcb commented Sep 5, 2018

dan144 commented Sep 6, 2018

PR to merge all work from NCSU Senior Design project #165

PR to merge all work from NCSU Senior Design project #165

Conversation

dan144 commented Jun 21, 2017

jeremybmerrill Jul 8, 2017

Choose a reason for hiding this comment

dan144 Jul 11, 2017 • edited

Choose a reason for hiding this comment

dbangera23 Jul 22, 2017

Choose a reason for hiding this comment

jeremybmerrill Aug 10, 2017

Choose a reason for hiding this comment

jeremybmerrill commented Aug 12, 2017

jazzido commented Aug 16, 2017 • edited

jeremybmerrill commented Aug 17, 2017

dbangera23 commented Aug 18, 2017

dan144 commented Aug 20, 2017

dbangera23 commented Aug 20, 2017

dbangera23 commented Aug 21, 2017

rosenjcb commented Sep 5, 2018

dan144 commented Sep 6, 2018

dan144 Jul 11, 2017 •

edited

jazzido commented Aug 16, 2017 •

edited