Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to merge all work from NCSU Senior Design project #165

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
d884a2e
added changes for regex search
Jan 10, 2017
3ad86c7
missed some changes
Jan 10, 2017
19ab131
fixed java code for regex search
Jan 16, 2017
225d648
Intermediary change while we get 4 corners working
Feb 6, 2017
fffae84
added Tess4j dependency to pom.xml, added base class for ocr implemen…
Feb 14, 2017
1b490b0
OCR Full conversion complete, dumped into individual pdf's folder
Feb 14, 2017
0026ac7
Changed return to succes or failure
Feb 17, 2017
3241583
Add files via upload
Feb 27, 2017
c721649
Add files via upload
Feb 27, 2017
a130b12
fixed imports for full build dependencies
Mar 16, 2017
2148259
fix double iteration on backup
Mar 20, 2017
7440fcb
temp fix for out of position strings that begin with same char
Mar 21, 2017
b8e904b
batch test for GUI
Mar 25, 2017
57093cc
updated container class for regular expressions
Mar 25, 2017
1210982
fixed a couple errors. file needed to be deleted
Mar 26, 2017
c9f60ef
added basic coordinate search
Mar 27, 2017
2e174a4
fixed check for coordinates
Mar 27, 2017
1879eab
System output change in regex and batch coordinate change
Mar 30, 2017
592ba47
system output changes and fixed bug with bottom of string not being w…
Mar 30, 2017
2f40969
ocr selection for batch now passed in
Mar 31, 2017
409f902
Fixed issue where first element wasn't within the coordinates
Apr 7, 2017
4980374
attempt to merge ocr into batch, overlap might do something
Apr 13, 2017
f533d9d
Merge branch 'master' into dev
Apr 14, 2017
643eb73
allowing 3 corner search, new parsing method
mattrich37 Apr 17, 2017
eb3fe2e
changed how string arrays are handled, should be able to perform 3 co…
mattrich37 Apr 17, 2017
46daf5c
removed backend cause of need for renaming operations.
Apr 20, 2017
d17d044
cleaned up code
Apr 24, 2017
15b6d54
removed print statements
mattrich37 Apr 30, 2017
cd96c69
tried to format tables better, removed prints and delete ocr files
mattrich37 Apr 30, 2017
f5ec776
fixes for ocr and indexing errors
mattrich37 Apr 30, 2017
4dce3d6
fixed comments
mattrich37 Apr 30, 2017
bba10d1
fixed conflict of ocr output naming. added boolean to determine ocr e…
May 1, 2017
0a92f37
updated output for ocr in batch processing
May 2, 2017
ee0e4eb
incorrect parsing of coordinate search list from json fix
May 3, 2017
c259189
basic 1 string search, should have fixed bounding box issues
mattrich37 May 9, 2017
32c31b0
no longer needed
mattrich37 May 10, 2017
9b5259f
one string search now compares to autodetect
May 10, 2017
93e0e63
added boolean as well..
May 10, 2017
dd40bb2
thinned out comments
May 10, 2017
a418f59
Complete rename of any regex to string to match functionality
dbangera23 May 11, 2017
45e316e
Initial frramework for testing of ocr and string search
dbangera23 Jun 9, 2017
68d9474
Implemented test cases for string searching
Jun 20, 2017
e238cad
Implemented test for OCR conversion
Jun 20, 2017
ed31e18
Wrote tests for batch processing. Added resources.
Jun 21, 2017
54cfc6b
preliminary effort to address PR comments. all relevant tests passing.
Jun 27, 2017
857d2de
cleaned up a number of PR comments. tested changes in GUI and verifie…
Jun 28, 2017
677ecee
Added OCR as 'e' in command line
dbangera23 Jun 29, 2017
9296532
changed OCR to remove creation of new files in user system. all new f…
Jun 30, 2017
3e69d60
reorganized batch search to remove all catch(Exception) swallowing.
Jun 30, 2017
6692f71
generalized path creation cross platform
Jul 12, 2017
701d3f6
fixing potential null pointers
Jul 12, 2017
ad2c520
fixed (another) potential nullpointerexception
Jul 12, 2017
5f39398
fixing linux error when files read in different order
Jul 12, 2017
680d375
Merge branch 'master' into master
jazzido Jul 12, 2017
c18ff67
install tesseract in travis
jazzido Jul 12, 2017
e24fc19
we need sudo in travis
jazzido Jul 12, 2017
5fd099b
sudo for travis
jazzido Jul 12, 2017
4471cee
travis: ghostscript
jazzido Jul 14, 2017
b716766
Wrote batch using file writer example in spreadsheetextractor
dbangera23 Jul 18, 2017
5bde097
Updated java process for batch processing. more modular process. upda…
dbangera23 Jul 22, 2017
763f341
Changed expected files in expected of batch testing due to change in …
dbangera23 Aug 11, 2017
0878b17
Merge branch 'master' into master
jeremybmerrill Aug 12, 2017
dd9ecea
updated command line interface, -b now takes into consideration -e fo…
dbangera23 Aug 20, 2017
54b3ec7
Now pass a page list to OcrConvertor and only run OCR on specified pages
dan144 Aug 30, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 6 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,12 @@
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>

<dependency>
Copy link
Contributor

@jazzido jazzido Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does tess4j include binaries for all the platforms that we support? (Windows, Mac and Linux). Does it require that Tesseract is present in the machine where Tabula is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I researched only Windows binaries are included.
[1] [2]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into this. I believe @jeremybmerrill one time mentioned that Unix systems require a terminal install command for a tesseract package, but I'll investigate this and find the fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we might need to add a line in the instructions for the linux/mac users. We used Tess4j in the project (http://tess4j.sourceforge.net/usage.html). which points to the original tesseract-ocr github page (https://github.com/tesseract-ocr/tesseract/wiki) that suggests running "sudo port install tesseract".

Although to be honest I'm not able to test since I don't have access to a linux device. Let me know if there is anything I can do and what you guys find out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jazzido it looks like we need ghostscript as well, which can be installed with sudo apt-get install ghostscript. That should be added to the .travis.yml file the same way you added the tesseract dependency I believe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the latest travis issue with Tesseract is java.lang.UnsatisfiedLinkError: Error looking up function 'TessBaseAPICreate': /usr/lib/libtesseract.so.3.0.2: undefined symbol: TessBaseAPICreate. When I run apt-get install tesseract-ocr on my local Ubuntu 16.04 system, it installs the same package, but with libtesseract.so.3.0.4, which contains the API calls for Tess4J that we used, which are present in Tesseract 3.0.3+. It looks like the Travis system is running Ubuntu 12.04LTS, which ends support at 3.0.2. How do we want to proceed with this knowledge?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremybmerrill this is the cause of the failures and my analysis of it. I don't know Travis super well, so I'm not sure how to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dan144 @dbangera23 I see two failures when I run the tests myself (the other error may just be Travis being weird), both in testBatchExtractor, just with assertions that fail. Can you guys look into it and see what's going on, then get the tests to pass? Thanks in advance!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremybmerrill I fixed the issue with testBatchExtractor. Looks like in the changes I forgot we had formatting in place for batch and now to keep in line with rest of Tabula, BatchWriter follows CSVWriter which messed up the expected output in testBatchExtractor.

<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.2.1</version>
</dependency>
</dependencies>

</project>