Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Docsplit 0.5.0

  • Loading branch information...
commit f70a607081f3683e075cd1171fc875f20c214015 1 parent 1df3cc6
@jashkenas jashkenas authored
View
4 docsplit.gemspec
@@ -1,7 +1,7 @@
Gem::Specification.new do |s|
s.name = 'docsplit'
- s.version = '0.4.1' # Keep version in sync with docsplit.rb
- s.date = '2010-8-23'
+ s.version = '0.5.0' # Keep version in sync with docsplit.rb
+ s.date = '2010-10-18'
s.homepage = "http://documentcloud.github.com/docsplit/"
s.summary = "Break Apart Documents into Images, Text, Pages and PDFs"
View
17 index.html
@@ -98,7 +98,7 @@
(title, author, number of pages...)
</p>
- <p>Docsplit is currently at <a href="http://rubygems.org/gems/docsplit">version 0.4.1</a>.</p>
+ <p>Docsplit is currently at <a href="http://rubygems.org/gems/docsplit">version 0.5.0</a>.</p>
<p>
<i>Docsplit is an open-source component of <a href="http://documentcloud.org/">DocumentCloud</a>.</i>
@@ -192,7 +192,7 @@ <h2 id="usage">Usage</h2>
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])</pre>
<p class="break">
- <b class="header">text</b><code>--pages --ocr --no-ocr</code>
+ <b class="header">text</b><code>--pages --ocr --no-ocr --no-clean</code>
<span class="alias">Ruby: <b>extract_text</b></span>
<br />
Extract the complete <b>UTF-8</b>-encoded plain text of a document to a
@@ -200,7 +200,9 @@ <h2 id="usage">Usage</h2>
pass <tt>--pages all</tt>. You can use the <tt>--ocr</tt> and <tt>--no-ocr</tt>
flags to force OCR, or disable it, respectively. By default (if Tesseract is installed)
Docsplit will OCR the text of each page for which it fails to extract text
- directly from the document.
+ directly from the document. Docsplit will also attempt to clean up garbage
+ characters in the OCR'd text &mdash; to disable this, pass the
+ <tt>--no-clean</tt> flag.
</p>
<pre>
docsplit text path/to/doc.pdf --pages all</pre>
@@ -210,7 +212,7 @@ <h2 id="usage">Usage</h2>
<p class="break">
<b class="header">pages</b><code>--pages</code>
- <span class="alias">Ruby: <b>extract_text</b></span>
+ <span class="alias">Ruby: <b>extract_pages</b></span>
<br />
Burst apart a document into single-page PDFs. Use <tt>--pages</tt> to
specify the individual pages (or ranges of pages) you'd like to generate.
@@ -280,6 +282,13 @@ <h2 id="internals">Internals</h2>
<h2 id="changes">Change Log</h2>
<p>
+ <b class="header">0.5.0</b><br />
+ Added a <tt>Docsplit::TextCleaner</tt> class which is used to post-process
+ OCR'd text, and remove garbage characters that are created when Tesseract
+ encounters non-english text. To disable the cleanup, pass <tt>--no-clean</tt>.
+ </p>
+
+ <p>
<b class="header">0.4.1</b><br />
Upgraded the JODConverter dependency for PDF conversion via OpenOffice to
3.0 beta. Added PNG, GIF, TIF, JPG, and BMP to the list of supported
View
2  lib/docsplit.rb
@@ -1,7 +1,7 @@
# The Docsplit module delegates to the Java PDF extractors.
module Docsplit
- VERSION = '0.4.1' # Keep in sync with gemspec.
+ VERSION = '0.5.0' # Keep in sync with gemspec.
ROOT = File.expand_path(File.dirname(__FILE__) + '/..')
View
8 lib/docsplit/text_cleaner.rb
@@ -1,3 +1,4 @@
+require 'iconv'
require 'strscan'
module Docsplit
@@ -19,11 +20,11 @@ class TextCleaner
SPACE = /\s+/
NEWLINE = /[\r\n]/
ALNUM = /[a-z0-9]/i
- PUNCT = /[^a-z0-9\s]/i
+ PUNCT = /[[:punct:]]/i
REPEAT = /([^0-9])\1{2,}/
UPPER = /[A-Z]/
LOWER = /[a-z]/
- ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,]?$/
+ ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,:]?$/
ALL_ALPHA = /^[a-z]+$/i
CONSONANT = /(^y|[bcdfghjklmnpqrstvwxz])/i
VOWEL = /([aeiou]|y$)/i
@@ -33,8 +34,9 @@ class TextCleaner
SINGLETONS = /^[AaIi]$/
# For the time being, `clean` uses the regular StringScanner, and not the
- # multibyte-aware version.
+ # multibyte-aware version, coercing to ASCII first.
def clean(text)
+ text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
scanner = StringScanner.new(text)
cleaned = []
spaced = false
View
14 test/fixtures/corrosion/corrosion_2.txt
@@ -6,8 +6,8 @@ the inhibitors in mitigating internal corrosion; and
calendar year, but with intervals not exceeding 7 1/2 months.
Internal corrosion monitoring was discontinued on the five hydrogen permeation monitors
(Beta oils) installed on Line 6B. Two monitors were discontinued in
-May 2006. One gated monitor was discontinued in January 2006, and the
-other two gated monitors were discontinued in October 2007. Enbridge
+May 2006. One remotely-interro gated monitor was discontinued in January 2006, and the
+other two remotely-interro gated monitors were discontinued in October 2007. Enbridge
representatives stated the monitoring was discontinued due to
"communication/instrumentation problems."
Enbridge is in the process of implementing an alternative method of internal corrosion
@@ -17,11 +17,11 @@ the first half of 2010. In the interim, Enbridge provided the following informat
demonstration that the internal corrosion threat is being properly managed:
a comprehensive report related to the internal corrosion mitigation and
monitoring program for their heavy oil pipeline system
-repair sleeve installations (which require circumferential non-destructive
+ repair sleeve installations (which require circumferential non-destructive
testing)
-inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
+ inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
ultrasonic inspection of the trap floor between the 5:00 and 7:00 positions)
-detailed pipe examinations at in-line inspection indications
+ detailed pipe examinations at in-line inspection indications
records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
which sees only fluid flow from Line 6B
The information provided does not demonstrate compliance with the above regulation. Line
@@ -30,10 +30,10 @@ several years. As required by Line 6B must have coupons or other monitoring
equipment to determine the effectiveness of the inhibitor program, and the coupons or other
monitoring equipment nlust be examined at least twice each calendar year, at intervals not to
exceed 7-l/2 months. PHMSA acknowledges the positive steps being taken to improve
-internal corrosion mitigation and monitoring program. However, the transition
+Enbridge's internal corrosion mitigation and monitoring program. However, the transition
from one technology to another must be implemented in a manner that ensures continued
compliance with the regulations.
-Under 49 United States Code, 60122, you are subject to a civil penalty not to exceed
+Under 49 United States Code, SS 60122, you are subject to a civil penalty not to exceed
$100,000 for each violation for each day the violation persists up to a maximum of $1,000,000
for any related series of violations. We have reviewed the circumstances and supporting
documents involved in this case, and have decided not to conduct additional enforcement
Please sign in to comment.
Something went wrong with that request. Please try again.