Permalink
Browse files

Updating tests for OCR cleaning

  • Loading branch information...
1 parent 41e257a commit 1df3cc6643fc33e621b607d55414add8f9ce30f4 @jashkenas jashkenas committed Oct 18, 2010
View
16 lib/docsplit/text_cleaner.rb
@@ -20,10 +20,10 @@ class TextCleaner
NEWLINE = /[\r\n]/
ALNUM = /[a-z0-9]/i
PUNCT = /[^a-z0-9\s]/i
- REPEAT = /(.)\1{2,}/
+ REPEAT = /([^0-9])\1{2,}/
UPPER = /[A-Z]/
LOWER = /[a-z]/
- ACRONYM = /^\(?[A-Z]+('?s|[.,])?\)?$/
+ ACRONYM = /^\(?[A-Z0-9\.]+('?s)?\)?[.,]?$/
ALL_ALPHA = /^[a-z]+$/i
CONSONANT = /(^y|[bcdfghjklmnpqrstvwxz])/i
VOWEL = /([aeiou]|y$)/i
@@ -55,14 +55,16 @@ def clean(text)
# Is a given word OCR garbage?
def garbage(w)
- # More than 20 bytes in length.
- (w.length > 20) ||
+ acronym = w =~ ACRONYM
+
+ # More than 30 bytes in length.
+ (w.length > 30) ||
# If there are three or more identical characters in a row in the string.
(w =~ REPEAT) ||
# More punctuation than alpha numerics.
- (w.scan(ALNUM).length < w.scan(PUNCT).length) ||
+ (!acronym && (w.scan(ALNUM).length < w.scan(PUNCT).length)) ||
# Ignoring the first and last characters in the string, if there are three or
# more different punctuation characters in the string.
@@ -73,14 +75,14 @@ def garbage(w)
# Number of uppercase letters greater than lowercase letters, but the word is
# not all uppercase + punctuation.
- ((w.scan(UPPER).length > w.scan(LOWER).length) && (w !~ ACRONYM)) ||
+ (!acronym && (w.scan(UPPER).length > w.scan(LOWER).length)) ||
# Single letters that are not A or I.
(w.length == 1 && (w =~ ALL_ALPHA) && (w !~ SINGLETONS)) ||
# All characters are alphabetic and there are 8 times more vowels than
# consonants, or 8 times more consonants than vowels.
- ((w.length > 2 && (w =~ ALL_ALPHA) && (w !~ ACRONYM)) &&
+ (!acronym && (w.length > 2 && (w =~ ALL_ALPHA)) &&
(((vows = w.scan(VOWEL).length) > (cons = w.scan(CONSONANT).length) * 8) ||
(cons > vows * 8)))
end
View
2 lib/docsplit/text_extractor.rb
@@ -118,7 +118,7 @@ def extract_options(options)
@pages = options[:pages]
@force_ocr = options[:ocr] == true
@forbid_ocr = options[:ocr] == false
- @clean_ocr = options[:clean]
+ @clean_ocr = !(options[:clean] == false)
end
end
View
12 test/fixtures/corrosion/corrosion_1.txt
@@ -1,26 +1,26 @@
-©
+
U.S. Deponmem 901 rachel Street, suns 452
of Tronsportotion Kansas City, Mo 64106-2641
Pipeline una
Hazardous Materials Safety
Administration
WARNING LETTER
-CERTIFIED MAIL - RETURN RECEIPT RE! QUESTED
+CERTIFIED MAIL RETURN RECEIPT QUESTED
January 21, 2010
Mr. Terry McGill, President
-Enbridge Energy Paitners, L,P,
+Enbridge Energy Paitners,
1100 Louisiana, Suite 3300
Houston, Texas 77002
-. CPF 3-2010-5002W
+. CPF
Dear Mr. McGill:
On October 6-8, 2008, October 28, 2008, and January 21-22, 2009, a representative of the
Pipeline and Hazardous Materials Safety Administration (PHMSA) pursuant to Chapter 601 of
49 United States Code inspected your facilities associated with the Griffith Unit in Griffith,
Indiana, and surrounding locations
As a result ofthe inspection, it appears that you have committed a probable violation of the
Pipeline Safety Regulations, Title 49, Code of Federal Regulations. The items inspected and
-the probable violation(s) are:
+the probable violation(s) are:
1. 195.579 What must I do to mitigate internal corrosion?
-(b) Inhibitors. If you use corrosion inhibitors to mitigate internal corrosion, you
+Inhibitors. If you use corrosion inhibitors to mitigate internal corrosion, you
must--
View
24 test/fixtures/corrosion/corrosion_2.txt
@@ -5,35 +5,35 @@ the inhibitors in mitigating internal corrosion; and
(3) Examine the coupons or other monitoring equipment at least twice each
calendar year, but with intervals not exceeding 7 1/2 months.
Internal corrosion monitoring was discontinued on the five hydrogen permeation monitors
-(Beta F oils) installed on Line 6B. Two manuallydnterrogated monitors were discontinued in
-May 2006. One remotely—interro gated monitor was discontinued in January 2006, and the
-other two remotely—interro gated monitors were discontinued in October 2007. Enbridge
+(Beta oils) installed on Line 6B. Two monitors were discontinued in
+May 2006. One gated monitor was discontinued in January 2006, and the
+other two gated monitors were discontinued in October 2007. Enbridge
representatives stated the monitoring was discontinued due to
"communication/instrumentation problems."
Enbridge is in the process of implementing an alternative method of internal corrosion
monitoring on Line 6B utilizing a technology referred to as Electrical Resistance Tomography
(FSMPIT), however, it is not expected to be implemented on Line 6B until sometime during
the first half of 2010. In the interim, Enbridge provided the following information as
demonstration that the internal corrosion threat is being properly managed:
-a comprehensive report related to the internal corrosion mitigation and
+a comprehensive report related to the internal corrosion mitigation and
monitoring program for their heavy oil pipeline system
-¤ repair sleeve installations (which require circumferential non-destructive
+repair sleeve installations (which require circumferential non-destructive
testing)
-¤ inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
+inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
ultrasonic inspection of the trap floor between the 5:00 and 7:00 positions)
-¤ detailed pipe examinations at in-line inspection indications
-e records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
+detailed pipe examinations at in-line inspection indications
+records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
which sees only fluid flow from Line 6B
The information provided does not demonstrate compliance with the above regulation. Line
-6B has been subject to a batch chemical treatment program to inhibit internal corrosion for
-several years. As required by l95.579(b), Line 6B must have coupons or other monitoring
+6B has been subject to a batch chemical treatment program to inhibit internal corrosion for
+several years. As required by Line 6B must have coupons or other monitoring
equipment to determine the effectiveness of the inhibitor program, and the coupons or other
monitoring equipment nlust be examined at least twice each calendar year, at intervals not to
exceed 7-l/2 months. PHMSA acknowledges the positive steps being taken to improve
-Enbridge’s internal corrosion mitigation and monitoring program. However, the transition
+internal corrosion mitigation and monitoring program. However, the transition
from one technology to another must be implemented in a manner that ensures continued
compliance with the regulations.
-Under 49 United States Code, § 60122, you are subject to a civil penalty not to exceed
+Under 49 United States Code, 60122, you are subject to a civil penalty not to exceed
$100,000 for each violation for each day the violation persists up to a maximum of $1,000,000
for any related series of violations. We have reviewed the circumstances and supporting
documents involved in this case, and have decided not to conduct additional enforcement
View
2 test/fixtures/corrosion/corrosion_4.txt
@@ -2,7 +2,7 @@ action or penalty assessment proceedings at this time. We advise you to correct
identified in this letter. Failure to do so will result in Enbridge being subject to additional
enforcement action.
No reply to this letter is required. If` you choose to reply, in your correspondence please refer
-to CPF 3-2010-5002W. Be advised that all material you submit in response to this
+to CPF Be advised that all material you submit in response to this
enforcement action is subject to being made publicly available. If` you believe that any portion
of your responsive material qualifies for confidential treatment under 5 U.S.C, 552(b), along
with the complete original document you must provide a second copy of the doctunent with the
View
1 test/unit/test_extract_text.rb
@@ -38,6 +38,7 @@ def test_ocr_extraction
assert Dir["#{OUTPUT}/*.txt"].length == 4
4.times do |i|
file = "corrosion_#{i + 1}.txt"
+ # File.open("test/fixtures/corrosion/#{file}", "w+") {|f| f.write(File.read("#{OUTPUT}/#{file}")) }
assert File.read("#{OUTPUT}/#{file}") == File.read("test/fixtures/corrosion/#{file}")
end
end

0 comments on commit 1df3cc6

Please sign in to comment.