Permalink
Browse files

Adding transparent OCR via Tesseract to Docsplit, with per-page check…

…s and --ocr and --no-ocr flags to force it.
  • Loading branch information...
1 parent dc5fdd2 commit 10378a72fde29fe231ad9c6a3b3f9df83fcd9f15 @jashkenas jashkenas committed Aug 4, 2010
@@ -2,41 +2,86 @@ module Docsplit
class TextExtractor
+ NO_TEXT_DETECTED = /---------\n\Z/
+
+ OCR_FLAGS = '-density 200x200 -colorspace GRAY'
+
+ MIN_TEXT_PER_PAGE = 100 # bytes
+
+ def initialize
+ @tiffs_generated = false
+ @pages_to_ocr = []
+ end
+
def extract(pdfs, opts)
extract_options opts
FileUtils.mkdir_p @output unless File.exists?(@output)
[pdfs].flatten.each do |pdf|
- if @pages
- pages = (@pages == 'all') ? 1..Docsplit.extract_length(pdf) : @pages
- pages.each {|page| extract_page(pdf, page) }
+ @pdf_name = File.basename(pdf, File.extname(pdf))
+ pages = (@pages == 'all') ? 1..Docsplit.extract_length(pdf) : @pages
+ if @force_ocr || (!@forbid_ocr && !contains_text?(pdf))
+ extract_from_ocr(pdf, pages)
else
- extract_full(pdf)
+ extract_from_pdf(pdf, pages)
+ extract_from_ocr(pdf, @pages_to_ocr) if !@forbid_ocr && !@pages_to_ocr.empty?
end
end
+ FileUtils.remove_entry_secure @tempdir if @tempdir
+ end
+
+ def contains_text?(pdf)
+ fonts = `pdffonts #{pdf} 2>&1`
+ !fonts.match(NO_TEXT_DETECTED)
+ end
+
+ def extract_from_pdf(pdf, pages)
+ return extract_full(pdf) unless pages
+ pages.each {|page| extract_page(pdf, page) }
+ end
+
+ def extract_from_ocr(pdf, pages)
+ @tempdir ||= Dir.mktmpdir
+ base_path = File.join(@output, @pdf_name)
+ if pages
+ run "gm convert +adjoin #{OCR_FLAGS} #{pdf} #{@tempdir}/#{@pdf_name}_%d.tif 2>&1" unless @tiffs_generated
+ @tiffs_generated = true
+ pages.each do |page|
+ run "tesseract #{@tempdir}/#{@pdf_name}_#{page - 1}.tif #{base_path}_#{page} 2>&1"
+ end
+ else
+ tiff = "#{@tempdir}/#{@pdf_name}.tif"
+ run "gm convert #{OCR_FLAGS} #{pdf} #{tiff} 2>&1"
+ run "tesseract #{tiff} #{base_path} -l eng 2>&1"
+ end
end
private
- def extract_full(pdf)
- pdf_name = File.basename(pdf, File.extname(pdf))
- text_path = File.join(@output, "#{pdf_name}.txt")
- cmd = "pdftotext -enc UTF-8 #{pdf} #{text_path} 2>&1"
- result = `#{cmd}`.chomp
+ def run(command)
+ result = `#{command}`
raise ExtractionFailed, result if $? != 0
+ result
+ end
+
+ def extract_full(pdf)
+ text_path = File.join(@output, "#{@pdf_name}.txt")
+ run "pdftotext -enc UTF-8 #{pdf} #{text_path} 2>&1"
end
def extract_page(pdf, page)
- pdf_name = File.basename(pdf, File.extname(pdf))
- text_path = File.join(@output, "#{pdf_name}_#{page}.txt")
- cmd = "pdftotext -enc UTF-8 -f #{page} -l #{page} #{pdf} #{text_path} 2>&1"
- result = `#{cmd}`.chomp
- raise ExtractionFailed, result if $? != 0
+ text_path = File.join(@output, "#{@pdf_name}_#{page}.txt")
+ run "pdftotext -enc UTF-8 -f #{page} -l #{page} #{pdf} #{text_path} 2>&1"
+ unless @forbid_ocr
+ @pages_to_ocr.push(page) if File.read(text_path).length < MIN_TEXT_PER_PAGE
+ end
end
def extract_options(options)
- @output = options[:output] || '.'
- @pages = options[:pages]
+ @output = options[:output] || '.'
+ @pages = options[:pages]
+ @force_ocr = options[:ocr] == true
+ @forbid_ocr = options[:ocr] == false
end
end
View
Binary file not shown.
@@ -0,0 +1,26 @@
+U.S. Deponmem 901 rachel Street, suns 452
+of Tronsportotion Kansas City, Mo 64106-2641
+Pipeline una
+Hazardous Materials Safety
+Administration
+WARNING LETTER
+CERTIFIED MAIL - RETURN RECEIPT RE! QUESTED
+January 21, 2010
+Mr. Terry McGill, President
+Enbridge Energy Paitners, L,P,
+1100 Louisiana, Suite 3300
+Houston, Texas 77002
+. CPF 3-2010-5002W
+Dear Mr. McGill:
+On October 6-8, 2008, October 28, 2008, and January 21-22, 2009, a representative of the
+Pipeline and Hazardous Materials Safety Administration (PHMSA) pursuant to Chapter 601 of
+49 United States Code inspected your facilities associated with the Griffith Unit in Griffith,
+Indiana, and surrounding locations
+As a result ofthe inspection, it appears that you have committed a probable violation of the
+Pipeline Safety Regulations, Title 49, Code of Federal Regulations. The items inspected and
+the probable violation(s) are: ’
+1. 195.579 What must I do to mitigate internal corrosion?
+(b) Inhibitors. If you use corrosion inhibitors to mitigate internal corrosion, you
+must--
+
@@ -0,0 +1,41 @@
+(1) Use inhibitors in sufficient quantity to protect the entire part ofthe pipeline
+system that the inhibitors are designed to protect;
+(2) Use coupons or other monitoring equipment to determine the effectiveness of
+the inhibitors in mitigating internal corrosion; and
+(3) Examine the coupons or other monitoring equipment at least twice each
+calendar year, but with intervals not exceeding 7 1/2 months.
+Internal corrosion monitoring was discontinued on the five hydrogen permeation monitors
+(Beta F oils) installed on Line 6B. Two manuallydnterrogated monitors were discontinued in
+May 2006. One remotely—interro gated monitor was discontinued in January 2006, and the
+other two remotely—interro gated monitors were discontinued in October 2007. Enbridge
+representatives stated the monitoring was discontinued due to
+"communication/instrumentation problems."
+Enbridge is in the process of implementing an alternative method of internal corrosion
+monitoring on Line 6B utilizing a technology referred to as Electrical Resistance Tomography
+(FSMPIT), however, it is not expected to be implemented on Line 6B until sometime during
+the first half of 2010. In the interim, Enbridge provided the following information as
+demonstration that the internal corrosion threat is being properly managed:
+• a comprehensive report related to the internal corrosion mitigation and
+monitoring program for their heavy oil pipeline system
+¤ repair sleeve installations (which require circumferential non-destructive
+testing)
+¤ inspection ofthe Line 6B Pig Sending Trap at Griffith Station (which included
+ultrasonic inspection of the trap floor between the 5:00 and 7:00 positions)
+¤ detailed pipe examinations at in-line inspection indications
+e records for a weight loss coupon at the Stockbridge Ptunping Station (Line 17),
+which sees only fluid flow from Line 6B
+The information provided does not demonstrate compliance with the above regulation. Line
+6B has been subject to a batch chemical treatment program to inhibit internal corrosion for ‘
+several years. As required by l95.579(b), Line 6B must have coupons or other monitoring
+equipment to determine the effectiveness of the inhibitor program, and the coupons or other
+monitoring equipment nlust be examined at least twice each calendar year, at intervals not to
+exceed 7-l/2 months. PHMSA acknowledges the positive steps being taken to improve
+Enbridge’s internal corrosion mitigation and monitoring program. However, the transition
+from one technology to another must be implemented in a manner that ensures continued
+compliance with the regulations.
+Under 49 United States Code, § 60122, you are subject to a civil penalty not to exceed
+$100,000 for each violation for each day the violation persists up to a maximum of $1,000,000
+for any related series of violations. We have reviewed the circumstances and supporting
+documents involved in this case, and have decided not to conduct additional enforcement
+2
+
@@ -0,0 +1,5 @@
+Promote Cultural Diplomacy: American artists, performers and thinkers – representing our values and ideals – can inspire people both at home and all over the world. Through efforts like that of the United States Information Agency, America’s cultural leaders were deployed around the world during the Cold War as artistic ambassadors and helped win the war of ideas by demonstrating to the world the promise of America. Artists can be utilized again to help us win the war of ideas against Islamic extremism. Unfortunately, our resources for cultural diplomacy are at their lowest level in a decade. Barack Obama will work to reverse this trend and improve and expand public-private partnerships to expand cultural and arts exchanges throughout the world. Attract Foreign Talent: The flipside to promoting American arts and culture abroad is welcoming members of the foreign arts community to America. Opening America’s doors to students and professional artists provides the kind of two-way cultural understanding that can break down the barriers that feed hatred and fear. As America tightened visa restrictions after 9/11, the world’s most talented students and artists, who used to come here, went elsewhere. Barack Obama will streamline the visa process to return America to its rightful place as the world’s top destination for artists and art students. Provide Health Care to Artists: Finding affordable health coverage has often been one of the most vexing obstacles for artists and those in the creative community. Since many artists work independently or have nontraditional employment relationships, employer-based coverage is unavailable and individual policies are financially out of reach. Barack Obama’s plan will provide all Americans with quality, affordable health care. His plan includes the creation of a new public program that will allow individuals and small businesses to buy affordable health care similar to that available to federal employees. His plan also creates a National Health Insurance Exchange to reform the private insurance market and allow Americans to enroll in participating private plans, which would have to provide comprehensive benefits, issue every applicant a policy, and charge fair and stable premiums. For those who still cannot afford coverage, the government will provide a subsidy. His health plan will lower costs for the typical American family by up to $2,500 per year. Ensure Tax Fairness for Artists: Barack Obama supports the Artist-Museum Partnership Act, introduced by Senator Patrick Leahy (D-VT). The Act amends the Internal Revenue Code to allow artists to deduct the fair market value of their work, rather than just the costs of the materials, when they make charitable contributions.
+
+Paid for by Obama for America
+
+
@@ -0,0 +1,16 @@
+action or penalty assessment proceedings at this time. We advise you to correct the item
+identified in this letter. Failure to do so will result in Enbridge being subject to additional
+enforcement action.
+No reply to this letter is required. If` you choose to reply, in your correspondence please refer
+to CPF 3-2010-5002W. Be advised that all material you submit in response to this
+enforcement action is subject to being made publicly available. If` you believe that any portion
+of your responsive material qualifies for confidential treatment under 5 U.S.C, 552(b), along
+with the complete original document you must provide a second copy of the doctunent with the
+portions you believe qualify for confidential treatment redacted and an explanation of why you
+believe the redacted information qualifies for confidential treatment under 5 U.S.C, 552(b).
+Sincerely,
+Ivan A. Huntoon
+Director, Central Region
+Pipeline and Hazardous Materials Safety Administration
+3
+
@@ -33,6 +33,15 @@ def test_unicode_extraction
assert Dir["#{OUTPUT}/*.txt"].length == 3
end
+ def test_ocr_extraction
+ Docsplit.extract_text('test/fixtures/corrosion.pdf', :pages => 'all', :output => OUTPUT)
+ assert Dir["#{OUTPUT}/*.txt"].length == 4
+ 4.times do |i|
+ file = "corrosion_#{i + 1}.txt"
+ assert File.read("#{OUTPUT}/#{file}") == File.read("test/fixtures/corrosion/#{file}")
+ end
+ end
+
def test_password_protected
assert_raises(ExtractionFailed) do
Docsplit.extract_text('test/fixtures/completely_encrypted.pdf')

0 comments on commit 10378a7

Please sign in to comment.