Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Get all of the tests passing again, and a rake task for starting open…

…office.
  • Loading branch information...
commit 99312674141b7ef9ad1461207a05fad97f6f79d4 1 parent af26b9a
@jashkenas jashkenas authored
View
13 Rakefile
@@ -9,16 +9,9 @@ task :test do
Dir['test/*/**/test_*.rb'].each {|test| require test }
end
-desc 'Clean the compiled Java classes'
-task :clean do
- FileUtils.rm_r('build') if File.exists?('build')
- Dir.mkdir('build')
-end
-
-desc 'Build all Java command-line clients'
-task :build => :clean do
- sh "javac -cp vendor/'*' -d build -Xlint -Xlint:-path lib/docsplit/*.java"
- sh "sudo chmod -R 755 build"
+desc 'Launch OpenOffice for testing'
+task :openoffice do
+ sh "/Applications/OpenOffice.org.app/Contents/MacOS/soffice.bin soffice -headless -accept=\"socket,host=127.0.0.1,port=8100;urp;\" -nofirststartwizard"
end
namespace :gem do
View
3  index.html
@@ -259,7 +259,8 @@ <h2 id="changes">Change Log</h2>
<p>
<b class="header">0.2.1</b><br />
- TODO
+ The Java dependency on PDFBox has been removed, in favor of a runtime
+ dependency on ... TODO
</p>
<p>
View
2  lib/docsplit/info_extractor.rb
@@ -16,7 +16,7 @@ class InfoExtractor
def extract(key, pdfs, opts)
pdf = [pdfs].flatten.first
- cmd = "pdfinfo #{pdf}"
+ cmd = "pdfinfo #{pdf} 2>&1"
result = `#{cmd}`.chomp
raise ExtractionFailed, result if $? != 0
match = result.match(MATCHERS[key])
View
2  lib/docsplit/page_extractor.rb
@@ -10,7 +10,7 @@ def extract(pdfs, opts)
pdf_name = File.basename(pdf, File.extname(pdf))
page_path = File.join(@output, "#{pdf_name}_%d.pdf")
FileUtils.mkdir_p @output unless File.exists?(@output)
- cmd = "pdftk #{pdf} burst output #{page_path}"
+ cmd = "pdftk #{pdf} burst output #{page_path} 2>&1"
result = `#{cmd}`.chomp
FileUtils.rm('doc_data.txt') if File.exists?('doc_data.txt')
raise ExtractionFailed, result if $? != 0
View
4 lib/docsplit/text_extractor.rb
@@ -14,7 +14,7 @@ def extract(pdfs, opts)
extract_page pdf, page, pdf_name
end
else
- cmd = "pdftotext -enc UTF-8 #{pdf} #{text_path}"
+ cmd = "pdftotext -enc UTF-8 #{pdf} #{text_path} 2>&1"
result = `#{cmd}`.chomp
raise ExtractionFailed, result if $? != 0
end
@@ -23,7 +23,7 @@ def extract(pdfs, opts)
def extract_page(pdf, page, pdf_name)
text_path = File.join(@output, "#{pdf_name}_#{page}.txt")
- cmd = "pdftotext -enc UTF-8 -f #{page} -l #{page} #{pdf} #{text_path}"
+ cmd = "pdftotext -enc UTF-8 -f #{page} -l #{page} #{pdf} #{text_path} 2>&1"
result = `#{cmd}`.chomp
raise ExtractionFailed, result if $? != 0
result
View
5 test/unit/test_extract_pages.rb
@@ -7,11 +7,6 @@ def test_multi_page_extraction
assert Dir["#{OUTPUT}/*.pdf"].length == 2
end
- def test_single_page_extraction
- Docsplit.extract_pages('test/fixtures/encrypted.pdf', :output => OUTPUT)
- assert Dir["#{OUTPUT}/*.pdf"].length == 1
- end
-
def test_password_protected
assert_raises(ExtractionFailed) do
Docsplit.extract_pages('test/fixtures/completely_encrypted.pdf')
View
23 test/unit/test_extract_text.rb
@@ -3,26 +3,15 @@
class ExtractTextTest < Test::Unit::TestCase
- FULL_TEXT = <<-EOTEXT
-Gem::Specification.new do |s| s.name = 'pdf-pieces' s.version = '0.1.0' s.date = '2009-11-29'
-
-# Keep version in sync with jammit.rb
-
-s.homepage = "http://documentcloud.github.com/pdf-pieces/" s.summary = "" s.description = <<-EOS EOS s.authors = ['Jeremy Ashkenas'] s.email = 'jeremy@documentcloud.org' s.rubyforge_project = 'pdf-pieces' s.require_paths s.executables s.has_rdoc s.extra_rdoc_files s.rdoc_options = ['lib'] = ['pdf-pieces'] = true = ['README'] << '--title' '--exclude' '--main' '--all'
-
-<< 'PDF Pieces' << << 'test' << << 'README' <<
-
-s.files = Dir['build/*', 'lib/**/*', 'bin/*', 'vendor/*', 'pdf-pieces.gemspec', 'LICENSE', 'README'] end
-EOTEXT
-
- def test_full_text_extraction
- Docsplit.extract_text('test/fixtures/encrypted.pdf', :output => OUTPUT)
- assert FULL_TEXT.strip == File.read("#{OUTPUT}/encrypted.txt").strip
- end
+ FULL_TEXT = <<-EOS
+BARACK OBAMA: A CHAMPION FOR ARTS AND CULTURE
+Our nation’s creativity has filled the world’s libraries, museums, recital halls, movie houses, and marketplaces with works of genius. The arts embody the American spirit of self-definition. As the author of two best-selling books – Dreams from My Father and The Audacity of Hope – Barack Obama uniquely appreciates the role and value of creative expression. A PLATFORM IN SUPPORT OF THE ARTS Reinvest in Arts Education: To remain competitive in the global economy, America needs to reinvigorate the kind of creativity and innovation that has made this country great. To do so, we must nourish our children’s creative skills. In addition to giving our children the science and math skills they need to compete in the new global context, we should also encourage the ability to think creatively that comes from a meaningful arts education. Unfortunately, many school districts are cutting instructional time for art and music education. Barack Obama believes that the arts should be a central part of effective teaching and learning. The Chairman of the National Endowment for the Arts recently said “The purpose of arts education is not to produce more artists, though that is a byproduct. The real purpose of arts education is to create complete human beings capable of leading successful and productive lives in a free society.” To support greater arts education, Obama will: Expand Public/Private Partnerships Between Schools and Arts Organizations: Barack Obama will increase resources for the U.S. Department of Education’s Arts Education Model Development and Dissemination Grants, which develop public/private partnerships between schools and arts organizations. Obama will also engage the foundation and corporate community to increase support for public/private partnerships. Create an Artist Corps: Barack Obama supports the creation of an “Artists Corps” of young artists trained to work in low-income schools and their communities. Studies in Chicago have demonstrated that test scores improved faster for students enrolled in low-income schools that link arts across the curriculum than scores for students in schools lacking such programs. Publicly Champion the Importance of Arts Education: As president, Barack Obama will use the bully pulpit and the example he will set in the White House to promote the importance of arts and arts education in America. Not only is arts education indispensable for success in a rapidly changing, high skill, information economy, but studies show that arts education raises test scores in other subject areas as well. Support Increased Funding for the NEA: Over the last 15 years, government funding for the National Endowment for the Arts has been slashed from $175 million annually in 1992 to $125 million today. Barack Obama supports increased funding for the NEA, the support of which enriches schools and neighborhoods all across the nation and helps to promote the economic development of countless communities. Paid for by Obama for America
+ EOS
def test_paged_extraction
Docsplit.extract_text('test/fixtures/obama_arts.pdf', :pages => 'all', :output => OUTPUT)
assert Dir["#{OUTPUT}/*.txt"].length == 2
+ assert File.read("#{OUTPUT}/obama_arts_1.txt").strip == FULL_TEXT.strip
end
def test_page_only_extraction
@@ -43,7 +32,7 @@ def test_unicode_extraction
Docsplit.extract_text('test/fixtures/unicode.pdf', :pages => 'all', :output => OUTPUT)
assert Dir["#{OUTPUT}/*.txt"].length == 3
end
-
+
def test_password_protected
assert_raises(ExtractionFailed) do
Docsplit.extract_text('test/fixtures/completely_encrypted.pdf')
Please sign in to comment.
Something went wrong with that request. Please try again.