Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Updated for ruby 1.9.x #4

Merged
merged 8 commits into from

2 participants

@masom

Hello

This is an updated ruby-tesseract for 1.9.x

The gemfile has been fixed to properly reference the lib folder.
The file list no longer includes development-related files.

Before calling the convert utility, the argument list is escaped using Shellwords.

Cleans up the temporary text file.

@scottdavis
Owner

I like this pull request a lot but i wouldn't mind seeing some new test cases for these options you added

@scottdavis
Owner

also if your interested in working on this more ill give you commit I only used it for a small side project and i don't work on it very often

@masom

I'll add some more unit tests and take a look at the require statements, I was looking at how nokogiri and cover_me were loading up.

I plan on using this gem quite a bit in the near future, if Tesseract can be properly trained. I wouldn't mind commit access but I am just starting out on ruby. I would rather have somebody take a look at the changes I make/propose :)

@scottdavis
Owner
@scottdavis scottdavis commented on the diff
tesseract.gemspec
((6 lines not shown))
s.description = %q{Ruby wrapper for google tesseract}
s.summary = %q{Ruby wrapper for google tesseract}
s.email = %q{jetviper21@gmail.com}
s.date = Date.today.to_s
- s.files = `git ls-files`.split("\n")
- s.executables = `git ls-files`.split("\n").map{|f| f =~ /^bin\/(.*)/ ? $1 : nil}.compact
@scottdavis Owner

don't remove this using git here is very common across all gems

@masom
masom added a note

Okay, i'll revise the git command. It was adding all the files to the gem, including non-gem related ones.

@scottdavis Owner

you are right this is a better approach i forgot i did this in compass https://github.com/chriseppstein/compass/blob/stable/compass.gemspec#L22-36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lib/tesseract/process.rb
((5 lines not shown))
attr_reader :image
- attr_accessor :lang
- CONVERT_COMMAND = 'convert'
- TESSERACT_COMMAND = 'tesseract'
@scottdavis Owner

use these constants in the defaults below it makes it less "magic"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
tesseract.gemspec
((21 lines not shown))
+ s.required_ruby_version = '>= 1.9.0'
@scottdavis Owner

This is going to break backwards compatibility and the gem works fine in both 1.8.7 and 1.9.x i would like to keep 1.8.7 support

@masom
masom added a note

Oh, seems like I misread about Shellwords being only in 1.9, will drop the requirement to the version Shellwords was included in the stdlib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@masom

All tests passes (aside from the DependencyCheck, never got them to fail)

@scottdavis
Owner

which OS are you on?

@masom

Fedora 16

@masom

1) Failure:
test: dependency imagemagic fails should throw exception. (TesseractTest) [/home/msamson/dev/ruby-tesseract/test/tesseract_test.rb:26]:
ImageMagick "convert" command not found! Make sure ImageMagick is installed and in the system path
Exception expected but nothing was raised.

2) Failure:
test: dependency os check fails windows should throw exception. (TesseractTest) [/home/msamson/dev/ruby->tesseract/test/tesseract_test.rb:11]:
Only Unix Based enviroments are supported Mac, Linux, etc.
Exception expected but nothing was raised.

3) Failure:
test: dependency tesseract fails should throw exception. (TesseractTest) [/home/msamson/dev/ruby-tesseract/test/tesseract_test.rb:37]:
"tesseract" command not found! Make sure tesseract is installed and in the system path
Exception expected but nothing was raised.

12 tests, 12 assertions, 3 failures, 0 errors, 0 skips

ruby 1.9.2p290 (2011-07-09 revision 32553) [i686-linux]

@scottdavis
Owner

and the tests for DependencyChecker arn't passing?

@masom

Both ruby 1.8.7 and 1.9.2 fails the DependencyChecker tests. All others pass.

@scottdavis
Owner
@masom

x86_64-linux

Seems it has to do with the manipulation of the RUBY_PLATFORM constant.

@masom

Otherwise, the changes works and passes tests on x86 and x86-64

I'm looking into the DependencyChecker but it was failing before my changes. Not a biggy and it should probably be removed and added in the Readme instead of checking at runtime.

@scottdavis
Owner
@scottdavis
Owner
@masom

ruby-1.9.2-p290 :001 > system "derp"
=> nil
ruby-1.9.2-p290 :002 > system "ps"
PID TTY TIME CMD
6443 pts/0 00:00:00 bash
9613 pts/0 00:00:00 ruby
9938 pts/0 00:00:00 ps
=> true

system returns nil when the command was not executed, we can raise on that. Will submit an update.

@scottdavis scottdavis merged commit c865813 into scottdavis:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 24, 2011
  1. @masom

    Gemspec fixes for 1.9

    masom authored
Commits on Dec 28, 2011
  1. @masom
  2. @masom

    Added options for convert.

    masom authored
  3. @masom

    Updated gemfile.

    masom authored
  4. @masom
  5. @masom

    Added unit tests for convert command. Modified where Shellwords is us…

    masom authored
    …ed to only wrap the input file.
  6. @masom

    updated readme.

    masom authored
  7. @masom

    All test passes

    masom authored
This page is out of date. Refresh to see the latest.
View
4 Rakefile
@@ -1,9 +1,9 @@
require 'rake'
require 'rake/testtask'
-require 'rake/rdoctask'
+require 'rdoc/task'
Rake::TestTask.new do |t|
t.libs << "tesseract"
t.libs << "test"
t.test_files = FileList['test/*_test.rb']
t.verbose = true
-end
+end
View
3  Readme.md
@@ -17,6 +17,5 @@ This is a library for using the tesseract OCR in ruby applications
Config options are also supported
- tess = Tesseract::Process.new("photo.jpg", {:lang => 'some language', :chop_enable => 0})
+ tess = Tesseract::Process.new("photo.jpg", {:lang => :fra, :tesseract_options => {:chop_enable => 0}})
tess.to_s
-
View
13 lib/tesseract.rb
@@ -1,10 +1,11 @@
-require 'tesseract/dependency_checker'
-require 'tesseract/file_handler'
-require 'tesseract/process'
+path = File.join(File.dirname(__FILE__), 'tesseract')
+['dependency_checker', 'file_handler', 'process'].each do |f|
+ require File.expand_path(File.join(path, f))
+end
require 'pathname'
require 'digest/md5'
-
+require 'shellwords'
module Tesseract
-
-end
+
+end
View
16 lib/tesseract/dependency_checker.rb
@@ -1,23 +1,23 @@
module Tesseract
- class DependencyChecker
+ class DependencyChecker
#putting these here so its easyer to test
IMAGE_MAGICK_ERROR = "ImageMagick \"convert\" command not found! Make sure ImageMagick is installed and in the system path"
TESSERACT_ERROR = "\"tesseract\" command not found! Make sure tesseract is installed and in the system path"
OS_ERROR = "Only Unix Based enviroments are supported Mac, Linux, etc."
-
+
def self.check!
check_os!
check_for_tesseract!
check_for_imagemagick!
true
end
-
+
private
#for easy mocking
def self.run_cmd(cmd)
`#{cmd}`
end
-
+
def self.check_os!
case ::RUBY_PLATFORM
when /darwin/
@@ -27,14 +27,14 @@ def self.check_os!
end
raise Exception, OS_ERROR
end
-
+
def self.check_for_imagemagick!
raise Exception, IMAGE_MAGICK_ERROR if run_cmd('which convert').empty?
end
-
+
def self.check_for_tesseract!
raise Exception, TESSERACT_ERROR if run_cmd('which tesseract').empty?
end
-
+
end
-end
+end
View
7 lib/tesseract/file_handler.rb
@@ -3,18 +3,15 @@
module Tesseract
class FileHandler
@tempfiles = []
-
def self.create_temp_file(filename)
file = Pathname.new(Dir::tmpdir).join(filename)
@tempfiles << file
return file
end
-
def self.cleanup!
@tempfiles.each do |file|
- File.unlink(file.to_s) if File.exists?(file.to_s)
+ File.unlink(file.to_s) if File.exists?(file.to_s)
end
end
-
end
-end
+end
View
114 lib/tesseract/process.rb
@@ -1,51 +1,123 @@
+require 'shellwords'
module Tesseract
class Process
+
attr_reader :image
- attr_accessor :lang
+
CONVERT_COMMAND = 'convert'
TESSERACT_COMMAND = 'tesseract'
-
+ # Initialize a Tesseract translation process
+ # image_name is the file to translate
+ # options can be of the following:
+ # * tesseract_options Hash of options for tesseract
+ # * convert_options Array of options for convert
+ # * lang Image input language (eng, fra, etc. )
+ # * convert_command Convert binary name/path
+ # * tesseract_command Tesseract binary name/path
+ # * check_deps Boolean value. If true, verifies dependencies. Defaults to false
def initialize(image_name, options = {})
- DependencyChecker.check!
+ defaults = {
+ :tesseract_options => {},
+ :convert_options => {:input => [], :output => []},
+ :lang => :eng,
+ :convert_command => CONVERT_COMMAND,
+ :tesseract_command => TESSERACT_COMMAND,
+ :check_deps => false
+ }
+ @out = nil
@image = Pathname.new(image_name)
@hash = Digest::MD5.hexdigest("#{@image}-#{Time.now}")
- @lang = options[:lang].nil? ? 'eng' : options.delete(:lang)
- @options = options
+
+ merge_options! defaults, options
+ DependencyChecker.check! if @options[:check_deps]
+ end
+
+ def merge_options!(defaults, options)
+ @options = {}
+
+ if options.has_key? :tesseract_options
+ @options[:tesseract_options] = defaults[:tesseract_options].merge!(options[:tesseract_options]) if options.has_key? :tesseract_options
+ end
+
+
+ if options.has_key? :convert_options
+ @options[:convert_options] = defaults[:convert_options]
+ defaults[:convert_options].each do |k,v|
+ next unless options[:convert_options].has_key? k
+ @options[:convert_options][k] = v | options[:convert_options][k]
+ end
+ options.delete :convert_options
+ end
+
+ [:tesseract_options, :convert_options].each do |k|
+ options.delete(k) if options.has_key? k
+ end
+ @options = defaults.merge options
+ end
+
+ def lang=(lang)
+ @options[:lang]
+ end
+ def lang
+ @options[:lang]
end
-
def to_s
@out ||= process!
end
-
+
+ # Process the image into text.
def process!
temp_image = to_tiff
- text = tesseract_translation(temp_image)
- FileHandler.cleanup!
+ begin
+ text = tesseract_translation(temp_image)
+ rescue IOError
+ raise
+ ensure
+ FileHandler.cleanup!
+ end
text.gsub(/^\//, '')
end
-
+
+ # Generates the convert command.
+ def generate_convert_command(temp_file)
+ cmd = [@options[:convert_command]]
+ input_opt = @options[:convert_options][:input]
+ output_opt = @options[:convert_options][:output]
+
+ cmd += input_opt unless input_opt.empty?
+ cmd << Shellwords.shellescape(@image.to_s)
+ cmd += output_opt unless output_opt.empty?
+ cmd << temp_file.to_s
+ cmd.join(" ")
+ end
+
+ # Converts the source image to a tiff file.
def to_tiff
temp_file = FileHandler.create_temp_file("#{@hash}.tif")
- system [CONVERT_COMMAND, image, temp_file].join(" ")
+ executed = system generate_convert_command(temp_file)
+ raise RuntimeError, "`#{@options[:convert_command]}` could not be executed." if executed.nil?
temp_file
end
-
+
+ # Translate a tiff file into text
def tesseract_translation(image_file)
- temp_text_file = FileHandler.create_temp_file("#{@hash}")
+ temp_text_file = FileHandler.create_temp_file(@hash.to_s)
config_file = write_configs
- system [TESSERACT_COMMAND, image_file, temp_text_file, "-l #{@lang}", config_file, "&> /dev/null"].join(" ")
- File.read("#{temp_text_file}.txt")
+ txt_file = "#{temp_text_file}.txt"
+ executed = system [@options[:tesseract_command], image_file.to_s, temp_text_file.to_s, "-l #{@options[:lang]}", config_file, "&> /dev/null"].join(' ')
+ raise RuntimeError, "`#{@options[:tesseract_command]}` could not be executed." if (executed.nil? || executed == false)
+ out = File.read(txt_file)
+ File.unlink txt_file
+ out
end
-
+ # Writes Tesseract configuration for the current source file
def write_configs
- return '' if @options.empty?
+ return '' if @options[:tesseract_options].empty?
path = FileHandler.create_temp_file("#{@hash}.config")
File.open(path, "w+") do |f|
- @options.each { |k,v| f << "#{k} #{v}\n" }
+ @options[:tesseract_options].each { |k,v| f << "#{k} #{v}\n" }
end
path
end
-
end
-
-end
+end
View
4 lib/tesseract/version.rb
@@ -1,3 +1,3 @@
module Tesseract
- VERSION = '0.1.0'
-end
+ VERSION = '0.1.1'
+end
View
11 tesseract.gemspec
@@ -9,16 +9,17 @@ Gem::Specification.new do |s|
s.version = Tesseract::VERSION
s.platform = Gem::Platform::RUBY
- s.authors = ["Scott Davis"]
+ s.authors = ["Scott Davis", "Martin Samson"]
s.description = %q{Ruby wrapper for google tesseract}
s.summary = %q{Ruby wrapper for google tesseract}
s.email = %q{jetviper21@gmail.com}
s.date = Date.today.to_s
- s.files = `git ls-files`.split("\n")
- s.executables = `git ls-files`.split("\n").map{|f| f =~ /^bin\/(.*)/ ? $1 : nil}.compact
@scottdavis Owner

don't remove this using git here is very common across all gems

@masom
masom added a note

Okay, i'll revise the git command. It was adding all the files to the gem, including non-gem related ones.

@scottdavis Owner

you are right this is a better approach i forgot i did this in compass https://github.com/chriseppstein/compass/blob/stable/compass.gemspec#L22-36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
- s.require_path = 'tesseract'
+ s.files = ['lib/tesseract.rb', 'lib/tesseract/process.rb', 'lib/tesseract/file_handler.rb', 'lib/tesseract/dependency_checker.rb']
+ s.files += ['lib/tesseract/version.rb']
+ s.require_path = 'lib'
s.homepage = %q{http://github.com/scottdavis/ruby-tesseract}
s.rdoc_options = ["--charset=UTF-8"]
s.required_rubygems_version = ">= 1.3.6"
s.add_development_dependency "bundler", ">= 1.0.0"
-end
+ s.required_ruby_version = '>= 1.8.6'
+end
View
85 test/tesseract_test.rb
@@ -9,14 +9,14 @@ class TesseractTest < Test::Unit::TestCase
end
should "throw exception" do
assert_raises Exception, Tesseract::DependencyChecker::OS_ERROR do
- Tesseract::Process.new(TEST_FILE)
+ Tesseract::Process.new(TEST_FILE, :check_deps => true)
end
end
teardown do
silence_stream(STDERR) { Object.const_set("RUBY_PLATFORM", @old_val) }
end
end
-
+
context "dependency imagemagic fails" do
setup do
Tesseract::DependencyChecker.expects(:run_cmd).with("which tesseract").returns('foo').once
@@ -24,22 +24,22 @@ class TesseractTest < Test::Unit::TestCase
end
should "throw exception" do
assert_raises Exception, Tesseract::DependencyChecker::IMAGE_MAGICK_ERROR do
- Tesseract::Process.new(TEST_FILE)
+ Tesseract::Process.new(TEST_FILE, :check_deps => true)
end
end
end
-
+
context "dependency tesseract fails" do
setup do
Tesseract::DependencyChecker.expects(:run_cmd).with("which tesseract").returns('').once
end
should "throw exception" do
assert_raises Exception, Tesseract::DependencyChecker::TESSERACT_ERROR do
- Tesseract::Process.new(TEST_FILE)
+ Tesseract::Process.new(TEST_FILE, :check_deps => true)
end
end
end
-
+
context "tesseract" do
setup do
@tess = Tesseract::Process.new(TEST_FILE)
@@ -47,30 +47,81 @@ class TesseractTest < Test::Unit::TestCase
should "return text" do
assert !@tess.to_s.empty?
end
- should "hanve lang of eng" do
- assert_equal 'eng', @tess.lang
+ should "have lang of eng" do
+ assert_equal :eng, @tess.lang
+ end
+ should "generate a valid convert command" do
+ expected = "convert #{TEST_FILE} derp"
+ result = @tess.generate_convert_command('derp')
+ assert_equal expected, result
end
end
-
+
+ context "tesseract convert options" do
+ should "generate a valid convert command with input options" do
+ options = {:convert_options => {:input => ['-size 120x120']}}
+ tess = Tesseract::Process.new(TEST_FILE, options)
+ expected = "convert -size 120x120 #{TEST_FILE} derp"
+ result = tess.generate_convert_command('derp')
+ assert_equal expected, result
+ end
+ should "generate a valid convert command with output options" do
+ options = {:convert_options => {:output => ['-resize 120x120']}}
+ tess = Tesseract::Process.new(TEST_FILE, options)
+ expected = "convert #{TEST_FILE} -resize 120x120 derp"
+ result = tess.generate_convert_command('derp')
+ assert_equal expected, result
+ end
+ should "generate a valid convert command with input and output options" do
+ options = {
+ :convert_options => {
+ :input => ['-size 120x120'],
+ :output => ['-resize 140x140']
+ }
+ }
+ tess = Tesseract::Process.new(TEST_FILE, options)
+ expected = "convert -size 120x120 #{TEST_FILE} -resize 140x140 derp"
+ result = tess.generate_convert_command('derp')
+ assert_equal expected, result
+ end
+
+ context "tesseract invalid commands" do
+ should "raise an exception when convert could not be executed" do
+ options = {:convert_command => "derp"}
+ tess = Tesseract::Process.new(TEST_FILE, options)
+ assert_raises RuntimeError do
+ tess.to_s
+ end
+ end
+ should "raise an exception when tesseract could not be executed" do
+ options = {:tesseract_command => "derp"}
+ tess = Tesseract::Process.new(TEST_FILE, options)
+ assert_raises RuntimeError do
+ tess.to_s
+ end
+ end
+ end
+ end
+
context "tesseract diff lang" do
setup do
- @tess = Tesseract::Process.new(TEST_FILE, {:lang => 'butts'})
+ @tess = Tesseract::Process.new(TEST_FILE, {:lang => :butts})
end
should "have lang of butts" do
- assert_equal 'butts', @tess.lang
+ assert_equal :butts, @tess.lang
end
end
-
+
context "tesseract configs" do
setup do
- @tess = Tesseract::Process.new(TEST_FILE, {:chop_enable=>0})
+ config = {:chop_enable => 0}
+ @tess = Tesseract::Process.new(TEST_FILE, {:tesseract_options => config})
end
should "return text" do
assert !@tess.to_s.empty?
end
- should "hanve lang of eng" do
- assert_equal 'eng', @tess.lang
+ should "have lang of eng" do
+ assert_equal :eng, @tess.lang
end
end
-
-end
+end
Something went wrong with that request. Please try again.