Skip to content

Covert SDR Objects (containing PDFs) into plain text (for AI-ETD project)

License

Notifications You must be signed in to change notification settings

sul-dlss-labs/druid_2_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Druid2Text

NOTE: This is a Proof of Concept and not currently intended for production use.

This library takes a druid (the identifier for an object in the Stanford Digital Repository) and attempts to turn it into text. This currently only supports converting PDFs to text (intended for extracting plain text from Electronic Thesis & Dissertation PDFs), but could be extended to support other conversions (or objects that already contain plain text).

Basic Usage

The examples here assume you're in the root of the repository and running in a ruby environment (e.g. irb)

By default the library will take an array of druids and write text files with the corresponding druid as the filename (+ .txt) in the results directory and dump the contents of any PDFs in the object to that file as text.

require './druid_2_text'

Druid2Text.call(druids: ['pd570yx1816'])

Configuring

You can print deubugging information by setting the Druid2Text::DEBUG constant to true.

Druid2Text::DEBUG = true

You can also pass in a class to handle the results (that will be used instead of the default Druid2Text::TextFileWriter handler).

This class will be instantiated with the druid and the text and the class is responsible for handling the rest.

class MyCustomHandler
  def initialize(druid:, text:)
    puts druid
    puts text
  end
end

Druid2Text.call(druids: ['pd570yx1816'], results_handler: MyCustomHandler)

Processing Data Manually

In addition to providing a handler class, you can also pass a block to process the page data manually.

Druid2Text.call(druids: ['pd570yx1816']) do |druid, pages|
  page_count = pages.count
  puts "#{druid} has #{page_count} pages"
  puts "#{pages.map(&:length).max} is the longest page"

  # Print the last half of the document with page indexes
  pages.each_with_index do |page, index|
    next if index < (page_count / 2)
    puts "Page index: #{index}"
    puts page
  end
end

About

Covert SDR Objects (containing PDFs) into plain text (for AI-ETD project)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages