Skip to content

Structural and Checksum Validation: a quick walk through

Naomi Dushay edited this page Jun 3, 2021 · 8 revisions

This is a quick-ish walkthrough of the Moab validation code, from higher level moab-versioning methods to the actual checksum calculation calls. (accurate as of 2021-05-21)

Script for validating a random Moab

Below is the script to validate a Moab from the README. Note that it can be used for any Moab anywhere as long as the configuration can point to the correct location.

It only requires the moab-versioning gem and rails console

https://github.com/sul-dlss/moab-versioning/blob/main/README.md#soup-to-nuts-example-of-validating-a-druid-at-stanford

Soup to Nuts Example of validating a Druid at Stanford (from README)

Usage:

  • Copy the script below to a box with Ruby installed.
  • Have a Moab you want to check on that box, with the druid-tree directory layout under sdr2objects (or whatever your storage trunk is called)
  • Call the script with a single argument: the druid that you wish to check. E.g.:
[~/ruby ]$ ruby moab_check.rb bb294sf0065

Script:

# moab_check.rb
require 'moab'
require 'moab/stanford'
require 'druid-tools'

Moab::Config.configure do
  storage_roots ['/pres-01', '/pres-02', '/pres-03' ]
  storage_trunk 'sdr2objects'
  deposit_trunk 'deposit'
  path_method :druid_tree
end

# Read druid from command line arg.
druid = "druid:#{ARGV[0]}"
# druid = 'cq580gn5234'

moab_path = Moab::StorageServices.object_path(druid)
puts "#{druid} found at #{path}"

moab = Moab::StorageObject.new(druid, moab_path)

# Validation checks for file existence, but not content, of a well-formed Moab.
# It does not read files or perform checksum validation.
object_validator = Stanford::StorageObjectValidator.new(moab)
validation_errors = object_validator.validation_errors # Returns an array of hashes with error codes
puts "\nChecking stuctural validition of #{druid}\n"
if validation_errors.empty?
  puts "\nYay! Moab #{moab.digital_object_id} passed structural validation.\n"  
else
  puts validation_errors
end

puts "\n"

# Iterate thru each moab version and perform verification. This includes discovery and checksum verification of files.
moab.version_list.each do |ver|
  puts "\nChecking version #{ver.version_id}\n"

  # add to_hash(verbose: true) or .to_json for more details on each

  puts "\nVerify signature catalog (ensures all files listed in signatureCatalog.xml exist)\n"
  puts ver.verify_signature_catalog.to_hash

  # verify_version_storage includes:
  #   verify_manifest_inventory, (which computes and compares v000x/manifest file checksums)
  #   verify_version_inventory,
  #   verify_version_additions (which computes v000x/data file checksums and compares them with values in signatureCatalog.xml)
  puts "\nVerify version storage (includes checksum validation of v000x/data and v000x/manifest files)\n"
  puts ver.verify_version_storage.to_hash
end

Walk-Through of Validation Code Called by the Script

StorageObjectValidator#validation_errors (detect structural, file layout, expected file/directory, etc errors for the moab as a whole)

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_validator.rb#L44-L50 (calls check_correctly_named_version_dirs, check_sequential_version_dirs, check_correctly_formed_moab)

then for each version calls

verify_version_storage

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L214-L222

inits a VerificationResult for the druid version

calls verify_version_additions

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L315

calls FileInventory.new(type: 'directory').inventory_from_directory(data_directory) -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L315

calls FileGroup.new(group_id: group_id).group_from_directory(data_dir) -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_inventory.rb#L180

calls harvest_directory(directory, true) -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_group.rb#L190

calls add_physical_file -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_group.rb#L214

calls FileSignature.new.signature_from_file(pathname) -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_group.rb#L232

calls FileSignature.from_file(pathname) -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_signature.rb#L169

which actually computes the checksums using the file paths -- https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/file_signature.rb#L73-L89

also calls verify_version_inventory and verify_manifest_inventory

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L214-L222

also called directly by the script, see below for how they work. prob redundant to call them again from the script?

verify_manifest_inventory

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L224-L245

calls FileInventory.new.inventory_from_directory(@version_pathname.join('manifests'), 'manifests') -- see verify_version_storage walkthrough for explanation of how this computes checksums for the directory it's given. this time called on the manifests dir instead of the data dir.

verify_version_inventory

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L275-L307

loads signature catalog, returns "true if files & signatures listed in version inventory can all be found". my read of the code is that this only looks for the presence of the files listed in signature catalog, and that it doesn't compute checksum values anew from content of the files as they are on disk when this is run.

...or maybe it validates checksums by re-computing from the files? a little hard to tell without digging into https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L284-L296 (esp catalog_entry = signature_catalog.signature_hash[file.signature])

verify_signature_catalog

https://github.com/sul-dlss/moab-versioning/blob/1cfe59cfb4f9fea7a1cc35b936a8f5a7ddd94029/lib/moab/storage_object_version.rb#L247-L273

this really does appear to just check for the presence of all the files listed in signature catalog.