Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deposit_bag_validator class added #122

Merged
merged 5 commits into from
Mar 20, 2018
Merged

Conversation

ndushay
Copy link
Contributor

@ndushay ndushay commented Mar 6, 2018

This is a Shameless Green implementation of Moab::DepositBagValidator, to support the validate_bag robot in sul-dlss/preservation_robots (which will be a replacement for the old robot code in sul-dlss/sdr-preservation-core).

The logic represented by this code was originally in:

  • sul-dlss/sdr-preservation-core/lib/sdr_ingest/validate_bag <-- old preservation robots
  • sul-dlss/archive-utils/lib/bagit_bag <-- gem only used by sdr-preservation-robots
  • sul-dlss/archive-utils/lib/file_fixity
  • sul-dlss/archive-utils/lib/fixity

The code's new home is here because the common_accessioning robots ultimately calls (via dor-services preservable concern ...) Moab::Bagger in this moab-versioning gem to create the deposit bag. So it makes sense to have the bag validation code live here as well.

Please check for:

  • code has at least as much actual validation of the bag as the original
  • code makes sense (suggestions for improvements welcome!)
  • tests check what they should
  • test code makes sense (ditto)
  • does it make sense for this code to live in moab-versioning gem?
  • world peace

closes #117
connects to sul-dlss/preservation_robots#11

@SaravShah
Copy link
Contributor

mvimg_20180306_153103

@ndushay
Copy link
Contributor Author

ndushay commented Mar 7, 2018

In case it helps, some vague parallels here:

Methods:

  • #validation_errors --> was sdr-preservation-core.validate_bag#validate_bag
  • L67 --> sdr-preservation-core.bagit_bag#verify_pathname
  • #bag_dir_exists? --> was archive-utils BagitBag.open_bag
  • #verify_version --> was sdr-preservation-core.validate_bag#verify_version_number
  • (last_)version_id_from_xxx --> was sdr-preservation-core.validate_bag verify_version_id, vmfile_version_id
  • #verify_version_in_xml_file --> sdr-preservation-core.validate_bag#verify_version_id
  • #verify_tagmanifests --> archive-utils.BagitBag#verify_tagfile_manifests
  • #verify_payload_manifests --> archive-utils.BagitBag#verify_payload_manifests
  • #checksums_hash_from_manifest_files --> was archive-utils.BagitBag#read_manifest_files
  • #generate_tagmanifest_checksums_hash --> archive-utils: BagitBag generate_tagfile_checksums, Fixity.generate_checksums
  • #generate_payload_checksums --> archive-utils: BagitBag generate_payload_checksums
  • #generate_checksums_hash --> archive-utils: Fixity.generate_checksums
  • #generated_checksums --> archive-utils: Fixity.fixity_from_file
  • #digester_hash --> archive-utils Fixity.get_digesters
  • #verify_manifest_checksums --> archive-utils BagitBag verify_manifests and BagitBag manifest_diff
  • #checksums_diff_hash --> archive-utils FileFixity diff
  • #verify_payload_size --> archive-utils BagitBag verify_payload_size
  • #bag_info_payload_size --> archive-utils BagitBag info_payload_size, BagitBag read_bag_info_text
  • #generated_payload_size --> BagitBag bag_payload_size
  • #checksum_types_from_manifest_checksums_hash <-- I created this so we could stop generating unnec checksums
  • #single_error_hash <-- akin to MV storage_object_validator
  • #error_code_msg <-- akin to MV storage_object_validator

Copy link
Contributor

@SaravShah SaravShah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you got the main test cases for errors.
There are some edge cases I can think of but I think they might be too edge casey. I went back and forth from your methods and the methods in which you abstracted them and everything looks good.

PS. I don't mind the name payload but if others want it to be changed then thats fine.

Edit: Went through specs. This looks good to me. I would like to see this get tested, maybe inserting this code in place of the existing validate-bag step in the sdrIngestWF.

  • code has at least as much actual validation of the bag as the original
  • code makes sense (suggestions for improvements welcome!)
  • tests check what they should
  • test code makes sense (ditto)

# returns Array of tiny error hashes, allowing multiple occurrences of a single error code
def validation_errors
return [single_error_hash(BAG_DIR_NOT_FOUND, bag_dir: deposit_bag_pathname)] unless deposit_bag_pathname.exist?
return result_array unless required_bag_files_exist?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want an error code here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_bag_files_exist? is a method that adds errors to result_array if all required files aren't present. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your question points out that that method is doing two things: writing errors and answering a boolean. perhaps that is bad.

end
end

def verify_manifest_checksums(manifest_type, manifests_checksum_hash, generated_checksum_hash)
Copy link
Contributor

@SaravShah SaravShah Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_manifest_checksums is the combination of verify_manifests and manifest_diff. I see manifest_diff uses FileFixity.new(file_id: file_id) and we are using an empty hash here. Just wanted some clarification

Copy link
Contributor

@SaravShah SaravShah Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class FileFixity
   def initialize(options=nil)
      @checksums=Hash.new
      options = {} if options.nil?
      options.each do |key,value|
        #instance_variable_set("@#{key}", value)
        send "#{key}=", value
      end
    end
.....

So they are passing in file_id to this hash obj but we are leaving it empty?

Copy link
Contributor Author

@ndushay ndushay Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the original, they use this "FileFixity" structure as a mean of associating a file name with its checksums. And there is a bunch of code about making FileFixity a class that can be used as a hash key ... but it never is.

When I unraveled what was going on, it appeared we needed something simpler: a hash with the relative file name as the key, and a hash of checksums as the value. No need for all the indirection.

We don't have to worry about file name clashes, because they can't happen with relative file names in the bag. And we are only verifying the checksums that exist for the file, so it seemed expedient to simply work with those.

It may be that some comments would be helpful explaining the hashes.

It may also be that I am wrong in my assessment.

A further note: FileFixity in archive-utils is similar to FileSignature in moab-versioning. Since I was already up to my neck in alligators, I did not take the next step to using existing code in moab-versioning. I saw that the bag validation code was totally separate for unknown reasons so I kept my focus there and didn't look to re-use moab-versioning existing classes. Perhaps that was a bad decision.

Copy link
Contributor Author

@ndushay ndushay Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps it was a bad decision that the original ingest-robots didn't leverage the moab-versioning code for the bag validation, since they do leverage it for deposit-bag. I have no idea why the existing production code is so inconsistent and strange. -- wait -- yes, yes I do. Le sigh.)

storage_obj = instance_double(Moab::StorageObject, deposit_bag_pathname: bag_pathname, current_version_id: 0)
dbv = described_class.new(storage_obj)
code = described_class::CHECKSUM_MISMATCH
from_file = '6666666666666666666666666666666666666666666666666666666666666666'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jmartin-sul jmartin-sul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have 3 small nitpicks. if you think they're off base, or would prefer to ticket them, or think they're a good point but not worth the effort, i'm happy to just approve and merge.

nice work, definitely more readable, and likely more maintainable!

pathname = deposit_bag_pathname.join(filename)
result_array << single_error_hash(REQUIRED_FILE_NOT_FOUND, file_pathname: pathname) unless pathname.exist?
end
result_array.empty? ? true : false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not totally opposed to this, but sort of assumes that required_bag_files_exist? will be called before any other checks, since it assumes result_array will be empty if it didn't add anything, right? comment maybe? is a local result array overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding comment - good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty? already returns true or false. No need to reinterpret that, right?

nil
end

def verify_version_in_xml_file(file_pathname, found)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about verify_version_from_xml_file? small difference, but every time i read the method name, it made me think it was going to read the xml, but it's doing stuff with already-extracted values.

diff_hash = {}
# NOTE: these are intentionally & and | instead of && and ||
matching_checksum_types = (left_checksums.keys & right_checksums.keys)
matching_checksum_types = (left_checksums.keys | right_checksums.keys) if matching_checksum_types.empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about... checksum_types_to_check? checksum_types_to_match? checksum_types_to_comp?

@ndushay
Copy link
Contributor Author

ndushay commented Mar 20, 2018

@jmartin-sul I believe I have addressed your concerns. Thanks for the 👀 . The code is better with the small tweaks.

@coveralls
Copy link

coveralls commented Mar 20, 2018

Coverage Status

Coverage decreased (-0.06%) to 98.342% when pulling 088b957 on new-deposit-bag-validator into df7102e on master.

@jmartin-sul jmartin-sul merged commit 37a34fb into master Mar 20, 2018
@jmartin-sul jmartin-sul deleted the new-deposit-bag-validator branch March 20, 2018 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(IR) create deposit_bag_validator in MV gem (blocked by PC #601)
5 participants