New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AD: Extend detection code to scan inside archives #5239
Conversation
7155176
to
f3573fe
Compare
f3573fe
to
4822a1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I found some time to review this PR :D Added my notes.
} | ||
|
||
if (_baseName.hasSuffix(".cab") || _baseName.hasSuffix(".hdr")) { | ||
_baseName.erase(_baseName.size() - 5, String::npos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 5
or still - 4
? I think the last letter could get erased with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely - 5. IS cabs follow a naming scheme of #.cab, where # is a digit that gets incremented by 1 with every new file. We only need to store the part, sans that digit, and then we re-add it as needed.
engines/nancy/detection.cpp
Outdated
{ "data2.cab", 0, "9c652edb9846a721839cb7e1dcc94a3e", 462008320 }, | ||
AD_LISTEND | ||
}, | ||
AD_ENTRY1s("is:data1.cab:ciftree.dat", "A:e1cd21841ab1b83a0ea0755ce0254cbc", 4480956), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I was thinking about something like
AD_ENTRY1s("data1.cab", "A:is:ciftree.dat:e1cd21841ab1b83a0ea0755ce0254cbc", 4480956)
But I like your schema better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I was initially planning to do that but quickly realized it makes more sense to keep information about the archive outside of the md5 sum.
engines/nancy/detection.cpp
Outdated
"nancy5", nullptr, | ||
{ | ||
{ "data1.hdr", 0, "258e27792fa7cc7a7125fd74d89f8487", 284091 }, | ||
{ "data1.cab", 0, "70433b30b6114031d54d0c991ad44577", 1446301 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... It seems to me that you are killing these variants. I mean, these archives as a total do differ. Is that the case?
If this is the case, then I would suggest still keeping a distinct archive like data1.cab
as a second file in the detection entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from program files, IS cabs contain a lot of extra data - installer strings in multiple languages, image banners for the installer, registry keys, optional libraries, etc etc. I do not have both of the variants to compare, but it's a pretty safe bet that there's no differences in the actual game data and the discrepancy is in the aforementioned extra data. HeR had a habit of re-releasing their games in various bundles, so having a slightly installer between versions, but identical game data is not out of the expected.
Also, since ciftree.dat (the file we use for detection) contains nearly all the game data except for music and some videos, we know for a fact there is no difference in game logic between those versions - if there were, the md5s would be different :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not about that. Once we have the integrity checking system in-place, it would require to have separate entries, so we could definitely tell that a game is complete and in distinct.
E.g. we would need to collect all variants in the detection tables.
Also, with this in light, I'd like to ask if possible, leave the current detection entries intact, because I am afraid it will be very difficult to port this archive support to Python. Or we decide to switch the scan utility to C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot about the integrity system... In this case your proposal to keep a .cab file in the entry would work fine; the integrity system itself wouldn't need to worry about what's inside an archive, and could just skip files with an A: prefix, since any changes in the files within would be reflected in the checksum of the containing archive.
Only issue is that some of the entries in the nancy engine were made ~10 years ago by other contributors, and I'd need their help with extracting the checksums of the ciftree.dat files the engine actually cares about; I'm assuming they'd be the same, but I can't know since I don't own those variants myself. I think for now commenting out the entries I can't upgrade to the new system and then re-activating them with the new data would be the best course of action; I'd like to not have any cabfile-only entries active, since the whole point of this PR is to get rid of the false positives they cause.
Thank you for the review @sev- ! I've made the relevant fixes and left notes for the rest of your comments. |
7afd19b
to
debaeae
Compare
This is needed for cases where SearchMan is not applicable (e.g. during detection)
Added the A: prefix to detection md5 sums, which is used to signify that the desired file is embedded within an archive. Added support for similar prefixes in filenames, which should look like the following example: is:data1.cab:file.dat Where "is" is the identifier for InstallShield archives, "data1.cab" is the name of the archive container file, and "file.dat" is the name of the file whose md5 sum will actually be calculated.
Detection entries for compressed game variants now have an additional file entry for ciftree.dat, which uses the newly introduced feature for scanning inside archives. The few entries which I don't own have been temporarily disabled. Also, a couple new compressed entries have been added; these were previously put on hold for after the scanning feature was implemented.
Renamed MD5CacheManager to AdvancedDetectorCacheManager, and added facilities for storing open archives inside it. This way an archive that was opened by an AdvancedDetector will be kept in memory until the end of the detection, so other entries/engines that will look inside it won't have to reopen it and reread its data every time.
This ensures we don't keep potentially expensive and otherwise useless objects when a game is running
InstallShield V3 archives are a completely different format and have their own Archive subclass; this commit allows AdvancedDetector to look inside them. Also moved around some of the relevant logic and added more comments to make the code more readable.
AdvancedMetaEngineDetection::detectClashes() now performs a sanity check for entries with archive members, making sure their filename schema contains exactly three elements.
Cleaned up the functions md5PropToGameFile() and md5PropToCachePrefix() so they no longer work through a tree of switches. Both functions have also been changed to return a Common::String.
debaeae
to
be8f951
Compare
Thank you! |
This is motivated by certain detection entries in the Nancy Drew
engine containing filenames of InstallShield cabinets, which are
all standard (data1.cab/data1.hdr, etc), and thus trigger false
positives during detection.
The approach taken is to add the A: prefix to md5 sums,
which is used to signify that the desired file is embedded within
an archive. The type of archive and the filename of the archive
are placed inside the filename string, like in the following example:
is:data1.cab:file.dat
Where "is" is the identifier for InstallShield archives,
"data1.cab" is the name of the archive container file, and
"file.dat" is the name of the file whose size and md5 sum will
actually be evaluated.
When detecting, archives are only opened once and then
kept in cache until detection ends for all plugins. The cache
is contained inside what was formerly known as MD5CacheManager,
which has now been renamed AdvancedDetectorCacheManager.
For now, this PR only adds support for InstallShield archives.
As a proof of concept, the offending Nancy Drew detection entries
have been replaced with ones that make use of the new feature.