Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AD: Extend detection code to scan inside archives #5239

Merged
merged 9 commits into from Nov 2, 2023

Conversation

fracturehill
Copy link
Contributor

@fracturehill fracturehill commented Aug 6, 2023

This is motivated by certain detection entries in the Nancy Drew
engine containing filenames of InstallShield cabinets, which are
all standard (data1.cab/data1.hdr, etc), and thus trigger false
positives during detection.

The approach taken is to add the A: prefix to md5 sums,
which is used to signify that the desired file is embedded within
an archive. The type of archive and the filename of the archive
are placed inside the filename string, like in the following example:

is:data1.cab:file.dat

Where "is" is the identifier for InstallShield archives,
"data1.cab" is the name of the archive container file, and
"file.dat" is the name of the file whose size and md5 sum will
actually be evaluated.

When detecting, archives are only opened once and then
kept in cache until detection ends for all plugins. The cache
is contained inside what was formerly known as MD5CacheManager,
which has now been renamed AdvancedDetectorCacheManager.

For now, this PR only adds support for InstallShield archives.
As a proof of concept, the offending Nancy Drew detection entries
have been replaced with ones that make use of the new feature.

@fracturehill fracturehill marked this pull request as ready for review August 16, 2023 18:26
Copy link
Member

@sev- sev- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I found some time to review this PR :D Added my notes.

engines/advancedDetector.cpp Outdated Show resolved Hide resolved
}

if (_baseName.hasSuffix(".cab") || _baseName.hasSuffix(".hdr")) {
_baseName.erase(_baseName.size() - 5, String::npos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- 5 or still - 4? I think the last letter could get erased with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely - 5. IS cabs follow a naming scheme of #.cab, where # is a digit that gets incremented by 1 with every new file. We only need to store the part, sans that digit, and then we re-add it as needed.

engines/advancedDetector.cpp Show resolved Hide resolved
engines/game.cpp Outdated Show resolved Hide resolved
engines/advancedDetector.cpp Show resolved Hide resolved
{ "data2.cab", 0, "9c652edb9846a721839cb7e1dcc94a3e", 462008320 },
AD_LISTEND
},
AD_ENTRY1s("is:data1.cab:ciftree.dat", "A:e1cd21841ab1b83a0ea0755ce0254cbc", 4480956),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I was thinking about something like

AD_ENTRY1s("data1.cab", "A:is:ciftree.dat:e1cd21841ab1b83a0ea0755ce0254cbc", 4480956)

But I like your schema better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I was initially planning to do that but quickly realized it makes more sense to keep information about the archive outside of the md5 sum.

"nancy5", nullptr,
{
{ "data1.hdr", 0, "258e27792fa7cc7a7125fd74d89f8487", 284091 },
{ "data1.cab", 0, "70433b30b6114031d54d0c991ad44577", 1446301 },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... It seems to me that you are killing these variants. I mean, these archives as a total do differ. Is that the case?

If this is the case, then I would suggest still keeping a distinct archive like data1.cab as a second file in the detection entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from program files, IS cabs contain a lot of extra data - installer strings in multiple languages, image banners for the installer, registry keys, optional libraries, etc etc. I do not have both of the variants to compare, but it's a pretty safe bet that there's no differences in the actual game data and the discrepancy is in the aforementioned extra data. HeR had a habit of re-releasing their games in various bundles, so having a slightly installer between versions, but identical game data is not out of the expected.
Also, since ciftree.dat (the file we use for detection) contains nearly all the game data except for music and some videos, we know for a fact there is no difference in game logic between those versions - if there were, the md5s would be different :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about that. Once we have the integrity checking system in-place, it would require to have separate entries, so we could definitely tell that a game is complete and in distinct.

E.g. we would need to collect all variants in the detection tables.

Also, with this in light, I'd like to ask if possible, leave the current detection entries intact, because I am afraid it will be very difficult to port this archive support to Python. Or we decide to switch the scan utility to C++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot about the integrity system... In this case your proposal to keep a .cab file in the entry would work fine; the integrity system itself wouldn't need to worry about what's inside an archive, and could just skip files with an A: prefix, since any changes in the files within would be reflected in the checksum of the containing archive.

Only issue is that some of the entries in the nancy engine were made ~10 years ago by other contributors, and I'd need their help with extracting the checksums of the ciftree.dat files the engine actually cares about; I'm assuming they'd be the same, but I can't know since I don't own those variants myself. I think for now commenting out the entries I can't upgrade to the new system and then re-activating them with the new data would be the best course of action; I'd like to not have any cabfile-only entries active, since the whole point of this PR is to get rid of the false positives they cause.

@fracturehill
Copy link
Contributor Author

Thank you for the review @sev- ! I've made the relevant fixes and left notes for the rest of your comments.
On a side note: I just noticed that the unknown game type dialog outputs punycode-encoded strings, which visually mangles the colon character used by the schema in the file name. I'm not sure if this is a thing that needs to be addressed or not, but either way it's outside the scope of this PR, so I'm leaving it alone for now

@fracturehill fracturehill force-pushed the detection-inside-archives branch 2 times, most recently from 7afd19b to debaeae Compare August 30, 2023 11:41
This is needed for cases where SearchMan is not applicable
(e.g. during detection)
Added the A: prefix to detection md5 sums, which is used to
signify that the desired file is embedded within an archive.
Added support for similar prefixes in filenames, which should
look like the following example:

is:data1.cab:file.dat

Where "is" is the identifier for InstallShield archives,
"data1.cab" is the name of the archive container file, and
"file.dat" is the name of the file whose md5 sum will actually
be calculated.
Detection entries for compressed game variants now have
an additional file entry for ciftree.dat, which uses the newly
introduced feature for scanning inside archives. The few entries
which I don't own have been temporarily disabled.
Also, a couple new compressed entries have been added; these
were previously put on hold for after the scanning feature
was implemented.
Renamed MD5CacheManager to AdvancedDetectorCacheManager,
and added facilities for storing open archives inside it. This
way an archive that was opened by an AdvancedDetector
will be kept in memory until the end of the detection, so
other entries/engines that will look inside it won't have
to reopen it and reread its data every time.
This ensures we don't keep potentially expensive and
otherwise useless objects when a game is running
InstallShield V3 archives are a completely different format
and have their own Archive subclass; this commit allows
AdvancedDetector to look inside them. Also moved around
some of the relevant logic and added more comments to
make the code more readable.
AdvancedMetaEngineDetection::detectClashes() now
performs a sanity check for entries with archive members,
making sure their filename schema contains exactly three
elements.
Cleaned up the functions md5PropToGameFile() and
md5PropToCachePrefix() so they no longer work through
a tree of switches. Both functions have also been changed to
return a Common::String.
@sev-
Copy link
Member

sev- commented Nov 2, 2023

Thank you!

@sev- sev- merged commit 486b139 into scummvm:master Nov 2, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants