Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt local SHA1 hash of content before uploading #67

Open
shrimpza opened this issue Nov 8, 2022 · 8 comments
Open

Attempt local SHA1 hash of content before uploading #67

shrimpza opened this issue Nov 8, 2022 · 8 comments

Comments

@shrimpza
Copy link
Member

shrimpza commented Nov 8, 2022

Perhaps we can SHA1 hash and lookup content before uploading.

This would allow validation of at least some duplicates before wasting time and resources uploading to the server.

submit/index.html should be able to load and hash files selected by the user, then perform some sort of request to the backend to look up content by hash.

This would likely also involve generating a new file structure based on hashes, perhaps with meta redirects to the appropriate content pages.

References:

@sairuk
Copy link

sairuk commented Feb 18, 2023

I was looking for something like this because I want to scan my own 7 dvd map set locally to identify anything you may currently be missing in the archive. Chances are not if this already includes what i'd upped to rushbase before it closed.

be good to have a tool that could walk a local directory against the data here and build an upload queue of some kind with a separate frontend for folks who want that.

In the past i've used existing tools like clrmamepro (built to scan rom files but can be leveraged for other filesets) to catalog the files, ~13k in all which is similar to what the project looks to do in the metadata.

@shrimpza
Copy link
Member Author

shrimpza commented Feb 19, 2023

The unreal-archive distribution does include the capability to scan and identify new content, as well as index and upload to a data store (given that the person doing the index has an Azure, S3, or B2 account to upload to, as the supported backends currently).

Is your map set a collection of zip/rar/archive files as may have been originally distributed/uploaded by authors, or are they unpacked .unr and related texture/code/config/etc support files in various directories?

Unreal Archive is currently set up to deal with the former - "original" archives as distributed by content authors. Unfortunately it doesn't have a good mechanism of dealing with the individual "loose" content files at the moment.

If you want to try it out, there's a pre-built binary distribution available at: https://code.shrimpworks.za.net/artefacts/net/shrimpworks/unreal-archive/latest/unreal-archive-latest.zip

This can be used to scan a directory for new or unknown content:

$ bin/unreal-archive scan /path/to/content --new-only=true 

The output of the scan command is simple semi-colon delimited set of fields, intended for exactly what you propose - finding and refining lists of things to index and upload. I typically feed this through various grep filters to help narrow things down.

Maybe as a starting point, give that a try, and see what it finds? Given the caveat that it expects to find archive files, not unr files. As, if you feed it unr files, they will be hashed independently, and considered unique compared to existing content within archives, so everything would appear "new".

@sairuk
Copy link

sairuk commented Feb 19, 2023

Ok thanks, i'll take a look at that tool in the first instance, if need be i can knock up something to do the same anyway but sounds like it exists.

Is your map set a collection of zip/rar/archive files as may have been originally distributed/uploaded by authors, or are they unpacked .unr and related texture/code/config/etc support files in various directories?

all of the above, i was predominately focused on release archives and variations and then documenting the contents which is how the tooling i was using at the time works. I also only scanned crc32 (for some reason)

i've upped the old dats for reference here where every game entry is an archive and every rom entry was the contents in the archive. I was messing around with multiple methods though so there may be some crossover where game is a directory of loose files in some dats (probably recovered from cache). I was looking at this in 2006/7 so it's been a while since i've been back

@sairuk
Copy link

sairuk commented Feb 19, 2023

had a go at the tool and ran into some issues pulling the archive data, i'll have another look later. I ended up writing my own tool in python just to do the basic check I needed done, slower to pull the data but ran quickly enough for me. i.e. sha1 hash the local archive and look the hash up in the unreal-archive data.

I have ~1500 potentially new files or variations on known. Noting it seems past me didn't retain the release casing for filenames, lowercased everything, annoying.

e.g. dm-sdm-bullspumphouse.zip including the missing BullPHsounds dependency from the current entry

i'll have a read in how to bulk submit, or i may just up them and you can grab them

@shrimpza
Copy link
Member Author

Cool. Unfortunate it didn't just work, what errors or problems did you encounter getting the archive data? It should automatically download it if you don't manually git checkout and provide a --content-path. I've been doing a lot of refactoring recently and may have broken something.

Anyway, if you've managed to narrow down just the new or variant content, and if you have somewhere to upload it even temporarily, where I can grab it, I'm gappy to do a bulk indexing on my end.

@sairuk
Copy link

sairuk commented Feb 21, 2023

there was an initial error that i didn't capture, but now when trying to download the data i run into this

$ bin/unreal-archive scan /mnt/files/games/video/ut_all/ut99_map_archive_test/ --new-only=true
Unreal Archive version 1.9.212
content-path not specified. Download and use read-only content data? [Y/n]
> y
Exception in thread "main" java.lang.IllegalArgumentException
	at java.base/java.util.Optional.orElseThrow(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.contentPath(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.contentRepo(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.main(Unknown Source)

using --content-path after cloning the repo manually did work, this is the output i had from it working

$ time bin/unreal-archive scan /mnt/files/games/video/ut_all/ut99_map_archive_test/ --new-only=true --content-path=$HOME/mounts/insertcredit/devel/com.github/unreal-archive-data/
Unreal Archive version 1.9.212
Loaded content index with 57317 items (310.18GB) in 204.85s
Found 3 file(s) to scan  
File;Known;Current Type;Scanned Type;Failure
[3/3] : cnq-overlord.zip  
Completed scanning 3 files

real	3m26.182s
user	0m48.197s
sys	0m38.538s

@shrimpza
Copy link
Member Author

Interesting, thanks. I'll have to validate that, I may well have broken it in the recent refactors 😬

@sairuk
Copy link

sairuk commented Feb 22, 2023

@shrimpza tried messaging you on discord to discuss uploads but it's being a pain, reach out if you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants