Attempt local SHA1 hash of content before uploading #67

shrimpza · 2022-11-08T00:11:20Z

Perhaps we can SHA1 hash and lookup content before uploading.

This would allow validation of at least some duplicates before wasting time and resources uploading to the server.

submit/index.html should be able to load and hash files selected by the user, then perform some sort of request to the backend to look up content by hash.

This would likely also involve generating a new file structure based on hashes, perhaps with meta redirects to the appropriate content pages.

References:

The text was updated successfully, but these errors were encountered:

sairuk · 2023-02-18T22:40:21Z

I was looking for something like this because I want to scan my own 7 dvd map set locally to identify anything you may currently be missing in the archive. Chances are not if this already includes what i'd upped to rushbase before it closed.

be good to have a tool that could walk a local directory against the data here and build an upload queue of some kind with a separate frontend for folks who want that.

In the past i've used existing tools like clrmamepro (built to scan rom files but can be leveraged for other filesets) to catalog the files, ~13k in all which is similar to what the project looks to do in the metadata.

shrimpza · 2023-02-19T03:50:23Z

The unreal-archive distribution does include the capability to scan and identify new content, as well as index and upload to a data store (given that the person doing the index has an Azure, S3, or B2 account to upload to, as the supported backends currently).

Is your map set a collection of zip/rar/archive files as may have been originally distributed/uploaded by authors, or are they unpacked .unr and related texture/code/config/etc support files in various directories?

Unreal Archive is currently set up to deal with the former - "original" archives as distributed by content authors. Unfortunately it doesn't have a good mechanism of dealing with the individual "loose" content files at the moment.

If you want to try it out, there's a pre-built binary distribution available at: https://code.shrimpworks.za.net/artefacts/net/shrimpworks/unreal-archive/latest/unreal-archive-latest.zip

This can be used to scan a directory for new or unknown content:

$ bin/unreal-archive scan /path/to/content --new-only=true

The output of the scan command is simple semi-colon delimited set of fields, intended for exactly what you propose - finding and refining lists of things to index and upload. I typically feed this through various grep filters to help narrow things down.

Maybe as a starting point, give that a try, and see what it finds? Given the caveat that it expects to find archive files, not unr files. As, if you feed it unr files, they will be hashed independently, and considered unique compared to existing content within archives, so everything would appear "new".

sairuk · 2023-02-19T05:47:10Z

Ok thanks, i'll take a look at that tool in the first instance, if need be i can knock up something to do the same anyway but sounds like it exists.

Is your map set a collection of zip/rar/archive files as may have been originally distributed/uploaded by authors, or are they unpacked .unr and related texture/code/config/etc support files in various directories?

all of the above, i was predominately focused on release archives and variations and then documenting the contents which is how the tooling i was using at the time works. I also only scanned crc32 (for some reason)

i've upped the old dats for reference here where every game entry is an archive and every rom entry was the contents in the archive. I was messing around with multiple methods though so there may be some crossover where game is a directory of loose files in some dats (probably recovered from cache). I was looking at this in 2006/7 so it's been a while since i've been back

sairuk · 2023-02-19T12:48:20Z

had a go at the tool and ran into some issues pulling the archive data, i'll have another look later. I ended up writing my own tool in python just to do the basic check I needed done, slower to pull the data but ran quickly enough for me. i.e. sha1 hash the local archive and look the hash up in the unreal-archive data.

I have ~1500 potentially new files or variations on known. Noting it seems past me didn't retain the release casing for filenames, lowercased everything, annoying.

e.g. dm-sdm-bullspumphouse.zip including the missing BullPHsounds dependency from the current entry

i'll have a read in how to bulk submit, or i may just up them and you can grab them

shrimpza · 2023-02-20T00:14:41Z

Cool. Unfortunate it didn't just work, what errors or problems did you encounter getting the archive data? It should automatically download it if you don't manually git checkout and provide a --content-path. I've been doing a lot of refactoring recently and may have broken something.

Anyway, if you've managed to narrow down just the new or variant content, and if you have somewhere to upload it even temporarily, where I can grab it, I'm gappy to do a bulk indexing on my end.

sairuk · 2023-02-21T07:34:49Z

there was an initial error that i didn't capture, but now when trying to download the data i run into this

$ bin/unreal-archive scan /mnt/files/games/video/ut_all/ut99_map_archive_test/ --new-only=true
Unreal Archive version 1.9.212
content-path not specified. Download and use read-only content data? [Y/n]
> y
Exception in thread "main" java.lang.IllegalArgumentException
	at java.base/java.util.Optional.orElseThrow(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.contentPath(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.contentRepo(Unknown Source)
	at unreal.archive/net.shrimpworks.unreal.archive.Main.main(Unknown Source)

using --content-path after cloning the repo manually did work, this is the output i had from it working

$ time bin/unreal-archive scan /mnt/files/games/video/ut_all/ut99_map_archive_test/ --new-only=true --content-path=$HOME/mounts/insertcredit/devel/com.github/unreal-archive-data/
Unreal Archive version 1.9.212
Loaded content index with 57317 items (310.18GB) in 204.85s
Found 3 file(s) to scan  
File;Known;Current Type;Scanned Type;Failure
[3/3] : cnq-overlord.zip  
Completed scanning 3 files

real	3m26.182s
user	0m48.197s
sys	0m38.538s

shrimpza · 2023-02-21T13:08:17Z

Interesting, thanks. I'll have to validate that, I may well have broken it in the recent refactors 😬

sairuk · 2023-02-22T01:33:50Z

@shrimpza tried messaging you on discord to discuss uploads but it's being a pain, reach out if you can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt local SHA1 hash of content before uploading #67

Attempt local SHA1 hash of content before uploading #67

shrimpza commented Nov 8, 2022

sairuk commented Feb 18, 2023

shrimpza commented Feb 19, 2023 •

edited

Loading

sairuk commented Feb 19, 2023 •

edited

Loading

sairuk commented Feb 19, 2023 •

edited

Loading

shrimpza commented Feb 20, 2023

sairuk commented Feb 21, 2023

shrimpza commented Feb 21, 2023

sairuk commented Feb 22, 2023

Attempt local SHA1 hash of content before uploading #67

Attempt local SHA1 hash of content before uploading #67

Comments

shrimpza commented Nov 8, 2022

sairuk commented Feb 18, 2023

shrimpza commented Feb 19, 2023 • edited Loading

sairuk commented Feb 19, 2023 • edited Loading

sairuk commented Feb 19, 2023 • edited Loading

shrimpza commented Feb 20, 2023

sairuk commented Feb 21, 2023

shrimpza commented Feb 21, 2023

sairuk commented Feb 22, 2023

shrimpza commented Feb 19, 2023 •

edited

Loading

sairuk commented Feb 19, 2023 •

edited

Loading

sairuk commented Feb 19, 2023 •

edited

Loading