Fingerprint caching #1530

oomwat · 2025-05-17T18:56:33Z

oomwat
May 17, 2025

Right now the comparison process is the one that takes the most time, it does not need to be this way.

If the caching were approached from the perspective of a graph database, then we would first build a list of file paths, modification dates and sizes ... then iterate across these to generate fingerprints or extract metadata for comparison ... the fingerprints would then be linked to the file object bi-directionally so that the process of finding duplicates would be a query against those fingerprints that are linked to more than one file - rather than iterating across all files to compare fingerprint.

This would also result in caching of fingerprint comparison by default, and re-scanning/fingerprinting would only need to be processed for those new paths, or changed modification date/file size.

oomwat · 2025-05-17T20:36:32Z

oomwat
May 17, 2025
Author

I've seen some references to cache_file_json in the code, where would I be able to find that ... my recent runs have only produced a .bin file I'm currently doing a run with a pretty print json output, but from what I read that's only outputting the duplicates, not the fingerprints.

If I can get hold of the fingerprint cache in json format then I should be able to code something up quite quickly that will find the dupes using JS assocative arrays - I'm a java/JS developer, so my Rust is not good enough to code it in the same language.

I'm seeing references to save_also_as_json in the gui, but not in the cli.

0 replies

oomwat · 2025-05-17T22:11:00Z

oomwat
May 17, 2025
Author

The process finally finished, the pretty-print json seems to have the data I need.

0 replies

oomwat · 2025-05-21T18:07:20Z

oomwat
May 21, 2025
Author

OK, so the pretty-print json does not appear to contain the data I need, it only seems to contain the cache entries for the duplicates found, which is unfortunate .. however if there is some way that I can get hold of the raw cache data in json format then I should be able to re-run this code to provide a more accurate result.

However, the console output for the following code, with the progress dots and file matches removed is as follows.

Finished processing JSON file19:00:12
</snip>
Finished processing duplicates19:00:12
Processing time: 386ms
Number of files: 15044

Generated by the following REALLY UGLY code, it's hacked together and is in no way production ready, but proves the theory

// @ts-ignore
import JSONStream from 'JSONStream';
import {radix64} from "radix64";
// @ts-ignore
import * as fs from "node:fs";

type FingerprintData ={
    size: number;
    path: string;
    modified_date: number;
    fingerprint: number[];
    track_title: string;
    track_artist: string;
    year: string;
    length: string;
    genre: string;
    bitrate: number;
}

class FileEntry {
    path: string;
    size: number;
    modified_date: number;
    fingerprintKey: string;

    constructor(data:FingerprintData , key: string) {
        this.path = data.path;
        this.size = data.size;
        this.modified_date = data.modified_date;
        this.fingerprintKey = key;
    }
};

class FingerprintEntry {
    fingerprint: number[];
    files: FileEntry[];

    constructor(fingerprint: number[]) {
        this.fingerprint = fingerprint;
        this.files = [];
    }
}

const fileStream = fs.createReadStream('czkawaka.json', {encoding: 'utf8'});
const parser = JSONStream.parse('*');

const files: Array<FileEntry> = [];
const fingerprints: { [key: string]: FingerprintEntry } = {};

console.log('Processing JSON file'+(new Date()).toLocaleTimeString());
fileStream.pipe(parser);

parser.on('data', (data: FingerprintData[]) => {
    process.stdout.write('.');
    data.forEach(item => {
        const key = item.fingerprint.map(num => radix64(num)).join();
        const fileEntry: FileEntry = new FileEntry(item, key);
        files.push(fileEntry);
        const fingerprintEntry: FingerprintEntry = fingerprints[key] ? fingerprints[key] : new FingerprintEntry(item.fingerprint);
        fingerprintEntry.files.push(fileEntry);
        fingerprints[key] = fingerprintEntry;
    });
});

parser.on('end', () => {
    const start = new Date();    
    console.log('Finished processing JSON file'+start.toLocaleTimeString());
    // @ts-ignore
    Object.entries(fingerprints).forEach(([key, entry]) => {
        if (entry.files.length > 1) {
            entry.files.forEach((file: { path: any; }) => {
                console.log(file.path);
            });
            console.log('');
        }
    });
    const end = new Date();   
    console.log('Finished processing duplicates'+end.toLocaleTimeString());
    console.log('Processing time: ' + (end.getTime() - start.getTime()) + 'ms');
    console.log('Number of files: ' + files.length);
});

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fingerprint caching #1530

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Fingerprint caching #1530

Uh oh!

oomwat May 17, 2025

Replies: 3 comments

Uh oh!

Uh oh!

oomwat May 17, 2025 Author

Uh oh!

oomwat May 17, 2025 Author

Uh oh!

Uh oh!

oomwat May 21, 2025 Author

oomwat
May 17, 2025

oomwat
May 17, 2025
Author

oomwat
May 17, 2025
Author

oomwat
May 21, 2025
Author