Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of way to get complete distance matrix #66

Closed
advance512 opened this issue Mar 28, 2017 · 1 comment
Closed

Documentation of way to get complete distance matrix #66

advance512 opened this issue Mar 28, 2017 · 1 comment

Comments

@advance512
Copy link

Hi there,

I have a set of 4000 images which I want to create into a cluster. My images are a large set of images taken from various fixed cameras (might move a small, small bit due to wind), some at day some at night, and they might have people, dogs, cats, etc. I am trying to create clusters based on the camera (i.e. clusters of images all taken by the same camera).

I'm planning on using HDBSCAN for this:
http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

I've got image-match running and have done the following modifications to the library to attempt and get a complete distance matrix:

I have tried settings distance_cutoff of SignatureDatabaseBase() to 1.0, and size of SignatureES() to 4000, but I seem to be getting a sparse 4000x4000 matrix.

Is there any easy way to get the full distance matrix?


Also, any hints on when increasing k, N and n_grid is correct for more precise results?

I also noticed some images contain specific textual labels embedded in the image in the same places (like date/time and camera name). Since these labels aren't big, I'm pretty sure they're mostly ignored here - am I right?

@rhsimplex
Copy link
Owner

rhsimplex commented Mar 31, 2017

For 4000 images, I would not use the database part of the package. Just use the generate_signature method from the ImageSignature class in image_match/goldberg.py on your images, and then use the normalized_distance over all pairs of signatures to generate your distance matrix.

Roughly speaking, decreasing k and increasing N should give you better results at the expense of lookup speed. Similarly, increasing n_grid should give you more discerning signatures (i.e. longer). I haven't tested anything but the defaults with any rigor though.

You are correct in that the labels shouldn't make much of a difference. If you have a couple examples of images you expect to cluster, could you post them here so I could advise further?

Closing the issue, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants