Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering problems #25

Open
silvia1993 opened this issue Jun 13, 2020 · 3 comments
Open

Clustering problems #25

silvia1993 opened this issue Jun 13, 2020 · 3 comments

Comments

@silvia1993
Copy link

silvia1993 commented Jun 13, 2020

Hello,
thank you very much for sharing your project!

I'm trying to apply this algorithm on a set of RGB images (cartoons), in particular I have 2344 samples with dimension [227,227,3] composed by 7 classes. The algorithm is not able to correctly cluster the images, at the end I have ~ 0.2 ACC with 1220 clusters. I read carefully all the issues solved in this repository but I cannot solve my problem so I list each step that I did to have a feedback about a possibile mistake:

  1. I made my dataset using the file "make_data.py" using normalization [-1,1]. At the end I have testdata.mat and traindata.mat. Each row in this matrices is composed by the concatenation of the three channels, so I have [R,G,B] -> [51529,51529,51529] (51529=227x227). Considering together testdata.mat and traindata.mat I have a matrix 2344x154587.

  2. Next I run the "pretraining.py" file using --batch_size=256, --niter=1831 (in order to have 200 epochs as suggested), --step=733 (to have 80 epochs as suggested) --lr=0.01 (since the dimension of the data samples is higher than the other datasets used with this framework I though that this could be a good choice for mine), --dim=10.

  3. With the file checkpoint_4.pth.tar obtained after 2 I extract the features of the dataset obtaining "pretrained.pkl".

  4. I construct the graph with the original data using "edge_construction.py" with --algo knn, --k 10, --samples 2344 and I get "pretrained.mat" file.

  5. After I launch "copyGraph.py" to the final "pretrained.mat" file.

  6. Finally I use "DCC.py" leaving all the default values.

I tried also to use an higher k (k=20) and mknn instead of knn but the things seems not change.
Do you have any idea about the reason why the algorithm not work properly with my data?

@sumeromer
Copy link

@silvia1993 : I have similar question, indeed. I search for a better architecture to use DCC losses because all the datasets (MNIST, YTF, Coil100, and YaleB) are toy datasets, and the current fully connected or convolutional architectures will not be enough to use 227x227 RGB images.

@shahsohil : Do you have any recommendations to try on ImageNet-like images? Did you experiment on them?

@ilyak93
Copy link
Contributor

ilyak93 commented Oct 12, 2020

@sumeromer, @silvia1993 I had another problem, with some similarity to yours, I always got a very dominant cluster of all, with the most of the data, and a lot singleton or just few examples cluster, did it happen to you ?

@shsaronian
Copy link

@sumeromer, @silvia1993 I had another problem, with some similarity to yours, I always got a very dominant cluster of all, with the most of the data, and a lot singleton or just few examples cluster, did it happen to you ?

It also happened to me, it's kind of like overfitting to all data points and clustering them all in a single group. Don't know if that makes sense as overfitting is mostly used in supervised algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants