-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding the test data set creation #4
Comments
Thank you for your attention to my work and for pointing out some issues in the comment section related to the dataset. |
The evaluation metrics on the test set can also be improved by fine-tuning some hyperparameters. For instance, the accuracy can be improved from 96.52% to 96.92%. However, I think these are not helpful to the research of this topic, just a tip. |
Can you publish a script for preprocessing the HAM10000 data and splitting it into the three subsets? I only get 50% accuracy with my own variant, even using your code and model, so that might be due to the data. |
I think you can use the scripts from https://github.com/Woodman718/CapsNets/tree/main/tools to split the data. I think that there could be a problem with the test set though. Isn't it possible that there are images to the same lesion id in the train set and test set? For the validation split, lesion id's that have multiple images are explicitly excluded. |
I was able to execute the code in https://github.com/Woodman718/CapsNets/blob/main/Experiment/HAM10000_9652.zip.
|
I optimized the loss function by introducing matrix norms and condition numbers, making the model more sensitive to errors. Changes in various hyperparameters will affect the final results. Batch size is a crucial hyperparameter that affects both learning and inference. In a previous experiment, adjusting the β parameter in the squash function improved the accuracy from 96.52% to 96.92%. For instance, setting β to be less than the initial value of 1.45 when the model converges can further improve the score {β=1.33, Acc=96.92%}. Similarly, adjusting the size of the LKC (Local Kernel Canonicalization) can also impact the score. When LKC=24, the accuracy reaches 96.98%, but when LKC=[11,15,24], the model fails to correctly identify DF symptoms. Furthermore, replacing adaptive max pooling with fractional max pooling can increase accuracy to 97.34%. However, the model becomes unstable, with accuracy fluctuating by over 3%, and many repeated experiments may be required to obtain consistent results. |
I have a few questions to understand how these very good metrics came about. Do I see correctly that first the augmentations were done and then the split into the three subsets for training, validation and testing? Doesn't that lead to overfitting or mere recognition of images when augmented variants of images are split between training and testing datasets? Or was this prevented and I missed it? I think the usual approach would be to use a dataloader with Weighted Random Sampler and augmentations for training only, but leave the test dataset unchanged. How does the quality of your model (accuracy, average F1, MCC) actually look on the official test dataset? https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/DBW86T/OSKJF2&version=4.0
The text was updated successfully, but these errors were encountered: