Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features: Asking for help to add new policies to convert CheXpert target class in our custom Dataset Class #9

Open
kdg1993 opened this issue Dec 7, 2022 · 5 comments
Assignees
Labels
Data About data or preprocessing Discussion Extra attention is needed

Comments

@kdg1993
Copy link
Collaborator

kdg1993 commented Dec 7, 2022

What

  • Add more policy options based on statistical or intuitive aspects of missing and label converting (Not based on domain knowledge or score)

Why

While I've looked around the target class distribution of CheXpert CSV data, I found an interesting possibility for data handling.
The figure below is a snapshot of target distribution by my personal exploration of CheXpert.

image

Meanwhile, our current custom Dataset class converts, (Not sure but I guess this way of converting is based on score)

Nan -> 0
-1 -> 1 ( if the target is 'Edema' or 'Atelectasis' )
-1 -> 0 ( if the target is neither 'Edema' nor 'Atelectasis')

  • In my opinion, converting Nan to 0 is acceptable because 'nothing' often means False (0). Thus, the thing is converting the '-1'
  • In train set, 11 disease columns among 14 have more labels 1 than 0. Thus, converting -1 to 1 also makes sense to me
  • In line with distribution-based thinking, converting -1 by random sampling from the total set of 0 and 1 could also be an interesting approach

Likewise, I think there are many ways to apply statistical or intuitive aspects of handling missing values in the traditional ML field. So, I want to discuss it and carefully ask for help to make this idea possible to use in our custom codes

FYI, I include the distribution of valid set just for sharing knowledge but I'm afraid that considering the validation set distribution might be connected to the data leakage issue. Probably everyone knows already but mentioned it just for reminding 😄

How

  • Any kind of interesting idea can be an option
  • My simple idea now is to consider the major class for converting candidate or random sampling
@kdg1993 kdg1993 added the Discussion Extra attention is needed label Dec 7, 2022
@jieonh
Copy link
Collaborator

jieonh commented Dec 7, 2022

In the paper that ranked 2nd in the Chexpert benchmark, i found that various experiments on uncertained labels were conducted. (https://arxiv.org/abs/1911.06475)

A brief summary of the experimental setting is as follows, and the results table for the experiment is attached below.

  • default setting: U-Ignore, U-Ones, U-Zeros
  • additional policy:
    • CT (conditional Training): Consider the hierarchical structure between labels
    • LSR (label smoothing regularization): machine learning algorithm

Presumably, the application of policies that differ only in two columns in libauc seems to have chosen the most accurate one after conducting several experiments themselves.

image

@kdg1993
Copy link
Collaborator Author

kdg1993 commented Dec 7, 2022

Thank you so much for solving my question about the reason for converting labels @jieonh 👍

If I understand right what you shared, the table supports the reason for converting by validation score. I couldn't agree more that score based method is one of the concrete and well-supported way to choose a way to experiment.

However, in terms of providing convincing options for experiments to the users of our testbed, I think my suggestion about expanding the data converting option is still worth to do it.

So, I want to ask about the worthiness of doing this work first.
Secondly, if it is worth it, want to know if anyone is interested in doing this. If worth but everyone is busy, then I think it's on me to do it 😄
Please let me know your opinion because it is totally fine and grateful to me to say a word of reply.

@jieonh
Copy link
Collaborator

jieonh commented Dec 7, 2022

When I looked it up a little bit more, it seemed that there was a lot of research going on on about uncertaining quantification. I guess that's because the importance of data centric ai is emerging these days, so i agree that the process of further investigating the data itself is worthwhile.

I'm not sure if I can fully concentrate on that task for now, but I can assist you or do some research to catch up (if that will be anyhelp!).

+) Does anyone know the exact difference between uncertained labels(-1) and missing values(Nan)? I'm little bit confused.

@chrstnkgn
Copy link
Collaborator

chrstnkgn commented Dec 7, 2022

In addition, I found a detailed datasheet for CheXpert for those of you who may be interested
https://arxiv.org/pdf/2105.03020.pdf

You can refer to p3-6 for the info we are looking for (labeling protocol)! In summary, the labeling section in this sheet explains how the labels are assigned based on the keywords found in the report, and how 'No findings' labels are assigned (which fully addressed my concern about normal data)

+) I think this sheet might explain the difference btwn -1 and Nan labels @jieonh just asked. You can refer to Table 3 of the sheet which describes the label definitions.

@kdg1993
Copy link
Collaborator Author

kdg1993 commented Dec 7, 2022

What a nice reference that you shared @chrstnkgn! 😆

I'm not fully read the whole paper yet but already solved some questions by what you shared! Especially, p2-5, fig 1, table 3 are thoroughly informative and helped my mind to sure about converting uncertain(-1) is worth doing for the similarity between train and validation set.

@seoulsky-field seoulsky-field added this to the Dataset: CheXpert milestone Dec 30, 2022
@seoulsky-field seoulsky-field added the Data About data or preprocessing label Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data About data or preprocessing Discussion Extra attention is needed
Projects
Status: 🤔 Discussion
Development

No branches or pull requests

5 participants