Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs in sparse data with missing nominal values #52

Closed
RinyChau opened this issue Mar 30, 2017 · 7 comments
Closed

Bugs in sparse data with missing nominal values #52

RinyChau opened this issue Mar 30, 2017 · 7 comments

Comments

@RinyChau
Copy link

image

Due to the default value str(0), it may causes BadNominalValue() error when data contains some missing nominal values.

@kyao
Copy link

kyao commented Feb 21, 2018

Because of the this bug, liac-arff cannot decode many OpenML datasets, such as https://www.openml.org/d/35002

The default behavior of Weka is to fill in the missing nominal value with the first value in the nominal specification list

kyao added a commit to kyao/liac-arff that referenced this issue Feb 21, 2018
@kyao kyao mentioned this issue Feb 21, 2018
@mfeurer
Copy link
Collaborator

mfeurer commented Feb 21, 2018

Thank you a lot for pointing this out, but I would argue that this is not a bug in liac-arff, but in the arff specification. This behavior is not defined there, and I am very hesitant to add functionality here which is undocumented behavior of the WEKA arff reader. In fact, we deactivated the datasets in question on OpenML because they are not valid arff.

Also, this is a known bug in WEKA with a workaround on the WEKA side:

Warning: There is a known problem saving SparseInstance objects from datasets that have string attributes. In Weka, string and nominal data values are stored as numbers; these numbers act as indexes into an array of possible attribute values (this is very efficient). However, the first string value is assigned index 0: this means that, internally, this value is stored as a 0. When a SparseInstance is written, string instances with internal value 0 are not output, so their string value is lost (and when the arff file is read again, the default value 0 is the index of a different string value, so the attribute value appears to change). To get around this problem, add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files.

@mitar
Copy link

mitar commented Feb 21, 2018

@mfeurer This still available dataset fails with this error: https://www.openml.org/d/35002

@mfeurer
Copy link
Collaborator

mfeurer commented Feb 21, 2018

I am aware of that; however, the dataset is flagged as 'in preparation' which basically means that it's not ready for production use. Someone needs to download all QSAR datasets, fix the bug in them, and re-upload them

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 14, 2018

Closing this as it is an issue with OpenML and WEKA.

@mfeurer mfeurer closed this as completed Mar 14, 2018
@mfeurer
Copy link
Collaborator

mfeurer commented Mar 14, 2018

Reopening as WEKA can actually read this file :(

@mfeurer
Copy link
Collaborator

mfeurer commented Aug 7, 2018

Solved by #76, thanks to @jnothman .

@mfeurer mfeurer closed this as completed Aug 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants