Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: io: support sparse format in loadarff #3535

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

fbenites
Copy link

@fbenites fbenites commented Apr 9, 2014

loadarff also for sparse format with support for meka multi-label assignments

@rgommers
Copy link
Member

@cournape you added the arff reader, so would be great if you could have a look at this enhancement.

@rgommers
Copy link
Member

@fbenites there's a mix of tabs and spaces in your commits: https://travis-ci.org/scipy/scipy/jobs/22599231
Could you remove all the tabs please?

@rgommers
Copy link
Member

There's also lots of pep8 issues: https://travis-ci.org/scipy/scipy/jobs/22599234

You can install pep8 on your own machine; running pep8 scipy in the root scipy dir should pass.

@fbenites
Copy link
Author

see https://travis-ci.org/scipy/scipy/builds/22940084, now seems pep8 conform and tests built successfully

@rgommers
Copy link
Member

Looks much better now. May be good to ask on the scipy-dev mailing list if there's anyone who has a use for this functionality and wants to test / provide feedback.

@WarrenWeckesser
Copy link
Member

Adding a third return value to loadarff is a backwards-incompatible API change. This change needs to be implemented in a way that won't break old code. Perhaps an additional argument that controls whether the new third value is returned? (It might get ugly, but that's part of the cost of API stability.) There are also incompatible API changes to read_header and MetaData. These parts of the code aren't "advertised" (e.g. http://docs.scipy.org/doc/scipy/reference/io.html#module-scipy.io.arff lists only loadarff), but they don't have leading underscores in the names and they are easily discoverable, so they are public. I haven't thought too much about it (I don't use this module "in anger"), but I guess the change to read_header could be handled the same as loadarff, and the new argument to MetaData.__init__ could be given a default value.

@cournape
Copy link
Member

@WarrenWeckesser agreed, returning a 3rd value should be avoided.

Returning a 3rd value conditionally is even worse IMO. We should add a new API to use the new features, with a provision to be more extensible (looking at my original code is humbling :) ).

@fbenites
Copy link
Author

I must admit that this class thing is pretty much a Meka thing, I could write the code for a conditional 3rd argument, since this is a special case. Mulan does handle this problem with a separate xml file for the classes. Further, there is a feature, which I never saw it out there but theoretically.., which gives weight to the instance. I did not implement it. So it does not cover the whole spec as in http://weka.wikispaces.com/ARFF+%28developer+version%29 . I also did not test for sparse and undefined.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 7010687 on fbenites:master into 5b94656 on scipy:master.

@pv
Copy link
Member

pv commented May 4, 2014

How about putting the class data in the MetaData object, e.g., by adding a def classes(self) method to it? This should be fully backward-compatible.

@fbenites
Copy link
Author

fbenites commented May 4, 2014

from the docs:
def Metadata(object):
Small container to keep useful informations on a ARFF dataset.

Knows about attributes names and types.

The classes are the classes for each object. In multi-label the objects can have multiple classes assigned to it, like tags. So for every instance there are attributes and classes. In normal weka the classes are part of the data, I wanted to split up. It is also possible to implement so that data also have the classes in it. So we could pass the number of classes in metadata. Meka uses the first x attributes as classes, MULAN (other multilabel library build over weka) uses the last x as classes. So it should clear that also then, if that important for conformidity. I hoped to use like that and later, if there are many interested in the functionality, change it accordingly as the most users need it.

@pv pv removed the PR label Aug 13, 2014
@pv pv added the needs-work Items that are pending response from the author label Aug 7, 2019
@lucascolley lucascolley changed the title sparse arff ENH: io: support sparse format in loadarff Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement needs-work Items that are pending response from the author scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants