Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multilabel reader in LibSVMFile #2073

Merged
merged 3 commits into from Apr 3, 2014

Conversation

Jiaolong
Copy link
Contributor

There are several issues need to be discussed:
(1) Use SGVector or SGSparseVector to store labels. Currently, it is SGVector but there should be no trouble to change it into SGVector.

(2) There is just one unified function for parsing single-label and multi-label files. However, I kept the olde API of reading single-label file. So the SGSparseVector labels are converted internally. To move this converter outside, I guess many files dependent on the old API need to be changed.

@tklein23 @vigsterkr
For (1), if you all agree, I can use the unified type SGVector, since it is more compatible with @tklein23 requirement.

@vigsterkr
For(2), I need your help since I don't know how to put the converter outside and how many files need to be changed correspondingly. To be honest, I don't fully understand your design. But I think all the necessary code are already in LibSVMFile right now.

@Jiaolong
Copy link
Contributor Author

@tklein23 @vigsterkr
I have changed the multiple label type into SGVector<float64_t>*.
The Travis reports all is well now.

@Jiaolong
Copy link
Contributor Author

@vigsterkr

(1) when user tries to read singlelabel file by calling
get_sparse_matrix(..., float_64* label) , it is fine.

(2) when user tries to read multilabel file by calling
get_sparse_matrix(..., float_64* label) , it will report an error to remind using get_sparse_matrix(..., SGVector<float_64>* label)

(3) when user uses get_sparse_matrix(..., SGVector<float_64>* label), reading single-label and multilabel are both fine.

get_sparse_matrix(matrix, num_feat, num_vec, labels, false); \
SGVector<float64_t>* mat_label; \
int32_t num_classes; \
get_sparse_matrix(mat_feat, num_feat, num_vec, mat_label, num_classes, false); \
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest "mat_label" to "labels" to align naming in both functions. Even better would be "multilabel", so the name is telling exactly what it will contain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will change the naming.

@tklein23
Copy link
Contributor

All-in-all it looks really cool. It's almost ready to merge.

I'm "just" missing: a simply unit test, that reads an example file and compares the resulting labels in a few cases:

  • a line with "empty" multilabel, i.e. no label
  • a line with a single label
  • a line with more than one label (sorting shouldn't be changed, duplicates are allowed)

Please run your test(s) in valgrind and check if memory is clean.

@Jiaolong
Copy link
Contributor Author

@tklein23 Thanks for your comments. Yes, the unit test is quite necessary. I will update the code soon.

@Jiaolong
Copy link
Contributor Author

@tklein23 I have updated the LibSVMFile:
(1) Multilabel writer is added
(2) Unit test is updated to cover the suggested cases. The example file is like:

2:5 4:10 6:4 8:1 10:8 12:7 14:2 16:0 18:6
0,0.1,0.2 2:2 4:9 6:9 8:8
0,0.1,0 2:6 4:5 6:7 8:3 10:6 12:3 14:4 16:10 18:8
0 2:4 4:10 6:4 8:10 10:7 12:8 14:4 16:6
1 2:2 4:0 6:2 8:0 10:4
0 2:1
1 2:6 4:5 6:5 8:4 10:0 12:5
-1 2:5
-1 2:2 4:2 6:2 8:10 10:1 12:6 14:7 16:10
0 2:9 4:1 6:10 8:5 10:9 12:0 14:2 16:4 18:9 20:8 22:2

SG_SPRINT("Unable to open file: %s\n", fname);
return;
}
fclose(pfile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an additional space here.

@hushell
Copy link
Contributor

hushell commented Mar 31, 2014

Just go through the code, don't understand why you change float64_t* to SGVector<float64_t>* in some places.

@Jiaolong
Copy link
Contributor Author

@hushell Thanks for your comments. There are a lot of details I didn't check carefully. Regarding to your main concern float64_t and SGVector<float64_t>*, I have replied in the above section. I will update the code soon.

@tklein23
Copy link
Contributor

On Monday 31 March 2014 05:13:02 Jiaolong wrote:

@tklein23 I have updated the LibSVMFile:
(1) Multilabel writer is added
(2) Unit test is updated to cover the suggested cases. The example file is
like: ```
2:5 4:10 6:4 8:1 10:8 12:7 14:2 16:0 18:6
0,0.1,0.2 2:2 4:9 6:9 8:8
0,0.1,0 2:6 4:5 6:7 8:3 10:6 12:3 14:4 16:10 18:8
0 2:4 4:10 6:4 8:10 10:7 12:8 14:4 16:6
1 2:2 4:0 6:2 8:0 10:4
0 2:1
1 2:6 4:5 6:5 8:4 10:0 12:5
-1 2:5
-1 2:2 4:2 6:2 8:10 10:1 12:6 14:7 16:10
0 2:9 4:1 6:10 8:5 10:9 12:0 14:2 16:4 18:9 20:8 22:2

"empty" multilabels are valid es well; just a line that starts with a
space and then the features. Can you do that?

@Jiaolong
Copy link
Contributor Author

@tklein23 Note that in the above example 2:5 4:10 6:4 8:1 10:8 12:7 14:2 16:0 18:6 is a sample with "empty" multilabels, but without a space at the head. Is it OK?

@tklein23
Copy link
Contributor

On Monday 31 March 2014 14:04:47 Jiaolong wrote:

@tklein23 Note that in the above example 2:5 4:10 6:4 8:1 10:8 12:7 14:2 16:0 18:6 is a sample with "empty" multilabels, but without a space at
the head. Is it OK?

I missed that. That's okay. (I guess the space is important to let
the parser distinguish between labels and features.)

@Jiaolong
Copy link
Contributor Author

@tklein23 Given a space at the head, I think it can be parsed by the current code as well. Anyway, I will add this in the unit test.

@tklein23
Copy link
Contributor

If you ran some tests, let me know if it worked!

@Jiaolong
Copy link
Contributor Author

@tklein23 There is an example in libshogun: io_libsvm_multilabel, which reads and displays multilabels from yeast dataset. I pasted the output as follows:

......
vector=[0,1,11,12]
vector=[1,2,6,7,11,12]
vector=[0,1,5,6,10,11]
Number of the samples: 917
Dimention of the feature: 103
Number of classes: 14

@Jiaolong
Copy link
Contributor Author

Jiaolong commented Apr 1, 2014

@tklein23 I got a Travis error:
The command "git fetch origin +refs/pull/2073/merge:" failed
https://travis-ci.org/shogun-toolbox/shogun/jobs/22037649

Could you help me to check it? I only changed the format issue and the previous versions have all passed the Travis test.

@Jiaolong
Copy link
Contributor Author

Jiaolong commented Apr 1, 2014

@tklein23 For the memory leak issue, I pasted the the valgrind output in gist
https://gist.github.com/Jiaolong/9925271
Thank you very much!

*/
#include <shogun/io/LibSVMFile.h>
#include <shogun/lib/SGVector.h>
#include <shogun/lib/SGVector.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including SGVector.h more than once? ;)

@Jiaolong Jiaolong closed this Apr 2, 2014
@Jiaolong Jiaolong deleted the io_multilabel branch April 2, 2014 09:02
@Jiaolong Jiaolong reopened this Apr 2, 2014
@Jiaolong
Copy link
Contributor Author

Jiaolong commented Apr 2, 2014

@tklein23 I have updated the code with minor changes. Now the tests are passed.

tklein23 added a commit that referenced this pull request Apr 3, 2014
Added multilabel reader in LibSVMFile - thanks to Jiaolong
@tklein23 tklein23 merged commit a9c4864 into shogun-toolbox:develop Apr 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants