Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2]: Svmlight chunk loader #935

Merged
merged 2 commits into from Jun 16, 2017

Conversation

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 5, 2012

Hi all,

I am working on an incremental data loader for the svmlight format that reads chunks of a big file that is not expected to all fit in memory in smaller CSR matrix to be dumped as set of memmapable files in folders to be later reconcatenated into a single, large CSR memmaped matrix.

The goal being to be able to load big svmlight files (multiple tens of GB) into an efficient memmaped CSR in an out-of-core manner (possible using several workers in //).

The first step is to augment the existing parser to be able to load chunks of a svmlight using seeks to bytes offsets.

Edit: the scope of this PR has changed. It is now just about loading a chunk (given by byte offset and length) of a large svmlight file as CSR matrix that fits in RAM. This would make it possible to efficiently load and parse a large svmlight file by workers on PySpark or dask distributed for instance.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 6, 2012

The issue probably comes from the fact that the serialization uses %f for the values which truncates to 6 places by default whatever the dtype of the input. For np.float64 we should probably use something like %0.16f instead. WDTY?

We should check that libsvm can accept such long values as input format though.

@larsmans
Copy link
Member

@larsmans larsmans commented Jul 6, 2012

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 6, 2012

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

I would like to be able os.stat a big svmlight file (several GB) and divide it into mt_size / n_chunks that would fit in memory to be parsed by n_workers in // without having to do a first sequential scan to count the lines and find the bytes offsets. Workers would then dump intermediate CSR data struct on the filesystem and a second pass would just aggregate them all into a single memmapped CSR with n_features being the max of the observed n_features on the indivual chunks.

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

I would like to be able to deal with matrices of the scale of the PASCAL large scale challenge:

$ time wc ~/Desktop/alpha/alpha.txt 
  500000 250500000 3590403790 /Users/oliviergrisel/Desktop/alpha/alpha.txt

real    0m27.502s
user    0m24.754s
sys 0m2.207s

This indeed won't fit on a single CSR:

>>> 3590403790. / 2 ** 31
1.6719120508059859

That's unfortunate but I guess I can split it into 10 CSR chunks or so and then use the partial_fit method of the SGDClassifier class (for instance) to deal with this limitation.

Would be great to have support for np.int64 indices in scipy.sparse though...

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 6, 2012

Splitting lines is CPU bound (at least wc -l is). Hence being able to seek forward several workers on a multicore machine should bring a speed up.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 6, 2012

Actually, wc -l is IO bound on my laptop (with SSD) but wc (parsing more similar to our svmlight parser) is CPU bound.

@mblondel
Copy link
Member

@mblondel mblondel commented Jul 8, 2012

Interesting / useful contrib!

It could be useful to have a way to estimate the actual size once a file chunk has been converted to CSR format.

Minor remark: for me, it would be more natural to use offset and length parameters rather than offset_min and offset_max.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 8, 2012

Then length would be equivalent to offset_max - offset_min? Why not. I might do the change later today.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 31, 2013

Ok I rebased, added more tests and fixed a bug. I also switched to @mblondel suggested API (offset + length). I think this is ready for merge to master. WDYT?

@ghost ghost assigned ogrisel Jul 31, 2013


def load_svmlight_files(files, n_features=None, dtype=np.float64,
multilabel=False, zero_based="auto", query_id=False):
multilabel=False, zero_based="auto", query_id=False,
offset=0, length=-1):

This comment has been minimized.

@mblondel

mblondel Aug 1, 2013
Member

I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?

This comment has been minimized.

@ogrisel

ogrisel Aug 2, 2013
Author Member

The problem is that load_svmlight_file is implemented by calling the load_svmlight_files function. Maybe I should just non document the parameters in the load_svmlight_files function.

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017
Member

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

@mblondel
Copy link
Member

@mblondel mblondel commented Aug 1, 2013

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

@mblondel
Copy link
Member

@mblondel mblondel commented Aug 1, 2013

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Aug 2, 2013

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

I was thinking to write a parallel, out-of-core conversion tool that would generate joblib dumps of chunked data in a folder instead. Do you think both would be useful?

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Aug 2, 2013

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

For my conversion tool I want to do a single parsing pass over the data and record the n_features detected in each chunk, take the max and re-save the datasets iteratively by padding in non-parsing hence fast second pass over the previously extracted chunks.

@mblondel
Copy link
Member

@mblondel mblondel commented Aug 2, 2013

Inferring n_features seems a bit expensive.

We could reproject the data while it is loaded from the svmlight file using a FeatureHasher. This way, n_features can be safely fixed.

Another thing I would like to check is whether a crude upper bound on n_features would work. The training time of solvers like SGDClassifier or LinearSVC with dual=True is not affected by the number of features (the training time of CD-based solvers is).

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Aug 2, 2013

My goal is to convert the svmlight file into a contiguous datastructure save on the hard drive once a never have to parse the svmlight file again. It's a dataset format conversion tool. I don't want to do learning on the fly in my case.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Aug 2, 2013

I simply use joblib for these purposes :)

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Aug 3, 2013

Yes but in this case I need to do it out of core as the svmlight file is 11GB and the dense representation is twice as big (without compression) and I want to detect the number of features on the fly so I need to do a second non-parsing pass to pad the previously extracted arrays with zero features.

@mblondel
Copy link
Member

@mblondel mblondel commented Aug 3, 2013

The svmlight format is very useful for preparing the data in one programming language and learning the model in another. It's really easy to write a script for outputting data to this format.

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Aug 20, 2013

I think the chunk loading support can be merged as it is. It's already useful for advanced users. I am not sure we want to implement a generic out-of-core converter in the library. Maybe it would be better implemented as an example in a benchmark script based on the mnist8m dataset. I will do that in another PR later.

@larsmans larsmans force-pushed the scikit-learn:master branch from 58a55ad to 4b82379 Aug 25, 2014
@MechCoder MechCoder force-pushed the scikit-learn:master branch from 6deaea0 to 3f49cee Nov 3, 2014
@amueller amueller modified the milestone: 0.15 Sep 29, 2016
@raghavrv raghavrv added this to the 0.19 milestone Nov 4, 2016
@raghavrv
Copy link
Member

@raghavrv raghavrv commented Nov 4, 2016

Could you rebase?

@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from 8d02541 to 8edcc30 Feb 14, 2017
@lesteve lesteve changed the title [MRG]: Svmlight chunk loader [MRG+1]: Svmlight chunk loader Feb 21, 2017
Copy link
Member

@jnothman jnothman left a comment

I think you should explicitly test boundary cases:

  • f.seek(offset) such that f.read(1) == '\n'
  • f.seek(length) such that f.read(1)) == '\n'
  • f.seek(length - 1) such that f.read(1) == '\n'
discarding the following bytes up until the next new line
character.
length: integer, optional, default -1

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

space before colon

discarding the following bytes up until the next new line
character.
length: integer, optional, default -1

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

space before colon

@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from f943c65 to b09c135 Feb 21, 2017
@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Feb 21, 2017

@jnothman thanks for the review. I added a test to check for all the possible of the byte offset of small dataset (along with query ids). The exhaustive test run in 500ms. This should cover all the boundary cases you mentioned.

@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from b09c135 to 577f51f Feb 21, 2017
@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Feb 21, 2017

@jnothman I fixed (workaround) the broken tests with old versions of scipy.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 14, 2017

Mind adding a what's new before merge, @ogrisel?

@jnothman jnothman changed the title [MRG+1]: Svmlight chunk loader [MRG+2]: Svmlight chunk loader Jun 14, 2017
@scikit-learn scikit-learn deleted a comment from codecov-io Jun 14, 2017
@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from 577f51f to 3d21127 Jun 14, 2017
@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jun 14, 2017

@jnothman done. I also rebased on top of current master to insert the entry at the right location. Let's see if CI is still green.

@scikit-learn scikit-learn deleted a comment from codecov bot Jun 16, 2017
@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from 3d21127 to 577f51f Jun 16, 2017
@scikit-learn scikit-learn deleted a comment from codecov bot Jun 16, 2017
@ogrisel ogrisel force-pushed the ogrisel:svmlight-memmaped-loader branch from 577f51f to 70a04c6 Jun 16, 2017
@ogrisel ogrisel merged commit a39c8ab into scikit-learn:master Jun 16, 2017
5 checks passed
5 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.29%)
Details
codecov/project 96.29% (+<.01%) compared to 7238b46
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel ogrisel deleted the ogrisel:svmlight-memmaped-loader branch Jun 16, 2017
@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jun 16, 2017

Merged! The scipy version check in the test was to lax. I updated it.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 17, 2017

I'm tempted to say: Thanks for your patience, @ogrisel ;)

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 17, 2017

It's pretty exciting to close an issue #<1000

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants
You can’t perform that action at this time.