[MRG+2]: Svmlight chunk loader #935

Merged
merged 2 commits into from Jun 16, 2017

Conversation

Projects
None yet
9 participants
@ogrisel
Member

ogrisel commented Jul 5, 2012

Hi all,

I am working on an incremental data loader for the svmlight format that reads chunks of a big file that is not expected to all fit in memory in smaller CSR matrix to be dumped as set of memmapable files in folders to be later reconcatenated into a single, large CSR memmaped matrix.

The goal being to be able to load big svmlight files (multiple tens of GB) into an efficient memmaped CSR in an out-of-core manner (possible using several workers in //).

The first step is to augment the existing parser to be able to load chunks of a svmlight using seeks to bytes offsets.

Edit: the scope of this PR has changed. It is now just about loading a chunk (given by byte offset and length) of a large svmlight file as CSR matrix that fits in RAM. This would make it possible to efficiently load and parse a large svmlight file by workers on PySpark or dask distributed for instance.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 6, 2012

Member

The issue probably comes from the fact that the serialization uses %f for the values which truncates to 6 places by default whatever the dtype of the input. For np.float64 we should probably use something like %0.16f instead. WDTY?

We should check that libsvm can accept such long values as input format though.

Member

ogrisel commented Jul 6, 2012

The issue probably comes from the fact that the serialization uses %f for the values which truncates to 6 places by default whatever the dtype of the input. For np.float64 we should probably use something like %0.16f instead. WDTY?

We should check that libsvm can accept such long values as input format though.

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jul 6, 2012

Member

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

Member

larsmans commented Jul 6, 2012

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 6, 2012

Member

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

I would like to be able os.stat a big svmlight file (several GB) and divide it into mt_size / n_chunks that would fit in memory to be parsed by n_workers in // without having to do a first sequential scan to count the lines and find the bytes offsets. Workers would then dump intermediate CSR data struct on the filesystem and a second pass would just aggregate them all into a single memmapped CSR with n_features being the max of the observed n_features on the indivual chunks.

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

I would like to be able to deal with matrices of the scale of the PASCAL large scale challenge:

$ time wc ~/Desktop/alpha/alpha.txt 
  500000 250500000 3590403790 /Users/oliviergrisel/Desktop/alpha/alpha.txt

real    0m27.502s
user    0m24.754s
sys 0m2.207s

This indeed won't fit on a single CSR:

>>> 3590403790. / 2 ** 31
1.6719120508059859

That's unfortunate but I guess I can split it into 10 CSR chunks or so and then use the partial_fit method of the SGDClassifier class (for instance) to deal with this limitation.

Would be great to have support for np.int64 indices in scipy.sparse though...

Member

ogrisel commented Jul 6, 2012

I've yet to read this through, but wouldn't it be easier to add a single parameter indicating the number of lines to read, then pass the same file-like to load_svmlight_file multiple times?

I would like to be able os.stat a big svmlight file (several GB) and divide it into mt_size / n_chunks that would fit in memory to be parsed by n_workers in // without having to do a first sequential scan to count the lines and find the bytes offsets. Workers would then dump intermediate CSR data struct on the filesystem and a second pass would just aggregate them all into a single memmapped CSR with n_features being the max of the observed n_features on the indivual chunks.

(Also, how large are your matrices? I was under the impression that you can never store more than 2**31 elements, or can you store actually that number of rows and columns?)

I would like to be able to deal with matrices of the scale of the PASCAL large scale challenge:

$ time wc ~/Desktop/alpha/alpha.txt 
  500000 250500000 3590403790 /Users/oliviergrisel/Desktop/alpha/alpha.txt

real    0m27.502s
user    0m24.754s
sys 0m2.207s

This indeed won't fit on a single CSR:

>>> 3590403790. / 2 ** 31
1.6719120508059859

That's unfortunate but I guess I can split it into 10 CSR chunks or so and then use the partial_fit method of the SGDClassifier class (for instance) to deal with this limitation.

Would be great to have support for np.int64 indices in scipy.sparse though...

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 6, 2012

Member

Splitting lines is CPU bound (at least wc -l is). Hence being able to seek forward several workers on a multicore machine should bring a speed up.

Member

ogrisel commented Jul 6, 2012

Splitting lines is CPU bound (at least wc -l is). Hence being able to seek forward several workers on a multicore machine should bring a speed up.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 6, 2012

Member

Actually, wc -l is IO bound on my laptop (with SSD) but wc (parsing more similar to our svmlight parser) is CPU bound.

Member

ogrisel commented Jul 6, 2012

Actually, wc -l is IO bound on my laptop (with SSD) but wc (parsing more similar to our svmlight parser) is CPU bound.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Jul 8, 2012

Member

Interesting / useful contrib!

It could be useful to have a way to estimate the actual size once a file chunk has been converted to CSR format.

Minor remark: for me, it would be more natural to use offset and length parameters rather than offset_min and offset_max.

Member

mblondel commented Jul 8, 2012

Interesting / useful contrib!

It could be useful to have a way to estimate the actual size once a file chunk has been converted to CSR format.

Minor remark: for me, it would be more natural to use offset and length parameters rather than offset_min and offset_max.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 8, 2012

Member

Then length would be equivalent to offset_max - offset_min? Why not. I might do the change later today.

Member

ogrisel commented Jul 8, 2012

Then length would be equivalent to offset_max - offset_min? Why not. I might do the change later today.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 31, 2013

Member

Ok I rebased, added more tests and fixed a bug. I also switched to @mblondel suggested API (offset + length). I think this is ready for merge to master. WDYT?

Member

ogrisel commented Jul 31, 2013

Ok I rebased, added more tests and fixed a bug. I also switched to @mblondel suggested API (offset + length). I think this is ready for merge to master. WDYT?

@ghost ghost assigned ogrisel Jul 31, 2013

def load_svmlight_files(files, n_features=None, dtype=np.float64,
- multilabel=False, zero_based="auto", query_id=False):
+ multilabel=False, zero_based="auto", query_id=False,
+ offset=0, length=-1):

This comment has been minimized.

@mblondel

mblondel Aug 1, 2013

Member

I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?

@mblondel

mblondel Aug 1, 2013

Member

I can understand that these options are useful for load_svmlight_file but are they for load_svmlight_files?

This comment has been minimized.

@ogrisel

ogrisel Aug 2, 2013

Member

The problem is that load_svmlight_file is implemented by calling the load_svmlight_files function. Maybe I should just non document the parameters in the load_svmlight_files function.

@ogrisel

ogrisel Aug 2, 2013

Member

The problem is that load_svmlight_file is implemented by calling the load_svmlight_files function. Maybe I should just non document the parameters in the load_svmlight_files function.

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

@lesteve

lesteve Feb 16, 2017

Member

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 1, 2013

Member

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

Member

mblondel commented Aug 1, 2013

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 1, 2013

Member

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

Member

mblondel commented Aug 1, 2013

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 2, 2013

Member

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

I was thinking to write a parallel, out-of-core conversion tool that would generate joblib dumps of chunked data in a folder instead. Do you think both would be useful?

Member

ogrisel commented Aug 2, 2013

I think it would be useful to add a generator function, say load_svmlight_file_chunks, that takes a n_chunks parameter and produces (X, y) pairs.

I was thinking to write a parallel, out-of-core conversion tool that would generate joblib dumps of chunked data in a folder instead. Do you think both would be useful?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 2, 2013

Member

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

For my conversion tool I want to do a single parsing pass over the data and record the n_features detected in each chunk, take the max and re-save the datasets iteratively by padding in non-parsing hence fast second pass over the previously extracted chunks.

Member

ogrisel commented Aug 2, 2013

You assume that n_features is user-given, right? You might want to raise an exception if that's not the case.

For my conversion tool I want to do a single parsing pass over the data and record the n_features detected in each chunk, take the max and re-save the datasets iteratively by padding in non-parsing hence fast second pass over the previously extracted chunks.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 2, 2013

Member

Inferring n_features seems a bit expensive.

We could reproject the data while it is loaded from the svmlight file using a FeatureHasher. This way, n_features can be safely fixed.

Another thing I would like to check is whether a crude upper bound on n_features would work. The training time of solvers like SGDClassifier or LinearSVC with dual=True is not affected by the number of features (the training time of CD-based solvers is).

Member

mblondel commented Aug 2, 2013

Inferring n_features seems a bit expensive.

We could reproject the data while it is loaded from the svmlight file using a FeatureHasher. This way, n_features can be safely fixed.

Another thing I would like to check is whether a crude upper bound on n_features would work. The training time of solvers like SGDClassifier or LinearSVC with dual=True is not affected by the number of features (the training time of CD-based solvers is).

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 2, 2013

Member

My goal is to convert the svmlight file into a contiguous datastructure save on the hard drive once a never have to parse the svmlight file again. It's a dataset format conversion tool. I don't want to do learning on the fly in my case.

Member

ogrisel commented Aug 2, 2013

My goal is to convert the svmlight file into a contiguous datastructure save on the hard drive once a never have to parse the svmlight file again. It's a dataset format conversion tool. I don't want to do learning on the fly in my case.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 2, 2013

Member

I simply use joblib for these purposes :)

Member

GaelVaroquaux commented Aug 2, 2013

I simply use joblib for these purposes :)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 3, 2013

Member

Yes but in this case I need to do it out of core as the svmlight file is 11GB and the dense representation is twice as big (without compression) and I want to detect the number of features on the fly so I need to do a second non-parsing pass to pad the previously extracted arrays with zero features.

Member

ogrisel commented Aug 3, 2013

Yes but in this case I need to do it out of core as the svmlight file is 11GB and the dense representation is twice as big (without compression) and I want to detect the number of features on the fly so I need to do a second non-parsing pass to pad the previously extracted arrays with zero features.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 3, 2013

Member

The svmlight format is very useful for preparing the data in one programming language and learning the model in another. It's really easy to write a script for outputting data to this format.

Member

mblondel commented Aug 3, 2013

The svmlight format is very useful for preparing the data in one programming language and learning the model in another. It's really easy to write a script for outputting data to this format.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 20, 2013

Member

I think the chunk loading support can be merged as it is. It's already useful for advanced users. I am not sure we want to implement a generic out-of-core converter in the library. Maybe it would be better implemented as an example in a benchmark script based on the mnist8m dataset. I will do that in another PR later.

Member

ogrisel commented Aug 20, 2013

I think the chunk loading support can be merged as it is. It's already useful for advanced users. I am not sure we want to implement a generic out-of-core converter in the library. Maybe it would be better implemented as an example in a benchmark script based on the mnist8m dataset. I will do that in another PR later.

@amueller amueller modified the milestone: 0.15 Sep 29, 2016

@raghavrv raghavrv added this to the 0.19 milestone Nov 4, 2016

@raghavrv raghavrv removed the Needs Review label Nov 4, 2016

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Nov 4, 2016

Member

Could you rebase?

Member

raghavrv commented Nov 4, 2016

Could you rebase?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 14, 2017

Member

@raghavrv @rth rebased. Let's see how it goes.

Member

ogrisel commented Feb 14, 2017

@raghavrv @rth rebased. Let's see how it goes.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Feb 15, 2017

Member

That is a nice feature.

As a side note though, writing / reading sparse CSR arrays with the svmlight parser in scikit-learn is still 2-3 orders of magnitude slower than using joblib dump / load as far as I could tell (the below bechmarks might not be very accurate),

import numpy as np
import scipy.sparse
from sklearn.externals import joblib
from sklearn.datasets import dump_svmlight_file, load_svmlight_file

# generate a ~100 MB dataset
X = scipy.sparse.rand(100000, 10000, format='csr')
y = np.zeros(100000)

# writing to / reading from tempfs to discount the disk I/O bottleneck
# disk I/O speeds are ~100 MB/s (HDD), ~500 MB/s (SSD)
# joblib pickling is I/O bound, svmlight parser CPU bound? 
%time joblib.dump(X, '/dev/shm/X.pkl')
%time dump_svmlight_file(X, y, '/dev/shm/X_y.libsvm')

CPU times: user 15 ms, sys: 44 ms, total: 59 ms Wall time: 56.7 ms
CPU times: user 32.7 s, sys: 238 ms, total: 32.9 s Wall time: 32.8 s

%time joblib.load('/dev/shm/X.pkl')
%time load_svmlight_file('/dev/shm/X_y.libsvm')

CPU times: user 22 ms, sys: 59 ms, total: 81 ms Wall time: 78.6 ms
CPU times: user 9.87 s, sys: 103 ms, total: 9.97 s Wall time: 9.97 s

# memaping for CSR files saved to disk works, but it's not that efficient
# 3% of total load time to load a 1 row / 100000
# 20% of total load time to load 1000 random rows / 100000
%time joblib.load('/dev/shm/X.pkl', mmap_mode='r')[89, :]
mask = np.random.randint(100000, size=(1000))
%time joblib.load('/dev/shm/X.pkl', mmap_mode='r')[mask, :]

CPU times: user 1 ms, sys: 1e+03 µs, total: 2 ms Wall time: 1.55 ms
CPU times: user 6 ms, sys: 8 ms, total: 14 ms Wall time: 11.8 ms

so for high performance use-cases it still might be worth converting the lightsvm file to a CSR saved on disk with joblib, and then load it in mmap mode..

Update: OK, just realized, after the reading the above conversation in more details., that the goal was to convert the lightsvm file to CSR on disk all along..

Member

rth commented Feb 15, 2017

That is a nice feature.

As a side note though, writing / reading sparse CSR arrays with the svmlight parser in scikit-learn is still 2-3 orders of magnitude slower than using joblib dump / load as far as I could tell (the below bechmarks might not be very accurate),

import numpy as np
import scipy.sparse
from sklearn.externals import joblib
from sklearn.datasets import dump_svmlight_file, load_svmlight_file

# generate a ~100 MB dataset
X = scipy.sparse.rand(100000, 10000, format='csr')
y = np.zeros(100000)

# writing to / reading from tempfs to discount the disk I/O bottleneck
# disk I/O speeds are ~100 MB/s (HDD), ~500 MB/s (SSD)
# joblib pickling is I/O bound, svmlight parser CPU bound? 
%time joblib.dump(X, '/dev/shm/X.pkl')
%time dump_svmlight_file(X, y, '/dev/shm/X_y.libsvm')

CPU times: user 15 ms, sys: 44 ms, total: 59 ms Wall time: 56.7 ms
CPU times: user 32.7 s, sys: 238 ms, total: 32.9 s Wall time: 32.8 s

%time joblib.load('/dev/shm/X.pkl')
%time load_svmlight_file('/dev/shm/X_y.libsvm')

CPU times: user 22 ms, sys: 59 ms, total: 81 ms Wall time: 78.6 ms
CPU times: user 9.87 s, sys: 103 ms, total: 9.97 s Wall time: 9.97 s

# memaping for CSR files saved to disk works, but it's not that efficient
# 3% of total load time to load a 1 row / 100000
# 20% of total load time to load 1000 random rows / 100000
%time joblib.load('/dev/shm/X.pkl', mmap_mode='r')[89, :]
mask = np.random.randint(100000, size=(1000))
%time joblib.load('/dev/shm/X.pkl', mmap_mode='r')[mask, :]

CPU times: user 1 ms, sys: 1e+03 µs, total: 2 ms Wall time: 1.55 ms
CPU times: user 6 ms, sys: 8 ms, total: 14 ms Wall time: 11.8 ms

so for high performance use-cases it still might be worth converting the lightsvm file to a CSR saved on disk with joblib, and then load it in mmap mode..

Update: OK, just realized, after the reading the above conversation in more details., that the goal was to convert the lightsvm file to CSR on disk all along..

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 15, 2017

Member

Actually the scope of this PR is just to load a chunk of a large svmlight file, which can be used for complex out of core distributed computation on shared filesystem cluster for instance, possibly with dask or PySpark.

Reading and writing svmlight file is unlikely to ever be faster than binary formats such as joblib pickles with numpy array support because pickling numpy arrays can just read or write the raw data bytes to / from the disk from / to memory.

But still svmlight format is a defacto exchange format for machine learning datasets and adding support for loading chunks does not add much complexity to the existing code.

Member

ogrisel commented Feb 15, 2017

Actually the scope of this PR is just to load a chunk of a large svmlight file, which can be used for complex out of core distributed computation on shared filesystem cluster for instance, possibly with dask or PySpark.

Reading and writing svmlight file is unlikely to ever be faster than binary formats such as joblib pickles with numpy array support because pickling numpy arrays can just read or write the raw data bytes to / from the disk from / to memory.

But still svmlight format is a defacto exchange format for machine learning datasets and adding support for loading chunks does not add much complexity to the existing code.

@lesteve

I tried my best to review, without knowing much about svmlight ...

sklearn/datasets/svmlight_format.py
@@ -77,6 +78,7 @@ def load_svmlight_file(f, n_features=None, dtype=np.float64,
bigger sliced dataset: each subset might not have examples of
every feature, hence the inferred shape might vary from one
slice to another.
+ n_features is required is offset or length are passed a default values.

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

required if
are passed a default value (i.e. no s)
I am not sure I understand what you mean by "are passed a default value" to be honest.

@lesteve

lesteve Feb 16, 2017

Member

required if
are passed a default value (i.e. no s)
I am not sure I understand what you mean by "are passed a default value" to be honest.

sklearn/datasets/svmlight_format.py
@@ -98,6 +103,15 @@ def load_svmlight_file(f, n_features=None, dtype=np.float64,
Data type of dataset to be loaded. This will be the data type of the
output numpy arrays ``X`` and ``y``.
+ offset : integer, optional (default is 0)

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

Nitpick: I don't know the official way of mentioning the default, but it looks like it uses:

..., default default_value

rather than

... (default is default_value)
@lesteve

lesteve Feb 16, 2017

Member

Nitpick: I don't know the official way of mentioning the default, but it looks like it uses:

..., default default_value

rather than

... (default is default_value)
+ if (zero_based is False or
+ zero_based == "auto" and all(len(tmp[1]) and np.min(tmp[1]) > 0
+ for tmp in r)):
+ for _, indices, _, _, _ in r:
indices -= 1

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

This works because indices is an array, I was really confused for a moment ...

@lesteve

lesteve Feb 16, 2017

Member

This works because indices is an array, I was really confused for a moment ...

+ X_concat = sp.vstack([X_0, X_1, X_2])
+ assert_array_equal(y, y_concat)
+ assert_array_almost_equal(X.toarray(), X_concat.toarray(), 5)

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

Just curious, why does X_concat not exactly match X?

@lesteve

lesteve Feb 16, 2017

Member

Just curious, why does X_concat not exactly match X?

This comment has been minimized.

@ogrisel

ogrisel Feb 20, 2017

Member

This is indeed required. I believe the string representation of floating point values involves some rounding.

@ogrisel

ogrisel Feb 20, 2017

Member

This is indeed required. I believe the string representation of floating point values involves some rounding.

sklearn/datasets/_svmlight_format.pyx
@@ -52,6 +53,12 @@ def _load_svmlight_file(f, dtype, bint multilabel, bint zero_based,
else:
labels = array.array("d")
+ if offset > 0:
+ f.seek(offset)
+ # drop the previous line that might be truncated and is to be fetched

This comment has been minimized.

@lesteve

lesteve Feb 16, 2017

Member

Is this not dropping the current line? Also I don't really understand the "is to be fetched by another call"

@lesteve

lesteve Feb 16, 2017

Member

Is this not dropping the current line? Also I don't really understand the "is to be fetched by another call"

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 20, 2017

Member

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

It's probably useless but the way the code is organized we have load_svmlight_file that calls into load_svmlight_files with a single input.

Member

ogrisel commented Feb 20, 2017

Just curious, does it actually make sense to have the same offset and length when calling this with multiple files?

It's probably useless but the way the code is organized we have load_svmlight_file that calls into load_svmlight_files with a single input.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 20, 2017

Member

@lesteve I addressed your comments. Thanks for the review.

Member

ogrisel commented Feb 20, 2017

@lesteve I addressed your comments. Thanks for the review.

@ogrisel ogrisel referenced this pull request in dmlc/xgboost Feb 20, 2017

Closed

Distributed computing with Dask #2032

@lesteve

LGTM

@lesteve lesteve changed the title from [MRG]: Svmlight chunk loader to [MRG+1]: Svmlight chunk loader Feb 21, 2017

@jnothman

I think you should explicitly test boundary cases:

  • f.seek(offset) such that f.read(1) == '\n'
  • f.seek(length) such that f.read(1)) == '\n'
  • f.seek(length - 1) such that f.read(1) == '\n'
sklearn/datasets/svmlight_format.py
+ discarding the following bytes up until the next new line
+ character.
+
+ length: integer, optional, default -1

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

space before colon

@jnothman

jnothman Feb 21, 2017

Member

space before colon

sklearn/datasets/svmlight_format.py
+ discarding the following bytes up until the next new line
+ character.
+
+ length: integer, optional, default -1

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

space before colon

@jnothman

jnothman Feb 21, 2017

Member

space before colon

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 21, 2017

Member

@jnothman thanks for the review. I added a test to check for all the possible of the byte offset of small dataset (along with query ids). The exhaustive test run in 500ms. This should cover all the boundary cases you mentioned.

Member

ogrisel commented Feb 21, 2017

@jnothman thanks for the review. I added a test to check for all the possible of the byte offset of small dataset (along with query ids). The exhaustive test run in 500ms. This should cover all the boundary cases you mentioned.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 21, 2017

Member

@jnothman I fixed (workaround) the broken tests with old versions of scipy.

Member

ogrisel commented Feb 21, 2017

@jnothman I fixed (workaround) the broken tests with old versions of scipy.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2017

Member

Mind adding a what's new before merge, @ogrisel?

Member

jnothman commented Jun 14, 2017

Mind adding a what's new before merge, @ogrisel?

@jnothman jnothman changed the title from [MRG+1]: Svmlight chunk loader to [MRG+2]: Svmlight chunk loader Jun 14, 2017

@scikit-learn scikit-learn deleted a comment from codecov-io Jun 14, 2017

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 14, 2017

Member

@jnothman done. I also rebased on top of current master to insert the entry at the right location. Let's see if CI is still green.

Member

ogrisel commented Jun 14, 2017

@jnothman done. I also rebased on top of current master to insert the entry at the right location. Let's see if CI is still green.

@scikit-learn scikit-learn deleted a comment from codecov bot Jun 16, 2017

@scikit-learn scikit-learn deleted a comment from codecov bot Jun 16, 2017

@ogrisel ogrisel merged commit a39c8ab into scikit-learn:master Jun 16, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.29%)
Details
codecov/project 96.29% (+<.01%) compared to 7238b46
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ogrisel ogrisel deleted the ogrisel:svmlight-memmaped-loader branch Jun 16, 2017

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 16, 2017

Member

Merged! The scipy version check in the test was to lax. I updated it.

Member

ogrisel commented Jun 16, 2017

Merged! The scipy version check in the test was to lax. I updated it.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 17, 2017

Member

I'm tempted to say: Thanks for your patience, @ogrisel ;)

Member

jnothman commented Jun 17, 2017

I'm tempted to say: Thanks for your patience, @ogrisel ;)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 17, 2017

Member

It's pretty exciting to close an issue #<1000

Member

jnothman commented Jun 17, 2017

It's pretty exciting to close an issue #<1000

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment