[WIP] Streamz as input #113

remiadon · 2019-05-11T18:20:37Z

Changes proposed in this pull request:

Creation of interfaces between skmultiflow.data and streamz.Source to make skmultiflow data generators as inputs to streamz.Stream(s)
[TODO] add methods for triggering partial_fit/predict on skmultiflow estimators

Note that this PR also deprecate some existing classes, in order to encourage usage of streamz as the default way of handling streaming

Checklist

Code complies with PEP-8 and is consistent with the framework.
Code is properly documented.
Tests are included for new functionality or updated accordingly.
Travis CI build passes with no errors.
Test Coverage is maintained (threshold is -0.2%).
Files changed (update, add, delete) are in the PR's scope (no extra files are included).

This PR introduces new syntax

from streamz import Source
from skmultiflow.data import SEAGenerator
gen = SEAGenerator(random_state=1)
stream = Source.from_generator(gen, poll_interval=.5, batch_size=10) # pull 10 elements every 0.5 seconds
stream.sink(print) 
stream.start() # printing data from the SEAGenerator

TODO : show syntax for partial_fit/predict

…ake generators interact with streamz

…amz instead

jacobmontiel · 2019-05-11T18:40:47Z

Travis fails due to the latest version of sklearn. I will update a temporary workaround as part of the other open PR. I will upload the actual fix later today.

jacobmontiel · 2019-05-13T23:39:29Z

src/skmultiflow/core/stream.py

+from streamz import Stream
+
+@Stream.register_api()
+class predict(Stream):


This is a method reserved for learning methods.

This makes predict appear as a method on streamz.Stream instances. We might want a more explicit name, or, alternatively, not register the method at all, but instead use Stream.connect().

jacobmontiel · 2019-05-13T23:39:34Z

src/skmultiflow/core/stream.py

+        return self._emit(self.model.predict(X))
+
+@Stream.register_api()
+class partial_fit(Stream):


This is a method reserved for learning methods.

jacobmontiel · 2019-05-13T23:41:34Z

src/skmultiflow/data/base_generator.py

@@ -2,26 +2,26 @@
 from skmultiflow.core.base_object import BaseObject


-class Stream(BaseObject, metaclass=ABCMeta):
-    """ The abstract class setting up the minimum requirements of a stream,
+class BaseGenerator(BaseObject, metaclass=ABCMeta):


This renaming could raise some confusion given that not all extended-classes are generators. Suggestions: SMKStream, BaseStream, StreamMixin,...

This is not limited to available methods, but also considering future sources such as kafka, tcp, http, etc.

@jacobmontiel do you have an example ? This was to avoid confusion with streamz.Stream.
IMO the extended-classes I have seen literally are generators, but I may be confused ...

Currently, there are two types of data sources supported: generators and data batches (files and raw data).

In this case, FileStream and DataStream are not generators. Future sources (via streamz) that come to mind include Kafka, TCP which also do not fall in the generator category.

During the first stage of development, we focused on generators because they are cheap (memory-wise). However, the next step is to allow multiple data sources. Generators are nice for reproducibility, but in order to make skmultiflow useful for real-world applications, we need to provide the user with more options.

src/skmultiflow/data/base_generator.py

jacobmontiel · 2019-05-13T23:50:30Z

src/skmultiflow/data/data_generator.py

+    # TODO : inherit from pandas.DataFrame should make things easier
+    """ DataGenerator
+
+    A generator constructed from the entries of a static dataset (numpy array or pandas


I like the idea of having a single method for both file and raw data sources. However, it would be nice to make clear in the example section how both functionalities remain. This is because a lot of users rely on these methods. Optionally we could keep DataStream and FileStreams as wrappers with a deprecation warning.

Unless the users are well-involved in the development process, I would second deprecation or clear documentation on how to switch.

File sources are already tackled by streamz via Source.from_textfile

I think I can add a deprecation warning

@jacobmontiel for this case I don't see how a deprecation warning could help, taking the renaming into account

If users had pieces of code importing FileStream, they could not import it anyway because following the applied renaming it would be rename to FileGenerator ...

I mean to have a wrapper class FileStream (same name as before) that internally calls the new method. If the user calls FileStream then the warning is raised.

src/skmultiflow/data/stream.py

src/skmultiflow/transform/missing_values_cleaner.py

src/skmultiflow/transform/one_hot_to_categorical.py

martindurant

On the whole, I am impressed by how relatively easy it was to plug streamz into your workflow.

I have left some comments for discussion.

martindurant · 2019-05-14T18:59:24Z

src/skmultiflow/core/stream.py

+from streamz import Stream
+
+@Stream.register_api()
+class predict(Stream):


This makes predict appear as a method on streamz.Stream instances. We might want a more explicit name, or, alternatively, not register the method at all, but instead use Stream.connect().

martindurant · 2019-05-14T19:03:40Z

src/skmultiflow/core/stream.py

+    def update(self, x, who=None):
+        X, y = X
+        self.model.partial_fit(X, y)
+        return self._emit(self.model)


Here is what I was talking about. I guess it's OK to emit the whole model (in-process, anyway), and then downstream streamz nodes can decide what to do with it to get out any metrics of interest. On the other hand, the model is mutable, so downstreams had better not be async, else they may process on incorrect state.

Is there any obvious "output" to the fitting and current state that we could consider as a result of the partial fit?

Following the sklearn way of doing, the output of partial_fit should be self.model
But in this case we are in a different context, calls in streamz may be asynchronous, which get things complicated

Option 1 :
Return self.model, and to make very explicit for users that this part of the code is not responsible for ensuring process on a correct state

Option 2 (requires more work):
Interally manage the access to self.model (via tornado conditions ? )

After some consideration, Option 1 is OK for the moment - with appropriate caveats - and I don't expect we actually expect any async down-streams here for the moment. In examples, it would be good to show what you might do with the model to get some basic metrics out for plotting the performance of the model versus time.

@martindurant to answer to your question more deeply, most of the informations we could get from a model actually depend on the "state" of the model
Even something like a call to "get_infos()" would lead to incorrect results if runned asynchronously. So I don't see what useful information can be returned instead of the model itself ...

.map(lambda x: x.get_infos()) would run immediately, generate a string, and be correct at the moment it is called. If there are downstream nodes that want to do something else with the model object, they may be slightly out-of-date, but not by much. If the object is to plot a time-series of the model performance, I don't anticipate much of a problem.

I think I got your point
I would actually be OK for returning get_infos() for a V1

get_info is just a method to provide the initial configuration of the estimator, so it does not change as the model is trained. I prefer return self because the user could then call something like model.score(X, y_true) to have an estimation on performance. score() is now available in all classifiers and regressors.

src/skmultiflow/core/stream.py

martindurant · 2019-05-14T19:07:55Z

src/skmultiflow/data/base_generator.py

        """
        return self.name + " - {} target(s), {} classes, {} features".format(self.n_targets,
                                                                             self.n_classes, self.n_features)

    def get_class_type(self):
-        return 'stream'
+        return type(self)


Always a multiflow-specific type here? Streamz comes in many classes, hope it's not possible to end up here with one of those.

get_class_type has been deprecated. However, we could take the chance to define a method in the base class to return the stream (streamz) type.

martindurant · 2019-05-14T19:09:00Z

src/skmultiflow/data/data_generator.py

+    # TODO : inherit from pandas.DataFrame should make things easier
+    """ DataGenerator
+
+    A generator constructed from the entries of a static dataset (numpy array or pandas


Unless the users are well-involved in the development process, I would second deprecation or clear documentation on how to switch.

martindurant · 2019-05-14T19:14:47Z

src/skmultiflow/data/stream.py

+        while self.mf_gen.has_more_samples():
+            sample = self.mf_gen.next_sample(self.batch_size)
+            yield self._emit(sample)
+            yield gen.sleep(self.poll_interval)


if self.stopped: break ?
A common pattern in other async sources, allowing stream.stop().

Is there a practical consequence to having this async? I doesn't need to be, to get closer to the functionality and performance overhead of the previous version. I note that the downstream fit/predict functionality is not async.

No I don't see any consequence of making this async

To be checked: I think you will get exactly one event emitted per event loop tick, which may not be significant overhead compared to the time spent training for each sample.

martindurant · 2019-05-14T19:17:04Z

src/skmultiflow/demos/_test_filters.py

+    sliding_array = stream.sliding_window(10).map(pd.np.concatenate)  # get last ten elements
+    data_from_stream = sliding_array.map(SimpleImputer(missing_values=-47, strategy='median').fit_transform).sink_to_list()
+    stream.start()
+    time.sleep(1)  # wait long enough to let all batches pass through the stream


I would recommend against this in tests, but maybe OK in a demo. There are helpers like wait_for that could be useful.

I did not know there was a function for this
Thanks !!

Actually @martindurant this is not a test, this is a demo.
In the future if demos are replaced by notebooks (I raised an issue for this)
Then calls to sleep or wait_for would not make sense
The cells of the notebook would have to be executed in time, that's all

tests/data/test_data_generator.py

tests/data/test_stream.py

tests/lazy/test_sam_knn.py

…to stop this Source

…ning

remiadon added 2 commits May 11, 2019 19:41

DataStream renamed to DataGenerator, add from_generator function to m…

f64e7c2

…ake generators interact with streamz

[RM] missing values cleaner, use sklearn.Transformers along with stre…

179a4aa

…amz instead

jacobmontiel self-requested a review May 11, 2019 18:29

remiadon changed the title ~~Streamz as input~~ [WIP] Streamz as input May 11, 2019

[FIX] demos/_test_filters.py

5fbbcc0

jacobmontiel reviewed May 13, 2019

View reviewed changes

martindurant reviewed May 14, 2019

View reviewed changes

remiadon added 12 commits May 15, 2019 13:31

test Source.from_generator can poduce a streaming dataframe

32a1220

reset current_sample_* attributes

e20c7de

reset MissingValuesCleaner, add deprecation warning

110ac5f

rename from_generator to from_mutltiflow_generator, make it possible …

473e26e

…to stop this Source

[RM] Stream.from_pipeline

bd7f288

split tests for data_generator.py

f42250c

Source.from_mutltiflow_generator: testing with await calls

e6b1288

sam_knn : shorten changes in unittest

0303b6f

[RM] useless comment in pipeline and last_sample in DataGenerator

b9d6e30

file_stream, add deprecation warning

833ba12

set FileStream as a subclass of DataGenerator and add deprecation war…

9ee05dc

…ning

DataGenerator : add documentation

2ec256a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Streamz as input #113

[WIP] Streamz as input #113

remiadon commented May 11, 2019 •

edited

Loading

jacobmontiel commented May 11, 2019

jacobmontiel May 13, 2019

martindurant May 14, 2019

jacobmontiel May 13, 2019

jacobmontiel May 13, 2019

remiadon May 14, 2019

jacobmontiel May 15, 2019

jacobmontiel May 13, 2019

martindurant May 14, 2019

remiadon May 14, 2019

remiadon May 15, 2019

jacobmontiel May 15, 2019

martindurant left a comment

martindurant May 14, 2019

martindurant May 14, 2019

remiadon May 14, 2019

martindurant May 14, 2019

remiadon May 14, 2019 •

edited

Loading

martindurant May 14, 2019

remiadon May 14, 2019

jacobmontiel May 15, 2019

martindurant May 14, 2019

jacobmontiel May 15, 2019

martindurant May 14, 2019

martindurant May 14, 2019

remiadon May 14, 2019

martindurant May 14, 2019

remiadon May 15, 2019

martindurant May 14, 2019

remiadon May 14, 2019

remiadon May 15, 2019

[WIP] Streamz as input #113

Are you sure you want to change the base?

[WIP] Streamz as input #113

Conversation

remiadon commented May 11, 2019 • edited Loading

TODO : show syntax for partial_fit/predict

jacobmontiel commented May 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

remiadon May 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

remiadon commented May 11, 2019 •

edited

Loading

remiadon May 14, 2019 •

edited

Loading