Skip to content

Original Idea

Paul Dubs edited this page Jan 3, 2018 · 2 revisions

A Proposal for a different interface for datavec

As far as my use of datavec is concerned, I've found it to be needlessly complex. I think that it can be a lot simpler, so here is my proposal of how it might look and work.

At first, lets take a look at what ETL actually means: Extract - Transform - Load. Those are 3 phases, each characterized by doing exactly one thing. The extraction phase just reads the data from some source. The transformation phase only transforms the data from the extraction phase into something different. And the loading phase saves it into something where we want to have it.

Those three steps are orthogonal to each other, neither one of them does something that one of the other two does. Java 8 has introduced Lambdas as syntactical sugar over anonymous classes with just a single method, and some interfaces to go with it: Producer, Consumer and Function<T, E>. I think that these interfaces perfectly represent our three ETL phases, Producer coresponds to Extract, Consumer corresponds to Load and Function<T,E> corresponds to Transform. Function<T,E> is chainable, so you can create a Function<T,S> by chaining a Function<T,I> to a Function<I,S>, so we can easily build up pipelines from them.

I am often using this new features of Java 8 in my own code, so I thought maybe something based on just those 3 compontents can be a good foundation to create a simple and easy to use ETL library. However, I'm also aware that Java 7 support is still very much desired in order to support Android, so I'm proposing to create our own interfaces in the spirit of those just described, but designed with ETL for DL4J in mind.

Let's take a look at ETL from a DL4J user's perspective. The Extract phase is mostly concerned with loading the data from some source and into a form that can be manipulated, the Transform phase is usually tasked with transforming that data into a proper numeric form so it can be used in training, and the Load phase in the end repeatedly iterates over the data in order to train a network on it.

In the previous paragraph I cheated a little:

transforming that data into a proper numeric form

Actually, we sometimes can't do that without first iterating over all of the data in order to get some statistics of it. The pipelines that we build can't always go just from data source directly to training. Sometimes it has to lead to some other, different sinks, e.g. when a preprocessor is going to be used that requires means and standard deviation be be known, or for TFIDF where the vectorizer requires counts of how often a token appears in the corpus. Those values can be extracted from the data, thus taking an iteration over the training data, or they can be loaded from somewhere because they are known beforehand.

Therefore, we don't necessarily have just one data transformation pipeline when training a network, but instead we can have multiple ones. However, one of those is going to be a final transformation pipeline, and it is this transformation pipeline that is going to be used in production.

But in production the data is probably not going to come from the same kind of source as it does when training, and in addition it usually isn't going to be in the form of an iterator over multiple examples.

With those high level needs in mind, I'm proposing the following general structure for using datavec.

	Source -> Pipeline -> Sink

A Source can be anything that can provide Record<T> next() and boolean hasNext() methods. Also notice that it doesn't have to be resetable, as we only need resetability for some special cases.

A Pipeline is composed of multiple functions, each of which does only have to have a single method Record<U> apply(Record<T> datum). Composing them is done via adding them to the Pipeline itself. This opens up the possibility to have different execution strategies for pipelines independent of source and sink, e.g. a pipeline that takes multiple records at once and processes them in parallel, or a pipeline that runs each step concurrently.

Now let's see how this could look like in action:

	FileContentsSource fileSource = new FileContentsSource("~/data/texts/");
	Pipeline<String,Map<String, Integer>> bowPipe = new Pipeline<String, String>().add(new Tokenizer()).add(new BagOfWords());
	Function<String, Map<String, Double>> vectorizer = TfidfVectorizer.create(fileSource, bowPipe);
	Pipeline<String, NDArray> tfidfPipe = bowPipe.add(vectorizer);
	DataSetIterator iterator = new PipelineIterator(fileSource, tfidfPipe);

In this example we have a single source that can be reused, two pipelines, where one of them is created by modifing the previous one, and two sinks. That gets quite a lot done in just 5 lines.

With this structure, I think that a lot of common problems around DataVec can be easily solved. It is pretty obvious when something is going to be running over all the data. It is simple to add another data transformation, or a different data source, or even use it along with different data sinks. And even explaining it should be pretty simple:

Source -> Pipeline(Function -> Function -> ...) -> Sink You can change any one of them.

As code trumps prose, I've translated some of the DataVec examples into this same structure, and along with it all of required Sources, Sinks and Functions, so they are executable and useable right now. (This actually didn't happen yet.)

Clone this wiki locally