How to set the size of train-eval split with CsvExampleGen? #20

Efaq · 2019-03-22T15:25:33Z

I have been working with a very simple pipeline which loads the iris dataset and generates some statistics about it. But when I run the pipeline, even without specifying a train-eval split anywhere, the folders eval and train are being created under the CsvExampleGen pipeline folder, with tfrecords inside of it, and an apparently predefined split is being applied (namely around 100 training examples and 50 evaluation examples).
My question is: where can I opt for doing the split or not, and where can I set the size of the split?

Pipeline code below, being run in AirFlow:

import os
import logging
import datetime
from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner
from tfx.orchestration.pipeline import PipelineDecorator

from tfx.utils.dsl_utils import csv_input
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.orchestration.tfx_runner import TfxRunner


_CASE_FOLDER = os.path.join(os.environ['HOME'], 'cases', 'iris')
_DATA_FOLDER = os.path.join(_CASE_FOLDER, 'data')
_PIPELINE_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'pipelines')
_METADATA_DB_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'metadata')
_LOG_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'logs')


@PipelineDecorator(
    pipeline_name='test_tfx_pipeline_iris',
    pipeline_root=_PIPELINE_ROOT_FOLDER,
    metadata_db_root=_METADATA_DB_ROOT_FOLDER,
    additional_pipeline_args={'logger_args': {
        'log_root': _LOG_ROOT_FOLDER,
        'log_level': logging.INFO
    }}
)
def create_pipeline():

    print("HELLO")
    examples = csv_input(_DATA_FOLDER)

    example_gen = CsvExampleGen(input_base=examples, name='iris_example_gen_1')
    #ingests this examples thing, and returns tf.Example records

    statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)

    return [
        example_gen, statistics_gen
    ]

_airflow_config = {
    'schedule_interval': None,
    'start_date': datetime.datetime(2019, 1, 1),
}
pipeline = AirflowDAGRunner(_airflow_config).run(create_pipeline())

The folder structure being generated:

The text was updated successfully, but these errors were encountered:

zhitaoli · 2019-03-22T16:37:21Z

The current version of TFX pipeline uses two default splits 'train' and 'eval' across all components. There is a feature request #21 to support custom splits and distributions.

Can you let us know your use case further? Do you want to complete opt out of split?

Efaq · 2019-03-22T16:41:05Z

I think my case is simpler than that: I want to set the value of the split - it may be the same across the whole pipeline - and I just couldn't find how to do it.

So it is OK for me that the default splits are there (train and eval), I just want to set the value of it (say 0.9 - 0.1)

Otherwise, request #21 does seem interesting!

1025KB · 2019-03-22T17:21:56Z

Hi, Currently we don't support custom split ratio, but it on our radar, we just need to figure out a generic way of doing it. For now the ratio is fixed 2:1 train:eval, which is defined in the _partition_fn in example_gen component

1025KB · 2019-03-22T17:26:02Z

To be more specific, in long term, we will support pre-split input, custom ratio and probably also custom split function

krazyhaas · 2019-03-22T21:54:23Z

If you're comfortable modifying your version of TFX, you can change the example_gen executor directly. This will cause all of your pipelines to use the same ratio so buyer beware! It's not a great experience and we're working to elevate this parameter into the pipeline, but to unblock you in case you really really want to change the ratio to be 9:1 train:eval, make the following change:
(if you are working with a github clone):
tfx/components/example_gen/base_example_gen_executor.py:37
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

(if you are using 0.12.0 downloaded from PyPi):
tfx/components/example_gen/csv_example_gen/executor.py:39
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

You'll have to make the change every time the file gets overwritten (e.g. upgrading to 0.13.0) so waiting for the pipeline config parameter is definitely recommended.

Efaq · 2019-03-25T09:22:58Z

Hi, thank you @1025KB for the explanation and @krazyhaas for the quick fix. I will probably use this workaround until the configuration is possible.

Should I keep this open as it is a feature request?

1025KB · 2019-03-25T17:51:37Z

sounds good, thank you!

Efaq · 2019-05-20T09:55:27Z

I just saw that you documented a way of doing that, namely in https://www.tensorflow.org/tfx/guide/examplegen

Should I close this one as done or are you still waiting for something else?

zhitaoli · 2019-05-20T16:21:04Z

@Efaq I think that's all we planned for this issue. I'm closing this one if it's fine for you. Feel free to reopen if you have further questions.

* RFC: Functions not Sessions in TensorFlow 2.0 * Formatting tweaks * Formatting tweaks * Fix some links * Fix the "Status" column * Incorporate some suggestions. * Shorten the autograph example * Correct autograph link. * Additional details on Trace Caches and Input Signatures. * s/tf.defun/tf.function/ * Collapse tf.method and tf.function. (Prototyped in tensorflow/tensorflow@84ace03) * Fix typo * Formatting tweak * Fix typo * Update note on tracing twice.

Harshini-Gadige added the type:feature label Mar 29, 2019

aaronlelevier mentioned this issue Apr 6, 2019

Support split config across the pipeline platform #21

Closed

zhitaoli closed this as completed May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set the size of train-eval split with CsvExampleGen? #20

How to set the size of train-eval split with CsvExampleGen? #20

Efaq commented Mar 22, 2019 •

edited

Loading

zhitaoli commented Mar 22, 2019

Efaq commented Mar 22, 2019

1025KB commented Mar 22, 2019

1025KB commented Mar 22, 2019

krazyhaas commented Mar 22, 2019

Efaq commented Mar 25, 2019

1025KB commented Mar 25, 2019

Efaq commented May 20, 2019

zhitaoli commented May 20, 2019

How to set the size of train-eval split with CsvExampleGen? #20

How to set the size of train-eval split with CsvExampleGen? #20

Comments

Efaq commented Mar 22, 2019 • edited Loading

zhitaoli commented Mar 22, 2019

Efaq commented Mar 22, 2019

1025KB commented Mar 22, 2019

1025KB commented Mar 22, 2019

krazyhaas commented Mar 22, 2019

Efaq commented Mar 25, 2019

1025KB commented Mar 25, 2019

Efaq commented May 20, 2019

zhitaoli commented May 20, 2019

Efaq commented Mar 22, 2019 •

edited

Loading