Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set the size of train-eval split with CsvExampleGen? #20

Closed
Efaq opened this issue Mar 22, 2019 · 9 comments
Closed

How to set the size of train-eval split with CsvExampleGen? #20

Efaq opened this issue Mar 22, 2019 · 9 comments

Comments

@Efaq
Copy link

Efaq commented Mar 22, 2019

I have been working with a very simple pipeline which loads the iris dataset and generates some statistics about it. But when I run the pipeline, even without specifying a train-eval split anywhere, the folders eval and train are being created under the CsvExampleGen pipeline folder, with tfrecords inside of it, and an apparently predefined split is being applied (namely around 100 training examples and 50 evaluation examples).
My question is: where can I opt for doing the split or not, and where can I set the size of the split?

Pipeline code below, being run in AirFlow:

import os
import logging
import datetime
from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner
from tfx.orchestration.pipeline import PipelineDecorator

from tfx.utils.dsl_utils import csv_input
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.orchestration.tfx_runner import TfxRunner


_CASE_FOLDER = os.path.join(os.environ['HOME'], 'cases', 'iris')
_DATA_FOLDER = os.path.join(_CASE_FOLDER, 'data')
_PIPELINE_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'pipelines')
_METADATA_DB_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'metadata')
_LOG_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'logs')


@PipelineDecorator(
    pipeline_name='test_tfx_pipeline_iris',
    pipeline_root=_PIPELINE_ROOT_FOLDER,
    metadata_db_root=_METADATA_DB_ROOT_FOLDER,
    additional_pipeline_args={'logger_args': {
        'log_root': _LOG_ROOT_FOLDER,
        'log_level': logging.INFO
    }}
)
def create_pipeline():

    print("HELLO")
    examples = csv_input(_DATA_FOLDER)

    example_gen = CsvExampleGen(input_base=examples, name='iris_example_gen_1')
    #ingests this examples thing, and returns tf.Example records

    statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)

    return [
        example_gen, statistics_gen
    ]

_airflow_config = {
    'schedule_interval': None,
    'start_date': datetime.datetime(2019, 1, 1),
}
pipeline = AirflowDAGRunner(_airflow_config).run(create_pipeline())

The folder structure being generated:

image

@zhitaoli
Copy link
Contributor

The current version of TFX pipeline uses two default splits 'train' and 'eval' across all components. There is a feature request #21 to support custom splits and distributions.

Can you let us know your use case further? Do you want to complete opt out of split?

@Efaq
Copy link
Author

Efaq commented Mar 22, 2019

I think my case is simpler than that: I want to set the value of the split - it may be the same across the whole pipeline - and I just couldn't find how to do it.

So it is OK for me that the default splits are there (train and eval), I just want to set the value of it (say 0.9 - 0.1)

Otherwise, request #21 does seem interesting!

@1025KB
Copy link
Collaborator

1025KB commented Mar 22, 2019

Hi, Currently we don't support custom split ratio, but it on our radar, we just need to figure out a generic way of doing it. For now the ratio is fixed 2:1 train:eval, which is defined in the _partition_fn in example_gen component

@1025KB
Copy link
Collaborator

1025KB commented Mar 22, 2019

To be more specific, in long term, we will support pre-split input, custom ratio and probably also custom split function

@krazyhaas
Copy link

If you're comfortable modifying your version of TFX, you can change the example_gen executor directly. This will cause all of your pipelines to use the same ratio so buyer beware! It's not a great experience and we're working to elevate this parameter into the pipeline, but to unblock you in case you really really want to change the ratio to be 9:1 train:eval, make the following change:
(if you are working with a github clone):
tfx/components/example_gen/base_example_gen_executor.py:37
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

(if you are using 0.12.0 downloaded from PyPi):
tfx/components/example_gen/csv_example_gen/executor.py:39
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

You'll have to make the change every time the file gets overwritten (e.g. upgrading to 0.13.0) so waiting for the pipeline config parameter is definitely recommended.

@Efaq
Copy link
Author

Efaq commented Mar 25, 2019

Hi, thank you @1025KB for the explanation and @krazyhaas for the quick fix. I will probably use this workaround until the configuration is possible.

Should I keep this open as it is a feature request?

@1025KB
Copy link
Collaborator

1025KB commented Mar 25, 2019

sounds good, thank you!

@Efaq
Copy link
Author

Efaq commented May 20, 2019

I just saw that you documented a way of doing that, namely in https://www.tensorflow.org/tfx/guide/examplegen

Should I close this one as done or are you still waiting for something else?

@zhitaoli
Copy link
Contributor

@Efaq I think that's all we planned for this issue. I'm closing this one if it's fine for you. Feel free to reopen if you have further questions.

ruoyu90 pushed a commit to ruoyu90/tfx that referenced this issue Aug 28, 2019
* RFC: Functions not Sessions in TensorFlow 2.0

* Formatting tweaks

* Formatting tweaks

* Fix some links

* Fix the "Status" column

* Incorporate some suggestions.

* Shorten the autograph example

* Correct autograph link.

* Additional details on Trace Caches and Input Signatures.

* s/tf.defun/tf.function/

* Collapse tf.method and tf.function.

(Prototyped in
tensorflow/tensorflow@84ace03)

* Fix typo

* Formatting tweak

* Fix typo

* Update note on tracing twice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants