Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for reading and writing to multiple tfrecords in nsl.tools.pack_nbrs #92

Closed
srihari-humbarwadi opened this issue Jul 11, 2021 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@srihari-humbarwadi
Copy link

The current implementation of nsl.tools.pack_nbrs does not support reading and writing to multiple tfrecord files.
Given the extensive optimizations made available by the tf.data API when working with multiple tfrecords, supporting this would yield significant performance gain in distributed training. I would be willing to contribute to this

Relevant parts of the code

  • for reading
    start_time = time.time()
    logging.info('Reading tf.train.Examples from TFRecord file: %s...', filename)
    result = {}
    for raw_record in tf.data.TFRecordDataset([filename]):
    tf_example = parse_example(raw_record)
    result[get_id(tf_example)] = tf_example
    logging.info('Done reading %d tf.train.Examples from: %s (%.2f seconds).',
    len(result), filename, (time.time() - start_time))
    return result
  • for writing
    with tf.io.TFRecordWriter(output_training_data_path) as writer:
    for merged_ex in _join_examples(seed_exs, nbr_exs, graph, max_nbrs):
    writer.write(merged_ex.SerializeToString())
    logging.info('Output written to TFRecord file: %s.',
    output_training_data_path)
    logging.info('Total running time: %.2f minutes.',
    (time.time() - start_time) / 60.0)
@arjung
Copy link
Collaborator

arjung commented Jul 12, 2021

Thanks for your interest in contributing, @srihari-humbarwadi! :) To begin with, could you share more specifics on what APIs you plan to change and how you plan to change them? Once we come to an agreement there, you can go ahead with the implementation. I've assigned one of my colleagues, @aheydon-google to this thread, who will be able to work with you on this.

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Jul 13, 2021

Here is the signature of the current implementation of pack_nbrs ---

def pack_nbrs(labeled_examples_path,
              unlabeled_examples_path,
              graph_path,
              output_training_data_path,
              add_undirected_edges=False,
              max_nbrs=None,
              id_feature_name='id'):

labeled_examples_path and unlabeled_examples_path are paths to a single TFRecord file, one for each of them.
The proposed modification changes this to additionally support a list of paths. This would enable users to load examples that are split across multiple TFRecord 'shards'; which is often the case when training on multiple accelerators.

labeled_examples_path = 'train.tfrecord'  # current implementation supports this.

# proposed modification adds support for the following as well
labeled_examples_path = [
    'train-0001-of-0004.tfrecord',
    'train-0002-of-0004.tfrecord',
    'train-0003-of-0004.tfrecord',
    'train-0004-of-0004.tfrecord'
    ]

The _read_tfrecord_examples function (defined here) that reads examples currently from a single file would require minimal changes to support reading from multiple files.

for raw_record in tf.data.TFRecordDataset([filename]):
tf_example = parse_example(raw_record)
result[get_id(tf_example)] = tf_example

This modified code would look something like this

for raw_record in tf.data.TFRecordDataset(filenames):  # filenames is list of tfrecord paths
  tf_example = parse_example(raw_record)
  result[get_id(tf_example)] = tf_example

For writing the newly generated examples, the current pack_nbrs implementation writes them into a single TFRecord at a path given by the output_training_data_path argument. The proposed modification adds an optional functionality to split the newly generated examples across multiple TFRecord shards. num_shards, a new argument in the pack_nbrs will control the number of TFRecord shards generated. This again would require minimal code addition, changing

with tf.io.TFRecordWriter(output_training_data_path) as writer:
for merged_ex in _join_examples(seed_exs, nbr_exs, graph, max_nbrs):
writer.write(merged_ex.SerializeToString())

to

writers = []
for i in range(num_shards):
  # there could be a better way to generate output TFRecord names
  output_path = '{}-{}-of-{}'.format(output_training_data_path, i, num_shards)
  writers.append(tf.io.TFRecordWriter(output_path))

for i, merged_ex in enumerate(_join_examples(seed_exs, nbr_exs, graph, max_nbrs)):
  writers[i % num_shards].write(merged_ex.SerializeToString())

#  close all writers
for writer in writers:
  writer.close()

@aheydon-google
Copy link

Hi, Srihari.

Thanks for supplying those details and for offering to contribute to NSL! What you propose sounds like a nice improvement. If you send me a pull request with your proposed changes, I can review it.

Thanks!

  • Allan

@srihari-humbarwadi
Copy link
Author

Thank you @aheydon-google, I will start working on this!

@sayakpaul
Copy link
Contributor

Looking forward to this feature.

For bigger datasets packing all the examples into a single TFRecord file will introduce a substantial amount of bottleneck in the overall data pipeline.

@aheydon-google
Copy link

Hi, Srihari. Do you have any updates to report on this issue? I think it could be quite useful! Thanks, - Allan

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Sep 23, 2021

@aheydon-google I'll push some changes in a couple of days!

@aheydon-google aheydon-google added the enhancement New feature or request label Jan 4, 2022
@aheydon-google
Copy link

Hi again, Srihari. Are you still planning to work on this issue? I think it would be a great contribution if you can do it!

Thanks,

  • Allan

@aheydon-google
Copy link

Since this issue has been dormant for quite some time, I'm going to close it. Feel free to send a pull request if you want to implement the improvement!

@csferng
Copy link
Collaborator

csferng commented Aug 9, 2023

Closing the issue for now.

@csferng csferng closed this as completed Aug 9, 2023
@csferng csferng closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants