Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs` #92

srihari-humbarwadi · 2021-07-11T14:41:43Z

The current implementation of nsl.tools.pack_nbrs does not support reading and writing to multiple tfrecord files.
Given the extensive optimizations made available by the tf.data API when working with multiple tfrecords, supporting this would yield significant performance gain in distributed training. I would be willing to contribute to this

Relevant parts of the code

for reading

neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py

Lines 63 to 71 in c21dad4

    
           start_time = time.time() 
        
           logging.info('Reading tf.train.Examples from TFRecord file: %s...', filename) 
        
           result = {} 
        
           for raw_record in tf.data.TFRecordDataset([filename]): 
        
             tf_example = parse_example(raw_record) 
        
             result[get_id(tf_example)] = tf_example 
        
           logging.info('Done reading %d tf.train.Examples from: %s (%.2f seconds).', 
        
                        len(result), filename, (time.time() - start_time)) 
        
           return result

for writing

neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py

Lines 264 to 270 in c21dad4

    
           with tf.io.TFRecordWriter(output_training_data_path) as writer: 
        
             for merged_ex in _join_examples(seed_exs, nbr_exs, graph, max_nbrs): 
        
               writer.write(merged_ex.SerializeToString()) 
        
           logging.info('Output written to TFRecord file: %s.', 
        
                        output_training_data_path) 
        
           logging.info('Total running time: %.2f minutes.', 
        
                        (time.time() - start_time) / 60.0)

The text was updated successfully, but these errors were encountered:

arjung · 2021-07-12T18:57:01Z

Thanks for your interest in contributing, @srihari-humbarwadi! :) To begin with, could you share more specifics on what APIs you plan to change and how you plan to change them? Once we come to an agreement there, you can go ahead with the implementation. I've assigned one of my colleagues, @aheydon-google to this thread, who will be able to work with you on this.

srihari-humbarwadi · 2021-07-13T17:26:18Z

Here is the signature of the current implementation of pack_nbrs ---

def pack_nbrs(labeled_examples_path,
              unlabeled_examples_path,
              graph_path,
              output_training_data_path,
              add_undirected_edges=False,
              max_nbrs=None,
              id_feature_name='id'):

labeled_examples_path and unlabeled_examples_path are paths to a single TFRecord file, one for each of them.
The proposed modification changes this to additionally support a list of paths. This would enable users to load examples that are split across multiple TFRecord 'shards'; which is often the case when training on multiple accelerators.

labeled_examples_path = 'train.tfrecord'  # current implementation supports this.

# proposed modification adds support for the following as well
labeled_examples_path = [
    'train-0001-of-0004.tfrecord',
    'train-0002-of-0004.tfrecord',
    'train-0003-of-0004.tfrecord',
    'train-0004-of-0004.tfrecord'
    ]

The _read_tfrecord_examples function (defined here) that reads examples currently from a single file would require minimal changes to support reading from multiple files.

neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py

Lines 66 to 68 in c21dad4

    
           for raw_record in tf.data.TFRecordDataset([filename]): 
        
             tf_example = parse_example(raw_record) 
        
             result[get_id(tf_example)] = tf_example

This modified code would look something like this

for raw_record in tf.data.TFRecordDataset(filenames):  # filenames is list of tfrecord paths
  tf_example = parse_example(raw_record)
  result[get_id(tf_example)] = tf_example

For writing the newly generated examples, the current pack_nbrs implementation writes them into a single TFRecord at a path given by the output_training_data_path argument. The proposed modification adds an optional functionality to split the newly generated examples across multiple TFRecord shards. num_shards, a new argument in the pack_nbrs will control the number of TFRecord shards generated. This again would require minimal code addition, changing

neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py

Lines 264 to 266 in c21dad4

    
           with tf.io.TFRecordWriter(output_training_data_path) as writer: 
        
             for merged_ex in _join_examples(seed_exs, nbr_exs, graph, max_nbrs): 
        
               writer.write(merged_ex.SerializeToString())

to

writers = []
for i in range(num_shards):
  # there could be a better way to generate output TFRecord names
  output_path = '{}-{}-of-{}'.format(output_training_data_path, i, num_shards)
  writers.append(tf.io.TFRecordWriter(output_path))

for i, merged_ex in enumerate(_join_examples(seed_exs, nbr_exs, graph, max_nbrs)):
  writers[i % num_shards].write(merged_ex.SerializeToString())

#  close all writers
for writer in writers:
  writer.close()

aheydon-google · 2021-07-29T01:52:30Z

Hi, Srihari.

Thanks for supplying those details and for offering to contribute to NSL! What you propose sounds like a nice improvement. If you send me a pull request with your proposed changes, I can review it.

Thanks!

Allan

srihari-humbarwadi · 2021-08-05T01:17:47Z

Thank you @aheydon-google, I will start working on this!

sayakpaul · 2021-08-20T04:41:52Z

Looking forward to this feature.

For bigger datasets packing all the examples into a single TFRecord file will introduce a substantial amount of bottleneck in the overall data pipeline.

aheydon-google · 2021-09-22T18:23:03Z

Hi, Srihari. Do you have any updates to report on this issue? I think it could be quite useful! Thanks, - Allan

srihari-humbarwadi · 2021-09-23T14:23:23Z

@aheydon-google I'll push some changes in a couple of days!

aheydon-google · 2022-01-12T19:06:31Z

Hi again, Srihari. Are you still planning to work on this issue? I think it would be a great contribution if you can do it!

Thanks,

Allan

aheydon-google · 2023-08-09T18:56:06Z

Since this issue has been dormant for quite some time, I'm going to close it. Feel free to send a pull request if you want to implement the improvement!

csferng · 2023-08-09T18:59:03Z

Closing the issue for now.

arjung assigned aheydon-google Jul 12, 2021

aheydon-google added the enhancement New feature or request label Jan 4, 2022

csferng closed this as completed Aug 9, 2023

csferng closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs` #92

Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs` #92

srihari-humbarwadi commented Jul 11, 2021

arjung commented Jul 12, 2021

srihari-humbarwadi commented Jul 13, 2021 •

edited

Loading

aheydon-google commented Jul 29, 2021

srihari-humbarwadi commented Aug 5, 2021

sayakpaul commented Aug 20, 2021

aheydon-google commented Sep 22, 2021

srihari-humbarwadi commented Sep 23, 2021 •

edited

Loading

aheydon-google commented Jan 12, 2022

aheydon-google commented Aug 9, 2023

csferng commented Aug 9, 2023

Adding support for reading and writing to multiple tfrecords in nsl.tools.pack_nbrs #92

Adding support for reading and writing to multiple tfrecords in nsl.tools.pack_nbrs #92

Comments

srihari-humbarwadi commented Jul 11, 2021

arjung commented Jul 12, 2021

srihari-humbarwadi commented Jul 13, 2021 • edited Loading

aheydon-google commented Jul 29, 2021

srihari-humbarwadi commented Aug 5, 2021

sayakpaul commented Aug 20, 2021

aheydon-google commented Sep 22, 2021

srihari-humbarwadi commented Sep 23, 2021 • edited Loading

aheydon-google commented Jan 12, 2022

aheydon-google commented Aug 9, 2023

csferng commented Aug 9, 2023

Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs` #92

Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs` #92

srihari-humbarwadi commented Jul 13, 2021 •

edited

Loading

srihari-humbarwadi commented Sep 23, 2021 •

edited

Loading