Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PcapDataset #303

Merged
merged 13 commits into from
Jun 19, 2019
Merged

Add PcapDataset #303

merged 13 commits into from
Jun 19, 2019

Conversation

ivelin
Copy link
Contributor

@ivelin ivelin commented Jun 17, 2019

@yongtang this is a pull request for a PcapDataset.
C++ code builds and python unit test passes.
Hopefully the code is in compliance with best practices for TF IO. Looking forward to your comments.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

Copy link
Member

@yongtang yongtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivelin Thanks for the PR! Overall looks impressive. There are only a couple of changes needed to support DatasetV2.

The lint is processed by .travis/lint.python.sh and is just tab space changes.

Also, can you sign CLA as was noted by the google bot.

for (size_t i = 0; i < out_tensors->size(); i++) {
Tensor tensor = (*out_tensors)[i].Slice(0, *record_read);
(*out_tensors)[i] = std::move(tensor);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice the overreserved memory in Tensor is processed automatically from the parent class:

for (size_t i = 0; i < chunk_tensors.size(); i++) {
// In case record_read is smaller than record_to_read, cut a slice
Tensor tensor = chunk_tensors[i].Slice(0, record_read);
out_tensors->emplace_back(std::move(tensor));

Since PcapInput is a subclass of FileInput, I think Line 187-190 could be deleted.

int64 record_count = 0;
double packet_timestamp;
string packet_data_buffer;
TF_RETURN_IF_ERROR(state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returning status might be OutOfRange which needs to be converted to
Status = Status::OK() and (*record_read) == 0
for the parent class to properly identify the end of the sequence. For that I think the following will work:

      Status status = state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count);
      if (!(status.ok() || errors::IsOutOfRange(status))) {
        return status;
      }


@property
def output_types(self):
return tuple([dtypes.float64, dtypes.string])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving toward DatasetV2. See PR #268 for some background about DatasetV2.

The DatasetV2 is quite easier than DatasetV1 (invoked when tensorflow.compat.v1 is used) actually. Below is the code snippet that will match DatasetV2. You can delete everything else in this file and it should work:

import tensorflow as tf
from tensorflow_io.core.python.ops import data_ops as data_ops
from tensorflow_io import _load_library
pcap_ops = _load_library('_pcap_ops.so')

class PcapDataset(data_ops.Dataset):
  """A pcap Dataset. Pcap is a popular file format for capturing network packets.
  """

  def __init__(self, filenames, batch=None):
    """Create a pcap Reader.

    Args:
      filenames: A `tf.string` tensor containing one or more filenames.
    """
    batch = 0 if batch is None else batch
    dtypes = [tf.float64, tf.string]
    shapes = [
        tf.TensorShape([]), tf.TensorShape([])] if batch == 0 else [
            tf.TensorShape([None]), tf.TensorShape([None])]
    super(PcapDataset, self).__init__(
        pcap_ops.pcap_dataset,
        pcap_ops.pcap_input(filenames),
        batch, dtypes, shapes)



if __name__ == "__main__":
test.main()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as part of the effort to move to DatasetV2, we want the test to focus on TF 2.0 env. In TF 2.0 env, the biggest difference is that Eager execution is enabled by default. There is no need for tf.Session anymore.

Since we still want to support TF 1.x, we test TF 1.x in eager execution (non-default). The following will bring in eager execution in TF 1.x, and is compatible with TF 2.0 in any case:

if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")):
  tf.compat.v1.enable_eager_execution()

With Eager execution enabled, the iteration of dataset could be simplified to

   for v in dataset:
      print(v)

And OutOfRange exception is handled automatically.

Overall, here is the test case:

"""
Test PcapDataset
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

import tensorflow as tf
if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")):
  tf.compat.v1.enable_eager_execution()
import tensorflow_io.pcap as pcap_io # pylint: disable=wrong-import-position

def test_pcap_input():
  """test_pcap_input
  """
  print("Testing PcapDataset")
  pcap_filename = os.path.join(
      os.path.dirname(os.path.abspath(__file__)), "test_pcap", "http.pcap")
  file_url = "file://" + pcap_filename
  url_filenames = [file_url]
  dataset = pcap_io.PcapDataset(url_filenames, batch=1)

  packets_total = 0
  for v in dataset:
    (packet_timestamp, packet_data) = v
    if packets_total == 0:
      assert packet_timestamp.numpy()[0] == 1084443427.311224 # we know this is the correct value in the test pcap file
      assert len(packet_data.numpy()[0]) == 62 # we know this is the correct packet data buffer length in the test pcap file
    packets_total += 1
  assert packets_total == 43 # we know this is the correct number of packets in the test pcap file

if __name__ == "__main__":
  test.main()

Additional notes:
1: We use .travis/lint.python.sh to lint python code:

curl -o .pylint -sSL https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/ci_build/pylintrc

find . -name \*.py | xargs pylint --rcfile=.pylint
  1. TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with test_xxx_eager.py and another one for files not end with _eager.py.
    With pcap support DatasetV2, the file name could be changed to tests/test_pcap_eager.py.

@ivelin
Copy link
Contributor Author

ivelin commented Jun 17, 2019

@yongtang Thank you for the constructive and detailed feedback. I will try to work on the fixes over the weekend between flights.


Just a note that this PR is linked to Issue #264 .

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@ivelin
Copy link
Contributor Author

ivelin commented Jun 18, 2019

Changes applied and committed. In my local environment there seems to be a v1 vs v2 python lib mixup. I am trying to trace the root of this new error in the test case:

========================================================================== ERRORS ==========================================================================
___________________________________________________________ ERROR collecting tests/test_pcap.py ____________________________________________________________
tests/test_pcap.py:24: in <module>
    import tensorflow_io.pcap as pcap_io # pylint: disable=wrong-import-position
tensorflow_io/pcap/__init__.py:25: in <module>
    from tensorflow_io.pcap.python.ops.pcap_ops import PcapDataset
tensorflow_io/pcap/python/ops/pcap_ops.py:17: in <module>
    from tensorflow_io.core.python.ops import data_ops as data_ops
tensorflow_io/core/python/ops/data_ops.py:24: in <module>
    class BaseDataset(tf.compat.v2.data.Dataset):
E   AttributeError: module 'tensorflow._api.v1.compat' has no attribute 'v2'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================= 1 error in 1.82 seconds ==================================================================

Updated tf to 2.0.beta and rebuilt tf io from source. Didn't help.

@yongtang
Copy link
Member

@ivelin I think your version of tensorflow is likely 1.13.1? That is likely the case for has no attribute 'v2' error. TensorFlow's 1.13.1 was released too earlier and didn't pick up a change in API.

The DatasetV2 will require tensorflow 1.14.0+ or 2.0.0+.

You can install with pip install tensorflow==1.14.0rc1. Once installed, also need to re-invoke

./configure.sh
bazel build -s --verbose_failures //tensorflow_io/...

to allow bazel picking up the changes in configuration.

Also, you may want to change tests/test_pcap.py to tests/test_pcap_eager.py.

The reason is that TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with test_xxx_eager.py and another one for files not end with _eager.py. Changing tests/test_pcap.py to tests/test_pcap_eager.py will resolve another issue in Travis CI.

@yongtang
Copy link
Member

The PR looks good to me and all tests passed. Thanks for the contribution! 👍 🎉

@yongtang yongtang merged commit e9ab766 into tensorflow:master Jun 19, 2019
@ivelin ivelin deleted the add-pcap branch June 19, 2019 04:28
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Initial push for pcap support in tensorflow io. Still untested.

* updates to finalize the initial version of pcap dataset

* fixed syntax errors

* added debug info

* added more debug info. Still working on crash in pcap_input.cc

* working version with debug data

* working version for pcap dataset

* updated git email address

* email address

* email

* applied Yong's recommended changes for TF v2
python lint

* renamed test_pcap.py to test_pcap_eager.py
lint pcap/pcap_ops.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants