Add PcapDataset #303

ivelin · 2019-06-17T05:18:55Z

@yongtang this is a pull request for a PcapDataset.
C++ code builds and python unit test passes.
Hopefully the code is in compliance with best practices for TF IO. Looking forward to your comments.

googlebot · 2019-06-17T05:18:59Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

yongtang

@ivelin Thanks for the PR! Overall looks impressive. There are only a couple of changes needed to support DatasetV2.

The lint is processed by .travis/lint.python.sh and is just tab space changes.

Also, can you sign CLA as was noted by the google bot.

yongtang · 2019-06-17T16:53:25Z

tensorflow_io/pcap/kernels/pcap_input.cc

+      for (size_t i = 0; i < out_tensors->size(); i++) {
+        Tensor tensor = (*out_tensors)[i].Slice(0, *record_read);
+        (*out_tensors)[i] = std::move(tensor);
+      }


Slice the overreserved memory in Tensor is processed automatically from the parent class:

io/tensorflow_io/core/kernels/dataset_ops.h

Lines 283 to 286 in 7e5fcbc

for (size_t i = 0; i < chunk_tensors.size(); i++) {

// In case record_read is smaller than record_to_read, cut a slice

Tensor tensor = chunk_tensors[i].Slice(0, record_read);

out_tensors->emplace_back(std::move(tensor));

Since PcapInput is a subclass of FileInput, I think Line 187-190 could be deleted.

yongtang · 2019-06-17T16:56:19Z

tensorflow_io/pcap/kernels/pcap_input.cc

+      int64 record_count = 0;
+      double packet_timestamp;
+      string packet_data_buffer;
+      TF_RETURN_IF_ERROR(state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count));


The returning status might be OutOfRange which needs to be converted to
Status = Status::OK() and (*record_read) == 0
for the parent class to properly identify the end of the sequence. For that I think the following will work:

Status status = state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count); if (!(status.ok() || errors::IsOutOfRange(status))) { return status; }

yongtang · 2019-06-17T16:59:51Z

tensorflow_io/pcap/python/ops/pcap_ops.py

+
+  @property
+  def output_types(self):
+    return tuple([dtypes.float64, dtypes.string])


We are moving toward DatasetV2. See PR #268 for some background about DatasetV2.

The DatasetV2 is quite easier than DatasetV1 (invoked when tensorflow.compat.v1 is used) actually. Below is the code snippet that will match DatasetV2. You can delete everything else in this file and it should work:

import tensorflow as tf from tensorflow_io.core.python.ops import data_ops as data_ops from tensorflow_io import _load_library pcap_ops = _load_library('_pcap_ops.so') class PcapDataset(data_ops.Dataset): """A pcap Dataset. Pcap is a popular file format for capturing network packets. """ def __init__(self, filenames, batch=None): """Create a pcap Reader. Args: filenames: A `tf.string` tensor containing one or more filenames. """ batch = 0 if batch is None else batch dtypes = [tf.float64, tf.string] shapes = [ tf.TensorShape([]), tf.TensorShape([])] if batch == 0 else [ tf.TensorShape([None]), tf.TensorShape([None])] super(PcapDataset, self).__init__( pcap_ops.pcap_dataset, pcap_ops.pcap_input(filenames), batch, dtypes, shapes)

yongtang · 2019-06-17T17:09:31Z

tests/test_pcap.py

+
+
+if __name__ == "__main__":
+    test.main()


Also as part of the effort to move to DatasetV2, we want the test to focus on TF 2.0 env. In TF 2.0 env, the biggest difference is that Eager execution is enabled by default. There is no need for tf.Session anymore.

Since we still want to support TF 1.x, we test TF 1.x in eager execution (non-default). The following will bring in eager execution in TF 1.x, and is compatible with TF 2.0 in any case:

if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")): tf.compat.v1.enable_eager_execution()

With Eager execution enabled, the iteration of dataset could be simplified to

for v in dataset: print(v)

And OutOfRange exception is handled automatically.

Overall, here is the test case:

""" Test PcapDataset """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import os import tensorflow as tf if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")): tf.compat.v1.enable_eager_execution() import tensorflow_io.pcap as pcap_io # pylint: disable=wrong-import-position def test_pcap_input(): """test_pcap_input """ print("Testing PcapDataset") pcap_filename = os.path.join( os.path.dirname(os.path.abspath(__file__)), "test_pcap", "http.pcap") file_url = "file://" + pcap_filename url_filenames = [file_url] dataset = pcap_io.PcapDataset(url_filenames, batch=1) packets_total = 0 for v in dataset: (packet_timestamp, packet_data) = v if packets_total == 0: assert packet_timestamp.numpy()[0] == 1084443427.311224 # we know this is the correct value in the test pcap file assert len(packet_data.numpy()[0]) == 62 # we know this is the correct packet data buffer length in the test pcap file packets_total += 1 assert packets_total == 43 # we know this is the correct number of packets in the test pcap file if __name__ == "__main__": test.main()

Additional notes:
1: We use .travis/lint.python.sh to lint python code:

curl -o .pylint -sSL https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/ci_build/pylintrc find . -name \*.py | xargs pylint --rcfile=.pylint

TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with test_xxx_eager.py and another one for files not end with _eager.py.
With pcap support DatasetV2, the file name could be changed to tests/test_pcap_eager.py.

ivelin · 2019-06-17T19:09:08Z

@yongtang Thank you for the constructive and detailed feedback. I will try to work on the fixes over the weekend between flights.

Just a note that this PR is linked to Issue #264 .

googlebot · 2019-06-17T19:42:28Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

python lint

ivelin · 2019-06-18T03:41:53Z

Changes applied and committed. In my local environment there seems to be a v1 vs v2 python lib mixup. I am trying to trace the root of this new error in the test case:

========================================================================== ERRORS ==========================================================================
___________________________________________________________ ERROR collecting tests/test_pcap.py ____________________________________________________________
tests/test_pcap.py:24: in <module>
    import tensorflow_io.pcap as pcap_io # pylint: disable=wrong-import-position
tensorflow_io/pcap/__init__.py:25: in <module>
    from tensorflow_io.pcap.python.ops.pcap_ops import PcapDataset
tensorflow_io/pcap/python/ops/pcap_ops.py:17: in <module>
    from tensorflow_io.core.python.ops import data_ops as data_ops
tensorflow_io/core/python/ops/data_ops.py:24: in <module>
    class BaseDataset(tf.compat.v2.data.Dataset):
E   AttributeError: module 'tensorflow._api.v1.compat' has no attribute 'v2'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================= 1 error in 1.82 seconds ==================================================================

Updated tf to 2.0.beta and rebuilt tf io from source. Didn't help.

yongtang · 2019-06-18T16:21:02Z

@ivelin I think your version of tensorflow is likely 1.13.1? That is likely the case for has no attribute 'v2' error. TensorFlow's 1.13.1 was released too earlier and didn't pick up a change in API.

The DatasetV2 will require tensorflow 1.14.0+ or 2.0.0+.

You can install with pip install tensorflow==1.14.0rc1. Once installed, also need to re-invoke

./configure.sh
bazel build -s --verbose_failures //tensorflow_io/...

to allow bazel picking up the changes in configuration.

Also, you may want to change tests/test_pcap.py to tests/test_pcap_eager.py.

The reason is that TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with test_xxx_eager.py and another one for files not end with _eager.py. Changing tests/test_pcap.py to tests/test_pcap_eager.py will resolve another issue in Travis CI.

lint pcap/pcap_ops.py

yongtang · 2019-06-19T01:34:05Z

The PR looks good to me and all tests passed. Thanks for the contribution! 👍 🎉

* Initial push for pcap support in tensorflow io. Still untested. * updates to finalize the initial version of pcap dataset * fixed syntax errors * added debug info * added more debug info. Still working on crash in pcap_input.cc * working version with debug data * working version for pcap dataset * updated git email address * email address * email * applied Yong's recommended changes for TF v2 python lint * renamed test_pcap.py to test_pcap_eager.py lint pcap/pcap_ops.py

Ivelin Ivanov added 7 commits June 16, 2019 10:17

Initial push for pcap support in tensorflow io. Still untested.

d17ab60

updates to finalize the initial version of pcap dataset

5851e7f

fixed syntax errors

53770eb

added debug info

389f941

added more debug info. Still working on crash in pcap_input.cc

6cfacd4

working version with debug data

0574739

working version for pcap dataset

7ece1a0

yongtang mentioned this pull request Jun 17, 2019

Allow running lint concurrently with Travis CI build #304

Merged

yongtang reviewed Jun 17, 2019

View reviewed changes

ivelin mentioned this pull request Jun 17, 2019

Support for pcap offline network packet format for tensorflow-io #264

Open

ivelin and others added 3 commits June 17, 2019 14:32

updated git email address

afc77f1

email address

2782f6f

email

b39d50c

applied Yong's recommended changes for TF v2

a94b2a8

python lint

ivelin added 2 commits June 18, 2019 12:55

renamed test_pcap.py to test_pcap_eager.py

3187d5c

lint pcap/pcap_ops.py

Merge branch 'master' of https://github.com/tensorflow/io into add-pcap

090f7f9

yongtang added the kokoro:run Kokoro CI label Jun 18, 2019

kokoro-team removed the kokoro:run Kokoro CI label Jun 18, 2019

yongtang added kokoro:run Kokoro CI kokoro:force-run labels Jun 18, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jun 18, 2019

yongtang merged commit e9ab766 into tensorflow:master Jun 19, 2019

ivelin deleted the add-pcap branch June 19, 2019 04:28

yongtang mentioned this pull request Jul 24, 2019

[WIP] Add FileDataset to read whole content of file into tf.data pipeline #366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PcapDataset #303

Add PcapDataset #303

ivelin commented Jun 17, 2019

googlebot commented Jun 17, 2019

yongtang left a comment

yongtang Jun 17, 2019

yongtang Jun 17, 2019

yongtang Jun 17, 2019

yongtang Jun 17, 2019

ivelin commented Jun 17, 2019

googlebot commented Jun 17, 2019

ivelin commented Jun 18, 2019

yongtang commented Jun 18, 2019

yongtang commented Jun 19, 2019

	for (size_t i = 0; i < chunk_tensors.size(); i++) {
	// In case record_read is smaller than record_to_read, cut a slice
	Tensor tensor = chunk_tensors[i].Slice(0, record_read);
	out_tensors->emplace_back(std::move(tensor));

Add PcapDataset #303

Add PcapDataset #303

Conversation

ivelin commented Jun 17, 2019

googlebot commented Jun 17, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

yongtang left a comment

Choose a reason for hiding this comment

yongtang Jun 17, 2019

Choose a reason for hiding this comment

yongtang Jun 17, 2019

Choose a reason for hiding this comment

yongtang Jun 17, 2019

Choose a reason for hiding this comment

yongtang Jun 17, 2019

Choose a reason for hiding this comment

ivelin commented Jun 17, 2019

googlebot commented Jun 17, 2019

ivelin commented Jun 18, 2019

yongtang commented Jun 18, 2019

yongtang commented Jun 19, 2019