Migrate Dpkt to Python 3 (Sub project of GSoC 2015 Honeynet Project)

Hi There, I'm Hao Sun. I was engaged in the dpkt project last summer. Let me share more details for dpkt background and jobs I've done within 3 months.

What is dpkt?

dpkt is a Python library that helps with “fast, simple packet creation/parsing, with definitions for the basic TCP/IP protocols”. It supports a lot of protocols (currently about 63) and has been increasingly used in a lot of network security projects. It is 44x faster than Scapy2, and 5x faster than Impacket3. With Scapy no longer in development, dpkt is the only network creation/parsing library for Python that is active.

Which aspects need to be improved for dpkt?

Firstly more test cases need to be added to expand the test coverage. Secondly we need to update the code to offer Python 3 support. Lastly some pending bugs or issues need to be solved, also the documentation needs to be improved.

My accomplishments during GSoC

Made the dpkt Python 3 compatible. The trick is that we need to keep support for Python 2 as well. There are many potential code updates to achieve this goal. Also this is the major job I've done during last summer. See technical details in the next section of this blog.
Fixed bugs in the project issue queue.
Added some test cases to improve test coverage and also added a few documentations.

Approaches to migrate dpkt to Python 3

We use the following overall process to migrate dpkt from Python 2 to Python 3.

Use 2to3 to automatically apply syntax and other obvious changes in Python 3.
Run the test cases in migrated module, making sure it could pass all the tests on both Python 2 and 3. If there are problems, go to Step 3, otherwise finish migrating the current module.
Manually fix the problems (maybe most of which are caused by string and bytes type change in Python 3).

Based on the overall process, the rest of this section is organised as follows:

First we'll check out what favor 2to3 could do for us. Then we'll dive into details on the manual fix. Since the "bytes and string" problem is a big issue, we'll discuss this tricky issue at the end of this section.

1. 2to3 initial conversion

2to3 can automatically solve the following migration issues.

int and long

Python 2 has two integer types int and long. We may have the code as follows in Python 2.

tmp = ~crc & 0xffffffffL

These have been unified in Python 3, so there is now only one type, int. Just get rid of L.

tmp = ~crc & 0xffffffff

The Python 2 print statement is in Python 3 a function.

Python 2

print '%s : time = %f kstones = %f' % (function.__name__, time, kstones)

Python 2 & 3

print('%s : time = %f kstones = %f' % (function.__name__, time, kstones))

In Python 3 the syntax to catch exceptions have changed.

Python 2

except struct.error, e:

Python 2 & 3

except struct.error as e:

Relative import.

Python 3 changes the syntax for imports from within a package, requiring you to use the relative import syntax, saying from . import mymodule instead of the just import mymodule.

Dictionary methods.

In Python 2 dictionaries have the methods iterkeys(), itervalues() and iteritems() that return iterators instead of lists. In Python 3 the standard keys(), values() and items() return dictionary views, which are iterators, so the iterator variants become pointless and are removed.

Note that 2to3 would replace the old dictionary methods with the new ones in Python 3. If we do not care about the efficiency. Just keep this change and using the new syntax. However, as the Python doc points out:

dict.items(): Return a copy of the dictionary’s list of (key, value) pairs.

dict.iteritems(): Return an iterator over the dictionary’s (key, value) pairs.

It is recommended to modify the code as follows.

try:
   values = d.itervalues()
except AttributeError:
   values = d.values()

repr()

In Python 2 we can generate a string representation of an expression by enclosing it with backticks. However in Python 3 we need to use repr() function instead.

next method of the iterator

In Python 2 iterators have a .next() method you use to get the next value from the iterator. For instance,

>>> i = iter(range(5))
>>> i.next()
0
>>> i.next()
1

This special method has in Python 3 been renamed to .__next__() to be consistent with the naming of special attributes elswhere in Python. However, we should generally not call it directly, but instead use the builtin is next() function. This function is also available from Python 2.6. Here is an example.

for _ in range(cnt):
   try:
       ts, pkt = next(iter(self))

2. Manual fix issues

import and dict related syntax.

The import and dict are both changed in Python 3. Note that the 2to3 automatic update cannot provide compatible code for both Python 2 and 3. Thus we should update the code based on different cases. Here are examples.

Python 2

from StringIO import StringIO

Python 2 & 3

try:
   from StringIO import StringIO
except ImportError:
   from io import StringIO

Metaclass related issues

Based on http://python-3-patterns-idioms-test.readthedocs.org/en/latest/Metaprogramming.html.

The equivalent of:

class C: pass

is:

C = type('C', (), {})

Thus the metaclass syntax in dpkt.py module can be modified, which is both Python 2 and 3 compatible, as follows.

class Packet(_MetaPacket("Temp", (object,), {}))

Integer division

In Python 2, the result of dividing two integers will itself be an integer; in other words 1/2 returns 0. In Python 3 integer division will return an integer only if the result is a whole number. So 1/2 will return 0.5.

For instance, in dpkt.py, the original code is.

cnt = (n / 2) * 2
a = array.array('H', buf[:cnt])

which would cause the following exception.

` a = array.array('H', buf[:cnt])

TypeError: slice indices must be integers or None or have an index method `

So the code needs to be changed to

cnt = (n // 2) * 2
a = array.array('H', buf[:cnt])

3. bytes and str related issues

This is a big issue when migrating dpkt to Python 3. The changes between bytes and str are listed as follows.

In Python 2, you use str objects to hold binary data and ASCII text, while text data that needs more characters than what is available in ASCII is held in unicode objects. In Python 3, instead of str and unicode objects, you use bytes objects for binary data and str objects for all kinds of text data, Unicode or not.

When update the original code to support Python 3. We need to keep an eye on the following aspects.

string and bytes literals

If the original Python 2 string is holding byte data, we need to change them be bytes literals by adding a leading b to them.

This occurs dozens of times in the project. We need to inspect carefully which strings are holding byte data and change the type of literals. For instance, in many test cases, we might have statements looks like

ip = IP(id=0, src='\x01\x02\x03\x04', dst='\x01\x02\x03\x04', p=17)

Defintely we need to add a leading b in the two strings, which become

ip = IP(id=0, src=b'\x01\x02\x03\x04', dst=b'\x01\x02\x03\x04', p=17)

Change str() to bytes() where necessary

It is common case when we use dpkt to convert a protocol object, e.g. IP, TCP, etc., to string form. Such as

assert (str(ip) == s)

At this time, it is essential to change the code as follows

assert (bytes(ip) == s)

As a consequence, this change leads to the next key point - __str__ and __bytes__ function update.

__str__ and __bytes__ function

As aforementioned str() to bytes() update. We have to change the implementation of __str__ and __bytes__ respectively. Most of dpkt modules do not have a __bytes__ yet, because in Pythnon 2, its funcionality is exactly the same as __str__. However in Python 3, things become different. In my experience, in most situations dpkt is dealing with bytes data. Thus it is important to provide __bytes__ implementation for every needed class.

For instance, the origin __str__ fuction of IP class is

def __str__(self):
       self.len = self.__len__()
       if self.sum == 0:
           self.sum = dpkt.in_cksum(self.pack_hdr() + str(self.opts))
           if (self.p == 6 or self.p == 17) and (self.off & (IP_MF | IP_OFFMASK)) == 0 and \
                   isinstance(self.data, dpkt.Packet) and self.data.sum == 0:
               # Set zeroed TCP and UDP checksums for non-fragments.
               p = str(self.data)
               s = dpkt.struct.pack('>4s4sxBH', self.src, self.dst,
                                    self.p, len(p))
               s = dpkt.in_cksum_add(0, s)
               s = dpkt.in_cksum_add(s, p)
               self.data.sum = dpkt.in_cksum_done(s)
               if self.p == 17 and self.data.sum == 0:
                   self.data.sum = 0xffff  # RFC 768
                   # XXX - skip transports which don't need the pseudoheader
       return self.pack_hdr() + str(self.opts) + str(self.data)

Now we modify the __str__ and add __bytes__ function as follows.

def __str__(self):
       return str(self.__bytes__())
   
def __bytes__(self):
       self.len = self.__len__()
       if self.sum == 0:
           self.sum = dpkt.in_cksum(self.pack_hdr() + bytes(self.opts))
           if (self.p == 6 or self.p == 17) and (self.off & (IP_MF | IP_OFFMASK)) == 0 and \
                   isinstance(self.data, dpkt.Packet) and self.data.sum == 0:
               # Set zeroed TCP and UDP checksums for non-fragments.
               p = bytes(self.data)
               s = dpkt.struct.pack('>4s4sxBH', self.src, self.dst,
                                    self.p, len(p))
               s = dpkt.in_cksum_add(0, s)
               s = dpkt.in_cksum_add(s, p)
               self.data.sum = dpkt.in_cksum_done(s)
               if self.p == 17 and self.data.sum == 0:
                   self.data.sum = 0xffff  # RFC 768
                   # XXX - skip transports which don't need the pseudoheader
       return self.pack_hdr() + bytes(self.opts) + bytes(self.data)

Please carefully check the differences between to get a perceptual understanding of how to update __bytes__ and __str__.

chr and ord built-in function

For ord(c), given a string of length one, it'll return an integer representing the Unicode code point of the character when the argument is a unicode object, or the value of the byte when the argument is an 8-bit string. While for chr(i), it'll return a string of one character whose ASCII code is the integer i.

In Python 2, both of the two function's usage is straight forward. For example, we have the following code snippet.

l = buf.split(chr(IAC))

However, in Python 3, please note that most time in dpkt we'll deal with data with the type of bytes. Thus it is improper if the buf is of the type of bytes while the chr() function returns str. In order to solve this problem, we update the code as follows to provide support for both Python 2 and 3.

if sys.version_info < (3,):
   l = buf.split(chr(IAC))
else:
   l = buf.split(struct.pack("B", IAC))

Similarly, for ord function, we could have the snippet as follows,

o = ord(w[0])

where w is a string and o is an integer. Yet in Python 3, every element of bytes array is an integer, thus it is no need for the calling of ord any more.

Due to the expandability consideration, we add a compatible module in the project, and it'll provide some functions that are both compatible for Python 2 and 3. Currently there is only one function, namely, ord. Please see the implementation below.

if sys.version_info < (3,):
   def compatible_ord(char):
       return ord(char)
else:
   def compatible_ord(char):
       return char

Using the compatible module, the contributor only need to modify the client code as follows.

o = compatible.compatible_ord(w[0])

Future work

I think there are three aspects of major improvements for dpkt project. Firstly the migration related modifications are in a standalone branch has hasn't been merged to master yet, after a thorough test, we can finish the merging and release in the future. Secondly, when I write this blog, there are still 60 open issues on the dashboard, it would take a relatively long time to fix all of them. Lastly, there is little time left last time for me to write a detailed documentation. dpkt is a cool library, however it would be more popular if we could further improve its documentation and demos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate Dpkt to Python 3 (Sub project of GSoC 2015 Honeynet Project)

What is dpkt?

Which aspects need to be improved for dpkt?

My accomplishments during GSoC

Approaches to migrate dpkt to Python 3

Future work

Clone this wiki locally