# Mangle Data Like A Pro

## Binary Data

We have discussed several ways to manipulate and search text data, but not all data comes in the form of text.  There is an entire world of data that isn't organized into characters, ranging from images, to sound recordings, to compressed files.  When data can't be understood as text, we refer to it as binary data.  To manipulate binary data, we have to work at the level of individual bytes.

## bytes and bytearray

Python 3 introduced two data types that can handle sequences of bytes:

- `bytes` is immutable, like a tuple
- `bytearray` is mutable, like a list of bytes

We will often interpret the sequence of one's and zero's in a byte as a binary number.  Converted to base 10, this means that a byte corresponds to a number between 0 and 255.  For example, the byte 00000101 is given the number 5, while 10000000 is given the number 128.  This is not to suggest that any given byte is meant to represent a number - the data in a byte could be used to represent many different things.  The base 10 representation just gives us a friendly way to write down what data is contained in a byte.  There is one base 10 integer for each of the 256 possible sequences of one's and zero's in a byte.

We can use the base 10 notation to create bytes in Python:

In [1]:
byte_value_list = [4, 3, 242]
bytes_test = bytes(byte_value_list)
print(bytes_test)

b'\x04\x03\xf2'


Notice that when our bytes object is printed back out, each byte is represented in hexadecimal notation.  The '\x' characters mean that the following characters should be understood using base 16.  

To convert from base 16 to base 10, remember that 'a' represents the number 10, 'b' represents 11, and so on.  We have to multiply the first digit by 16 and add it to the second digit.  For example, 'f' represents 15, so the hexadecimal number '\xf2' can be coverted to base 10 as follows:

15 * 16 + 2 = 242

Let's see what happens if we try to change a byte in our bytes object.

In [4]:
bytes_test[1] = 129
print(bytes_test)

TypeError: 'bytes' object does not support item assignment

In [5]:
bytearray_test = bytearray(byte_value_list)
print(bytearray_test)

bytearray(b'\x04\x03\xf2')


In [6]:
bytearray_test[1] = 127
print(bytearray_test)

bytearray(b'\x04\x7f\xf2')


Note that the bytes in bytes_test cannot be changed but those in bytearray can.  

Up to this point, all bytes have been displayed in hexadecimal format.  Most byte values are printed this way, but if a byte value is printable using the ASCII encoding, then the ASCII letter will be printed instead:

In [7]:
bytearray_test[2] = 68
print(bytearray_test)

bytearray(b'\x04\x7fD')


The ASCII number for the letter "D" is 68 in decimal, so "D" was printed to the screen instead of /x44 (base 16)

## Convert Binary Data with struct

Most data scientists will never need to work with data at the level of individual bits.  Byte manipulation is a low-level and challenging area of computer science.  Fortunately, Python has tools to help you deal with many common binary file types.  They can help you convert binary data into Python data structures and python data structures back into binary data.

Bill Lubanovic provides an example of using python to detect whether an image is a png, a popular format for storing images. We are going to check if the file cal-image.png that is in the curent folder is a png.  If it is, we'll print out the width and the height from the information contained in the file itself:

In [8]:
import struct

f = open("cal-image.png", "rb")
try:
    data = f.read(24)
finally:
    f.close()
    
png_header = b'\x89PNG\r\n\x1a\n'
        
if data[0:8] == png_header:
    width, height = struct.unpack('>LL', data[16:24])
    print(' Valid PNG, width', width, 'height', height)
else:
    print('Not a valid PNG')

 Valid PNG, width 500 height 398


In this example, our first step is to open the file "cal-image.png" in read only binary format and read the first 24 bytes in that file. We'll talk more about reading from files in the next section.

We next test to see if our file is in png format.  Fortunately for us, every png file begins with the same sequence of 8 bytes, known as a header.  This sequence of bytes is stored in the png_header variable, and we use an if statement to see if it matches the beginning of our file.

If the file header matches, we then inspect the 16 - 24 bytes, which we know represent the width and height of the image.  This is done with the struct.unpack function.  The parameter '>LL' is a format string, explaining what type of data we're expecting (In this case, it specifies two integers in big endian style, meaning that the most significant bytes are to the left as opposed to to the right if we read each byte as a stream. We have to deal with this because each integer is represented by four bytes).

Finally, we print the encoded width and height to the screen.

### Format Specifiers using struct

The following symbols are used in a format string to explain different ways that bytes can be understood as data.

**Endian specifiers**
- **<**: little endian
- **\>**: big endian

**Format specifiers**: the number in parenthensies is the number of bytes each format specifier refers to
- **x**: skip a byte (1)
- **b**: signed byte (1)
- **B**: unsigned byte (1)

- **h**: signed short integer (2)
- **H**: unsigned short integer (2)

- **i**: signed integer (4)
- **I**: unsigned integer (4)

- **l**: signed long integer (4)
- **L**: unsigned long integer (4)

- **Q**: unsigned long long integer (8)
- **f**: single precision float

## Other Binary Tools

Depending on the type of binary data you're working with, there are other external packages that might make your task easier. Some options are:

- [bitstring](https://github.com/scott-griffiths/bitstring)
- [construct](http://construct.readthedocs.org/en/latest/)
- [hachoir](https://bitbucket.org/haypo/hachoir/wiki/Home)
- [binio](http://spika.net/py/binio/)

Be sure to use pip to install these packages as so:

`pip install construct`

## Convert Bytes/Strings with binascii()

You can use the standard module binascii to convert binary data to string representations of verious types.

Let's say we want to see the pure hexadecimal repsentation of the PNG header:

In [9]:
print(png_header)

b'\x89PNG\r\n\x1a\n'


In [10]:
import binascii
png_header_in_hex = binascii.hexlify(png_header)
print(png_header_in_hex)

b'89504e470d0a1a0a'


In [11]:
print(binascii.unhexlify(png_header_in_hex))

b'\x89PNG\r\n\x1a\n'


## Bit Operators

At the lowest level, you can use Python to manipulate data, one bit at a time.  We won't spend time practicing these bit-level operations, but they have a lot of uses.  For example, you may need to manipulate bits when optimizing a program for high performance.  

| Operator | Description  | Example | Decimal result | Binary result                             |
|----------|--------------|---------|----------------|-------------------------------------------|
| &        | and          | a & b   | 1              | 0b0001                                    |
| &#124;        | or           | a &#124; b   | 5              | 0b0101                                    |
| ^        | exclusive or | a ^ b   | 4              | 0b0100                                    |
| ~        | flip bits    | ~a      | -6             | binary representation depends on int size |
| <<       | left shift   | a << 1  | 10             | 0b1010                                    |
| >>       | right shift  | a >> 1  | 2              | 0b0010                                    |

You now know the basics of manipulating both bytes as well as text in Python.  When you face a difficult coding challenge, remember that are many external packages that could help make your life easier.