# Mangle Data Like A Pro

## Binary Data

We have gone over several ways to manipulate and search text data. However sometimes you have to dig depper into the actual data. Sometimes characters are too abstract or meaningless to the data that you are trying to process. One good example is if you are trying to detect roads on satellite images. Those pixels will be encoded in a series of bytes. So here are some ways to manage binary data

## Bytes and bytearray

Python 3 introduced two data types that can handle sequences of bytes with possible values from 0 to 255:

- `bytes` is immutable, like a tuple
- `bytearray` is mutable, like a list of bytes

For example

In [3]:
byte_value_list = [4, 3, 242]
bytes_test = bytes(byte_value_list)
print(bytes_test)

b'\x04\x03\xf2'


In [4]:
bytes_test[1] = 129
print(bytes_test)

TypeError: 'bytes' object does not support item assignment

In [5]:
bytearray_test = bytearray(byte_value_list)
print(bytearray_test)

bytearray(b'\x04\x03\xf2')


In [6]:
bytearray_test[1] = 127
print(bytearray_test)

bytearray(b'\x04\x7f\xf2')


Note how bytes_test cannot be changed but bytearray can? Also note that when we print out bytes we get "\x" followed by their hexadecimal representation displayed to the screen. If the byte number is printable via the ASCII encoding, then the letter will be printed. For example:

In [7]:
bytearray_test[2] = 68
print(bytearray_test)

bytearray(b'\x04\x7fD')


The ASCII number for the letter "D" is 68 in decimal, so "D" was printed to the screen instead of /x44 (base 16)

## Convert Binary Data with struct

Python also offers tools to directly convert Python data structures into bytes and back to python data structures. This way you would not need to handle the various intricacies that reading bytes from an external source could provide, for example:

- Are the bytes stored in big endian (most significant bytes to the left) or little endia (least significant bytes to the left)
- What kind of bytes are they? Signed? Unsigned? Short or Long integers?

In the book they give an example of how to use python to detect whether an image is a png. We are going to check if the file cal-image.png that is in this curent folder is a png and if so print out the width and the height from the file information itself:

In [8]:
import struct

f = open("cal-image.png", "rb")
try:
    data = f.read(24)
finally:
    f.close()
    
png_header = b'\x89PNG\r\n\x1a\n'
        
if data[0:8] == png_header:
    width, height = struct.unpack('>LL', data[16:24])
    print(' Valid PNG, width', width, 'height', height)
else:
    print('Not a valid PNG')

 Valid PNG, width 500 height 398


What's happening is that we first open the file "cal-image.png" in read only binary format and read the first 24 bytes in that file. We will go into depth what this is doing in the next section.

We then test to see if it is indeed a png file by testing to see if the first eight bytes correspond to the png_header variable, which contains the header that is in every png image.

If so, we then inspect the 16 - 24 bytes, load them in big endian style because that is how the integers are stored bytes wise (the most significant bytes are to the left as opposed to to the right if we read each byte as a stream. We have to deal with this because each integer is represented by four bytes)

We get the encoded width and height and print them to the screen.

### Format Specifiers using struct

**Endian specifiers**
- **<**: little endian
- **\>**: big endian

**Format specifiers**: the number in parenthensies is the number of bytes each format specifier is referring to
- **x**: skip a byte (1)
- **b**: signed byte (1)
- **B**: unsigned byte (1)

- **h**: signed short integer (2)
- **H**: unsigned short integer (2)

- **i**: signed integer (4)
- **I**: unsigned integer (4)

- **l**: signed long integer (4)
- **L**: unsigned long integer (4)

- **Q**: unsigned long long integer (8)
- **f**: single precision float

## Other Binary Tools

There are other external packages that you can use that may be more easy to use in your case. Some options are:

- [bitstring](https://github.com/scott-griffiths/bitstring)
- [construct](http://construct.readthedocs.org/en/latest/)
- [hachoir](https://bitbucket.org/haypo/hachoir/wiki/Home)
- [binio](http://spika.net/py/binio/)

Be sure to use pip to install these packages as so:

`pip isntall constrcut`

## Convert Bytes/Strings with binascii()

You can use the standard module binascii to convert binary data to string representations of verious types.

Let's say we wanted to see the pure hexadecimal repsentation of the PNG header:

In [9]:
print(png_header)

b'\x89PNG\r\n\x1a\n'


In [10]:
import binascii
png_header_in_hex = binascii.hexlify(png_header)
print(png_header_in_hex)

b'89504e470d0a1a0a'


In [11]:
print(binascii.unhexlify(png_header_in_hex))

b'\x89PNG\r\n\x1a\n'


## Bit Operators

You can also do bit level operations on bytes as well and in case of your application can be extremely useful manipulating large data sets very quickly:

| Operator | Description  | Example | Decimal result | Binary result                             |
|----------|--------------|---------|----------------|-------------------------------------------|
| &        | and          | a & b   | 1              | 0b0001                                    |
| &#124;        | or           | a &#124; b   | 5              | 0b0101                                    |
| ^        | exclusive or | a ^ b   | 4              | 0b0100                                    |
| ~        | flip bits    | ~a      | -6             | binary representation depends on int size |
| <<       | left shift   | a << 1  | 10             | 0b1010                                    |
| >>       | right shift  | a >> 1  | 2              | 0b0010                                    |

So now you can manipualte both bytes as well as text: check out the external packages as well to make your life eaiser in the long run. Next we will go into how we can introduce external data into our applications with File I/O