### Base64 Encoding and Decoding

There are many situations where we have to deal with base64-encoded data.

Using Python, this is pretty straightforward.

The only thing is that the base 64 encoding functions Python provides us, work on bytes (or byte arrays), not strings directly.

The first thing we'll likely need is to encode strings to bytes and decode bytes to strings.

In Python, strings are basically sequences of Unicode code points, and the string class provides an `encode` method that can be used.

In [1]:
s = "ye old cheese shoppe"

In [2]:
s.encode("ascii")

b'ye old cheese shoppe'

In [3]:
s.encode("utf-8")

b'ye old cheese shoppe'

As you can see from the output, the `encode` function returns a byte array - we can also see it this way:

In [4]:
type(s.encode("utf-8"))

bytes

Those encoded strings look the same, because the character in `s` were all ASCII characters, but you may not always be dealing with just ASCII characters, so adjust your encoding accordingly (might even be other codecs such as UTF-16, etc)

Let's throw some non-ascii characters and encode things again:

In [5]:
s = "Python rocks! 😀"

In [6]:
s.encode("utf-8")

b'Python rocks! \xf0\x9f\x98\x80'

But an `ascii` encoding will not work:

In [7]:
try:
    s.encode("ascii")
except UnicodeError as ex:
    print(ex)

'ascii' codec can't encode character '\U0001f600' in position 14: ordinal not in range(128)


When we have a byte array, we can convert it back to a string, by using the `decode` methods that the `bytes` class implements:

In [8]:
encoded = s.encode("utf-8")

In [9]:
encoded.decode("utf_8")

'Python rocks! 😀'

The next step to base64 encode a string, is to use the `b64encode` function in the `base64` module.

In [10]:
import base64

In [11]:
s

'Python rocks! 😀'

We have to convert our string to a byte string first, and then base64 encode it.

In [12]:
encoded = s.encode("utf-8")
base64_encoded = base64.b64encode(encoded)
base64_encoded

b'UHl0aG9uIHJvY2tzISDwn5iA'

As you can see, we get a byte string back, which we can convert to a string if needed:

In [13]:
base64_encoded.decode("utf-8")

'UHl0aG9uIHJvY2tzISDwn5iA'

To decode a base64 encoded string, we just need to reverse the process:

1. encode the string into a byte string
2. base64 decode the byte string
3. encode that result back into a string (if required)

In [14]:
b = 'UHl0aG9uIHJvY2tzISDwn5iA'
encoded = b.encode("utf-8")
b64 = base64.b64decode(encoded)
result = b64.decode("utf-8")
print(result)

Python rocks! 😀


There's one thing to know about decoding base64 encoded strings - the encoded string length should technically be a multiple of 4. 

Let's look at this:

In [15]:
enc1 = base64.b64encode('Python rocks! 😀'.encode("utf-8"))
enc2 = base64.b64encode('Python rocks! 😀😀'.encode("utf-8"))
enc3 = base64.b64encode('Python rocks! 😀😀😀'.encode("utf-8"))
enc4 = base64.b64encode('Python rocks! 😀😀😀😀'.encode("utf-8"))

print(enc1, len(enc1), sep="\t")
print(enc2, len(enc2), sep="\t")
print(enc3, len(enc3), sep="\t")
print(enc4, len(enc4), sep="\t")

b'UHl0aG9uIHJvY2tzISDwn5iA'	24
b'UHl0aG9uIHJvY2tzISDwn5iA8J+YgA=='	32
b'UHl0aG9uIHJvY2tzISDwn5iA8J+YgPCfmIA='	36
b'UHl0aG9uIHJvY2tzISDwn5iA8J+YgPCfmIDwn5iA'	40


As you can see, all the encoding lenghts are a multiple of 4, and `=` characters are used to pad the lenght as needed.

Python takes care of that for us atutomatically when base64 encoding strings - however not all systems do that. In fact, we often find API tokens that are base64 encoded, but not padded.

In [16]:
s = 'UHl0aG9uIHJvY2tzISDwn5iA8J+YgA'

If we try to decode this string, we'll get an exception:

In [17]:
try:
    base64.b64decode(s.encode("utf-8"))
except Exception as ex:
    print(ex)

Incorrect padding


The fix is easy, we simply need to check the length of our byte string and pad it appropriately, using `=` characters.

Let's try it:

In [18]:
s = 'UHl0aG9uIHJvY2tzISDwn5iA8J+YgA'
encoded = s.encode("utf-8")
encoded += b"=="
base64.b64decode(encoded).decode("utf-8")

'Python rocks! 😀😀'

The interesting thing is that if we add extra `=` characters, Python will just ignore them, and we don't need to have a length of a multiple of 4:

In [19]:
s = 'UHl0aG9uIHJvY2tzISDwn5iA8J+YgA'
encoded = s.encode("utf-8")
encoded += b"====="
base64.b64decode(encoded).decode("utf-8")

'Python rocks! 😀😀'

So if we thing about the length of a string, if it is not a multiple of 4, we would need to pad, at most, 3 `=` characters, and so we can always just add three `=` characters without any issues.

So given all this, let's go ahead and write some functions to base64 encode and decode strings:

In [20]:
def encode(s: str, encoding: str="utf-8") -> str:
    encoded = s.encode(encoding)
    b64_encoded = base64.b64encode(encoded)
    return b64_encoded.decode(encoding)

In [21]:
def decode(s: str, encoding: str="utf-8") -> str:
    encoded = s.encode(encoding)
    b64_decoded = base64.b64decode(encoded + b"===")
    return b64_decoded.decode(encoding)

In [22]:
decode('UHl0aG9uIHJvY2tzISDwn5iA8J+YgA')

'Python rocks! 😀😀'

In [23]:
encode("I'm a lumberjack, and I'm OK")

'SSdtIGEgbHVtYmVyamFjaywgYW5kIEknbSBPSw=='

In [24]:
decode(encode("I'm a lumberjack, and I'm OK"))

"I'm a lumberjack, and I'm OK"

And that's it for base 64 encoding!