# Haskell and Bytes

## Outline

* Grouping bits and bytes
* Haskell and bytes
* Lazy and strict byte strings
* Example

In this lesson we will highlight a fundamental type that is frequently used in Haskell, the Byte String type. The lesson will have the above structure and will work towards an example on why we like to use byte strings.

## Grouping bits and bytes
At its core computers only handle the binary object 1 and 0, this unit is also known as a **bit**. To make sense of these bits and extract context from them, different kinds of grouping have been made over the years to represent structure, for an extensive list see [(1)](https://en.wikipedia.org/wiki/Binary-to-text_encoding). These groupings are called **encodings** and map the grouping of bits to a more meaningful/readable form. Another advantage of this mapping is that the length of a list of bits can be reduced significantly to be readable. Below we will look at two of them.

One useful binary conversion is the hexadecimal encoding of bits, also abbreviated to hex. This changes the number of symbols from base 2 (binary) to base 16. So, the conversion will group 4 bits together to one symbol. Each symbol in the hex system thus has 16 possibilities, these are in the range of 0-9 and A - F.  This results in the conversion table. 

| Binary | Hex | Binary| Hex |  
|--------|---|--------| - |
| `0000` | 0 | `1000` | 8 |
| `0001` | 1 | `1001` | 9 |
| `0010` | 2 | `1010` | A |
| `0011` | 3 | `1011` | B |
| `0100` | 4 | `1100` | C | 
| `0101` | 5 | `1101` | D |
| `0110` | 6 | `1110` | E |
| `0111` | 7 | `1111` | F |

In its turn, the hexadecimal encoded bits often get grouped together in pairs to represent 8 bits. The grouping of 8 bits is also called a **byte** and is more commonly used than bits. The result is that the range of these two grouped hex symbols are between `00000000` and `11111111` in binary. In total, these represent the decimal number from 0 to 255. Below is an example how to use hex and binary from the decimal system. We can use it to check the table above.

In [1]:
import Numeric (showHex, showIntAtBase)
import Data.Char (intToDigit)

printHex n = Prelude.putStrLn $ showHex n ""
printBaseTwo n = Prelude.putStrLn $ showIntAtBase 2 intToDigit n "" 

printHex 10 
printBaseTwo 10

a

1010

Another important encoding that connects computer bits with text and is widely used, is the 8-bit Unicode Transformation Format (UTF-8). This standard represents Unicode characters in 1 to 4 bytes, depending on the character. The reason why this encoding uses a variable length of bytes to represent its characters, is because not every character is used that often. To optimize for data transmissions times and storage space, only the most used characters are represented with one byte. The next most used characters are then represented with 2 bytes etc.

## Haskell, bytes and encodings
In Haskell, we have several types that capture the encoding of characters and their representation as bytes. We will introduce a few below that are of importance. We start with the lowest structure that has no encoding, the type `ByteString`. This is a list of bytes that, given context, can be viewed as multiple things. We will look at two common ways, 

|a ByteString as  |info|
|---|----|
| a list of type `Word8`| This type is the standard way of representing a byte in Haskell. It offers no extra structure to the byte string.|
| a list of type `Char` | This type tries to decode the list of bytes as 1 byte Unicode character.

To convert a string to a byte string, we can use the function `pack` in the `Data.ByteString` module. As an example, consider the two ways of representing a byte string above and how they are printed to standard output. First, we convert a string to a byte string using the `pack` function. Then we `unpack` this byte string as two things, as a list of bytes, and as a list of characters with `unpack` from the `Char8` module. Lastly, we print the object of these lists to the output. We explicitly imported both modules with an alias because both contain clashing named functions, like the two `unpack` functions used. This is also the reason why we import the modules `qualified`, this force us to specify from which module we pick which function. We will talk more about qualified imports in lesson Pragmas, Modules and Cabal.

In [2]:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC

bytestring = BC.pack "Hello world" -- Is of type ByteString

asBytes = BS.unpack bytestring -- Is of type [Word8] 
asChars = BC.unpack bytestring -- Is of type [Char]

print asBytes
print asChars

[72,101,108,108,111,32,119,111,114,108,100]
"Hello world"

Interpreting each byte of a byte string as an `Char` means that not all possible Unicode encoded text in a byte string is correctly interpreted. We mentioned that Unicode can encode characters in up to 4 bytes, so characters with two bytes or more will not be correctly displayed this way. To highlight this, we encode the character "ǿ", which is encoded into the two bytes `[199,191]`. For a more general way of Unicode encoding, we use the  module `Data.Text.Encoding`. This module implements the general encoding of Unicode in Haskell.

In [37]:
import qualified Data.Text as Text
import qualified Data.Text.Encoding as Text
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC

bytestring :: BS.ByteString
bytestring = Text.encodeUtf8 $ Text.pack "ǿ"

print $ BS.unpack bytestring

(print . BC.unpack . BS.singleton . BS.head) bytestring -- this takes the first byte of the encoded character "ǿ" and tries to view it as a Char type.

[199,191]

"\199"

Here the function `Text.pack` converts the string into the type `Text` and the function `encodeUtf8` correctly converts this into the corresponding `ByteString`. When we try to view this byte string as a list of types `Char`, we see that it print the single byte as a character. This because there is no Unicode character which corresponds to the first byte `[199]`. This module also has a `decodeUtf8` function.

So as a conclusion, most common characters are only encoded by one byte in Unicode, here the `Data.ByteString.Char8` module suffices. But if more characters are involved, consider using the `Data.Text` and `Data.Text.Encoding` module.

## Lazy byte strings
Besides the module `Data.ByteStrings` Haskell also has a lazy variant of byte strings. These kinds of byte strings work the same, but have the advantage that they are only evaluated if they are used. This is especially useful when processing large amounts of data that does not need to be read into memory all at once. The module that contains these lazy byte strings is `Data.ByteString.Lazy`, similarly as before this is used as

In [36]:
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BLC

bytestring = BLC.pack "Hello world"

asBytes = BL.unpack bytestring -- Is of type [Word8] 
asChars = BLC.unpack bytestring -- Is of type [Char]

print $ Prelude.head asBytes
print $ Prelude.head asChars

72
'H'

Haskell also lets you switch between lazy and strict byte strings.

In [35]:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC

import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BLC

lazyByteString = BLC.pack "Hello world"
strictByteString = BL.toStrict lazyByteString
lazyByteStringAgain = BL.fromStrict strictByteString

BC.putStrLn strictByteString
BLC.putStrLn lazyByteStringAgain

Hello world

Hello world

Here, we first create a lazy byte string with encoded text `Hello world`. Then we convert this lazy byte string to a strictly evaluated byte string using the function `toStrict` from the lazy module. Lastly, we convert that strict byte string back again to a different lazy byte string.

## Example

In this section will compare the use of byte strings opposed to strings and their impact on computational time. We time the calculations by fetching the CPU time before and after the computation is performed. This is captured in the `time` function below that can time a general IO action. To highlight the performance gain if byte strings are used, we will make use of a sizable file (16mb) that is available if you run this notebook in a binder with the Haskell kernel enabled. This file will be read as a string and as a byte string, then the last line of the file will be printed. First, we will read it as a string with the function `readFile :: FilePath -> IO String`,

In [7]:
import System.IO   
import System.CPUTime
import Text.Printf 

time :: IO t -> IO t
time a = do
    start <- getCPUTime
    v <- a
    end   <- getCPUTime
    let diff = fromIntegral (end - start) / (10^12)
    printf "Computation time: %0.3f sec\n" (diff :: Double)
    return v

main = readFile "/home/jovyan/ihaskell_examples/ihaskell-hvega/hvega-frames-and-gaia.ipynb" >>= putStrLn . Prelude.last . Prelude.lines
time main

}
Computation time: 5.522 sec

Now we will read the same file, but as a byte string with `BS.readFile :: FilePath -> IO ByteString`. We will see an big improvement in the calculated time. 

In [8]:
import qualified Data.ByteString as BS      
import qualified Data.ByteString.Char8 as BC  
import System.IO      
import System.CPUTime

time :: IO t -> IO t
time a = do
    start <- getCPUTime
    v <- a
    end   <- getCPUTime
    let diff = fromIntegral (end - start) / (10^12)
    printf "Computation time: %0.3f sec\n" (diff :: Double)
    return v

main = BS.readFile "/home/jovyan/ihaskell_examples/ihaskell-hvega/hvega-frames-and-gaia.ipynb" >>= BC.putStr . Prelude.last . BC.lines
time main

}Computation time: 0.009 sec