# Bits and Bytes

## Outline

* Grouping bits and bytes

* Haskell and bytes

* Lazy byte strings

* Example

In this lesson we will highlight a fundamental type that is frequently used in Haskell, the Byte String type. 

The lesson will have the above structure and will work towards an example on why we like to use byte strings.

## Grouping bits and bytes

At its core computers only handle the binary object 1 and 0. This unit is also known as a **bit**. 

To make sense of these bits and extract meaningful context from them, different kinds of grouping have been made over the years to represent structure. 

For an extensive list see [(1)](https://en.wikipedia.org/wiki/Binary-to-text_encoding). On the top of the list you will find the very popular ASCII grouping.

These groupings are called **encodings** and map the grouping of bits to a more meaningful/readable form. 

Another advantage of this mapping is that the length of a list of bits can be reduced significantly to be readable. Below we will look at two of them.

One useful binary conversion is the hexadecimal encoding of bits, also abbreviated to hex. This changes the number of symbols from base 2 (binary) to base 16. 

So, the conversion will group 4 bits together to one symbol. Each symbol in the hex system thus has 16 possibilities, these are in the range of 0-9 and A - F. 

This results in the conversion table. 

| Binary | Hex | Binary| Hex |  
|--------|---|--------| - |
| `0000` | 0 | `1000` | 8 |
| `0001` | 1 | `1001` | 9 |
| `0010` | 2 | `1010` | A |
| `0011` | 3 | `1011` | B |
| `0100` | 4 | `1100` | C | 
| `0101` | 5 | `1101` | D |
| `0110` | 6 | `1110` | E |
| `0111` | 7 | `1111` | F |

In its turn, the hexadecimal encoded bits often get grouped together in pairs to represent 8 bits. The grouping of 8 bits is also called a **byte** and is more commonly used than bits. 

The result is that the range of these two grouped hex symbols are between `00000000` and `11111111` in binary. In total, these represent the decimal number from 0 to 255. 

Below is an example how to use hex and binary from the decimal system. We can use it to check the table above.

In [None]:
import Numeric (showHex, showIntAtBase)
import Data.Char (intToDigit)

printHex n = Prelude.putStrLn $ showHex n ""
printBaseTwo n = Prelude.putStrLn $ showIntAtBase 2 intToDigit n "" 

printHex 10 
printBaseTwo 10

Another important encoding that connects computer bits with text and is widely used, is the 8-bit Unicode Transformation Format (UTF-8). 

This standard represents Unicode characters in 1 to 4 bytes, depending on the character.

The reason why this encoding uses a variable length of bytes to represent its characters, is because not every character is used that often. 

To optimize for data transmissions times and storage space, only the most used characters are represented with one byte. 

The next most used characters are then represented with 2 bytes etc.

## Haskell bytes and encodings

### Bytestring in Haskell

In Haskell, we have several types that capture the encoding of characters and their representation as bytes. We will introduce a few below that are of importance. 

We start with the lowest structure that has no encoding, the type `ByteString`. This is a list of bytes that, given context, can be viewed as multiple things. 

We will look at two common ways: 

|a ByteString as  |info|
|---|----|
| a list of type `Word8`| This type is the standard way of representing a byte in Haskell. It offers no extra structure to the byte string.|
| a list of type `Char` | This type tries to decode the list of bytes as 1 byte Unicode character.

These type of bytestrings are strict by default.

To convert a string to a byte string and back, we can use the functions `pack` and `unpack` in the `Data.ByteString` module. 
```haskell
pack :: [GHC.Word.Word8] -> ByteString
unpack :: ByteString -> [GHC.Word.Word8]
```

If we would like to convert to and from `[Char]` type we need to use the **Data.ByteString.Char8** module.

As an example, consider the two ways of representing a byte string below and how they are printed to standard output. 

In [None]:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC
import GHC.Word (Word8)

bytestring :: BC.ByteString
bytestring = BC.pack "Hello world"

asBytes :: [Word8]
asBytes = BS.unpack bytestring
asChars :: [Char]
asChars = BC.unpack bytestring

print asBytes
print asChars

First, we convert a string to a byte string using the `pack` function. Then we `unpack` this byte string as a list of bytes and as a list of characters with `unpack` functions. 

Lastly, we print the object of these lists to the output. We do a qualified imported of both modules because both contain clashing named functions, like the two `unpack` functions used. 

If you would add in the beginning of the code the language pragma `{-# LANGUAGE OverloadedStrings #-}` then you could define the bytestring variable without the `pack` function.

In [None]:
{-# LANGUAGE OverloadedStrings #-}

bytestring :: BC.ByteString
bytestring = "Hello world"

### Data.Text module

There is another type for representing strings in Haskell that is called `Text`. It is part of the **Data.Text** module.

In contrast to the type `String`, the `Text` type is implemented as an array internaly in Haskell, which makes it more efficient.

`Text` also uses strict evaluation by default. If you want a lazy version you can use the **Data.Text.Lazy** module.

This type contains similar as the Bystring type the `pack` and `upack` helper functions for conversion with the `String` type.
```haskell
pack :: String -> T.Text
unpack :: T.Text -> String
```

Same as for `Bytestring` you can add the *OverloadedStrings* language pragma to your code and then define Text type variables without the `pack` function.

In [None]:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Text as T

text :: T.Text
text = "Some text"

string :: String
string = T.unpack text

The **Data.Text** module has many equivalent functions that are contained in Prelude for processing `String` type variables.

Among them are `words`, `lines`, `splitOn` and `intercalate`.

The `Text` type can also easily handle any Unicode characters. To be able to print `Text` type variables you need to import the **Data.Text.IO** module.

In [None]:
import qualified Data.Text.IO as TIO

thaiText :: T.Text
thaiText = "ประเทศไทย มีชื่ออย่างเป็นทางการว่า ราชอาณาจักรไทย"

TIO.putStrLn thaiText

### Text encoding

If you want safely conver back and forth between the `Text` and `Bytestring` type you need to use the **Data.Text.Encoding** module.

It contains two useful functions for conversion of `Bytestring` from the the **Data.ByteString.Char8** module.
```haskell
encodeUtf8 :: T.Text -> BC.ByteString
decodeUtf8 :: BC.ByteString -> T.Text
```

In [None]:
import qualified Data.Text.Encoding as TE

thaiTextSafe :: BC.ByteString
thaiTextSafe = TE.encodeUtf8 thaiText

TIO.putStrLn (TE.decodeUtf8 thaiTextSafe)

Interpreting each byte of a byte string as an `Char` means that not all possible Unicode encoded text in a byte string is correctly interpreted. 

We mentioned that Unicode can encode characters in up to 4 bytes, so characters with two bytes or more will not be correctly displayed this way. 

To highlight this, we encode the character "ǿ", which is encoded into the two bytes `[199,191]`. 

We use the `Data.Text.Encoding` module for a more general way of Unicode encoding. This module implements the general encoding of Unicode in Haskell.

In [None]:
import qualified Data.Text as T
import qualified Data.Text.Encoding as TE
import qualified Data.ByteString.Char8 as BC
import qualified Data.ByteString as BS

-- Wroking with BS.Bystring gives us Unicode numbers
bytestring :: BS.ByteString
bytestring = TE.encodeUtf8 $ T.pack "ǿ"

print $ BS.unpack bytestring -- prints both bytes
(print . BS.unpack . BS.singleton . BS.head) bytestring -- prints only first byte

-- Wroking with BC.Bystring gives us actual characters
bytestringChar :: BC.ByteString
bytestringChar = TE.encodeUtf8 $ T.pack "ǿ"

putStrLn $ BC.unpack bytestringChar -- prints characters for both bytes
(putStrLn . BC.unpack . BC.singleton . BC.head) bytestringChar -- prints character for first byte

The "ǿ" character had to be packed with the **Data.Text** module in order to use the `encodeUtf8` function on it.

Here the function `Text.pack` converts the string into the type `Text` and the function `encodeUtf8` correctly converts this into the corresponding `ByteString`.

We see we can convert it to a `BS.ByteString` or a `BC.ByteString`. The difference is that when printing `BS.ByteString` we seen only Unicode numbers.

When we try to view the bytestrings as a list of types `Char`, we see that it prints two characters which corespond to the two Unicode numbers. 

So as a conclusion, most common characters are only encoded by one byte in Unicode, here the `Data.ByteString.Char8` module suffices. 

But if more characters are involved, consider using the `Data.Text` and `Data.Text.Encoding` module.

## Lazy byte strings

Besides the module `Data.ByteStrings` Haskell also has a lazy variant of byte strings. 

These kinds of byte strings work the same, but have the advantage that they are only evaluated if they are used.

This is especially useful when processing large amounts of data that does not need to be read into memory all at once. 

The module that contains these lazy byte strings is `Data.ByteString.Lazy`. It is used similarly as strict bytestring.

In [None]:
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BLC
import qualified GHC.Word

bytestring :: BLC.ByteString
bytestring = BLC.pack "Hello world"

asBytes :: [GHC.Word.Word8]
asBytes = BL.unpack bytestring

asChars :: [Char]
asChars = BLC.unpack bytestring 

print $ Prelude.head asBytes
print $ Prelude.head asChars

Haskell also lets you switch between lazy and strict byte strings.

In [None]:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BC

import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BLC

lazyByteString :: BLC.ByteString
lazyByteString = BLC.pack "Hello world"

strictByteString :: BC.ByteString
strictByteString = BL.toStrict lazyByteString

lazyByteStringAgain :: BLC.ByteString
lazyByteStringAgain = BL.fromStrict strictByteString

BC.putStrLn strictByteString
BLC.putStrLn lazyByteStringAgain

Here, we first create a lazy byte string with encoded text `Hello world`. 

Then we convert this lazy byte string to a strictly evaluated byte string using the function `toStrict` from the lazy module. 

Lastly, we convert that strict byte string back again to a lazy byte string.

## Example

In this section will compare the use of byte strings opposed to strings and their impact on computational time. 

We time the calculations by fetching the CPU time before and after the computation is performed.

This is captured in the `time` function that can time a general IO action. 

To highlight the performance gain if byte strings are used, we will make use of a sizable file (16 MB).

It is available if you run this notebook in a binder with the Haskell kernel enabled.

This file will be read as a string and as a byte string, then the last line of the file will be printed. 

First, we will read it as a string with the function `readFile :: FilePath -> IO String`,

In [None]:
import System.IO 
import System.CPUTime
import Text.Printf 

time :: IO t -> IO t
time a = do
    start <- getCPUTime
    v <- a
    end   <- getCPUTime
    let diff = fromIntegral (end - start) / (10^12)
    printf "Computation time: %0.3f sec\n" (diff :: Double)
    return v

main :: IO ()
main = do
    fileContent <- readFile "/home/jovyan/ihaskell_examples/ihaskell-hvega/hvega-frames-and-gaia.ipynb" 
    (putStrLn . Prelude.last . Prelude.lines) fileContent
    
time main

Now we will read the same file, but as a byte string with `BS.readFile :: FilePath -> IO ByteString`. We will see an big improvement in the calculated time. 

In [None]:
import qualified Data.ByteString as BS      
import qualified Data.ByteString.Char8 as BC  

main :: IO ()
main = do
    fileContent <- BS.readFile "/home/jovyan/ihaskell_examples/ihaskell-hvega/hvega-frames-and-gaia.ipynb" 
    (BC.putStr . Prelude.last . BC.lines) fileContent

time main

## Recap

In this lesson, we have discussed:
- various bit and byte encodings

- Haskell representations of bits and bytes

- the Text type and text encodings

- the lazy bystring type

- example which shows performance benifits