# L2b: Fun With Base 16 Numbers
As scientists and engineers, we often encounter situations where the most critical numbering systems for data science and machine learning applications are not floating-point numbers, but rather base-$b$ integers. This might seem surprising, but it's true!

Base-$b$ integers are used in many often overlooked but critical applications. In this lab, we'll explore one such application: the mathematical encoding of text.
* __Backstory__: Computers don't see text the way humans do‚Äîas collections of strings, emojis, and punctuation. Instead, computers process text as sequences of binary bits. Each character in text must be mapped to a unique integer (a sequence of binary bits). Since it's difficult for humans to work with long binary sequences, we express these integers using positional notation in various bases, making text just another type of numerical sequence.
* __Why base $b$?__ Aren't base-10 numbers sufficient? While base-10 works, Unicode code points can be up to 4 bytes (32 bits) long. Working with 32-digit binary numbers is cumbersome, so we use other bases like hexadecimal (base 16) to create more compact, human-readable representations that can still be efficiently decoded back into the original characters.

Let's start at the beginning.
* Traditionally, characters were represented using [the ASCII system](https://en.wikipedia.org/wiki/ASCII). [Standard (7-bit) ASCII](https://www.ascii-code.com/ASCII) uses 7 bits of storage and defines 128 character codes numbered 0 through 127 (each character is mapped to a number ranging from 0 to 127). [Extended (8-bit) ASCII variants](https://www.ascii-code.com) use 8 bits and define 256 values numbered 0 to 255. But is 255 characters enough?
* No! Modern encodings, such as [Unicode encodings like UTF-8 or UTF-16](https://en.wikipedia.org/wiki/Unicode), represent a much wider range of characters and symbols.

In this lab, we'll explore the numerical basis of text, starting with the [7-bit ASCII system](https://en.wikipedia.org/wiki/ASCII) and progressing to [Unicode encodings like UTF-8 or UTF-16](https://en.wikipedia.org/wiki/Unicode). 

___

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file and loading any needed resources.

> __Include__: The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/)

Let's setup the computational environment.

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # what is this doing?

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl), check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types and data used in this material. 

___

##  Task 1: Explore the ASCII Character Set
Let's start by considering the representation of the characters in the 7-bit ASCII character system. 

> __Fun fact__: While 7-bit ASCII has been replaced a long time ago, it still lives on in the modern [Unicode system](https://en.wikipedia.org/wiki/Unicode). The original ASCII characters are encoded as the first 128 characters (`0`$\rightarrow$`127`) in [the Unicode system](https://en.wikipedia.org/wiki/Unicode).

Let's start by building the `ascii_char_dictionary::Dict{Int64, Char}` dictionary, which is a mapping between the ASCII character index (an integer) and the character value, which [is type `c::Char` in Julia](https://docs.julialang.org/en/v1/base/strings/#Core.Char).

> __Caution:__ This logic involves some advanced tools and techniques we have yet to discuss. You can skip the implementation details for now; we'll come back to them later. However, there is one interesting method, namely [the `convert(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.convert). This is a direct way to convert between the characters that you and I see and the integers that the computer understands.

What do we see?

In [2]:
ascii_char_dictionary = let

    # initialize -
    ascii_char_dictionary = Dict{Int64, Char}(); # storage for index - character map
    ASCII_character_range = range(0,stop=127,step=1) |> collect; # 7-bit ASCII indexes

    # main loop -
    for i ‚àà eachindex(ASCII_character_range)
        my_ascii_char_index = ASCII_character_range[i];
        c = convert(Char, my_ascii_char_index) # hmmm. This is an interesting function.
        ascii_char_dictionary[my_ascii_char_index] = c;
    end
    ascii_char_dictionary;
end

Dict{Int64, Char} with 128 entries:
  5   => '\x05'
  56  => '8'
  35  => '#'
  55  => '7'
  110 => 'n'
  114 => 'r'
  123 => '{'
  60  => '<'
  30  => '\x1e'
  32  => ' '
  6   => '\x06'
  67  => 'C'
  45  => '-'
  117 => 'u'
  73  => 'I'
  115 => 's'
  112 => 'p'
  64  => '@'
  90  => 'Z'
  ‚ãÆ   => ‚ãÆ

`Unhide` the code block to see how we build a table of the ASCII characters using [the `pretty_table(...)` function exported by the PrettyTables.jl package](https://github.com/ronisbr/PrettyTables.jl). Let's look at what the `ASCII` characters are. 
> __Caution:__ This logic involves some advanced tools and techniques we have yet to discuss. You can skip the implementation details for now; we'll come back to them later. TLDR: We build the rows in the table using a [`for-loop`](https://docs.julialang.org/en/v1/base/base/#for) and [the `convert(...)` function](https://docs.julialang.org/en/v1/base/base/#Base.convert), where we push the data for a row into a [DataFrame](https://dataframes.juliadata.org/stable/), and then display the data by [calling the `pretty_table(...)` method](https://github.com/ronisbr/PrettyTables.jl).

So what do we see?

In [3]:
let
    ASCII_index_array = keys(ascii_char_dictionary) |> collect |> sort;
    character_table_df = DataFrame();
    for i ‚àà eachindex(ASCII_index_array)
        my_ascii_char_index = ASCII_index_array[i];
        c = ascii_char_dictionary[my_ascii_char_index];

        row = (
            i = my_ascii_char_index,
            character = c
        ); # -> what is going on here? This is a cool type called a NamedTuple ...
        push!(character_table_df,row); # push! -> what is going on here?
    end
    pretty_table(character_table_df, tf=tf_simple)
end

 [1m     i [0m [1m character [0m
 [90m Int64 [0m [90m      Char [0m
      0          \0
      1        \x01
      2        \x02
      3        \x03
      4        \x04
      5        \x05
      6        \x06
      7          \a
      8          \b
      9          \t
     10          \n
     11          \v
     12          \f
     13          \r
     14        \x0e
     15        \x0f
     16        \x10
     17        \x11
     18        \x12
     19        \x13
     20        \x14
     21        \x15
     22        \x16
     23        \x17
     24        \x18
     25        \x19
     26        \x1a
     27          \e
     28        \x1c
     29        \x1d
     30        \x1e
     31        \x1f
     32
     33           !
     34           "
     35           #
     36           $
     37           %
     38           &
     39           '
     40           (
     41           )
     42           *
     43           +
     44           ,
     45           -
     46         

In the code block above, we explicitly called [the `convert(...)` method](https://docs.julialang.org/en/v1/manual/conversion-and-promotion/) to convert an [`Int`](https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Integers) to a [`Char` type](https://docs.julialang.org/en/v1/base/strings/#Core.Char). However, this was unnecessary, as Julia (and most modern languages) will automatically try to understand what you type (and do conversions for you if it can). For example:

In [4]:
'c' |> Int # wow. This seems a little magical. What is going on here?? (notice the single quotes), pick a different char?

99

In [5]:
Int('c') # Unpack the |> We can convert between ASCII characters and Int! 

99

In [6]:
Char(99)

'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

### Limitations of ASCII
Wow! ASCII seems pretty cool. It only uses 7 bits (which is super simple) to encode 128 characters (numbers, small and capital letters, some math symbols, etc). Why would we ever need anything beyond these 127 characters? As it turns out, there are many reasons. 

Let's look at three:
* The original 7-bit ASCII was limited to 128 characters (0‚Äì127), so it couldn‚Äôt represent characters beyond basic English letters, digits, punctuation, and control codes.
* The original 7-bit ASCII character set offered no native support for accented or non-Latin characters (e.g., Cyrillic, Greek, and Arabic), hindering internationalization. It only represented English.
* The original 7-bit ASCII character set omitted modern typographic symbols, mathematical glyphs, and emojis, making it inadequate for rich text or graphical communication. For example, it can't be used to represent rich mathematical text. Bummer!

These limitations, and others, led to the development of [the Unicode system](https://en.wikipedia.org/wiki/Unicode).

## Task 2: Explore the Unicode Character Set
Unlike the original 7-bit ASCII character set, which has only 128 characters, the modern [Unicode system](https://en.wikipedia.org/wiki/Unicode) _can encode_ 1,114,112 possible characters (code points). However, as of [Unicode Standard version 16.0 (released September 10, 2024)](https://www.unicode.org/versions/Unicode16.0.0/), only 154,998 of those possible code points are assigned to unique characters; various characters in many languages, a much larger family of mathematical symbols, and [emoji](https://en.wikipedia.org/wiki/Emoji) of all sorts are currently assigned. There is plenty of room to grow!

Technical:
* The [Unicode system](https://en.wikipedia.org/wiki/Unicode) uses up to 4 bytes (32 bits) of storage per character, and indexes characters using a base $b =16$ (hexadecimal) numbering system. Hexadecimal numbers are used as a convenience; they are much shorter than their binary equivalents and, thus, are easier for us to read and write than long sequences of 0s and 1s. While Unicode code points are ultimately stored as binary values, hexadecimal provides a direct and convenient way for us (humans) to represent these values.
* Julia has [built-in Unicode support in the standard library](https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode); thus, we can work with these characters in our programs. For more information on the specific Unicode primitives supported by Julia, check out the [Julia documentation](https://docs.julialang.org/en/v1/manual/unicode-input/).

Let's do some math with emojis (for fun):

In [7]:
üåΩ = 16; # corn = 16 \:corn: then tab
üç£ = 4; # sushi = 4 \:sushi: then tab

We can perform operations with these emoji variables:

In [8]:
üåΩ + üç£  # addition?

20

In [9]:
(üåΩ * üç£) # multiplication?

64

We can also perform logical comparisons. Because Julia has built-in Unicode support, we can use Unicode mathematical symbols to write functional code that resembles standard mathematical notation.
* For example, consider `üåΩ = 16` and `üç£ = 4`. We know that $üåΩ\geq{üç£}$ should return `true.` Let's write the `greater than or equal to` check using Unicode characters:

In [10]:
üåΩ ‚â• üç£ # logical comparison? (\geq then tab)

true

We can also determine whether an item is in the given collection. For example, assume we have the character set $\mathbb{C} \equiv \left\{A,B,Q,R,S\right\}$. Let's write an expression to check if $c\in\mathbb{C}$, where $\in$ denotes the `element of` operation, and `c` is some (test) character. 

> Use the [Set data structure](https://docs.julialang.org/en/v1/base/collections/#Base.Set) to do this example. [`Set` is a collection type](https://docs.julialang.org/en/v1/base/collections/#Base.Set) (included in most modern languages) that holds items; sets are `unique` but do not maintain order.

Specify the set $\mathbb{C}$:

In [11]:
C = let
    C = Set{Char}(); # empty set with Char types
    push!(C,'A'); # add a `A` to set C 
    push!(C,'B'); # ... `B` ...
    push!(C,'Q'); # ... `Q` ...
    push!(C,'R'); # ... `R` ...
    push!(C,'S'); # ... `S` ...

    C # return
end

Set{Char} with 5 elements:
  'A'
  'R'
  'S'
  'Q'
  'B'

Specify the test character `c`:

In [12]:
c = üåΩ; # Do we have corn in the set ‚ÑÇ?

Check if $c\in\mathbb{C}$:

In [13]:
c ‚àà C # ‚àà => \in then tab

false

The Unicode character set is extensive and powerful, and Julia's support for Unicode is quite advanced. But how are these characters related to base-16 numbers? Let's explore this connection next.

## Task 3: A deeper dive into Unicode Strings and Codepoints
The built-in [Julia `String` type](https://docs.julialang.org/en/v1/base/strings/) is similar (in some ways) to the traditional text model in languages like [C](https://en.wikipedia.org/wiki/C_(programming_language)), namely, the [`String` type](https://docs.julialang.org/en/v1/base/strings/) is an ordered collection (array) of `Char` types. However, the models for the characters are more complicated.

Let's play around with a `test_string_ascii::String`:

In [14]:
test_string_ascii = "Test String in Julia (notice the double quotes). Python uses both single and double quotes for Strings. üòí";

We convert `test_string_ascii::String` to an `Array{Char,1}` collection using [the `collect(...)` method](https://docs.julialang.org/en/v1/base/collections/#Base.collect-Tuple%7BAny%7D). Each character has a `U+xxxx` type code associated with it. That is a base 16 Unicode code point!
> __What?!?__ In a `U+xxxx` unicode code point, `U+` is followed by a four- or six-digit hexadecimal number `xx..x` uniquely identifying a character in the Unicode standard. However, (really) the `U+xxxx` points are just integers! Thus, we should be able to interconvert between the `U+xxxx` and integer representations.

Let's check that out!

In [15]:
character_array_test_string_ascii = test_string_ascii |> collect

105-element Vector{Char}:
 'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 ‚ãÆ
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 '.': ASCI

### How do we calculate a codepoint?
First, let's try to compute a codepoint (index) for an example Unicode character. Consider `üç£`, which equals `U+1F363`. Let's store a test character in the `test_unicode_char::Char` variable:

In [16]:
test_unicode_char = 'üç£' # want to select another Unicode character?

'üç£': Unicode U+1F363 (category So: Symbol, other)

We can compute the `base 10` index for a Unicode character by calling [the `Unicode.julia_chartransform(...)` method](https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode.julia_chartransform) and then converting that value to an `Int.` What is the base 10 value for the `test_unicode_char::Char`?

In [17]:
test_char_index = Unicode.julia_chartransform(test_unicode_char) |> Int

127843

Couldn't we also just do this:

In [18]:
test_unicode_char |> Int # oh yeah!

127843

So we can get the base 10 representation of the Unicode character. Now let's go from base 10 to the `U+xxxx` representation. Convert the base 10 value to a base 16 (hexadecimal) number and then convert that to the [Unicode index](https://en.wikipedia.org/wiki/Unicode) format. 
> __Hexadecimal digits__: Hexadecimal numbers use decimal digits $(0,1,\dots,9)$ and six extra symbols; the letters `A`, `B`, `C`, `D`, `E`, and `F`, where hexadecimal `A` = decimal 10, through hexadecimal `F` = decimal 15.

Let's start by building the `hexadecimal_digits_dictionary::Dict{Int64, Char}` dictionary which maps the index of a digit to the digit, e.g., `13 => D`.

In [19]:
hexadecimal_digits_dictionary = let

    # initialize 0
    hexadecimal_digits_dictionary = Dict{Int,Char}()
    base = 16; # what base are using?

    # loop: process each digit 0 -> 15
    for i ‚àà 0:(base - 1)
        hexadecimal_digits_dictionary[i] = '0' + i |> Char # hmmm! This is whacky; why does this work?
        if (i > 9)
            hexadecimal_digits_dictionary[i] = 'A' + (i - 10) |> Char 
        end
    end
    hexadecimal_digits_dictionary # return
end

Dict{Int64, Char} with 16 entries:
  5  => '5'
  12 => 'C'
  8  => '8'
  1  => '1'
  0  => '0'
  6  => '6'
  11 => 'B'
  9  => '9'
  14 => 'E'
  3  => '3'
  7  => '7'
  4  => '4'
  13 => 'D'
  15 => 'F'
  2  => '2'
  10 => 'A'

#### Algorithm
Next, let's specify and implement our first algorithm to convert a base 10 number into a Unicode character code. [Unicode](https://en.wikipedia.org/wiki/Unicode) left-pads the hexadecimal character value with zeros until a 5-digit code is generated, and then `U+` is appended to the five-digit code.

__Initialize__: Specify a non-negative integer $x\in\mathbb{Z}_{\geq{0}}$ value, and provide the `hexadecimal_digits_dictionary::Dict{Int64, Char}`. Set $i = 0$. Set $q = \infty$.

While $q$ not equal to `0` __do__:
1.  Divide $x$ by `16` and store the quotient in $q$ and the remainder $R\gets{r_{i}}$ in remainder array $R$.
2.  Divide the quotient $q$ by `16` and write down the new quotient $q^{\prime}$. Store the new remainder $R\gets {r_{i}}$ in the remainder array $R$. Set $q\gets{q}^{\prime}$ and update the counter $i\gets {i+1}$.
3.  Repeat step 2 until $q = 0$.

For each element of the remainder array $R$, look up the equivalent hexadecimal digit from the `hexadecimal_digits_dictionary`. Save the hexadecimal digits in the digits array $D$. Starting with the last value of $D$ and moving to the first, write each of the hexadecimal values. Left-pad with `0` characters until the length is `5`.

We've implemented this algorithm below, does it work (do we get back the proper `U+xx..x` code?)

In [20]:
my_code_point = let

    q = test_char_index; 
    remainder_array = Array{Int64,1}();
    while (q != 0)
        r = rem(q,16)
        q = div(q,16)
        push!(remainder_array,r)
    end

    my_code_point = "";
    for i ‚àà reverse(remainder_array)
        tmp = hexadecimal_digits_dictionary[i];
        my_code_point *= tmp |> Char;
    end

    # left pad with zeros to get a 4-digit code
    my_code_point = lpad(my_code_point, 5, '0') |> x-> "U+"*x
end

"U+1F363"

## Tests
In the code block below, we check some values in your notebook and give you feedback on which items are correct or different. `Unhide` the code block below (if you are curious) about how we implemented the tests and what we are testing.

In [21]:
@testset verbose = true "CHEME 4/5800 L2b Test Suite" begin

    @testset "ASCII Character Dictionary" begin
        # Test that ASCII dictionary is created correctly
        ascii_dict = let
            ascii_dict = Dict{Int64, Char}()
            ASCII_character_range = range(0,stop=127,step=1) |> collect
            for i ‚àà eachindex(ASCII_character_range)
                my_ascii_char_index = ASCII_character_range[i]
                c = convert(Char, my_ascii_char_index)
                ascii_dict[my_ascii_char_index] = c
            end
            ascii_dict
        end
        
        @test length(ascii_dict) == 128
        @test ascii_dict[65] == 'A'
        @test ascii_dict[97] == 'a'
        @test ascii_dict[48] == '0'
        @test haskey(ascii_dict, 127)
    end

    @testset "Character to Integer Conversion" begin
        @test Int('A') == 65
        @test Int('a') == 97
        @test Int('0') == 48
        @test Int('üç£') == 127843  # Sushi emoji code point
    end

    @testset "Hexadecimal Digits Dictionary" begin
        hex_dict = let
            hex_dict = Dict{Int,Char}()
            base = 16
            for i ‚àà 0:(base - 1)
                hex_dict[i] = '0' + i |> Char
                if (i > 9)
                    hex_dict[i] = 'A' + (i - 10) |> Char 
                end
            end
            hex_dict
        end
        
        @test length(hex_dict) == 16
        @test hex_dict[0] == '0'
        @test hex_dict[9] == '9'
        @test hex_dict[10] == 'A'
        @test hex_dict[15] == 'F'
    end

    @testset "Unicode Code Point Conversion" begin
        # Test the algorithm for converting base 10 to hexadecimal Unicode code point
        test_char = 'üç£'
        test_char_index = Int(test_char)
        
        # Expected code point for sushi emoji
        @test test_char_index == 127843
        
        # Test the conversion algorithm
        hex_dict = let
            hex_dict = Dict{Int,Char}()
            base = 16
            for i ‚àà 0:(base - 1)
                hex_dict[i] = '0' + i |> Char
                if (i > 9)
                    hex_dict[i] = 'A' + (i - 10) |> Char 
                end
            end
            hex_dict
        end
        
        my_code_point = let
            q = test_char_index
            remainder_array = Array{Int64,1}()
            while (q != 0)
                r = rem(q,16)
                q = div(q,16)
                push!(remainder_array,r)
            end

            my_code_point = ""
            for i ‚àà reverse(remainder_array)
                tmp = hex_dict[i]
                my_code_point *= tmp |> Char
            end

            my_code_point = lpad(my_code_point, 5, '0') |> x-> "U+"*x
        end
        
        @test my_code_point == "U+1F363"
    end

    @testset "Character Set Operations" begin
        C = let
            C = Set{Char}()
            push!(C,'A')
            push!(C,'B') 
            push!(C,'Q')
            push!(C,'R')
            push!(C,'S')
            C
        end
        
        @test length(C) == 5
        @test 'A' ‚àà C
        @test 'B' ‚àà C
        @test 'Q' ‚àà C
        @test 'Z' ‚àâ C  # Z should not be in the set
    end

    @testset "Emoji Variable Operations" begin
        üåΩ = 16
        üç£ = 4
        
        @test üåΩ + üç£ == 20
        @test üåΩ * üç£ == 64
        @test üåΩ ‚â• üç£
        @test üç£ < üåΩ
    end

    @testset "String to Character Array Conversion" begin
        test_string = "Hello"
        char_array = test_string |> collect
        
        @test length(char_array) == 5
        @test char_array[1] == 'H'
        @test char_array[5] == 'o'
        @test typeof(char_array) == Array{Char,1}
    end
end;

[0m[1mTest Summary:                          | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
CHEME 4/5800 L2b Test Suite            | [32m  29  [39m[36m   29  [39m[0m0.3s
  ASCII Character Dictionary           | [32m   5  [39m[36m    5  [39m[0m0.2s
  Character to Integer Conversion      | [32m   4  [39m[36m    4  [39m[0m0.0s
  Hexadecimal Digits Dictionary        | [32m   5  [39m[36m    5  [39m[0m0.0s
  Unicode Code Point Conversion        | [32m   2  [39m[36m    2  [39m[0m0.0s
  Character Set Operations             | [32m   5  [39m[36m    5  [39m[0m0.0s
  Emoji Variable Operations            | [32m   4  [39m[36m    4  [39m[0m0.0s
  String to Character Array Conversion | [32m   4  [39m[36m    4  [39m[0m0.0s
