# Activity: Fun With Base 16 Numbers
As a scientist or engineer, it's strange that (arguably) the most critical numbering systems for data science and machine learning applications are not floating point numbers but base $b$ integers! These numbers are used in many hard-to-see, yet critical, applications. Let's dig into one of these in the activity, namely, the mathematical encoding of text.
* __Backstory__. Computers don't see text like you and I, i.e., as a collection of Strings and other punctuation characters. Instead, computers map each character to a unique integer and then express that integer in a base-b positional numbering system, resulting in a digit sequence as bits or bytes.
* __Why base b__? Encoding numbers in base-b, i.e., beyond the base 10 numbers we are used to, facilitates efficient text processing, storage, and transmission by serializing characters into predictable, fixed-width numerical representations that can be decoded back into readable symbols.

Traditionally, characters were represented using [the ASCII system](https://en.wikipedia.org/wiki/ASCII). [Standard (7-bit) ASCII](https://www.ascii-code.com/ASCII) defined 128 character codes, numbered 0 through 127. [Extended 8-bit ASCII variants](https://www.ascii-code.com) used 256 values. 
Modern encodings, such as [Unicode encodings like UTF-8 or UTF-16](https://en.wikipedia.org/wiki/Unicode), represent a much wider range of characters.

In this activity, we'll dig into the numerical basis of text, starting with the [7-bit ASCII system](https://en.wikipedia.org/wiki/ASCII) and [extended 8-bit ASCII variants](https://www.ascii-code.com) and working our way to [Unicode encodings like UTF-8 or UTF-16](https://en.wikipedia.org/wiki/Unicode). 

## Setup
We set up the computational environment by including the `Include.jl` file and loading any needed resources.
* __Include__: The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation]

In [3]:
include("Include.jl")

___

##  Explore the ASCII Character Set
Let's start by considering the representation of the characters in the `ASCII` character system. The original `ASCII` characters are encoded as the first `128` characters (`0`$\rightarrow$`127`) in the modern [Unicode system](https://en.wikipedia.org/wiki/Unicode), which uses 7 bits of information to represent a character.

In [6]:
ascii_char_dictionary = let

    # initialize -
    ascii_char_dictionary = Dict{Int64, Char}(); # storage for index - character map
    ASCII_character_range = range(0,stop=127,step=1) |> collect; # extended ASCII indexes

    # main loop -
    for i ∈ eachindex(ASCII_character_range)
        my_ascii_char_index = ASCII_character_range[i];
        c = convert(Char, my_ascii_char_index) # hmmm. This is an interesting function.
        ascii_char_dictionary[my_ascii_char_index] = c;
    end
    ascii_char_dictionary;
end

Dict{Int64, Char} with 128 entries:
  5   => '\x05'
  56  => '8'
  35  => '#'
  55  => '7'
  110 => 'n'
  114 => 'r'
  123 => '{'
  60  => '<'
  30  => '\x1e'
  32  => ' '
  6   => '\x06'
  67  => 'C'
  45  => '-'
  117 => 'u'
  73  => 'I'
  115 => 's'
  112 => 'p'
  64  => '@'
  90  => 'Z'
  4   => '\x04'
  13  => '\r'
  54  => '6'
  63  => '?'
  86  => 'V'
  104 => 'h'
  ⋮   => ⋮

`Unhide` the code block to see how we build a table of the `ACSII` characters using [the `pretty_table(...)` function exported by the PrettyTables.jl package](https://github.com/ronisbr/PrettyTables.jl). Let's look at what the `ASCII` characters are. 
* This logic involves some advanced tools and techniques we've not yet discussed. You can skip the implementation details for now, we'll come back to them later.
* We'll do this using a [`for-loop`](https://docs.julialang.org/en/v1/base/base/#for) and [the `convert(...)` function](https://docs.julialang.org/en/v1/base/base/#Base.convert), where we push items into a [DataFrame](https://dataframes.juliadata.org/stable/), and the display the data by [calling the `pretty_table(...)` method](https://github.com/ronisbr/PrettyTables.jl).

In [8]:
let
    ASCII_index_array = keys(ascii_char_dictionary) |> collect |> sort;
    character_table_df = DataFrame();
    for i ∈ eachindex(ASCII_index_array)
        my_ascii_char_index = ASCII_index_array[i];
        c = ascii_char_dictionary[my_ascii_char_index];

        row = (
            i = my_ascii_char_index,
            character = c
        ); # -> what is going on here? This is a cool type called a NamedTuple ...
        push!(character_table_df,row); # push! -> what is going on here?
    end
    pretty_table(character_table_df, tf=tf_simple)
end

 [1m     i [0m [1m character [0m
 [90m Int64 [0m [90m      Char [0m
      0          \0
      1        \x01
      2        \x02
      3        \x03
      4        \x04
      5        \x05
      6        \x06
      7          \a
      8          \b
      9          \t
     10          \n
     11          \v
     12          \f
     13          \r
     14        \x0e
     15        \x0f
     16        \x10
     17        \x11
     18        \x12
     19        \x13
     20        \x14
     21        \x15
     22        \x16
     23        \x17
     24        \x18
     25        \x19
     26        \x1a
     27          \e
     28        \x1c
     29        \x1d
     30        \x1e
     31        \x1f
     32
     33           !
     34           "
     35           #
     36           $
     37           %
     38           &
     39           '
     40           (
     41           )
     42           *
     43           +
     44           ,
     45           -
     46         

In the code block above, we explicitly called [the `convert(...)` method](https://docs.julialang.org/en/v1/manual/conversion-and-promotion/) to convert an [`Int`](https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Integers) to a [`Char` type](https://docs.julialang.org/en/v1/base/strings/#Core.Char). However, this was unnecessary, as Julia (and most modern languages) will automatically try to understand what you type (and do conversions for you if it can). For example:

In [10]:
'c' |> Int # wow. this seems a little magical. What is actually going on here?? (notice the single quotes)

99

In [11]:
Int('c') # what?!? We can convert between ASCII characters and Int! 

99

In [12]:
Char(99)

'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

### Limitations of ASCII
Wow! ASCII seems pretty cool. Why do we need anything beyond these 127 characters?
* The original 7-bit ASCII was limited to 128 code points (0–127), so it couldn’t represent characters beyond basic English letters, digits, punctuation, and control codes.
* The original 7-bit ASCII character set offered no native support for accented characters or non-Latin scripts (e.g., Cyrillic, Greek, and Arabic), hindering internationalization.
* The original 7-bit ASCII character set omitted modern typographic symbols, mathematical glyphs, and emojis, making it inadequate for rich text or graphical communication.

Bummer. These limitations, and others, led to the development of [the Unicode system](https://en.wikipedia.org/wiki/Unicode).

## Explore the Unicode Character Set
The ASCII character set has only 128 characters (lower and uppercase English-language characters, some mathematical symbols, and numbers). The modern [Unicode system](https://en.wikipedia.org/wiki/Unicode) _can encode_ 1,114,112 possible characters (code points). However, as of [Unicode Standard version 16.0 (released September 10, 2024)](https://www.unicode.org/versions/Unicode16.0.0/), 154,998 of those code points are assigned to uniquely named characters; various characters in many languages, a much larger family of mathematical symbols, and [emoji](https://en.wikipedia.org/wiki/Emoji) of all sorts. 
* The [Unicode system](https://en.wikipedia.org/wiki/Unicode) indexes characters using a `base 16` (hexadecimal) numbering system. Julia has [built-in Unicode support in the standard library](https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode); thus, we can work with these characters in our programs. For more information on the specific Unicode primitives supported by Julia, check out the [Julia documentation](https://docs.julialang.org/en/v1/manual/unicode-input/).

Let's do some math with `emojis` (not really that important, but fun):

In [15]:
🌽 = 16; # corn = 16 \:corn:
🍣 = 4; # sushi = 4 \:sushi:

Can we do operations?

In [17]:
🌽 + 🍣  # addition?

20

In [18]:
(🌽 * 🍣) # multiplication?

64

How about logical comparisons? Because Julia has built-in support for [Unicode characters](https://en.wikipedia.org/wiki/Unicode), in addition to using these characters as variables, you can also use many of the Unicode mathematical symbols to _write functional code_ that looks like something we might understand from math. 

To see what characters are supported, check out [the Unicode input documentation](https://docs.julialang.org/en/v1/manual/unicode-input/):
* For example, consider `🌽 = 16` and `🍣 = 4`. We know that $🌽\geq{🍣}$ should return `true.` Let's write the `greater than or equal to` check using [Unicode characters](https://en.wikipedia.org/wiki/Unicode):

In [20]:
🌽 ≥ 🍣 # logical comparison? (\geq then tab)

true

We can also determine whether an item is in the given collection. For example, assume we have the character set $\mathbb{C} \equiv \left\{A,B,Q,R,S\right\}$. Let's write an expression to check if $c\in\mathbb{C}$, where $\in$ denotes the `element of` operation, and `c` is some character. 
* Use the [Set data structure](https://docs.julialang.org/en/v1/base/collections/#Base.Set) to do this example. [`Set` is a collection type](https://docs.julialang.org/en/v1/base/collections/#Base.Set) (included in most modern languages) that holds items; sets are `unique` but do not maintain order.

Specify the set $\mathbb{C}$:

In [22]:
C = let
    C = Set{Char}(); # empty set wiht Char types
    push!(C,'A'); # add a `A` to set C 
    push!(C,'B'); # ... `B` ...
    push!(C,'Q'); # ... `Q` ...
    push!(C,'R'); # ... `R` ...
    push!(C,'S'); # ... `S` ...

    C # return
end

Set{Char} with 5 elements:
  'A'
  'R'
  'S'
  'Q'
  'B'

Specify the test character `c`:

In [24]:
c = 🌽; # Do we have corn in the set ℂ?

Check if $c\in\mathbb{C}$:

In [26]:
c ∈ C # ∈ => \in then tab

false

Ok, so [the Unicode character set is cool](https://en.wikipedia.org/wiki/Unicode), and [Julia's support for Unicode](https://docs.julialang.org/en/v1/manual/unicode-input/) is next level. But how are these characters related to base 16 numbers? let's dig into this question next.

## Unicode Strings and Codepoints
The built-in [Julia `String` type](https://docs.julialang.org/en/v1/base/strings/) can represent text. The [`String` type](https://docs.julialang.org/en/v1/base/strings/) can be thought of similarly to the traditional text model in languages like [C](https://en.wikipedia.org/wiki/C_(programming_language)), namely an ordered collection (array) of `Char` types. 
* However, behind the scenes, the `String` type in `Julia` or `Python` uses a sophisticated `base 16` addressing system composed of Unicode codepoints.

Let's play around with a `test_string_ascii::String`:

In [29]:
test_string_ascii = "Test String in Julia (notice the double quotes). Python uses both single and double quotes for Strings. 😒";

Fill me in.

In [31]:
character_array_test_string_ascii = test_string_ascii |> collect

105-element Vector{Char}:
 'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 ⋮
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 't': ASCII/Un

Fill me in.

### How do we calculate a codepoint?
The codepoint (index) for `🍣` equals `U+1F363`. But what does that mean? To start, we can compute the `base 10` index for a Unicode character by calling [the `Unicode.julia_chartransform(...)` method](https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode.julia_chartransform) and then converting that value to an `Int.` 

Let's store a test character in the `test_unicode_char::Char` variable:

In [63]:
test_unicode_char = '🍣' # want to select another Unicode character?

'🍣': Unicode U+1F363 (category So: Symbol, other)

What is the base 10 value for the `test_unicode_char::Char`?

In [65]:
test_char_index = Unicode.julia_chartransform(test_unicode_char) |> Int

127843

Now, let's check this value by going backward: from `base 10` back to the `codepoint.`
* Convert the `base 10` value to a `base 16` (hexadecimal) number and then convert that to the [Unicode index](https://en.wikipedia.org/wiki/Unicode) format; [Unicode indexs](https://en.wikipedia.org/wiki/Unicode) indexes left-pad the hexadecimal character value with zeros until a 5-digit code is generated, and then `U+` is appended to the five-digit code.
* Hexadecimal numbers use decimal digits and six extra symbols; the decimal values $(0,1,\dots,9)$, and the letters A, B, C, D, E, and F where hexadecimal A = decimal 10, through hexadecimal F = decimal 15.

In [69]:
hexidecimal_digits_dictionary = let

    hexidecimal_digits_dictionary = Dict{Int,Char}()
    for i ∈ 0:15
        hexidecimal_digits_dictionary[i] = '0' + i |> Char # hmmm: this is whacky; why does this work?
        if (i > 9)
            hexidecimal_digits_dictionary[i] = 'A' + (i - 10) |> Char
        end
    end
    hexidecimal_digits_dictionary
end

Dict{Int64, Char} with 16 entries:
  5  => '5'
  12 => 'C'
  8  => '8'
  1  => '1'
  0  => '0'
  6  => '6'
  11 => 'B'
  9  => '9'
  14 => 'E'
  3  => '3'
  7  => '7'
  4  => '4'
  13 => 'D'
  15 => 'F'
  2  => '2'
  10 => 'A'

__Algorithm__:
* Step 1: Divide the given decimal number by 16 and write down the quotient and remainder
* Step 2: Divide the previous quotient by 16 and write down the quotient and remainder
* Step 3: Repeat steps 1 and 2 until the quotient equals zero.
* Step 4: Map all the remainder values to their corresponding hexadecimal equivalents 
* Step 5: Starting with the last value and moving to the first, write each of the hexadecimal values

In [74]:
my_code_point = let

    q = test_char_index; 
    remainder_array = Array{Int64,1}();
    while (q != 0)
        r = rem(q,16)
        q = div(q,16)
        push!(remainder_array,r)
    end

    my_code_point = "";
    for i ∈ reverse(remainder_array)
        tmp = hexidecimal_digits_dictionary[i];
        my_code_point *= tmp |> Char;
    end

    # left pad with zeros to get a 4-digit code
    my_code_point = lpad(my_code_point, 5, '0') |> x-> "U+"*x
end

"U+1F363"

### Alternative: Use the codepoint function
The [`codepoint(...)` method](https://docs.julialang.org/en/v1/base/strings/#Base.codepoint) takes a [`Char` type](https://docs.julialang.org/en/v1/base/strings/#Core.Char) and (typically) returns [a `UInt32` value](https://docs.julialang.org/en/v1/base/numbers/#Core.UInt32) corresponding to the character index. The __fast way__ to get the Unicode character index is this is to call [`convert(...)` method]() on the `UInt32` value (convert back to an `Int32` or `Int64`) returned from [the `codepoint(...)` method](https://docs.julialang.org/en/v1/base/strings/#Base.codepoint)

In [77]:
fast_test_index = Unicode.codepoint(test_unicode_char) |> x-> convert(Int,x); # for sushi 127843
@assert fast_test_index == test_char_index

However, we can go a little deeper. Even though `UInt32` is a strange integer, it is still an integer. Thus, we can use [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) to compute the corresponding bitstring and then convert that into an array of `Int64` integers. 

Save that array in the `codepoint_bitpattern_array::Array{Int64,1}` variable.

In [80]:
codepoint_bitpattern_array = Unicode.codepoint(test_unicode_char) |> bitstring |> collect |> reverse .|> x-> parse(Int, x)

32-element Vector{Int64}:
 1
 1
 0
 0
 0
 1
 1
 0
 1
 1
 0
 0
 1
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

Then, we can use the values in the `codepoint_bitpattern_array` to compute the `base 10` index of the Unicode character and compare that to the `test_char_index` value using [the @assert macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert):

In [85]:
N = length(codepoint_bitpattern_array)
base = 2;
value = 0;
for i ∈ 0:(N-1)
    value += codepoint_bitpattern_array[i+1]*base^(i); # why i + 1
end
value

127843