# Example: Fun with Text, Strings and Characters
Textual data on a computer is represented as the `String` data type. `Strings` in languages such as [C](https://en.wikipedia.org/wiki/C_(programming_language)) were modeled as a sequence of characters, where each character was type `Char.` 
* Characters were represented via the [American Standard Code for Information Interchange (ASCII) system](https://en.wikipedia.org/wiki/ASCII), which was a set of `7-bit` teleprinter codes for the [AT&T](https://www.att.com) Teletypewriter exchange (TWX) network. For example, the character `A` in the ASCII system has an index of `65`.
* Later, `8-bit` character mappings were developed, i.e., the so-called [extended ASCII systems](https://en.wikipedia.org/wiki/Extended_ASCII), which had $0,\dots,255$ possible character values.

However, some of the `ASCII` systems remain true today, while others are very different. For example, modern languages have sophisticated built-in `String` types constructed using the [Unicode character set](https://en.wikipedia.org/wiki/Unicode). 
* The [Unicode standard](https://en.wikipedia.org/wiki/Unicode) encodes approximately 1.1 million possible characters, the first `128` of which are the same as the original `ASCII` set. [Unicode](https://en.wikipedia.org/wiki/Unicode) characters, which use up to 4$\times$bytes (32-bits) of storage per character, are indexed using the `base 16` (hexadecimal) number systems. Today, there are approximately `150,000` characters in Unicode.

## Setup
Let's load some `external packages`, i.e., code that other people have made available to the world, using the [Julia package manager](https://docs.julialang.org/en/v1/stdlib/Pkg/) (which we'll explore in a few lectures from now):

In [1]:
# add -
using Pkg;
Pkg.add("DataFrames");
Pkg.add("PrettyTables");

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-4800-5800-Examples-AY-2024/week-1/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-4800-5800-Examples-AY-2024/week-1/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-4800-5800-Examples-AY-2024/week-1/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-4800-5800-Examples-AY-2024/week-1/Manifest.toml`


In [2]:
# use -
using DataFrames
using PrettyTables
using Unicode

## ASCII Characters
Let's start by considering the representation of the characters in the `ASCII` character system. The original `ASCII` characters are encoded as the first `128` characters (`0`$\rightarrow$`127`) in the modern [Unicode system](https://en.wikipedia.org/wiki/Unicode). 
* Let's look at what the `ASCII` characters are. We'll do this using a `for` loop and the [convert function](https://docs.julialang.org/en/v1/base/base/#Base.convert), where we push items into a [DataFrame](https://dataframes.juliadata.org/stable/), and display using the `pretty_table(...)` function exported by the [PrettyTables.jl package](https://github.com/ronisbr/PrettyTables.jl)

In [3]:
ASCII_character_range = range(0,stop=127,step=1) |> collect; # what is going on here?
character_table_df = DataFrame();
for i ∈ eachindex(ASCII_character_range)
    my_ascii_char_index = ASCII_character_range[i];
    c = convert(Char,my_ascii_char_index) # hmmm. This is an interesting function.

    row = (
        i = my_ascii_char_index,
        character = c
    );

    push!(character_table_df,row);
end
pretty_table(character_table_df, tf=tf_simple)

 [1m     i [0m [1m character [0m
 [90m Int64 [0m [90m      Char [0m
      0          \0
      1        \x01
      2        \x02
      3        \x03
      4        \x04
      5        \x05
      6        \x06
      7          \a
      8          \b
      9          \t
     10          \n
     11          \v
     12          \f
     13          \r
     14        \x0e
     15        \x0f
     16        \x10
     17        \x11
     18        \x12
     19        \x13
     20        \x14
     21        \x15
     22        \x16
     23        \x17
     24        \x18
     25        \x19
     26        \x1a
     27          \e
     28        \x1c
     29        \x1d
     30        \x1e
     31        \x1f
     32
     33           !
     34           "
     35           #
     36           $
     37           %
     38           &
     39           '
     40           (
     41           )
     42           *
     43           +
     44           ,
     45           -
     46         

We explicitly called the [convert function](https://docs.julialang.org/en/v1/manual/conversion-and-promotion/) to convert an `Int` to a `Char` type. However, this was unnecessary, as `Julia` (and most modern languages) will automatically try to understand what you type (and do conversions for you if it can). For example:

In [5]:
'c' |> Int # wow. this seems a little magical. What is actually going on here?? (notice the single quotes)

99

In [6]:
Int('c') # we can convert between ASCII characters and Int! 

99

In [23]:
Char(99)

'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

## Unicode Characters
The `ASCII` character set has only `128` characters (lower and uppercase `English-language` characters, some mathematical symbols, and numbers). The modern [Unicode system](https://en.wikipedia.org/wiki/Unicode) currently encodes approximately `150,000` characters in many languages, a much larger family of mathematical symbols, and [emoji](https://en.wikipedia.org/wiki/Emoji) of all sorts. 
* The [Unicode system](https://en.wikipedia.org/wiki/Unicode) indexes characters using `base 16` numbers. `Julia` has [built-in Unicode support in the standard library](https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode); thus, we can work with these characters in our programs.
* For more information on the specific `Unicode primitives` supported by Julia, check out the [Julia documentation](https://docs.julialang.org/en/v1/manual/unicode-input/).

Let's do some math with `emojis` (not really that important, but fun):

In [7]:
🌽 = 16; # corn = 16 \:corn:
🍣 = 4; # sushi = 4 \:sushi:

In [8]:
🌽 + 🍣

20

In [9]:
🌽 * 🍣

64

Because Julia has built-in support for [Unicode characters](https://en.wikipedia.org/wiki/Unicode), we can use mathematical symbols to write computer code that looks like something we might understand from math. To see what characters are supported in [Julia, check out the Unicode input documentation](https://docs.julialang.org/en/v1/manual/unicode-input/):
* For example, consider `x = 4` and `y = 10`. We know that $x\leq{y}$ would return `true.` Let's write the `less than or equal to` check using [Unicode characters](https://en.wikipedia.org/wiki/Unicode):

In [10]:
x = 4;
y = 10;

In [25]:
x ≥ y # wow, how did this work? A little fancy, but to get the less than or equal we typed \leq followed by the tab key

false

We can also determine whether an item is in the given collection. For example, assume we have the character set $\mathbb{C} \equiv \left\{A,B,Q,R,S\right\}$. Let's write an expression to check if $c\in\mathbb{C}$. 
* To do this example, let's use the [Set data structure](https://docs.julialang.org/en/v1/base/collections/#Base.Set) in `Julia`. `Sets` are a collection type that holds items, sets are `unique`, but do not maintain order:

In [30]:
c = 'Q';

In [34]:
C = Set{Char}();
push!(C,'A');
push!(C,'B');
push!(C,'Q');
push!(C,'R');
push!(C,'S');
push!(C,1)

LoadError: ArgumentError: 1 is not a valid key for type Char

In [31]:
c ∈ C

true

## Unicode Strings and Codepoints
The built-in [Julia `String` type](https://docs.julialang.org/en/v1/base/strings/) can represent text. The `String` class can be thought of similarly to the traditional text model in languages like [C](https://en.wikipedia.org/wiki/C_(programming_language)), namely an ordered collection (array) of `Char` types. However, behind the scenes, the `String` type in `Julia` or `Python` uses a sophisticated `base 16` addressing system composed of `Unicode codepoints`.

In [15]:
test_string_ascii = "Test String in Julia (notice the double quotes). Python uses both single and double quotes for Strings. 😒";

In [16]:
character_array_test_string_ascii = test_string_ascii |> collect

105-element Vector{Char}:
 'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 ⋮
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
 't': ASCII/Un

In [17]:
Unicode.julia_chartransform('µ') # returns code point value \mu

'μ': Unicode U+03BC (category Ll: Letter, lowercase)

### How do we calculate a `codepoint`?
The `codepoint` for `🍣` equals `U+1F363`. Compute the `base 10` index for the Unicode character `🍣`.

#### Solution
We can directly calculate the `base 10` index from the `codepoint` by calling the `Unicode.julia_chartransform(...)` function and then converting that value to an `Int`. Let's store this value in the variable `sushi_char_index`:

In [18]:
sushi_char_index = Unicode.julia_chartransform('🍣') |> Int

127843

Now, let's check this value by going backward: from `base 10` back to the `codepoint.`
* Convert the `base 10` value to a `base 16` (hexadecimal) number and then convert that to the [Unicode](https://en.wikipedia.org/wiki/Unicode) index format; [Unicode](https://en.wikipedia.org/wiki/Unicode) indexes left-pad the hexadecimal character value with zeros until a 5-digit code is generated, and then `U+` is appended to the five-digit code.
* Hexadecimal numbers use decimal digits and six extra symbols; the decimal values $(0,1,\dots,9)$, and the letters A, B, C, D, E, and F where hexadecimal A = decimal 10, through hexadecimal F = decimal 15.

In [19]:
hexidecimal_digits_dictionary = Dict{Int,Char}()
for i ∈ 0:15
    hexidecimal_digits_dictionary[i] = '0' + i |> Char # hmmm: this is whacky; why does this work?
    if (i>9)
        hexidecimal_digits_dictionary[i] = 'A' + (i - 10) |> Char
    end
end
hexidecimal_digits_dictionary

Dict{Int64, Char} with 16 entries:
  5  => '5'
  12 => 'C'
  8  => '8'
  1  => '1'
  0  => '0'
  6  => '6'
  11 => 'B'
  9  => '9'
  14 => 'E'
  3  => '3'
  7  => '7'
  4  => '4'
  13 => 'D'
  15 => 'F'
  2  => '2'
  10 => 'A'

__Approach__:
* Step 1: Divide the given decimal number by 16 and write down the quotient and remainder
* Step 2: Divide the previous quotient by 16 and write down the quotient and remainder
* Step 3: Repeat steps 1 and 2 until the quotient equals zero.
* Step 4: Map all the remainder values to their corresponding hexadecimal equivalents 
* Step 5: Starting with the last value and moving to the first, write each of the hexadecimal values

In [20]:
q = sushi_char_index; 
remainder_array = Array{Int64,1}();
while (q != 0)
    r = rem(q,16)
    q = div(q,16)
    push!(remainder_array,r)
end

In [21]:
my_code_point = "";
for i ∈ reverse(remainder_array)
    tmp = hexidecimal_digits_dictionary[i];
    my_code_point *= tmp |> Char;
end

# left pad with zeros to get a 4-digit code
my_code_point = lpad(my_code_point, 5, '0') |> x-> "U+"*x

"U+1F363"