# Text Analysis with Ancient and Medieval Languages<br><br>Day 02:<br>The Problems of Under-Resourced Languages (and general solutions)

<center>Dr. William Mattingly<br>
TAP Institute with JSTOR</center>

In [1]:
!pip install fasttext



## The Problems of NLP in Under-Resourced Languages

For those of us working with under-resourced languages, such as medieval and ancient languages, we often have some shared problems. These problems often fit into two categories: lack of textual data and lack of structured training data. Both of these are needed to perform various NLP tasks. In this notebook, I address these two categories and provide conceptual solutions to these problems.

## Getting the Data from PDF to Text

In order to create an NLP model for any language, you need textual data. Certain NLP tasks can be done with unstructured raw text, but even getting that can be a challenge. Though you may not have raw text, you may have many pdfs or scans of texts laying around. In this section, I will provide some code and some steps for converting image data into raw text via Python. To do this, we will engage in OCR, or Optical Character Recognition. OCR allows us to read an image (which is just a collection of numbers) and convert those numbers that represent pixels into meaningful text.

To demonstrate this, we will be working with the following image, which is a scan from a critical edition of Alcuin's letters.

<img src="data/sample_mgh.JPG"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

The text is in Latin and contains several common problems when trying to convert a critical edition into raw text. First, it has marginalia on the left and right. On the left hand side, the marginalia detail scriptural quotations or allusions. On the right hand side, we have the lines denoted. Both of these throw off OCR results.

Second, we have header data, such as pagination and edition markings.

Third, we have footer data, specifically the critical apparatus and the footnotes.

Were I interested in producing a clean OCR of this text, I need a way to remove all that data automatically. Because these are scans, the marginalia will be off by as much as 20-50 pixels each image. Because footnotes vary in quantity, each page will have footer data in different locations. These problems prevent me from writing a set of rules to automate the removal of these things from the image. Fortunately, I can turn to Python's computer vision library OpenCV.

Through the code below, I can manipulate this image, identify structure, and then convert the entire document into raw text via PyTesseract which acts as a wrapper between Python and Tesseract, a collection of OCR models trained by Google.

Finally, I can do some post-processing cleaning to the text to get a text that is fairly accurate (with some few typographical errors). I have something that is good enough for machine learning purposes to start working with textual data. Furthermore, these scripts would allow me to automate this process and convert many critical editions from this same publisher into good quality OCR in a matter of hours.

In other words, in order to solve the problem of no textual data, you need to become familiar with automating OCR via Python.

In [2]:
#Source for this section is another TAP Institute Course by Hannah Jacobs => https://hub.binder.constellate.org/user/hlj24-tapi_2021_ocr-ncsnihj6/notebooks/01-WhatIsOCR.ipynb

In [3]:
# Install tesseract on Binder.
# The exclamation runs the command as a terminal command.
# This may take 1-2 minutes.
# Source: Nathan Kelber & JStor Labs Constellate team.
!conda install -c conda-forge -y tesseract

'conda' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
!wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
!mv eng.traineddata /srv/conda/envs/notebook/share/tessdata/eng.traineddata

'wget' is not recognized as an internal or external command,
operable program or batch file.
'mv' is not recognized as an internal or external command,
operable program or batch file.


In [5]:
# Import the pytesseract library, which will run the OCR process.
import pytesseract

In [6]:
import cv2
image = cv2.imread("data/sample_mgh.JPG")
base_image = image.copy()

In [7]:
def find_footnote_line(image, base_image):
    
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernal = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 10))
    dilate = cv2.dilate(thresh, kernal, iterations=1)
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cnts = sorted(cnts, key=lambda x: cv2.boundingRect(x)[1])
    main_line = []
    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        if h < 25 and w > 250:
            roi = base_image[y:y+h, x:x+w]
#             cv2.rectangle(image, (x,y), (x+w, y+h), (36, 255, 12), 2)
            main_line.append([x,y,w,h])
    cv2.imwrite("data/sample_boxes.png", image)
    return (main_line)

In [8]:
main_line = find_footnote_line(image, base_image)
x,y,w,h = main_line[0]
new = base_image[25:y, 0:-25]
cv2.imwrite("data/extraction.png", new)

True

In [9]:
image = cv2.imread("data/final.jpg")
def find_body(image, base_image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (5,5), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernal = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 50))
    dilate = cv2.dilate(thresh, kernal, iterations=1)
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cnts = sorted(cnts, key=lambda x: cv2.boundingRect(x)[1])
    main_line = []
    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        if h > 200 and w > 250:
            roi = base_image[y:y+h, x:x+w]
#             cv2.rectangle(image, (x,y), (x+w, y+h), (36, 255, 12), 2)
            main_line.append([x,y,w,h])
    cv2.imwrite("data/body_text.png", image)
    return (roi)

In [10]:
final = find_body(new, base_image)
cv2.imwrite("data/final.jpg", final)

True

In [11]:
def find_paras(image, base_image):
    base_image = image.copy()
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernal = cv2.getStructuringElement(cv2.MORPH_RECT, (4, 4))
    dilate = cv2.dilate(thresh, kernal, iterations=10)
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cnts = sorted(cnts, key=lambda x: cv2.boundingRect(x)[1])
    for c in cnts:
        x,y,w,h = cv2.boundingRect(c)
        if h > 200 and w > 600:
            roi = base_image[y:y+h, x:x+w]
            cv2.rectangle(image, (x,y), (x+w, y+h), (36, 255, 12), 2)
    return (roi)

In [12]:
last = find_paras(final, base_image)
cv2.imwrite("data/last.jpg", last)

True

In [13]:
ocr_result = pytesseract.image_to_string(last, lang="lat")
print (ocr_result)

DILECTISSIMO* AMICO TOTIUS PROSPERITATIS PRAESENTIS ET AETERNAE
BEATITUDINIS PERPETUAM SALUTEM.

Magna mihi laetitia est de bona voluntate vestra, quam audivi a fratre nostro
Benedicto! in vobis esse. Opto atque Deum deprecor, ut citius cum omni convenien-
tia perficiatur. Seriptum est enim: 'Ne tardes converti ad dominum Deum; quia
nescis, quid ventura pariat dies, Erue te de harum carcere tribulationum, quae in
hoe mundo fidelium animos torquere solent"; sicut scriptum est: *Multae tribulationes
iustorum; ut, quod sequitur, tibi evenire merearis: 'Sed de his omnibus liberavit eos
Dominus. Et cave diligentissime, ne qua te, aratrum Domini tenentem, iniustitia
retro revocet. Nemo miles sarcinis alienis onustus ad bella bene procedit, nisi armis
tantummodo victrieibus, vel ad defensionem sui vel ad laesionem adversarii.

Omnia quae vobis demandare necessaria videbantur mihi fidelissimo fratri Bene-
dieto dixi: loca, adiutorium et animi constantiam.

Sed scire debes, quod in omni loco, u

In [14]:
sections = ocr_result.split("\n\n")
print (len(sections))

4


In [15]:
final_sections = []
for sec in sections:
    sec = sec.replace("-\n", "")
    sec = sec.replace("\n", " ")
    sec = sec.replace(" ,", ",").replace(" .", ".").replace(" ;", ";").replace("*", " ").replace("\"", "\'")
    while "  " in sec:
        sec = sec.replace("  ", " ")
    final_sections.append(sec)
cleaned_text = "\n\n".join(final_sections)
print (cleaned_text)

DILECTISSIMO AMICO TOTIUS PROSPERITATIS PRAESENTIS ET AETERNAE BEATITUDINIS PERPETUAM SALUTEM.

Magna mihi laetitia est de bona voluntate vestra, quam audivi a fratre nostro Benedicto! in vobis esse. Opto atque Deum deprecor, ut citius cum omni convenientia perficiatur. Seriptum est enim: 'Ne tardes converti ad dominum Deum; quia nescis, quid ventura pariat dies, Erue te de harum carcere tribulationum, quae in hoe mundo fidelium animos torquere solent'; sicut scriptum est: Multae tribulationes iustorum; ut, quod sequitur, tibi evenire merearis: 'Sed de his omnibus liberavit eos Dominus. Et cave diligentissime, ne qua te, aratrum Domini tenentem, iniustitia retro revocet. Nemo miles sarcinis alienis onustus ad bella bene procedit, nisi armis tantummodo victrieibus, vel ad defensionem sui vel ad laesionem adversarii.

Omnia quae vobis demandare necessaria videbantur mihi fidelissimo fratri Benedieto dixi: loca, adiutorium et animi constantiam.

Sed scire debes, quod in omni loco, ubi hom

## Introduction to Word Embeddings

Word vectors, or word embeddings, take these one dimensional bag of words and gives them multidimensional meaning by representing them in higher dimensional space, noted above. This is achieved through machine learning and can be easily achieved via Python libraries, such as Gensim or FastText, which we will explore more closely later in this notebook

The goal of word vectors is to achieve numerical understanding of language so that a computer can perform more complex tasks on that corpus. Let’s consider the example above. How do we get a computer to understand 2 and 6 are synonyms or mean something similar? One option you might be thinking is to simply give the computer a synonym dictionary. It can look up synonyms and then know what words mean. This approach, on the surface, makes perfect sense, but let’s explore that option and see why it cannot possibly work.

For the example below, we will be using the Python library PyDictionary which allows us to look up definitions and synonyms of words.

In [16]:
!pip install PyDictionary



In [17]:
from PyDictionary import PyDictionary

dictionary=PyDictionary()
text = "Tom loves to eat chocolate"

words = text.split()
for word in words:
        syns = dictionary.synonym(word)
        print (f"{word}: {syns[0:5]}\n")

Tom: ['Felis domesticus', 'tomcat', 'domestic cat', 'gib', 'house cat']

loves: ['amorousness', 'caring', 'lovingness', 'agape', 'adoration']

to: ['digitizer', 'data converter', 'digitiser', 'analog-digital converter']

eat: ['consume', 'garbage down', 'eat up', 'gluttonize', 'take in']

chocolate: ['drinking chocolate', 'drink', 'drinkable', 'potable', 'beverage']



Even with the simple sentence, the results are comically bad. Why? The reason is because synonym substitution, a common method of data augmentation, does not take into account syntactical differences of synonyms. I do not believe anyone would think “Felis domesticus”, the Latin name of the common house cat, would be an adequate substitution for the name Tom. Nor is “garbage down” a really proper synonym for eat.

Perhaps, then we could use synonyms to find words that have cross-terms, or terms that appear in both synonym sets.

In [18]:
from PyDictionary import PyDictionary

dictionary=PyDictionary()

words  = ["like", "love"]
for word in words:
    syns = dictionary.synonym(word)
    print (f"{word}: {syns[0:5]}\n")

like: ['love', 'prefer', 'enjoy', 'cotton', 'care for']

love: ['amorousness', 'caring', 'lovingness', 'agape', 'adoration']



This, as we can see, has some potential to work, but again it is not entirely reliable and to work with such a list would be computationally expensive. For both of these reasons, word vectors are prefered. The reason? Because they are formed by the computer on corpora for a specific task. Further, they are numerical in nature (not a dictionary of words), meaning the computer can process them more quickly.

Word vectors have a preset number of dimensions. These dimensions are honed via machine learned. Models take into account word frequency alongside words across a corpus and the appearance of other words in similar contexts. This allows for the the computer to determin the syntactical similarity of words numerically. It then needs to represent these relationships numerically. It does this through the vector, or a matrix of matrices. To represent these more concisely, models flatten a matrix to a float (decimal number). The number of dimensions represent the number of floats in the matrix.

Below is a pretrained model’s output of word vectors for Holocaust documents. This is how the word “know” looks in vectors:

know -0.19911548 -0.27387282 0.04241912 -0.58703226 0.16149549 -0.08585547 -0.10403373 -0.112367705 -0.28902963 -0.42949626 0.051096343 -0.04708015 -0.051914077 -0.010533272 -0.23334776 0.031974062 -0.015784053 -0.21945408 0.07359381 0.04936823 -0.15373217 -0.18460844 -0.055799782 -0.057939123 0.14816307 -0.46049833 0.16128318 0.190906 -0.29180774 -0.08877125 0.23563664 -0.036557104 -0.23812544 0.21938106 -0.2781296 0.5112853 0.049084224 0.14876273 0.20611146 -0.04535578 -0.35051352 -0.26381743 0.20824358 0.29732847 -0.013382204 -0.19970295 -0.34890386 -0.16214448 -0.23497184 0.1656344 0.15815939 0.012848561 -0.22887675 -0.21618247 0.13367777 0.1028471 0.25068823 -0.13625076 -0.11771541 0.4857257 0.102198474 0.06380113 -0.22328818 -0.05281015 0.0059655504 0.095453635 0.39693353 -0.066147 -0.1920163 0.5153346 0.24972811 -0.0076305643 -0.05530072 -0.24668717 -0.074051596 0.29288396 -0.0849124 0.37786478 0.2398532 -0.10374063 0.5445305 -0.41955113 0.39866814 -0.23992492 -0.15373677 0.34488577 -0.07166888 -0.48001364 0.0660652 0.061260436 0.32197484 -0.12741785 0.024006622 -0.07915035 -0.04467735 -0.2387938 -0.07527494 0.07079664 0.074456714 0.17877163 -0.002122373 -0.16164272 0.12381973 -0.5908519 0.5827627 -0.38076186 0.095964395 0.020342976 -0.5244792 0.24467848 -0.12481717 0.2869162 -0.34473857 -0.19579992 -0.18069582 0.015281798 -0.18330036 -0.08794056 0.015334953 -0.5609912 0.17393902 0.04283724 -0.07696586 0.2040299 0.34686008 0.31219167 0.14669564 -0.26249585 -0.42771882 0.5381632 -0.123247474 -0.29142144 -0.29963812 -0.32800657 -0.10684048 -0.08594837 0.19670585 0.13474767 0.18349588 -0.4734125 0.15554792 -0.21062694 -0.14191462 -0.12800062 0.2053445 -0.05258381 0.10878109 0.56381494 0.22724482 -0.17778987 -0.061046753 0.10789692 -0.015310492 0.16563527 -0.31812978 -0.1478078 0.4323269 -0.2543924 -0.25956103 0.38653126 0.5080214 -0.18796602 -0.10318089 0.023921987 -0.14618908 0.22923793 0.37690258 0.13323267 -0.34325415 -0.048353776 -0.30283198 -0.2839813 -0.2627738 -0.07422618 -0.31940162 0.38072023 0.56700015 -0.023362642 -0.3786432 0.084006436 0.0729958 0.09483505 -0.2665334 0.12699558 -0.37927982 -0.39073908 0.0063185897 -0.34464878 -0.24011964 0.09303968 -0.15488827 -0.018486138 0.3560308 -0.26005003 0.089302294 0.116130605 0.07684872 -0.085253105 -0.28178927 -0.17346472 -0.20008522 0.004347025 0.34192443 0.017453942 0.06926512 -0.15926014 -0.018554512 0.18478563 -0.040194467 0.38450953 0.4104423 -0.016453728 0.013374495 -0.011256633 0.09106963 0.20074937 0.17310189 -0.12467103 0.16330549 -0.0009963055 0.12181527 -0.05295286 -0.0059491103 -0.04697837 0.38616535 -0.21074814 -0.32234505 0.47269863 0.27924335 0.13548143 -0.2677968 0.03536313 0.3248672 0.2062973 0.29093853 0.1844036 -0.43359983 0.025519002 -0.06319317 -0.2427806 -0.22732906 0.08803728 -0.041860744 -0.151291 0.3400458 -0.29143015 0.25334117 0.06265491 0.26399022 -0.20121849 0.22156847 -0.50599706 0.069224015 0.52325517 -0.34115726 -0.105219565 -0.37346402 -0.02126528 0.09619415 0.017722093 -0.3621799 -0.109912336 0.021542747 -0.13361925 0.2087667 -0.08780184 0.09494446 -0.25047818 -0.07924239 0.21750642 0.2621652 -0.52888566 0.081884995 -0.20485449 0.18029206 -0.5623824 -0.03897387 0.3213515 0.057455678 -0.26524526 0.14741589 0.1257589 0.04708992 0.026751317 -0.014696863 -0.11038961 0.004459205 -0.01394376 0.091146186 -0.15486309 0.20662159 -0.0987916 -0.07740813 0.009704136 0.28866896 0.3916269 0.35061485 0.31678385 0.43233085 0.44510433

For these vectors, I used the industry-standard of 300 dimensions. We see each of these dimensions represented by each of the floats, separated by whitespace. As the model passes over the corpus it is being trained on, it hones these numbers and changes them for each word. Over multiple epochs, or generations, it gains a clearer sense of the similarity of words, or at least words that are used in similar contexts.

Once a word vector model is trained, we can do similarity matches very quickly and very reliably. AI work primarily with Holocaust and human rights abuses documents. For this reason, I will use a word vector model that I have trained on Holocaust documents. Consider the word "concentration camp". Let’s now use these word vectors to find the 10 most similar words to concentration camp.

Once a word vector model is trained, we can do similarity matches very quickly and very reliably. At the start of the notebook, I asked you to consider the word concentration camp. Let’s now use these word vectors to find the 10 most similar words to concentration camp.

In [19]:
[
    ('extermination_camp', 0.5768706798553467),
    ('camp', 0.5369070172309875),
    ('Flossenbiirg', 0.5099129676818848),
    ('Sachsenhausen', 0.5068483948707581),
    ('Auschwitz', 0.48929861187934875),
    ('Dachau', 0.4765608310699463),
    ('concen', 0.4753464460372925),
    ('Majdanek', 0.4740387797355652),
    ('Sered', 0.47086501121520996),
    ('Buchenwald', 0.4692303538322449)
]

[('extermination_camp', 0.5768706798553467),
 ('camp', 0.5369070172309875),
 ('Flossenbiirg', 0.5099129676818848),
 ('Sachsenhausen', 0.5068483948707581),
 ('Auschwitz', 0.48929861187934875),
 ('Dachau', 0.4765608310699463),
 ('concen', 0.4753464460372925),
 ('Majdanek', 0.4740387797355652),
 ('Sered', 0.47086501121520996),
 ('Buchenwald', 0.4692303538322449)]

These are the items that are most similar to concentration camp in our word vectors. The tuple has two indices. Index 0 is the word and index 1 is the similarity, represented as a float.

Extermination camp is not a direct synonym, as it has a distinction in what happened to prisoners, i.e. execution, however, these are very similar. Seeing this as the most similar word is a sign that the word vectors are well-aligned. Camp is expected as it is a singular word that has similar meaning in context to concentration camp. The remainder of this list are proper nouns, all of which were concentration camps with one exception: “concen”. This is clearly a result of poor cleaning. Concen is not a word, rather a type of concen-tration, most likely. The fact that this is here is also a good sign that our word vectors have aligned well enough to have typos in near vector space.

Let’s do something similar with Auschwitz.items that are most similar to concentration camp in our word vectors. The tuple has two indices. Index 0 is the word and index 1 is the similarity, represented as a float.

Extermination camp is not a direct synonym, as it has a distinction in what happened to prisoners, i.e. execution, however, these are very similar. Seeing this as the most similar word is a sign that the word vectors are well-aligned. Camp is expected as it is a singular word that has similar meaning in context to concentration camp. The remainder of this list are proper nouns, all of which were concentration camps with one exception: “concen”. This is clearly a result of poor cleaning. Concen is not a word, rather a type of concentration, most likely. The fact that this is here is also a good sign that our word vectors have aligned well enough to have typos in near vector space.

Let’s do something similar with Auschwitz.

In [20]:
[
    ('Auschwitz_Birkenau', 0.6649479866027832),
    ('Birkenau', 0.5385118126869202),
    ('subcamp', 0.5343026518821716),
    ('camp', 0.533636748790741),
    ('III', 0.5323576927185059),
    ('stutthof', 0.518073320388794),
    ('Ravensbriick', 0.5084848403930664),
    ('Berlitzer', 0.5083401203155518),
    ('Malchow', 0.5051567554473877),
    ('Oswiecim', 0.5016494393348694)
]

[('Auschwitz_Birkenau', 0.6649479866027832),
 ('Birkenau', 0.5385118126869202),
 ('subcamp', 0.5343026518821716),
 ('camp', 0.533636748790741),
 ('III', 0.5323576927185059),
 ('stutthof', 0.518073320388794),
 ('Ravensbriick', 0.5084848403930664),
 ('Berlitzer', 0.5083401203155518),
 ('Malchow', 0.5051567554473877),
 ('Oswiecim', 0.5016494393348694)]

As we can see, the words closest to Auchwitz are places assocaited with Auschwitz, such as Birkenau, subcamps (of which Auschwitz had many), other concentration camps (such as Ravensbriick), and the location of the Auschwitz memorial, Oswiecim.

In other words, we have words closely associated with Auschwitz in particular.

In [21]:
import fasttext

In [22]:
model = fasttext.train_unsupervised("data/alice.txt")

In [23]:
def find_matches(search_word, model, k=10):
    res =  model.get_nearest_neighbors(search_word, k=k)
    return (res)

In [24]:
find_matches("known", model)

[(0.999941885471344, 'know,'),
 (0.9999399781227112, "know,'"),
 (0.99993497133255, 'shouted'),
 (0.9999307990074158, 'grown'),
 (0.9999297857284546, 'crowded'),
 (0.9999275207519531, 'followed'),
 (0.9999272227287292, 'know'),
 (0.9999259114265442, 'meaning'),
 (0.9999238848686218, 'something'),
 (0.9999237060546875, "know.'")]

In [25]:
latin_model = fasttext.train_unsupervised("data/100.txt")

In [26]:
find_matches("Carolus", latin_model, k=10)

[(0.9504625797271729, 'Carolus,'),
 (0.9000089168548584, 'Carolum'),
 (0.8991857767105103, 'Caroli'),
 (0.8934779763221741, 'Carolum,'),
 (0.8916164040565491, 'Carolo'),
 (0.8890672326087952, 'Carolo,'),
 (0.8861714601516724, 'Ludovicus'),
 (0.8721418380737305, 'imperator'),
 (0.8589209914207458, 'Aldricus'),
 (0.8531587719917297, 'Offa')]

In [27]:
def find_relationships(base_word, is_to, second_word, model):
    as_blank = model.get_analogies(base_word, is_to, second_word)
    return (as_blank)

In [28]:
find_relationships("Carolus", "rex", "abbas", latin_model)

[(0.9006853699684143, 'Ferrariensis'),
 (0.900662899017334, 'Sigulfus'),
 (0.8969728350639343, 'Ferrariensi'),
 (0.8933336138725281, 'Lupi'),
 (0.8929973244667053, 'Parisiensis'),
 (0.8918678760528564, 'Sigulfo'),
 (0.8896274566650391, 'Turonense'),
 (0.8875166773796082, 'Turonensis'),
 (0.8857744932174683, 'Eborac.'),
 (0.8843535780906677, 'Fuldensi')]

## Cultivating Training Data

I will be demonstrating this outside of the notebook with Prodigy because this is proprietary software.

## Encoding Issues

Another issue under-resourced languages frequently face is something collectively known as encoding issues. To understand why this is a problem, you need to understand a bit about how text is represented in a compute.

Text as we see it is not how a computer processes that information. A computer understands a character that represents a letter numerically. This number is understood as a combination of 0s and 1s which is how all computers with the exception of quantum computers process data. This is due to how computers are structured at the hardware level, that is, as a series of gates that are either on or off--0 or 1.

Binary can represent complex things and through different types of gates it can do addition and subtraction and once you can do addition and subtraction, you can pretty much do anything. To prove this, think about multiplication. Computers do not multiply. They do not do 5 x 10 as single problem, rather than add 5 10 times.

For numbers to be expressed in binary, it is fairly straight forward, you convert down from our base-10 system to a base-2 system. Imagine if you had a number: 14. We use two digits to express this numerical value. The first digit from the left is 4. This is the ones place that denotes 4 single items of something. The next digit is in the tens place and it denotes that 4 must be added to 1 x 10. This is how we use base 10 to arrive at 14.

In binary, we do the same thing, but instead of a 1s place and a 10s place, you have the same concept truncated to 0 and 1.

To express 14 in binary, we need 4 places. In binary 14 is 1110. Why do we need four places? Because 14 must be expressed in base-2. To understand this better, let's look at the number 2.

2 in binary is 10.

It is 0 in the initial position and 1 in the second. So, it is (0 x 0) + (1 x 2). This results in 2 in base-10.

Let's return to 14 or 1110.

Imagine that it is the same as writing (in reverse order) (0x1)+(1x2)+(1x4)+(1x8). We increase by multiples of 2s, rather than 10s because we are in base-2. Go ahead and try to do that base-10 math on your calculator. You get 0+2+4+8. If you add that together, you get 14.

Here's the problem. We can easily represent numbers numerically in this fashion, but how do we represent characters like letters. The answer is a bit complicated. We need to represent it as numbers and we do that using bytes, or a combination of 8 bits.

To represent the letter "a", for example, we can use the binary => 01100001. Numerically, this corresponds to the number 97. With 8 bytes, we can have a max value of 11111111 which equates to 256 different numbers.

Early in the development of computers, these values were used to represent texts through what is known as ASCII, the American Standard Code for Information Exchange. To see how this worked, check out the image below.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/USASCII_code_chart.png/1280px-USASCII_code_chart.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

While Ascii was revolutionary, it introduced problems, especially as those outside of America began to use computers for text purposes. In Europe, there were characters unrepresented in the American alphabet. These could be added in, but as more speakers began to use computers the ability the encode these characters through Ascii proved impossible. This was particularly true in Asia, especially Japan, where characters, not letters, needed to be represented.

You can now see the mounting problem. Different countries invented their own encoding methods to suit their needs. And this worked until you needed to send a document from Japan to England or from England to Germany. The documents would not be decoded correctly and would come out illegible.

Enter the internet of the 80s and and the mass internet boom of the 90s and you can now see the increasing problem.

Who will save us from this disaster!

Enter... Unicode or the Universal Coded Character.

It is a universal standard that can account for 1,111,998 characters, everything from "j" to "é" to 😀. Unicode solved problems. But now a new problem exists for us working with ancient or medieval languages. They may be encoded in an earlier pre-Unicode encoding. Or they may be standardized.

These are largely two different problems. Let's look at normalization with the Python library unicodedata

In [29]:
from unicodedata import normalize

<img src="https://miro.medium.com/max/2400/1*_irZeGalg3lwm2pQoHtUFw.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

In [30]:
caesar = "Galliā est omnīs dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostra Gallī appellantur."

In [31]:
print (caesar)

Galliā est omnīs dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostra Gallī appellantur.


In [33]:
decoded_caesar = normalize(u'NFKD', caesar).encode('ascii', 'ignore').decode('utf8')

In [34]:
print (decoded_caesar)

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
