<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `MultiLingual NER`

This is lesson `1` of 3 in the educational series on `Named Entity Recognition`. This notebook is intended `the basic problems one faces in multilingual texts`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Basic file operations (open, close, read, write)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Natural Language Processing
* spaCy
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand the complexities of multilingual corpora
2. Understand text encoding
3. Understand how to solve encoding-issues
4. Understand how to think about corpora-specific problems
5. Understand spaCy
6. Understand Named Entity Recognition (NER) as a concept
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spaCy
* requests
* BeautifulSoup (bs4)
* unicode

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!pip install unicode
!pip insntall requests
!pip install bs4
!pip install unidecode

In [1]:
# allows the embedding of YouTube videos
from IPython.display import HTML

# Introduction

```
In this notebook, we will be covering the major issues and challenges one can face when working with multilingual corpora. We will specifically address how these issues relate to the problem of named entity recognition (NER), a method of information extraction that relies upon natural language processing (NLP).

We will also cover the basics of spaCy and named entity recognition in general for those who do not have a background in either area.
```

# Introduction to Named Entity Recognition

## Natural Language Processing

Named entity recognition (addressed below) is a branch of natural language processing, better known as NLP. NLP is the process by which a researcher uses a computer system to parse human language and extract important metadata from texts. The purpose of NLP is to perform, among other things, distant reading.

Distant reading has a long history extending to the late-twentieth century. It is commonly used when the quantity of texts in a given corpus prevent a researcher (or a team of researchers) from reading the corpus closely in its entirety. In order to make sense of that large corpus, the researcher will often pass certain tasks to a computer with the understanding that there is a margin of error. This margin of error is accepted in exchange for the ability to gain a larger, distant understanding of that corpus. Distant reading is used to perform several significant tasks, such as:

- sentiment analysis=> understanding the sentiment of a text
- text classification=> classify texts into predetermined categories
- named entity recognition=> extract entities from a text

The metadata from these tasks can then be used to get a sense of the texts without reading them closely, hence the term distant reading.

The goal of NLP is to feed a text to a computer system and have it return some sort of output. This is often achieved through a series of pipelines that perform some operations on the data at hand.

Earlier pipelines, may include a tokenizer, whose sole job is to break a text into individual tokens. Tokens are items in a text that have some linguistic meaning. They can be words, such as “Martha”, but they can also be punctuation marks, such as “,” in the relative clause “, a senior,”. Likewise, “‘nt” in the contraction “can’t” would also be recognized as a token since “‘nt” in English corresponds to the word “not”.

A common pipeline after a tokenizer is a POS tagger whose job is to identify the parts-of-speech, or POS, in the text. This is essential for the computer to understand how individual tokens are functioning in a sentence. The way in which we perform POS on different languages is not the same. In inflected languages, such as German, or highly inflected languages, such as Latin or Ancient Greek, the ending of the word contains a lot of information about it’s role in the sentence, i.e. a nominative singular or dative plural. In low inflected languages, such as English, position in the sentence holds primacy. English is a Noun-Verb-Object language (NVO). Let us consider an example sentence:

The boy took the ball to the store.

The nominative (subject), “boy”, comes first in the sentence, followed by the verb, “took”, then followed by the accusative (object), “ball”, and finally the dative (indirect object), “store”. The words “the” and “to” also contain vital information. “The” occurs twice and tells the reader that it’s not just any ball, it’s the ball; likewise, it’s not just a store, but the store. The period too tells us something important. This is a statement, not a question. For native speakers of a given language these parts-of-speech may go entirely unnoticed. We understand them intuitively. Some of us may have memories of memorizing parsing trees in 5th grade grammar, but for the most part we developed mentally and linguistically with our mother tongue in a unique way. We can use that language without thought of grammar. For those who have devoted time to learning a second language later in life, grammar is a necessity (and sometimes a bane) to learn. We do not learn languages later in life the same way we learn our mother tongue. For a computer, the same holds true. We need to allow the computer to understand parts of speech.

Named entity recognition will often times come later in a pipeline because it needs to receive a tokenized text and, in some languages, it needs to understand a words POS to perform well. As a text moves through the pipeline, it receives spans that contain valuable information, such as part of speech. Once the text reaches the NER pipeline, it is time for the machine to make some structured decisions about individual tokens.

## Named Entity Recognition

**Entities** are words in a text that correspond to a specific type of data. They can be numerical, such as cardinal numbers; temporal, such as dates; nominal, such as names of people and places; and political, such as geopolitical entities (GPE). In short, an entity can be anything the designer wishes to designate as an item in a text that has a corresponding label.

Named entity recognition, or NER, is the process by which a system takes an input of unstructured data (a text) and outputs structured data, specifically the identification of entities. Let us consider this short example.

Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.

In this example, we have several potential entities. First, there is “Martha”. Different NER models will have different corresponding labels for such an entity, but PERSON or PER is considered standard practice. Note here that the label is capitalized. This is also standard practice. We also have a GPE, or Geopolitical Entity, notably “Spain”. Finally, we have a DATE entity, “05 June 2022”. These are standard labels that one can expect to extract from a text. If the domain at hand, however, has additional labels, those can be extracted as well. Perhaps the client or user wants to not only extract people, GPEs, and dates, but also sports. In such a scenario “basketball” could be extracted and given the label SPORT.

Not all entities are singular. As is common with texts, sometimes entities are **Multi-word Tokens, or MWT**. Let us consider the same sentence as above, but with one modification:

`Martha Thompson, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.`

Here, Martha now has a surname, “Thompson”. We can either extract Martha and Thompson as individual entities or, as is more common practice, extract both as a single entity, since “Martha Thompson” is a single individual. An NER system, therefore, should recognize “Martha Thompson” as a single, MWT.

As we progress through these notebooks and videos, we will learn new NER concepts. For now, I recommend watching the video below. Each notebook, including this one, will have a corresponding video lesson.

# Text Encoding

## Background

In [2]:
# import libraries for basic web scraping
import requests
from bs4 import BeautifulSoup

In [None]:
Before we discuss the complexities of text encoding, let's first jump into a real-world example. We will be scraping the data from the SABC SAHA website in South Africa. The page we are scraping, looks like this.

<center><img src='../images/saha_website.JPG' width=500></center>

We are interested in grabbing the data from the p tag in the HTML with id that is equal to line4 (the area I have underlined in the image). This entire page holds a particular testimony from the TRC in South Africa during the late 90s and early 2000s, when this text representation of the testimony was generated. The date and time are important here because this particular text was created with a computer that used an early form of text encoding, or the process by which text is encoded into numerical values that can be parsed by your computer.

We know that this has an non-modern standard form of encoding because one character stands out with a �. This character indicates that the text cannot be parsed with the encoding method being used to parse the text. Because we are viewing this on a modern browser (which is based in a modern encoding method known as utf-8), the browser cannot render this particular encoding. Something is happening between the server and our personal computer's browser.

Let's grab that particular line using requests and BeautifulSoup

In [167]:
from unidecode import unidecode
s = requests.get('https://sabctrc.saha.org.za/documents/amntrans/benoni/52831.htm')
soup = BeautifulSoup(s.content)
line = soup.find("p", {"id": "line4"})
iso_text =  line.text

Now that we have that line, let's print it off.

In [168]:
print (iso_text)

ADV STEENKAMP: I'm André Steenkamp.


Everything looks absolutely fine. No issues whatsoever. So, let's save that test to a file.

In [155]:
with open('../data/iso-text.txt', 'w') as f:
    f.write(iso_text)

And now, let's open that file up and take another look at it.

In [156]:
with open ("../data/iso-text.txt", 'r') as f:
    iso_data = f.read()
print (iso_data)

ADV STEENKAMP: I'm André Steenkamp.


In [157]:
iso_bytes2 = str.encode(iso_data)
print (iso_bytes2)

b"ADV STEENKAMP: I'm Andr\xc3\xa9 Steenkamp."


In [158]:
with open("../data/utf8-text.txt", "r") as f:
    utf8_data = f.read()

In [159]:
utf8_data == iso_data

False

In [161]:
print(utf8_data)

ADV STEENKAMP: I'm AndrÃ© Steenkamp.


In [162]:
utf8_bytes = str.encode(utf8_data)
print (utf8_bytes)

b"ADV STEENKAMP: I'm Andr\xc3\x83\xc2\xa9 Steenkamp."


In [163]:
iso_bytes = str.encode(iso_text)
print (iso_bytes)

b"ADV STEENKAMP: I'm Andr\xc3\xa9 Steenkamp."


In [119]:
iso_bytes == utf8_bytes

False

To see what is happening here, let's look at these in a proper text editor. We will use Atom.

<center><img src='../images/encoding-issue.JPG' width=500></center>

Notice in the above image, we can see the problem is preserved. In the bottom left corner of Atom, however, we have the ability to change the encoding of the text file being observed.

<center><img src='../images/encoding-issue2.JPG' width=500></center>

When we change that from UTF-8 to ISO-8859-15, the problem vanishes. So, what has happened? We have switched our IDE (Atom) into an encoding method that was used in the 1990s in northern-European languages of which Afrikaans, despite being South African, is associated due to Dutch colonialism.

Why is this such a big deal? Because while you or I will see the same text and the same characters on the screen, a computer will not. Those who work with multilingual corpora, especially those who work with texts that were created before the modern day, will encounter at some point corpora that contain multiple encodings. Understanding this issue and the myriad of problems that surface because of them will make your life much easier.

This is precisely what we will now address.

Now, let's return to our variable above, utf8_data. Let's print off utf8_data and see what is different.

In [120]:
print(utf8_data)

ADV STEENKAMP: I'm AndrÃ© Steenkamp.


Let's open up the utf8 file now, with encoding specified.

In [132]:
with open("../data/utf8-text.txt", "r") as f:
    utf8_data2 = f.read()

In [133]:
iso_data == utf8_data2

False

In [139]:
utf8_bytes = str.encode(utf8_data2)
print (utf8_bytes)

iso_bytes = str.encode(iso_data)
print (iso_bytes)

b"ADV STEENKAMP: I'm Andr\xc3\x83\xc2\xa9 Steenkamp."
b"ADV STEENKAMP: I'm Andr\xc3\xa9 Steenkamp."


So this has now rendered the encoding correctly in Jupyter. Let's see what happens when we open our iso-8859-15 encoded data with utf-8.

In [140]:
with open ("../data/iso-text.txt", 'r', encoding="utf-8") as f:
    iso_data = f.read()
print (iso_data)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 23: invalid continuation byte

Now we get an error. This error is occuring because Python cannot read the file (which is not encoded in utf-8) as a utf-8-encoded text file. We can use Python, however, to read a different encoding, standardize it into utf-8, and then continue to open that file as a utf-8 file consistently in the future. The process for doing this will vary significantly for other langauges.

## Resources

Tim Scott from Computerphile explains UTF-8

In [16]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MijmeoH9LT4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')



James Briggs Explains Unicode Noramlization

In [17]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/9Od9-DV9kd8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0

## Problems within UTF-8

Our problems with encodings, unfortunately, do not end with UTF-8. Once we have encoded our texts into UTF-8, we can still have issues with characters that look the same but being encoded differently. This is particularly true with accented characters.

In [51]:
"Ç" == "Ç"

True

In [52]:
"Ç" == "Ç"

False

In [53]:
compound_c = "\u0043\u0327"
print (compound_c)

Ç


In [55]:
accent_c = "\u00C7"
print (accent_c)

Ç


## Normalize Unicode

In [41]:
import unicodedata

| Name | Abbreviation | Description | Example |
| --- | --- | --- | --- |
| Form D | NFD | *Canonical* decomposition | `Ç` → `C ̧` |
| Form C | NFC | *Canoncial* decomposition followed by *canonical* composition | `Ç` → `C ̧` → `Ç` |
| Form KD | NFKD | *Compatibility* decomposition | `ℌ ̧` → `H ̧` |
| Form KC | NFKC | *Compatibility* decomposition followed by *canonical* composition | `ℌ ̧` → `H ̧` → `Ḩ` |

```
Source: James Briggs - https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0
```

In [56]:
compound_c == accent_c

False

In [57]:
print(compound_c, accent_c)

Ç Ç


In [62]:
nfd_compound = unicodedata.normalize('NFD', compound_c)
nfd_accent = unicodedata.normalize('NFD', accent_c)
print(nfd_compound, nfd_accent)
print (nfd_compound == nfd_accent)

Ç Ç
True


In [64]:
nfc_compound = unicodedata.normalize('NFC', compound_c)
nfc_accent = unicodedata.normalize('NFC', accent_c)
print(nfc_compound, nfc_accent)
print (nfc_compound == nfc_accent)

Ç Ç
True


In [47]:
nfd_example == accent_e

False

In [147]:
utf8_normalized = unicodedata.normalize('NFC', utf8_data4)
print (utf8_normalized)

ADV STEENKAMP: I'm AndrÃ© Steenkamp.


## Carriage Returns

In [65]:
e_with_carriage = "\u00C8\u000D"
e_without_carriage = '\u00C8'
print(e_with_carriage)
print (e_without_carriage)

È
È


In [66]:
e_with_carriage == e_without_carriage

False

In [67]:
e_without_carriage

'È'

In [68]:
e_with_carriage

'È\r'

In [69]:
print('È\n')

È



In [70]:
'È\n'

'È\n'

## Normalize Accents

In [32]:
import unidecode

In [36]:
accented_string = "âcre"
print (accented_string)

âcre


In [37]:
print (unidecode.unidecode(accented_string))

acre


The problem here is that we have a change in meaning.

| French      | English |
| ----------- | ----------- |
| âcre      | acrid, pungent       |
| acre   | acre        |