<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `MultiLingual NER`

This is lesson `1` of 3 in the educational series on `Named Entity Recognition`. This notebook is intended `the basic problems one faces in multilingual texts`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Basic file operations (open, close, read, write)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Natural Language Processing
* spaCy
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand the complexities of multilingual corpora
2. Understand text encoding
3. Understand how to solve encoding-issues
4. Understand how to think about corpora-specific problems
5. Understand spaCy
6. Understand Named Entity Recognition (NER) as a concept
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spaCy
* requests
* BeautifulSoup (bs4)
* unicode

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!pip install unicode
!pip insntall requests
!pip install bs4

In [1]:
# allows the embedding of YouTube videos
from IPython.display import HTML

# Introduction

```
In this notebook, we will be covering the major issues and challenges one can face when working with multilingual corpora. We will specifically address how these issues relate to the problem of named entity recognition (NER), a method of information extraction that relies upon natural language processing (NLP).

We will also cover the basics of spaCy and named entity recognition in general for those who do not have a background in either area.
```

# Text Encoding

## Background

In [8]:
# import libraries for basic web scraping
import requests
from bs4 import BeautifulSoup

Before we discuss the complexities of text encoding, let's first jump into a real-world example. We will be scraping the data from the SABC SAHA website in South Africa. The page we are scraping, looks like this.

<center><img src='../images/saha_website.JPG' width=500></center>

We are interested in grabbing the data from the p tag in the HTML with id that is equal to line4 (the area I have underlined in the image). This entire page holds a particular testimony from the TRC in South Africa during the late 90s and early 2000s, when this text representation of the testimony was generated. The date and time are important here because this particular text was created with a computer that used an early form of text encoding, or the process by which text is encoded into numerical values that can be parsed by your computer.

We know that this has an non-modern standard form of encoding because one character stands out with a �. This character indicates that the text cannot be parsed with the encoding method being used to parse the text. Because we are viewing this on a modern browser (which is based in a modern encoding method known as utf-8), the browser cannot render this particular encoding. Something is happening between the server and our personal computer's browser.

Let's grab that particular line using requests and BeautifulSoup

In [9]:
s = requests.get('https://sabctrc.saha.org.za/documents/amntrans/benoni/52831.htm')
soup = BeautifulSoup(s.content)
line = soup.find("p", {"id": "line4"})
iso_text =  line.text

Now that we have that line, let's print it off.

print (iso_text)

Everything looks absolutely fine. No issues whatsoever. So, let's save that test to a file.

In [4]:
with open('../data/iso-text.txt', 'w') as f:
    f.write(iso_text)

And now, let's open that file up and take another look at it.

In [5]:
with open ("../data/iso-text.txt", 'r') as f:
    iso_data = f.read()
print (iso_data)

ADV STEENKAMP: I'm André Steenkamp.


Yep. It still looks good. For good measure, let's open the file up the same text in a separate file that is stored as utf-8.

In [16]:
with open("../data/utf8-text.txt", "r") as f:
    utf8_data = f.read()
print (utf8_data)

ADV STEENKAMP: I'm AndrÃ© Steenkamp.


And now, we can see that something looks wrong. Just to be safe, let's compare the two files.

In [17]:
iso_data == utf8_data

False

Let's open up the utf8 file now, with encoding specified.

In [18]:
with open("../data/utf8-text.txt", "r", encoding='utf-8') as f:
    utf8_data = f.read()
print (utf8_data)

ADV STEENKAMP: I'm André Steenkamp.


In [19]:
iso_data == utf8_data

True

Python has allowed us to standardize now these two different pieces of data by bringing together the two different encodings into a utf-8 standard. Now, why is this such a problem? Because text encoding, while looking clean in Python, will fundamentally break named entity recognition systems which are not looking at the same data.

To see what is happening here, let's look at these in a proper text editor. We will use Atom.

<center><img src='../images/encoding-issue.JPG' width=500></center>

Notice in the above image, we can see the problem is preserved. In the bottom left corner of Atom, however, we have the ability to change the encoding of the text file being observed.

<center><img src='../images/encoding-issue2.JPG' width=500></center>

When we change that from UTF-8 to ISO-8859-15, the problem vanishes. So, what has happened? We have switched our IDE (Atom) into an encoding method that was used in the 1990s in northern-European languages of which Afrikaans, despite being South African, is associated due to Dutch colonialism.

Why is this such a big deal? Because while you or I will see the same text and the same characters on the screen, a computer will not. Those who work with multilingual corpora, especially those who work with texts that were created before the modern day, will encounter at some point corpora that contain multiple encodings. Understanding this issue and the myriad of problems that surface because of them will make your life much easier.

This is precisely what we will now address.

## Resources

Tim Scott from Computerphile explains UTF-8

In [16]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MijmeoH9LT4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')



James Briggs Explains Unicode Noramlization

In [17]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/9Od9-DV9kd8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0

## The Problem

### Problem within UTF-8

In [8]:
"È" == "È"

True

In [12]:
"È" == "È"

False

In [11]:
compound_e = "\u0045\u0300"
print (compound_e)

È


In [9]:
accent_e = "\u00C8"
print (accent_e)

È


In [40]:
utf_16 = "feff00c8"
utf_16

'feff00c8'

### Carriage Returns

In [25]:
e_with_carriage = "\u00C8\u000D"
e_without_carriage = '\u00C8'
print(e_with_carriage)
print (e_without_carriage)

È
È


In [26]:
e_with_carriage == e_without_carriage

False

In [27]:
e_without_carriage

'È'

In [28]:
e_with_carriage

'È\r'

In [30]:
print('È\n')

È



In [31]:
'È\n'

'È\n'

## The Solution

## Normalize Unicode

| Name | Abbreviation | Description | Example |
| --- | --- | --- | --- |
| Form D | NFD | *Canonical* decomposition | `Ç` → `C ̧` |
| Form C | NFC | *Canoncial* decomposition followed by *canonical* composition | `Ç` → `C ̧` → `Ç` |
| Form KD | NFKD | *Compatibility* decomposition | `ℌ ̧` → `H ̧` |
| Form KC | NFKC | *Compatibility* decomposition followed by *canonical* composition | `ℌ ̧` → `H ̧` → `Ḩ` |

```
Source: James Briggs - https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0
```

In [39]:
import unicodedata

## Normalize Accents

In [32]:
import unidecode

In [36]:
accented_string = "âcre"
print (accented_string)

âcre


In [37]:
print (unidecode.unidecode(accented_string))

acre


The problem here is that we have a change in meaning.

| French      | English |
| ----------- | ----------- |
| âcre      | acrid, pungent       |
| acre   | acre        |

# Solving the Right Problem by Understanding the Data

## Working with a Multilingual Corpus

## Working with a Multilingual Document

# Introduction to Named Entity Recognition

# Introduction to spaCy