# Named Entity Recognition: Spanish

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Spanish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Spanish is *Oasis en la vida* by Juana Manuela Gorriti [from Project Gutenberg](http://www.gutenberg.org/ebooks/62564).

**Here's a preview of spaC's NER tagging *Oasis en la vida*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Spanish NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Spanish](https://spacy.io/models/es) on the spaCy model page.

In [6]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [2]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Spanish-language model (`es_core_news_md`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Spanish](https://spacy.io/models/es) on the spaCy model page.

In [None]:
!python -m spacy download es_core_news_md

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [4]:
import es_core_news_md
nlp = es_core_news_md.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('es_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run`nlp()` on the text and create our document.

In [5]:
filepath = '../texts/other-languages/es.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [7]:
document.ents

(﻿INTRODUCCION,
 El sombrío Prudhon,
 Santos
 Padres de la Iglesia,
 La industria,
 La manera de realizar el prodigio,
 _Economía política_,
 Esta tutriz moralizadora de la sociedad,
 El ahorro,
 Este tema,
 Por si existe incrédulo,
 Su autora comprueba,
 Despues,
 S. VACA-GUZMAN,
 OASIS EN LA VIDA,
 _,
 OASIS,
 EN LA VIDA,
 Mauricio Ridel,
 Fin,
 folletin--respondió Mauricio,
 CHAMUSQUINAS DE AMOR?,
 Enrique,
 María,
 Dios,
 trabajo!--replicó,
 Catorce horas,
 Un poco de sueño,
 Hablando así,
 Mauricio,
 Redaccion,
 Emilio,
 Sábelo,
 Mauricio,
 Regente,
 Suma: ¡catorce horas!,
 ¡Adios,
 Mauricio,
 Emilio,
 Mauricio,
 Uncido,
 En los teatros,
 Enigma,
 Emilio,
 Mauricio,
 En verdad,
 Mauricio,
 Cárlos Ridel,
 Madrastra!,
 Siempre espiados por la
 mirada suspicaz de un fiscal,
 La casa paterna,
 Tal suerte cupo á Mauricio,
 Víctima de una semejanza,
 Europa,
 Francia,
 Paris,
 La bondad característica de los hijos de aquella tierra,
 Desde el sábio Blain,
 Colombe,
 El desterrado comenz

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [8]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

﻿INTRODUCCION PER
El sombrío Prudhon MISC
Santos
Padres de la Iglesia ORG
La industria MISC
La manera de realizar el prodigio MISC
_Economía política_ MISC
Esta tutriz moralizadora de la sociedad MISC
El ahorro MISC
Este tema MISC
Por si existe incrédulo MISC
Su autora comprueba MISC
Despues MISC
S. VACA-GUZMAN PER
OASIS EN LA VIDA MISC
_ MISC
OASIS ORG
EN LA VIDA MISC
Mauricio Ridel PER
Fin MISC
folletin--respondió Mauricio MISC
CHAMUSQUINAS DE AMOR? MISC
Enrique PER
María PER
Dios PER
trabajo!--replicó MISC
Catorce horas MISC
Un poco de sueño MISC
Hablando así MISC
Mauricio LOC
Redaccion MISC
Emilio PER
Sábelo LOC
Mauricio LOC
Regente PER
Suma: ¡catorce horas! MISC
¡Adios MISC
Mauricio LOC
Emilio PER
Mauricio PER
Uncido PER
En los teatros MISC
Enigma MISC
Emilio PER
Mauricio PER
En verdad MISC
Mauricio LOC
Cárlos Ridel PER
Madrastra! MISC
Siempre espiados por la
mirada suspicaz de un fiscal MISC
La casa paterna MISC
Tal suerte cupo á Mauricio MISC
Víctima de una semejanza MISC
Europa

To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix:

In [9]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

﻿INTRODUCCION
S. VACA-GUZMAN
Mauricio Ridel
Enrique
María
Dios
Emilio
Regente
Emilio
Mauricio
Uncido
Emilio
Mauricio
Cárlos Ridel
Paris
Mauricio
Blain
Envíaselo
Besó
Mauricio
Mauricio
Mauricio
Cárlos
  Ridel
Mauricio Ridel
Repugnábame
señor Ridel
Mauricio
Paris
Arrojóse
Paris
Ensayó
Mauricio
Mauricio
Ridel
Cárlos Ridel
Mauricio
Mas
Cárlos Ridel
Vd.
Mauricio
Blain
Paris
Mauricio
Lloraba
Pouillac
Mauricio
Plata
Sr. Santa Coloma
Vice-Cónsul Argentino
Julia
Pronta
Rendidos
Julia Lopez
Vd.
Pouillac
Mauricio
Mauricio
Julia Lopez
Mauricio
Anochecía
Mauricio
Julia
Lopez
Supo
Vice-Cónsul
Mauricio
Dónde
Mauricio
Mauricio
Mauricio
Dió
madame Bazan
Contento
Madame Bazan
Mauricio
Capricho
madame Bazan
_
madame Bazan
Mauricio
_
Mauricio
Mauricio
Steinway
Mauricio
Pouillac
Mauricio
Mauricio
Mauricio
Encerrado
Paris
Renata
Háse
Insensatas
Jesucristo
Ribeaumont
Conoce
Mr
Ribeaumont
Le Courrier de la Plata»
Le Courrier de la Plata
Valois
Mauricio
Renata
Bowctlaw
Renata déme Vd. mi baton de cachemira
Jul

## NER with Long Texts or Many Texts

In [10]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [11]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER."

In [12]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Mauricio,60
1,Julia,19
2,Cárlos Ridel,10
3,Renata,8
4,Vd.,6
5,Emilio,5
6,Paris,5
7,Pouillac,5
8,Despues,4
9,madame Bazan,4


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [13]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Mauricio,42
1,Buenos Aires,12
2,Colombe,5
3,Francia,4
4,Burdeos,4
5,Senegal,4
6,Rio Janeiro,4
7,Gran Hotel,3
8,Europa,2
9,Pouillac,2


## Get NER in Context

In [14]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PER', 'ORG', 'LOC']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [15]:
for document in chunked_documents:
    get_ner_in_context('Francia', document)

---

**LOC**

Por dicha suya fué el «bello país de **Francia**,» la hospitalaria Paris, el lugar de su destierro.  

---

**LOC**

Sin embargo, Mauricio amaba tambien la **Francia**.  

---

**LOC**

Quizá es de **Francia**.

---

**LOC**

--A **Francia**, amada mia, para pedir al sepulcro los restos que lloras y devolverlos á la tierra de la patria.  