# Named Entity Recognition: Danish

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Danish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Danish is *Evangelines Genvordigheder: Til Kvinder med rødt Haar* by Elinor Glyn [from Project Gutenberg](http://www.gutenberg.org/ebooks/33632).

**Here's a preview of spaC's NER tagging *Evangelines Genvordigheder: Til Kvinder med rødt Haar*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Danish NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [3]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Danish-language model (`da_core_news_lg`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [None]:
!python -m spacy download da_core_news_md

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [3]:
import da_core_news_md
nlp = da_core_news_md.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('da_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run`nlp()` on the text and create our document.

In [2]:
filepath = '../texts/other-languages/da.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [4]:
document.ents

(﻿EVANGELINES
  GENVORDIGHEDER,
 RØDT HAAR,
 ELINOR GLYN,
 MARTINS FORLAG
  ,
 MARTIN'S,
 KØBENHAVN
 
 
 
 
 BEGYNDELSEN PAA,
 Branches Park,
 Eventyrerske,
 Hatte paa,
 Slags Ting,
 Eventyrerske,
 Fru Carruthers,
 Testamente,
 Fru Carruthers,
 Fru Carruthers,
 Papa,
 Mama,
 Mama,
 Mama,
 Afkald
 paa,
 Lord,
 Moder,
 Mama,
 Papa,
 Indien,
 Eventyrerske,
 Fru Carruthers,
 Slags Katte,
 Tante,
 Fru Carruthers,
 Pjank,
 Paris,
 Rusland,
 England,
 Eventyrerske,
 Lyst,
 Liv,
 Hr. Carruthers,
 Tante,
 Hjem,
 Buket Violer,
 Brystet paa,
 Fru Carruthers Død,
 Doktor Garrison,
 døve,
 Fru Carruthers,
 Sted,
 Schweiz,
 Efteraaret,
 Sted til Sted,
 London,
 Hjertetilfælde,
 Thomas,
 Forbavselsen,
 Diamantring,
 Eventyrerske,
 Hr. Carruthers,
 Gudernes Skød,
 Officielt,
 Fyre,
 Efteraaret,
 Bridge,
 Fru
 Carruthers,
 Politikere,
 Skolestuen,
 Mademoiselle,
 Mademoiselle,
 Selskabeligheden,
 London,
 Gud,
 Paris,
 Hr. Carruthers,
 London,
 Fru Carruthers,
 Hr. Carruthers,
 Christopher,
 Christophe

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [5]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

﻿EVANGELINES
 GENVORDIGHEDER ORG
RØDT HAAR LOC
ELINOR GLYN PER
MARTINS FORLAG
  PER
MARTIN'S MISC
KØBENHAVN




BEGYNDELSEN PAA ORG
Branches Park MISC
Eventyrerske MISC
Hatte paa LOC
Slags Ting MISC
Eventyrerske MISC
Fru Carruthers PER
Testamente MISC
Fru Carruthers PER
Fru Carruthers PER
Papa PER
Mama PER
Mama PER
Mama PER
Afkald
paa PER
Lord PER
Moder
 PER
Mama PER
Papa PER
Indien LOC
Eventyrerske MISC
Fru Carruthers PER
Slags Katte LOC
Tante PER
Fru Carruthers PER
Pjank MISC
Paris LOC
Rusland LOC
England LOC
Eventyrerske MISC
Lyst ORG
Liv PER
Hr. Carruthers PER
Tante PER
Hjem LOC
Buket Violer MISC
Brystet paa LOC
Fru Carruthers Død PER
Doktor Garrison PER
døve MISC
Fru Carruthers PER
Sted LOC
Schweiz LOC
Efteraaret
 PER
Sted til Sted LOC
London LOC
Hjertetilfælde PER
Thomas PER
Forbavselsen PER
Diamantring MISC
Eventyrerske MISC
Hr. Carruthers PER
Gudernes Skød MISC
Officielt ORG
Fyre
 PER
Efteraaret LOC
Bridge PER
Fru
Carruthers PER
Politikere ORG
Skolestuen LOC
Mademoiselle PER
Ma

Parken LOC
Uforskammethed LOC
Lord Robert PER
Claridges ORG
Kur PER
Véronique PER
London LOC
Lord Robert PER
døve MISC
Lord Robert PER
Hr. Carruthers PER
Gongongen PER
Lady
Katherine PER
Sted LOC
paa Vejen LOC
Lord Robert PER
Spisestuen LOC
Lady Verningham PER
Mackintosh PER
Marys Mand LOC
Botanik ORG
skotsk MISC
Malcolm PER
Botanikeren PER
Lady
Katherine PER
Slags Halvsorg LOC
Lady Verningham PER
Sofa ORG
Krog PER
Sofaerne LOC
Branches ORG
McTavish ORG
Mackintosh PER
McTavish ORG
Slægten McTavish
 ORG
Mary saadan PER
Mary PER
Eventyrerske MISC
En Eventyrerske MISC
Fru
Carruthers PER
De lille Pus MISC
Robert Vavasour PER
Kælebørn PER
Lord Robert PER
Lord Robert PER
Lord Robert PER
Malcolm PER
Lord Roberts Øjne
 PER
Lady Verningham PER
Bridge ORG
Malcolm PER
Liv PER
Eventyrerske MISC
Gud PER
Lord Robert PER
LONDON LOC
Ting paa Vejen LOC
Strøm PER
Verdenshav PER
Lørdag Aften LOC
Tryland LOC
Minde LOC
Morgenmaaltidet ORG
Lord Robert PER
Malcolm PER
Lord Robert PER
Par Gange paa MISC
Lady


Christopher PER
Alicia Verneys PER
Aa, men MISC
Lady Ver PER
Strygninger PER
Slangepige PER
Christopher PER
Paris LOC
Sir Charles' Poulet PER
Victoria
 LOC
Gentleman PER
Carruthers Smaragderne PER
Pelsværket LOC
Suite Værelser LOC
Fru Carruthers PER
Dronning Victorias Tid MISC
Triumfer PER
Lunchen PER
Sir Charles PER
City LOC
Lady Ver PER
Bords ORG
Selskab paa MISC
Paris LOC
Gentleman PER
Ægtemand PER
Verden LOC
Generaler paa Branches ORG
Papa PER
Papa PER
Held PER
Livs Højdepunkt LOC
Lady Merrenden PER
Gang Lady Sophia Vavasour PER
Fru
Carruthers PER
Papa PER
Fru Carruthers PER
Mama PER
Fru Carruthers PER
Tiders Skyld MISC
Par Ord MISC
Oberst Tom Carden PER
Albany LOC
Jupiter LOC
Lady Ver PER
Lunch MISC
Montgomeries PER
Browns Hotel LOC
Tante Katherine PER
Pigebørn paa
 MISC
Skræderinder PER
Shakespeare MISC
Lady Ver PER
Tebordet ORG
Tryland LOC
Jean PER
Lady Ver PER
Paris LOC
Lady Katherine PER
Nièce PER
Forbindelsen med Familien ORG
Tryland LOC
Malcolm PER
Byen LOC
Lady Ver PER
Fedt

Robert PER
Skat
 ORG
Véronique PER
Slaabrok PER
Time MISC
Ilden i Dagligstuen MISC
Vrag MISC
Korridoren LOC
Robert PER
Lady Merrenden PER
Véronique PER
Børsten af Forbavselse ORG
Robert saá PER
Arme PER
Lady Merrenden PER
Evangeline PER
Tante Sophia PER
Robert PER
Robert PER
Evangeline, du Skat MISC
Robert PER
Lunch MISC
Lady Merrenden PER
Robert PER
Robert PER
Udsigt til Rigdom ORG
egenkærligt PER
Kærlighed
 PER
Lady Merrenden PER
Robert PER
Forhallen LOC
Robert PER
Time MISC
Robert PER
Véronique PER
Robert PER
Christopher PER
Lady Ver PER
Lady Ver PER
Robert PER
Lady Ver PER
Christopher PER
Véronique PER
Carlton House Terrace ORG
Robert PER
Oberst Tom Carden
 PER
London LOC
Carlton House Terrace ORG
Robert PER
Robert PER
Time MISC
Mængde Ting MISC
Lady Merrenden PER
Vavasour House LOC
Hertugens Værelse MISC
Lady Merrenden PER
Torquilstone PER
Robert PER
Robert PER
Tigerkat PER
Roberts Øjne PER
Lunch MISC
Carlton House Terrace MISC
Véronique PER
Krone paa MISC
Verden LOC
Robert PER
Fr

To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix:

In [8]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

ELINOR GLYN
MARTINS FORLAG
 
Fru Carruthers
Fru Carruthers
Fru Carruthers
Papa
Mama
Mama
Mama
Afkald
paa
Lord
Moder

Mama
Papa
Fru Carruthers
Tante
Fru Carruthers
Liv
Hr. Carruthers
Tante
Fru Carruthers Død
Doktor Garrison
Fru Carruthers
Efteraaret

Hjertetilfælde
Thomas
Forbavselsen
Hr. Carruthers
Fyre

Bridge
Fru
Carruthers
Mademoiselle
Mademoiselle
Gud
Hr. Carruthers
Fru Carruthers
Hr. Carruthers
Christopher
Christopher
Christopher
Uvillie
Gud
Fru Carruthers
Cicely Parkers
Præstens
Datter
Hr. Carruthers
Lyst
Hr. Carruthers
Hr. Barton
Fru Carruthers
Véronique
Hr. Carruthers
Dagligstue
Hr. Carruthers
Tantes
Befaling
Véronique
Hr. Carruthers
Sten
Hans Væsen
Hr. Barton
Hr. Barton
Hr. Carruthers
Hr. Barton
Hr. Carruthers Te
Blik
Fru
Carruthers
Fru Carruthers
Fru Carruthers
Christopher
Yndigheder
Liv
Hr. Barton
Modstandsevne
Hr. Barton
Forstillelsesevne
Hr.
Carruthers
Milady
Hr. Carruthers

Hr.
Carruthers
Humør
Tante
Arving
Faders

Fru Carruthers
Lampelyset

Fred
Hr. Barton
Skænd
Frøken T

## NER with Long Texts or Many Texts

In [9]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [10]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER."

In [11]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Robert,102
1,Lord Robert,84
2,Hr. Carruthers,78
3,Lady Ver,59
4,Christopher,46
5,Fru Carruthers,45
6,Lady Katherine,42
7,Lady Merrenden,39
8,Véronique,33
9,Malcolm,32


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [12]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Paris,21
1,London,20
2,Verden,13
3,Vestibulen,11
4,Tryland,11
5,Vejen,10
6,Parken,10
7,Teatret,7
8,England,6
9,paa Lørdag,6


## Get NER in Context

In [13]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PER', 'ORG', 'LOC']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [14]:
for document in chunked_documents:
    get_ner_in_context('Jupiter', document)

---

**LOC**

"Det vil jeg," lovede han, "og ved **Jupiter**, den Mand bliver en heldig Fyr."  