---
# NER and NEL with NLP Tools
----

> ### General Instructions
> - Run this notebook cell by cell and observe what happens.
> - If a cell is marked with `TODO`, there is a small task for you.
> - Report the observations in your submission doc, as requested in the assignment.
> - Be generous with screenshots - show us your results!

---

In [1]:
%pip install ipykernel

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install spacy
%pip install nltk
%pip install wikipedia-api

Collecting spacy
  Downloading spacy-3.7.2-cp312-cp312-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp312-cp312-macosx_10_9_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp312-cp312-macosx_10_9_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-macosx_10_9_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.1.8 (from spacy)
  Downloading thinc-8.2.1-cp312-cp312-macosx_10_9_x86_64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.2-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Download

# spaCy
>First, we will perfom `NER (Named Entity Recognition)` and `NEL (Named Entity Linking)` with spaCy.
><br>
><br>SpaCy is a free and open-source python library with a lot of built-in capabilities. 
><br>It’s becoming increasingly popular for processing and analyzing data in the field of NLP, and you will soon see why.

### Import

In [3]:
# import the necessary libraries
import spacy
from spacy import displacy
import wikipediaapi 

## English

>In this section, you will see how SpaCy's `English-language` model works.

### Load the English model

In [4]:
# NOTE It is important that you run cells one-by-one, since you need to activate all previous cells to be able to run a next one.
# the below lines help you download the English model 
# and load it to this notebook
spacy.cli.download('en_core_web_sm')
import en_core_web_sm
nlp_english = en_core_web_sm.load()

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Save your text to a variable

In [5]:
# TODO
# the below string is given to you as an example.
# substitute it with the text of your choice.
# NOTE: triple quotations allow you to paste a multiline text

text = """
A joint UN team led by the WHO assessed the hospital for one hour following its occupation by the Israeli military and as some patients and those seeking shelter there began to evacuate it. The team said they saw evidence of shelling and gunfire and observed a mass grave at the hospital's entrance. They were told it held the remains of 80 people. Following an evacuation which the hospital director said was ordered by the Israeli army but which the army said was requested by the director, 300 critically ill patients remain in al-Shifa - formerly the largest and most advanced hospital in Gaza. The WHO said it was trying to arrange the urgent evacuation of remaining patients and staff to other facilities in Gaza, and repeated calls for a ceasefire. Meanwhile, the White House has responded to a report in the Washington Post which said Israel, Hamas and the US were on the verge of a deal that would see the release of women and children seized by Hamas on 7 October in exchange for a five-day pause in fighting. A White House spokesperson said no such deal had yet been reached but it was working hard to get one agreed. Hundreds of people, including some patients, left al-Shifa on Saturday. A journalist at al-Shifa hospital told the BBC that only "patients who could not move, and a very small number of doctors" remained behind.  "We raised our hands and carried white flags," Khader, a journalist who had been at al-Shifa, told the BBC. 
"""

### Run the NER

In [6]:
# process the input text using the SpaCy model
# NOTE If the output data are too long to be displayed, open it in a text editor as suggested.

article = nlp_english(text) # 

for entity in article.ents:
    print(f'{entity.label_}\t{entity.text}')

ORG	UN
ORG	WHO
TIME	one hour
NORP	Israeli
CARDINAL	80
NORP	Israeli
CARDINAL	300
ORG	al-Shifa - formerly
GPE	Gaza
ORG	WHO
GPE	Gaza
ORG	the White House
ORG	the Washington Post
GPE	Israel
ORG	Hamas
GPE	US
ORG	Hamas
DATE	7 October
DATE	five-day
ORG	White House
CARDINAL	one
CARDINAL	Hundreds
ORG	al-Shifa
DATE	Saturday
ORG	al-Shifa
ORG	BBC
PERSON	Khader
ORG	al-Shifa
ORG	BBC


### Visualize the NER results

In [7]:
# this nice feature allows you to visualize the entities
displacy.render(article, style="ent", jupyter=True)

### NEL: Linking to Wikipedia data

In [8]:
# this is a function allowing you to link the entities to the Wikipedia database
def entity_linking(some_text):
    doc = nlp_english(some_text)
    wiki_wiki = wikipediaapi.Wikipedia('en', headers={'User-Agent': 'my-custom-agent'})
    unique_links = set() # use a set to avoid duplicates
    
    for ent in doc.ents:
        
        # TODO modify the below list so that it includes all the types of entities you have found in your text
        if ent.label_ in ['PERSON', 'ORG']: # modify this list, e.g. add 'GPE'
            entity_page = wiki_wiki.page(ent.text)
            if entity_page.exists() and entity_page.fullurl not in unique_links:
                unique_links.add(entity_page.fullurl)
                print(f"{ent.text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking(text)


UN	Wikipedia URL: https://en.wikipedia.org/wiki/United_Nations
WHO	Wikipedia URL: https://en.wikipedia.org/wiki/World_Health_Organization
the White House	Wikipedia URL: https://en.wikipedia.org/wiki/White_House
the Washington Post	Wikipedia URL: https://en.wikipedia.org/wiki/The_Washington_Post
Hamas	Wikipedia URL: https://en.wikipedia.org/wiki/Hamas
al-Shifa	Wikipedia URL: https://en.wikipedia.org/wiki/The_Book_of_Healing
BBC	Wikipedia URL: https://en.wikipedia.org/wiki/BBC
Khader	Wikipedia URL: https://en.wikipedia.org/wiki/Khadr


## German

> Run the cells below to see how SpaCy copes with German text.
><br>`TODO`: Select a short German text of your own and see how the model manages to process it.
>><br>`NOTE`: If you are doing the `Bonus` task, substitute German with another language of your choice available at spaCy (https://spacy.io/usage/models). Try to figure out what needs to be changed when changing the language. No worries, you are not going to lose any points if you do not manage to make it run - that's what `BONUS` is for!

### Load the German model

In [14]:
# here you download and load the German model
spacy.cli.download('de_core_news_sm') # NOTE modify this line if working with a different language
import de_core_news_sm # NOTE modify this line if working with a different language
nlp = de_core_news_sm.load() # NOTE modify this line if working with a different language

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


### Save your text to a variable

In [15]:
# TODO: substitute the string with a text of your choice.
# It does not matter if you speak German or not, just observing the labels is enough.

text = """
Ein gemeinsames Team der Vereinten Nationen unter der Leitung der WHO hat das Krankenhaus eine Stunde lang beurteilt, nachdem es von der israelischen Armee besetzt wurde und als einige Patienten und Personen, die dort Schutz suchten, zu evakuieren begannen. Das Team gab an, Beweise für Beschuss und Schusswaffen gesehen zu haben und beobachtete ein Massengrab am Eingang des Krankenhauses. Ihnen wurde gesagt, dass es die Überreste von 80 Personen enthalte. Nach einer Evakuierung, die der Krankenhausdirektor als von der israelischen Armee angeordnet bezeichnete, die Armee aber sagte, sie sei vom Direktor angefordert worden, verbleiben 300 kritisch kranke Patienten in Al-Shifa - ehemals das größte und fortschrittlichste Krankenhaus in Gaza. Die WHO sagte, sie versuche, die dringende Evakuierung der verbleibenden Patienten und des Personals in andere Einrichtungen in Gaza zu arrangieren, und wiederholte die Forderungen nach einem Waffenstillstand. In der Zwischenzeit hat das Weiße Haus auf einen Bericht in der Washington Post reagiert, der besagte, Israel, Hamas und die USA stünden kurz vor einer Vereinbarung, die die Freilassung von Frauen und Kindern, die am 7. Oktober von der Hamas gefangen genommen wurden, im Austausch für eine fünftägige Kampfpause vorsieht. Ein Sprecher des Weißen Hauses sagte, dass noch keine solche Vereinbarung erreicht worden sei, aber man arbeite hart daran, eine zu vereinbaren. Hunderte von Menschen, darunter einige Patienten, verließen Al-Shifa am Samstag. Ein Journalist im Al-Shifa-Krankenhaus erzählte der BBC, dass nur "Patienten, die sich nicht bewegen konnten, und eine sehr kleine Anzahl von Ärzten" zurückgeblieben seien. "Wir haben unsere Hände erhoben und weiße Fahnen getragen", sagte Khader, ein Journalist, der sich in Al-Shifa befunden hatte, der BBC.
"""

### Run the NER

In [16]:
article = nlp(text)

for entity in article.ents:
    print(f'{entity.label_}\t{entity.text}')

ORG	Vereinten Nationen
ORG	WHO
ORG	israelischen Armee
ORG	Krankenhausdirektor
ORG	israelischen Armee
ORG	Armee
LOC	Al-Shifa
LOC	Gaza
ORG	WHO
LOC	Gaza
LOC	Weiße Haus
ORG	Washington Post
LOC	Israel
LOC	Hamas
LOC	USA
ORG	Hamas
PER	Al-Shifa
ORG	BBC
LOC	Al-Shifa
ORG	BBC


### Visualize the NER Results

In [17]:
# visualize the NE
displacy.render(article, style="ent", jupyter=True)

### NEL: Linking to Wikipedia Data

In [18]:
# # this is a function allowing you to link the entities to the Wikipedia database

def entity_linking(some_text):
    
    doc = nlp(some_text)

    # NOTE make sure to change 'de' to the new language abbreviation if working with a different language
    wiki_wiki = wikipediaapi.Wikipedia('de', headers={'User-Agent': 'my-custom-agent'}) # NOTE modify this line if working with a different language
    unique_links = set()
    
    for ent in doc.ents:

        if ent.label_ in ['PER', 'LOC', 'MISC']: # TODO change this list depending on the entity types found in your text
            entity_page = wiki_wiki.page(ent.text)
            if entity_page.exists() and entity_page.fullurl not in unique_links:
                unique_links.add(entity_page.fullurl)
                print(f"{ent.text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking(text)

Al-Shifa	Wikipedia URL: https://en.wikipedia.org/wiki/The_Book_of_Healing
Gaza	Wikipedia URL: https://en.wikipedia.org/wiki/Gaza
Israel	Wikipedia URL: https://en.wikipedia.org/wiki/Israel
Hamas	Wikipedia URL: https://en.wikipedia.org/wiki/Hamas
USA	Wikipedia URL: https://en.wikipedia.org/wiki/United_States


-----
# NLTK
> Now, we will perfom `NER (Named Entity Recognition)` and `NEL (Named Entity Linking)` with NLTK.
>
>> "NLTK is a leading platform for building Python programs to work with human language data. 
>> <br>It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, 
>> <br>along with a suite of text processing libraries for classification, tokenization, stemming, 
>> <br>tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, 
>> <br>and an active discussion forum." (https://www.nltk.org/)
>


## English

### Import

In [19]:
# import and load the necessary packages
import nltk
from nltk import word_tokenize, pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]     (_ssl.c:1000)>
[nltk_data] Error loading maxent_ne_chunker: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading words: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


False

### Save Your Text to a Variable

In [20]:
text = """ A joint UN team led by the WHO assessed the hospital for one hour following its occupation by the Israeli military and as some patients and those seeking shelter there began to evacuate it. The team said they saw evidence of shelling and gunfire and observed a mass grave at the hospital's entrance. They were told it held the remains of 80 people. Following an evacuation which the hospital director said was ordered by the Israeli army but which the army said was requested by the director, 300 critically ill patients remain in al-Shifa - formerly the largest and most advanced hospital in Gaza. The WHO said it was trying to arrange the urgent evacuation of remaining patients and staff to other facilities in Gaza, and repeated calls for a ceasefire. Meanwhile, the White House has responded to a report in the Washington Post which said Israel, Hamas and the US were on the verge of a deal that would see the release of women and children seized by Hamas on 7 October in exchange for a five-day pause in fighting. A White House spokesperson said no such deal had yet been reached but it was working hard to get one agreed. Hundreds of people, including some patients, left al-Shifa on Saturday. A journalist at al-Shifa hospital told the BBC that only "patients who could not move, and a very small number of doctors" remained behind.  "We raised our hands and carried white flags," Khader, a journalist who had been at al-Shifa, told the BBC. 
""" # TODO paste you English text here between the triple quotes

### Run the NER

In [21]:
# this part performs the NER of your text
# NOTE that the process is different from SpaCy. Can you spot the difference?
for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

ORGANIZATION WHO
GPE Israeli
GPE Israeli
GPE al-Shifa
LOCATION Gaza
ORGANIZATION WHO
GPE Gaza
FACILITY White House
ORGANIZATION Washington Post
GPE Israel
PERSON Hamas
ORGANIZATION US
PERSON Hamas
FACILITY White House
ORGANIZATION al-Shifa
ORGANIZATION BBC
PERSON Khader
ORGANIZATION al-Shifa
ORGANIZATION BBC


### NEL: Linking to Wikipedia Data

In [22]:
# here, we define a function to both extract NE and link them to the Wikipedia database

def entity_linking_nltk(some_text):
    sentences = nltk.sent_tokenize(some_text)
    
    wiki_wiki = wikipediaapi.Wikipedia('en', headers={'User-Agent': 'my-custom-agent'})
    unique_links = set()

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))

        for chunk in chunks:
            if hasattr(chunk, 'label'):
                entity_text = ' '.join(c[0] for c in chunk)
                entity_page = wiki_wiki.page(entity_text)
                
                if entity_page.exists() and entity_page.fullurl not in unique_links:
                    unique_links.add(entity_page.fullurl)
                    print(f"{entity_text}\tWikipedia URL: {entity_page.fullurl}")

# this line calls the above function with your text as its argument
entity_linking_nltk(text)


WHO	Wikipedia URL: https://en.wikipedia.org/wiki/World_Health_Organization
Israeli	Wikipedia URL: https://en.wikipedia.org/wiki/Israeli
al-Shifa	Wikipedia URL: https://en.wikipedia.org/wiki/The_Book_of_Healing
Gaza	Wikipedia URL: https://en.wikipedia.org/wiki/Gaza
White House	Wikipedia URL: https://en.wikipedia.org/wiki/White_House
Washington Post	Wikipedia URL: https://en.wikipedia.org/wiki/The_Washington_Post
Israel	Wikipedia URL: https://en.wikipedia.org/wiki/Israel
Hamas	Wikipedia URL: https://en.wikipedia.org/wiki/Hamas
US	Wikipedia URL: https://en.wikipedia.org/wiki/United_States
BBC	Wikipedia URL: https://en.wikipedia.org/wiki/BBC
Khader	Wikipedia URL: https://en.wikipedia.org/wiki/Khadr


## German 


><br>Select a short German text of your own and see how the model manages to process it.
>><br>If you are doing the `Bonus` task, try pasting a text in some different language and report what happens.

### Save your Text to a Variable

In [23]:
text = """Ein gemeinsames Team der Vereinten Nationen unter der Leitung der WHO hat das Krankenhaus eine Stunde lang beurteilt, nachdem es von der israelischen Armee besetzt wurde und als einige Patienten und Personen, die dort Schutz suchten, zu evakuieren begannen. Das Team gab an, Beweise für Beschuss und Schusswaffen gesehen zu haben und beobachtete ein Massengrab am Eingang des Krankenhauses. Ihnen wurde gesagt, dass es die Überreste von 80 Personen enthalte. Nach einer Evakuierung, die der Krankenhausdirektor als von der israelischen Armee angeordnet bezeichnete, die Armee aber sagte, sie sei vom Direktor angefordert worden, verbleiben 300 kritisch kranke Patienten in Al-Shifa - ehemals das größte und fortschrittlichste Krankenhaus in Gaza. Die WHO sagte, sie versuche, die dringende Evakuierung der verbleibenden Patienten und des Personals in andere Einrichtungen in Gaza zu arrangieren, und wiederholte die Forderungen nach einem Waffenstillstand. In der Zwischenzeit hat das Weiße Haus auf einen Bericht in der Washington Post reagiert, der besagte, Israel, Hamas und die USA stünden kurz vor einer Vereinbarung, die die Freilassung von Frauen und Kindern, die am 7. Oktober von der Hamas gefangen genommen wurden, im Austausch für eine fünftägige Kampfpause vorsieht. Ein Sprecher des Weißen Hauses sagte, dass noch keine solche Vereinbarung erreicht worden sei, aber man arbeite hart daran, eine zu vereinbaren. Hunderte von Menschen, darunter einige Patienten, verließen Al-Shifa am Samstag. Ein Journalist im Al-Shifa-Krankenhaus erzählte der BBC, dass nur "Patienten, die sich nicht bewegen konnten, und eine sehr kleine Anzahl von Ärzten" zurückgeblieben seien. "Wir haben unsere Hände erhoben und weiße Fahnen getragen", sagte Khader, ein Journalist, der sich in Al-Shifa befunden hatte, der BBC.
"""

### Run the NER

In [25]:
for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

PERSON Ein
PERSON Team
PERSON Vereinten Nationen
PERSON Leitung
ORGANIZATION WHO
GPE Krankenhaus
PERSON Stunde
PERSON Armee
PERSON Patienten
PERSON Schutz
PERSON Das
PERSON Team
PERSON Beweise
GPE Schusswaffen
PERSON Massengrab
PERSON Eingang
GPE Ihnen
PERSON Überreste
GPE Nach
PERSON Evakuierung
PERSON Armee
PERSON Armee
PERSON Patienten
GPE Al-Shifa
LOCATION Gaza
PERSON Die
ORGANIZATION WHO
PERSON Patienten
LOCATION Gaza
GPE Waffenstillstand
PERSON Weiße Haus
ORGANIZATION Washington Post
GPE Israel
PERSON Hamas
PERSON Vereinbarung
PERSON Freilassung
GPE Kindern
PERSON Hamas
ORGANIZATION Kampfpause
PERSON Ein
ORGANIZATION Sprecher
PERSON Weißen Hauses
PERSON Vereinbarung
GPE Hunderte
PERSON Patienten
PERSON Samstag
PERSON Ein
ORGANIZATION Journalist
ORGANIZATION BBC
PERSON Patienten
PERSON Anzahl
PERSON Hände
PERSON Fahnen
ORGANIZATION Journalist
ORGANIZATION BBC


### NEL: Linking to Wikipedia Data

In [26]:
def entity_linking_nltk(some_text):
    sentences = nltk.sent_tokenize(some_text)
    
    # NOTE you will need to substitute 'de' with your language, if doing the bonus task. The rest remains unchanged
    wiki_wiki = wikipediaapi.Wikipedia('de', headers={'User-Agent': 'my-custom-agent'}) 
    unique_links = set()

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))

        for chunk in chunks:
            if hasattr(chunk, 'label'):
                entity_text = ' '.join(c[0] for c in chunk)
                entity_page = wiki_wiki.page(entity_text)
                
                if entity_page.exists() and entity_page.fullurl not in unique_links:
                    unique_links.add(entity_page.fullurl)
                    print(f"{entity_text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking_nltk(text)

Ein	Wikipedia URL: https://en.wikipedia.org/wiki/Ein
Team	Wikipedia URL: https://en.wikipedia.org/wiki/Team
WHO	Wikipedia URL: https://en.wikipedia.org/wiki/World_Health_Organization
Schutz	Wikipedia URL: https://en.wikipedia.org/wiki/Schutz
Das	Wikipedia URL: https://en.wikipedia.org/wiki/Das
Eingang	Wikipedia URL: https://en.wikipedia.org/wiki/Cadenza
Ihnen	Wikipedia URL: https://en.wikipedia.org/wiki/Ihnen
Nach	Wikipedia URL: https://en.wikipedia.org/wiki/Nach
Al-Shifa	Wikipedia URL: https://en.wikipedia.org/wiki/The_Book_of_Healing
Gaza	Wikipedia URL: https://en.wikipedia.org/wiki/Gaza
Die	Wikipedia URL: https://en.wikipedia.org/wiki/Die
Washington Post	Wikipedia URL: https://en.wikipedia.org/wiki/The_Washington_Post
Israel	Wikipedia URL: https://en.wikipedia.org/wiki/Israel
Hamas	Wikipedia URL: https://en.wikipedia.org/wiki/Hamas
Sprecher	Wikipedia URL: https://en.wikipedia.org/wiki/Sprecher
Journalist	Wikipedia URL: https://en.wikipedia.org/wiki/Journalist
BBC	Wikipedia URL: http