---
# NER and NEL with NLP Tools
----

> ### General Instructions
> - Run this notebook cell by cell and observe what happens.
> - If a cell is marked with `TODO`, there is a small task for you.
> - Report the observations in your submission doc, as requested in the assignment.
> - Be generous with screenshots - show us your results!

---

In [None]:
%pip install ipykernel

In [None]:
%pip install spacy
%pip install nltk
%pip install wikipedia-api

# spaCy
>First, we will perfom `NER (Named Entity Recognition)` and `NEL (Named Entity Linking)` with spaCy.
><br>
><br>SpaCy is a free and open-source python library with a lot of built-in capabilities. 
><br>It’s becoming increasingly popular for processing and analyzing data in the field of NLP, and you will soon see why.

### Import

In [None]:
# import the necessary libraries
import spacy
from spacy import displacy
import wikipediaapi 

## English

>In this section, you will see how SpaCy's `English-language` model works.

### Load the English model

In [None]:
# NOTE It is important that you run cells one-by-one, since you need to activate all previous cells to be able to run a next one.
# the below lines help you download the English model 
# and load it to this notebook
spacy.cli.download('en_core_web_sm')
import en_core_web_sm
nlp_english = en_core_web_sm.load()

### Save your text to a variable

In [None]:
# TODO
# the below string is given to you as an example.
# substitute it with the text of your choice.
# NOTE: triple quotations allow you to paste a multiline text

text = """
Secretary of State Antony J. Blinken made an unannounced visit on Sunday to the Israeli-occupied West Bank to meet with Mahmoud Abbas, the president of the internationally backed Palestinian Authority, and other Palestinian leaders.

The top American diplomat’s visit to the West Bank city of Ramallah followed talks with Israeli and Arab leaders in Tel Aviv and Amman, Jordan, that have focused on preventing Israel’s war against Hamas in the Gaza Strip from spreading and on convincing the Israeli government to do more to limit civilian casualties in the enclave.

In Israel, Mr. Blinken had urged protections for Palestinian noncombatants and for “humanitarian pauses” in the fighting, even as he supported Israel’s right to defend itself. The Gazan Health Ministry, which is part of the Hamas-run government, says that Israeli attacks have killed more than 9,400 people in the territory, a toll that has provoked outrage around the world — and in the West Bank.

Mr. Blinken and Mr. Abbas last held talks three weeks ago in Ramallah, days after Hamas extremists from Gaza launched a surprise attack that killed about 1,400 people in Israel, mostly civilians.

Mr. Abbas is the leader of the Palestinian Authority, which Hamas, a rival, ousted from Gaza in a violent coup in 2007 after winning elections the previous year. Mr. Abbas has long advocated the establishment of a Palestinian state alongside Israel, and Palestinian security forces under his direction have worked closely with Israel to arrest Palestinian militants.
"""

### Run the NER

In [None]:
# process the input text using the SpaCy model
# NOTE If the output data are too long to be displayed, open it in a text editor as suggested.

article = nlp_english(text) # 

for entity in article.ents:
    print(f'{entity.label_}\t{entity.text}')

### Visualize the NER results

In [None]:
# this nice feature allows you to visualize the entities
displacy.render(article, style="ent", jupyter=True)

### NEL: Linking to Wikipedia data

In [None]:
# this is a function allowing you to link the entities to the Wikipedia database
def entity_linking(some_text):
    doc = nlp_english(some_text)
    wiki_wiki = wikipediaapi.Wikipedia('en', headers={'User-Agent': 'my-custom-agent'})
    unique_links = set() # use a set to avoid duplicates
    
    for ent in doc.ents:
        
        # TODO modify the below list so that it includes all the types of entities you have found in your text
        if ent.label_ in ['PERSON', 'ORG']: # modify this list, e.g. add 'GPE'
            entity_page = wiki_wiki.page(ent.text)
            if entity_page.exists() and entity_page.fullurl not in unique_links:
                unique_links.add(entity_page.fullurl)
                print(f"{ent.text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking(text)


## German

> Run the cells below to see how SpaCy copes with German text.
><br>`TODO`: Select a short German text of your own and see how the model manages to process it.
>><br>`NOTE`: If you are doing the `Bonus` task, substitute German with another language of your choice available at spaCy (https://spacy.io/usage/models). Try to figure out what needs to be changed when changing the language. No worries, you are not going to lose any points if you do not manage to make it run - that's what `BONUS` is for!

### Load the German model

In [None]:
# here you download and load the German model
spacy.cli.download('de_core_news_sm') # NOTE modify this line if working with a different language
import de_core_news_sm # NOTE modify this line if working with a different language
nlp = de_core_news_sm.load() # NOTE modify this line if working with a different language

### Save your text to a variable

In [None]:
# TODO: substitute the string with a text of your choice.
# It does not matter if you speak German or not, just observing the labels is enough.

text = """
«Wer bist du?» – Prinzessin Kates Antwort sagt viel über sie aus
Bei ihrem Besuch in Schottland wurde die Prinzessin von Wales von einem Kind gefragt, wer sie sei. Die Antwort mag überraschen, doch es steckt eine Absicht dahinter. Prinzessin Kate hat bei einem öffentlichen Auftritt einmal mehr ihre Unterstützung ihrem Ehemann und Thronfolger Prinz William gegenüber bewiesen. Bei ihrem kürzlichen Besuch in Schottland begrüssten die beiden Royals wie üblich zahlreiche Fans – auch ganz kleine vor einer Primarschule. Zunächst schwärmte Kate, die wie so oft alle Blicke auf sich zog, von der schottischen «Burghead»-Grundschule und sagte: «Ich finde die Schule ganz toll!» Dann kam es zu einer niedlichen Konversation zwischen Kate und einem Schuljungen. Dieser fragte die Prinzessin ganz unverblümt: «Wer bist du?» – eine Frage, die Kate wohl nicht alle Tage gestellt bekommt. Doch die Prinzessin reagiert, wie immer in der Öffentlichkeit, gelassen und gefasst: «Ich bin mit ihm verheiratet», sagt sie an den Jungen gewandt und zeigt dabei auf Prinz William, der unweit neben ihr steht.
"""

### Run the NER

In [None]:
article = nlp(text)

for entity in article.ents:
    print(f'{entity.label_}\t{entity.text}')

### Visualize the NER Results

In [None]:
# visualize the NE
displacy.render(article, style="ent", jupyter=True)

### NEL: Linking to Wikipedia Data

In [None]:
# # this is a function allowing you to link the entities to the Wikipedia database

def entity_linking(some_text):
    
    doc = nlp(some_text)

    # NOTE make sure to change 'de' to the new language abbreviation if working with a different language
    wiki_wiki = wikipediaapi.Wikipedia('de', headers={'User-Agent': 'my-custom-agent'}) # NOTE modify this line if working with a different language
    unique_links = set()
    
    for ent in doc.ents:

        if ent.label_ in ['PER', 'LOC', 'MISC']: # TODO change this list depending on the entity types found in your text
            entity_page = wiki_wiki.page(ent.text)
            if entity_page.exists() and entity_page.fullurl not in unique_links:
                unique_links.add(entity_page.fullurl)
                print(f"{ent.text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking(text)

-----
# NLTK
> Now, we will perfom `NER (Named Entity Recognition)` and `NEL (Named Entity Linking)` with NLTK.
>
>> "NLTK is a leading platform for building Python programs to work with human language data. 
>> <br>It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, 
>> <br>along with a suite of text processing libraries for classification, tokenization, stemming, 
>> <br>tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, 
>> <br>and an active discussion forum." (https://www.nltk.org/)
>


## English

### Import

In [None]:
# import and load the necessary packages
import nltk
from nltk import word_tokenize, pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


### Save Your Text to a Variable

In [None]:
text = """ """ # TODO paste you English text here between the triple quotes

### Run the NER

In [None]:
# this part performs the NER of your text
# NOTE that the process is different from SpaCy. Can you spot the difference?
for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

### NEL: Linking to Wikipedia Data

In [None]:
# here, we define a function to both extract NE and link them to the Wikipedia database

def entity_linking_nltk(some_text):
    sentences = nltk.sent_tokenize(some_text)
    
    wiki_wiki = wikipediaapi.Wikipedia('en', headers={'User-Agent': 'my-custom-agent'})
    unique_links = set()

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))

        for chunk in chunks:
            if hasattr(chunk, 'label'):
                entity_text = ' '.join(c[0] for c in chunk)
                entity_page = wiki_wiki.page(entity_text)
                
                if entity_page.exists() and entity_page.fullurl not in unique_links:
                    unique_links.add(entity_page.fullurl)
                    print(f"{entity_text}\tWikipedia URL: {entity_page.fullurl}")

# this line calls the above function with your text as its argument
entity_linking_nltk(text)


## German 


><br>Select a short German text of your own and see how the model manages to process it.
>><br>If you are doing the `Bonus` task, try pasting a text in some different language and report what happens.

### Save your Text to a Variable

In [None]:
text = """
«Wer bist du?» – Prinzessin Kates Antwort sagt viel über sie aus
Bei ihrem Besuch in Schottland wurde die Prinzessin von Wales von einem Kind gefragt, wer sie sei. Die Antwort mag überraschen, doch es steckt eine Absicht dahinter. Prinzessin Kate hat bei einem öffentlichen Auftritt einmal mehr ihre Unterstützung ihrem Ehemann und Thronfolger Prinz William gegenüber bewiesen. Bei ihrem kürzlichen Besuch in Schottland begrüssten die beiden Royals wie üblich zahlreiche Fans – auch ganz kleine vor einer Primarschule. Zunächst schwärmte Kate, die wie so oft alle Blicke auf sich zog, von der schottischen «Burghead»-Grundschule und sagte: «Ich finde die Schule ganz toll!» Dann kam es zu einer niedlichen Konversation zwischen Kate und einem Schuljungen. Dieser fragte die Prinzessin ganz unverblümt: «Wer bist du?» – eine Frage, die Kate wohl nicht alle Tage gestellt bekommt. Doch die Prinzessin reagiert, wie immer in der Öffentlichkeit, gelassen und gefasst: «Ich bin mit ihm verheiratet», sagt sie an den Jungen gewandt und zeigt dabei auf Prinz William, der unweit neben ihr steht.
"""

### Run the NER

In [None]:
for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

### NEL: Linking to Wikipedia Data

In [None]:
def entity_linking_nltk(some_text):
    sentences = nltk.sent_tokenize(some_text)
    
    # NOTE you will need to substitute 'de' with your language, if doing the bonus task. The rest remains unchanged
    wiki_wiki = wikipediaapi.Wikipedia('de', headers={'User-Agent': 'my-custom-agent'}) 
    unique_links = set()

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))

        for chunk in chunks:
            if hasattr(chunk, 'label'):
                entity_text = ' '.join(c[0] for c in chunk)
                entity_page = wiki_wiki.page(entity_text)
                
                if entity_page.exists() and entity_page.fullurl not in unique_links:
                    unique_links.add(entity_page.fullurl)
                    print(f"{entity_text}\tWikipedia URL: {entity_page.fullurl}")


entity_linking_nltk(text)