<a href="https://colab.research.google.com/github/dgromann/cl_intro_ws2024/blob/main/exercises/HomeExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [None]:
!pip install newspaper3k

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.3-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Co

In [None]:
!pip install lxml_html_clean

import newspaper
from newspaper import Article

url = "https://www.bbc.com/news/articles/cy4nd3n33yeo"
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.3.1-py3-none-any.whl (13 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.3.1
Authors:  [] 

Title:  Wojtek: The bear who was a private in the Polish army 

Text of article: 
 The bear who was a private in the Polish army

The Polish Institute and Sikorski Museum Wojtek the bear was adopted by Polish soldiers and fought alongside them in World War Two

A bear, famed for his love of beer, cigarettes and boxing and who was by the side of Allied troops in World War Two, has been made the subject of a play. Wojtek was adopted by the 2nd Polish Corps in 1943, after his mother was shot by hunters. The Syrian brown bear travelled with them from the Middle East as they were deployed to Italy. Allied soldiers described their shock at seeing Wojtek carrying artillery shells during the Battle of Monte Cassino. The story of frien

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [None]:
# Calculate and print the number of unique words in the text
unique_words = set(article.text.split())
  #article.text to acess the text itself instead of article as class: newspaper.article.ARticle
  #split() text into str

# Print the number of unique words
print("Number of unique words:", len(unique_words))

Number of unique words: 513


## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [None]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again
import nltk
nltk.download('wordnet') #import nltk and download wordnet to use it
from nltk.stem import WordNetLemmatizer


unique_words = set(article.text.lower().split())
print("Number of unique lowercase words:", len(unique_words))

filtered_words = [word for word in article.text.lower().split() if word.isalpha()]
  #isaplha() = only alphabetical character - no puncutation
unique_words = set(filtered_words)
print("Number of unique words without punctuation:", len(unique_words))

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
  #like in tuto2
unique_words = set(lemmatized_words)

# Print the number of unique words after lemmatization
print("Number of unique words after lemmatization:", len(unique_words))



[nltk_data] Downloading package wordnet to /root/nltk_data...


Number of unique lowercase words: 495
Number of unique words without punctuation: 362
Number of unique words after lemmatization: 354


## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [None]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
# Your code here
doc = nlp_sm(article.text)

#number of entities calculated
from collections import Counter
entity_counts = Counter(ent.label_ for ent in doc.ents)

print("\nEntity counts by type:")
for ent_type, count in entity_counts.items():
    print(f"{ent_type}: {count}")


Entity counts by type:
NORP: 16
ORG: 14
EVENT: 4
PERSON: 38
ORDINAL: 4
DATE: 10
LOC: 9
GPE: 16
WORK_OF_ART: 3
CARDINAL: 4
QUANTITY: 1
FAC: 1


In [None]:
nlp_lg = spacy.load("en_core_web_lg")
doc = nlp_lg(article.text)

#number of entities calculated
from collections import Counter
entity_counts = Counter(ent.label_ for ent in doc.ents)

print("\nEntity counts by type:")
for ent_type, count in entity_counts.items():
    print(f"{ent_type}: {count}")


Entity counts by type:
NORP: 16
ORG: 16
EVENT: 6
PERSON: 37
ORDINAL: 4
DATE: 11
LOC: 6
GPE: 20
WORK_OF_ART: 1
CARDINAL: 4
MONEY: 1
QUANTITY: 1
FAC: 3


In [None]:
nlp_trf = spacy.load("en_core_web_trf")
doc = nlp_trf(article.text)

#number of entities calculated
from collections import Counter
entity_counts = Counter(ent.label_ for ent in doc.ents)

print("\nEntity counts by type:")
for ent_type, count in entity_counts.items():
    print(f"{ent_type}: {count}")

  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):



Entity counts by type:
NORP: 17
ORG: 15
PERSON: 37
EVENT: 5
DATE: 12
LOC: 6
GPE: 21
WORK_OF_ART: 3
CARDINAL: 4
ORDINAL: 3
QUANTITY: 1
FAC: 2


You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

❗While it may seem that the transformer model (trf) provided the best results, this is not entirely accurate. Although the trf model is generally more accurate and captures more entities than the small (sm) model, it failed to identify the "rare" MONEY entity that the large (lg) model detected. This difference could be due to preprocessing, as certain entities like ORG (organizations) are often capitalized, which makes them easier for the models to recognize. Additionally, removing punctuation and lemmatizing words can lead to interpretation errors, making it harder for models to correctly identify multi-word entities and specific names.

❗SPOILER: as you can see in the output of the code below, the NER performance is sometimes better sometimes not if the text ist non-preprocessed. It depends on the category. I think in general, the preprocessed text is really good but as mentioned before a non-preprocessed version can also have its adavantages.  
In addition, I used isalpha() in the preprocessed text which only returns true if there are alphabetical numbers, that is why there is only one DATE entity recognized. The lower() is problematic as mentioned before because some organizaations use capital letters only and the model may get them confused if they are not capitalized anymore. Work of Art such as book titles are often within "" which are also removed on the preprocessed text and therefore harder to be recognized for the machine.

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here
# Performance on the non-preprocessed article text
doc_original = nlp_trf(article.text)
entity_counts_original = Counter(ent.label_ for ent in doc_original.ents)

# Preprocess text (AGAIN :D)
text = article.text.lower()
words = text.split()
words_alpha = [word for word in words if word.isalpha()]
text_alpha = " ".join(words_alpha)
doc_preprocessed = nlp_trf(text_alpha)
lemmatized_text = " ".join([token.lemma_ for token in doc_preprocessed])

# Performance on the preprocessed article text
doc_preprocessed_final = nlp_trf(lemmatized_text)
entity_counts_preprocessed = Counter(ent.label_ for ent in doc_preprocessed_final.ents)

# comparison of both versions
print("Entity counts on non-preprocessed text:")
for ent_type, count in entity_counts_original.items():
    print(f"{ent_type}: {count}")

print("\nEntity counts on preprocessed text:")
for ent_type, count in entity_counts_preprocessed.items():
    print(f"{ent_type}: {count}")


Entity counts on non-preprocessed text:
NORP: 17
ORG: 15
PERSON: 37
EVENT: 5
DATE: 12
LOC: 6
GPE: 21
WORK_OF_ART: 3
CARDINAL: 4
ORDINAL: 3
QUANTITY: 1
FAC: 2

Entity counts on preprocessed text:
NORP: 15
ORG: 13
PERSON: 30
EVENT: 6
LOC: 3
WORK_OF_ART: 1
CARDINAL: 3
GPE: 10
FAC: 4
DATE: 1
ORDINAL: 1


## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m90.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒ Perform NER on the selected article.

In [None]:
!pip install lxml_html_clean
import newspaper
from newspaper import Article

url = "https://www.rainews.it/articoli/2024/03/case-green-via-libera-definitivo-del-parlamento-europeo-42b42246-acfb-4dc1-a4f8-8d2c3c545539.html"
article_it = Article(url)
article_it.download()
article_it.parse()

#This line displays the authors of the article
print("Authors: ", article_it.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article_it.title, "\n")
print("Text of article: \n", article_it.text)

!python -m spacy download it_core_news_lg

nlp_lg_it = spacy.load("it_core_news_lg")
doc = nlp_lg_it(article_it.text)

#number of entities calculated
from collections import Counter
entity_counts = Counter(ent.label_ for ent in doc.ents)

print("\nEntity counts by type:")
for ent_type, count in entity_counts.items():
    print(f"{ent_type}: {count}")

Authors:  ['Redazione Di Rainews'] 

Title:  Case green, via libera definitivo del Parlamento europeo: nuovi edifici a emissioni zero entro 2030 

Text of article: 
 Il Parlamento europeo ha approvato l'accordo della direttiva sulla prestazione energetica nell'edilizia (EPBD) con 370 voti a favore, 199 contrari e 46 astenuti. L'EPBD, nota in Italia come direttiva 'case green', introduce nuove misure per realizzare un parco immobiliare climaticamente neutro entro il 2050. Lo scopo della revisione della direttiva sulla prestazione energetica nell'edilizia - spiega il Parlamento europeo - è di ridurre progressivamente le emissioni di gas serra e i consumi energetici nel settore edilizio entro il 2030 e pervenire alla neutralità climatica entro il 2050. Tra gli obiettivi figurano anche la ristrutturazione di un maggior numero di edifici con le prestazioni peggiori e una migliore diffusione delle informazioni sul rendimento energetico. Secondo la nuova normativa, tutti i nuovi edifici dovra

In [None]:
#visualize named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*

❗I tried the italin lg model because the trf model doesn't even exist (yet?). The performance was significantly lacking compared to the English NER. The output from the Italian model did not meet my expectations at all. It struggles to recognize key entities - probably because of few training data. I am sure this is a common isse when dealing with languages that have less representation in the datasets.

In comparison to the english verison it also uses the entity MISC a lot, which represents entities that do not fit neatly into other defined categories. It doesn't recognize numbers at all... there are in general only a few entities marked in the text.

❗As the result on the non-preprocessed italian version was really bad, I wanted to try it with the preprocessed version below:

In [None]:
#this is the preprocessed version of the italian article
'''
Lowercase all words in the text.
Remove punctuation markers and numbers (Hint: `string.isalpha()).
Lemmatize all words in the text.
'''

text_it = article_it.text.lower()
words_it = text_it.split()
words_alpha_it = [word for word in words_it if word.isalpha()]
text_alpha_it = " ".join(words_alpha_it)
doc_preprocessed_it = nlp_trf(text_alpha_it)
lemmatized_text_it = " ".join([token.lemma_ for token in doc_preprocessed_it])

# Performance on the preprocessed article text
doc_preprocessed_final = nlp_trf(lemmatized_text_it)
entity_counts_preprocessed_it = Counter(ent.label_ for ent in doc_preprocessed_final.ents)

print("\nEntity counts on preprocessed text:")
for ent_type, count in entity_counts_preprocessed_it.items():
    print(f"{ent_type}: {count}")

from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)




Entity counts on preprocessed text:
LOC: 1
NORP: 3


The output with the preprocessed tokens is even worse then the unpreprocessed version.