# Notebook 8 - Knowledge Representation (KR)

CSI4106 Artificial Intelligence  \
Fall 2021 \
Version 1 (2020) prepared by Julian Templeton and Caroline Barrière.  Version 2 (2021) revised by Caroline Barrière.

***INTRODUCTION***:  

When reading text, understanding the type of entities within the text helps to infer additional information about the entity. For example, if a text mentions *Canada*, knowing that it is a GPE (geopolitical entity), already indicates to us that this entity has a supercify, a population, etc. Through the use of Named Entity Recognition (NER), we are able to determine whether an entity is a Person, Organization, Country, ... 

When exploring text online, we also occassionally see entities have clickable links to webpages with more information on the entity. This is a form of enhancement of the text to allow readers to easily access the information needed to understand each entity from the text and its content.  If we take the example of Canada again, if we transform it into [Canada](https://en.wikipedia.org/wiki/Canada), using entity linking we access more information.

In this notebook we will be revisiting the Covid-19 related news dataset from notebook 7 to explore how we can improve spaCy's NER and enhance the text from the news articles through the use of entity linking. This will be done in three parts: 

(1) we explore the results of spaCy's NER  \
(2) we use text coherence for post-processing spaCy's NER results \
(3) we perform text enhancement with entity linking.    

This notebook uses libraries that have been used in previous notebooks, including spaCy and pandas. 

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, sign the notebook (at the end of the notebook), rename it to *StudentNum-LastName-Notebook8.ipynb* and submit it.  

*The notebook will be marked on 30.  
Each **(TO DO)** has a number of points associated with it.*
***

In [1]:
# Before starting we will import every module that we will be using
import spacy
import pandas as pd

In [2]:
# The core spacy object can be used for tokenization, lemmatization, POS Tagging, NER, ...
# Note that this is specifically for the English language and requires the English package to be installed
# via pip to work as intended.

# sp = spacy.load('en')

# If the above causes an error after installing the package 
# then install the package as below
# !spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')

Similarly to the last notebook, the dataset is provided on Brightspace (Module 8) along with this notebook, but details regarding Covid-19 news dataset can be found [here](https://www.kaggle.com/ryanxjhan/cbc-news-coronavirus-articles-march-26?select=news.csv). The first thing that we will do, as usual, is load the file into a pandas dataframe.  

In [3]:
# Read the dataset, show top ten rows
df = pd.read_csv("news.csv")
df.head(10)

Unnamed: 0.1,Unnamed: 0,authors,title,publish_date,description,text,url
0,0,[],'More vital now:' Gay-straight alliances go vi...,2020-05-03 1:30,Lily Overacker and Laurell Pallot start each g...,Lily Overacker and Laurell Pallot start each g...,https://www.cbc.ca/news/canada/calgary/gay-str...
1,1,[],Scientists aim to 'see' invisible transmission...,2020-05-02 8:00,Some researchers aim to learn more about how t...,"This is an excerpt from Second Opinion, a week...",https://www.cbc.ca/news/technology/droplet-tra...
2,2,['The Canadian Press'],Coronavirus: What's happening in Canada and ar...,2020-05-02 11:28,Canada's chief public health officer struck an...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
3,3,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
4,4,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
5,5,"['Senior Writer', 'Chris Arsenault Is A Senior...",Brazil has the most confirmed COVID-19 cases i...,2020-05-02 8:00,"From describing coronavirus as a ""little flu,""...","With infection rates spiralling, some big city...",https://www.cbc.ca/news/world/brazil-has-the-m...
6,6,['Cbc News'],The latest on the coronavirus outbreak for May 1,2020-05-01 20:43,The latest on the coronavirus outbreak from CB...,Coronavirus Brief (CBC) Canada is officiall...,https://www.cbc.ca/news/the-latest-on-the-coro...
7,7,['Cbc News'],Coronavirus: What's happening in Canada and ar...,2020-05-01 11:51,Nova Scotia announced Friday it is immediately...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
8,8,"['Senior Writer', ""Adam Miller Is Senior Digit...",Did the WHO mishandle the global coronavirus p...,2020-04-30 8:00,The World Health Organization has come under f...,The World Health Organization has come under f...,https://www.cbc.ca/news/health/coronavirus-who...
9,9,['Thomson Reuters'],Armed people in Michigan's legislature protest...,2020-04-30 21:37,"Hundreds of protesters, some armed, gathered a...","Hundreds of protesters, some armed, gathered a...",https://www.cbc.ca/news/world/protesters-michi...


**PART 1 - SpaCy's NER**  
  
Let's start by looking at the NER that is performed by spaCy.  SpaCy's documentation does not tell us how exactly their NER is done (certainly their trade secret), but we can at least look at the results.

As we've talked about in previous notebooks, when evaluating a process or a tool, we can do quantitative or **qualitative evaluation** of results.  In this notebook, we work at a qualitative level, meaning that we are not measuring metrics such as precision/recall on a large amount of data, but rather printing results of a few examples and try to understand these results.





Below is the same sentence example as in the last notebook, for which we had looked at POS-tagging and other linguistic processes.  We now use this sentence to show how to access spaCy's NER type predictions for tokens in a text.

In [4]:
# Same example from notebook 7, recall that we loop through the iterator found in the .ents property of a parsed sentence
sentence_example = "Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19."
sentence_example_content = sp(sentence_example)
# Loop through all tokens that contain a NER type and print the token along with the corresponding NER type
for token in sentence_example_content.ents:
    print("\"" + token.text + "\" is a " + token.label_ )

"Canada" is a GPE
"at least two metres" is a CARDINAL
"COVID-19" is a ORG


**(TO DO) Q1 - 5 marks** 

In the text of **second document** (index 1) of our corpus of documents, find out which words are *PER* (spaCy uses the *PERSON* type, rather than *PER*), *ORG* (Organization), and *GPE* (Geopolitical Entity). You must do the following for this question:    
a) (2 marks) Print each element in the text tagged as *PER*, *ORG*, and *GPE* along with its NER type from spaCy.     
b) (1 mark) Is the majority of outputs correct? Provide two examples of incorrect outputs from (a).  
c) (2 marks) Do any of the problems with the NER type predictions come from an earlier step in the NLP pipeline that is performed by spaCy? Describe the problem for two examples of your output from (a).   

In [5]:
# ANSWER Q1(a) - 2 marks
# Select the second document (index 1)
doc = df["text"][1]
# Print each PER, ORG, GPE along with its type
doc_content = sp(doc)
for token in doc_content.ents:
    if(token.label_ == 'PERSON' or token.label_ == 'ORG' or token.label_ == 'GPE'):
        print("\"" + token.text + "\" is a " + token.label_ )

"COVID-19" is a PERSON
"the World Health Organization" is a ORG
"WHO" is a ORG
"the Public Health Agency" is a ORG
"Canada" is a GPE
"W.F. Wells" is a PERSON
"the Harvard School of Public Health" is a ORG
"Wells" is a ORG
"Canada" is a GPE
"Lydia Bourouiba" is a PERSON
"the Fluid Dynamics of Disease Transmission Laboratory" is a ORG
"the Massachusetts Institute of Technology" is a ORG
"Bourouiba" is a PERSON
"Mark Loeb" is a PERSON
"McMaster University" is a ORG
"RNA" is a ORG
"Wuhan" is a GPE
"China" is a GPE
"Nebraska" is a GPE
"Canada" is a GPE
"COVID-19" is a ORG
"Gary Moore/CBC" is a PERSON
"Allison McGeer" is a PERSON
"Sinai Health" is a ORG
"Toronto" is a GPE
"COVID-19" is a PERSON
"McGeer" is a ORG
"McGeer" is a ORG
"Bourouiba" is a PERSON
"Bourouiba" is a PERSON
"Credit Lydia Bourouiba/MIT/JAMA Networks" is a ORG
"Samira Mubareka" is a PERSON
"Toronto" is a GPE
"Bourouiba" is a PERSON
"COVID-19" is a ORG
"McMaster" is a PERSON
"N95" is a ORG
"U.S." is a GPE
"Justin Trudeau" is

**ANSWER Q1 (b) - 1 mark**   
"COVID-19" is a PERSON
"Wells" is a ORG since "W.F. Wells" is a PERSON
    

**ANSWER Q1 (c) - 2 marks**   
With NLP pipeline, COVID-19 was predict as NOUN and have nsubj dependency. And Wells was predict as PROPN and have dependency as appos. Which means it is determine correctly in NLP pipeline. This problem may be cause by it has not put COVID-19 into the word data base. And did not connect Wells to W.F. Wells.

**PART 2 - Text Coherence and coreference chains**  
  
As you saw in Q1, the results of spaCy are quite good, but not perfect.  One main issue with NER (not just in spaCy but in many tools) is that the annotation is performed one entity at a time without consideration of the overall document.  

But when looking a the whole document, and knowing that text is usually coherent, we can do some post-processing to spaCy's NER module and correct some mistakes.  By text being coherent, we mean, for example, that if a person is referred to with a particular name, e.g. *McGeer*, chances are that each time we see *McGeer* in the document, it is the same person.  All the mentions of *McGeer* form a coreference chain all refering to a single entity. So it is unlikely that *McGeer* would be once a person and once an organization.  This is not always true, there are numerous counter-examples, but it is a common assumption.  This idea is even the topic of an older much-cited NLP article called "One sense per discourse" (Gale and al. 1992). 

With this idea of "one sense per discourse", we will explore two different strategies to use text coherence to post-process the output from the spaCy NER module.  

The first strategy (*explored in Q2/Q3*) is to find, among all NER types assigned, which is the most frequent one.  For example, the name *Bourouiba* was assigned 1 time ORG, and 2 times PERSON, so this information can be used to modify the ORG type and change it to PERSON.  

The second strategy (explored in Q4) is to try to find a longer surface form in the text.  Since that longer form should be less ambiguous, we can use it to disambiguate the shorter, more ambiguous forms.  For example, *Lydia Bourouiba* occurs in the text and is assigned PERSON.  We can use that information to assign further occurrences of the short form *Bourouiba* to also be PERSON.   

Of course, using these methods for text coherence will not work every time, and will unfortunately introduce some errors...  But let's try.  That's what empirical studies are about, we try ideas.

Let's take again the news article from Q1, but this time, let's show not only GPE, PER, ORG, but rather all the Named Entities found by spaCy.

In [6]:
# Select document 2
doc = df["text"][1]
# NER
doc_sp = sp(doc)
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ )

0: "weekly" is a DATE
1: "Saturday" is a DATE
2: "morning" is a TIME
3: "two metres" is a QUANTITY
4: "COVID-19" is a PERSON
5: "the World Health Organization" is a ORG
6: "WHO" is a ORG
7: "more than one metre" is a QUANTITY
8: "the Public Health Agency" is a ORG
9: "Canada" is a GPE
10: "at least two metres" is a QUANTITY
11: "two" is a CARDINAL
12: "2 metres" is a QUANTITY
13: "the 19th century" is a DATE
14: "1934" is a DATE
15: "W.F. Wells" is a PERSON
16: "the Harvard School of Public Health" is a ORG
17: "two metres" is a QUANTITY
18: "Wells" is a ORG
19: "56,000" is a CARDINAL
20: "Canada" is a GPE
21: "Saturday" is a DATE
22: "Lydia Bourouiba" is a PERSON
23: "the Fluid Dynamics of Disease Transmission Laboratory" is a ORG
24: "the Massachusetts Institute of Technology" is a ORG
25: "Bourouiba" is a PERSON
26: "Canadian" is a NORP
27: "Mark Loeb" is a PERSON
28: "McMaster University" is a ORG
29: "RNA" is a ORG
30: "Wuhan" is a GPE
31: "China" is a GPE
32: "Nebraska" is a GPE


**(TO DO) Q2 - 3 marks**  
As you can see in the results, sometimes the same entity was assigned different entity types (e.g. *McGeer* is one time assigned entity type ORG, and one time entity type PERSON) since the NER algorithm looks sentence by sentence.  In the following function, the purpose will be to find all the possible entity types assigned to a single entity.

Complete the definition of the *find_entity_types* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter and a list of all spaCy entities defined by the *entities* parameter.     

The function must find all entities (from *entities*) having the same surface form as *entity*. For each match between the entities, add the NER type to the dictionary *type_counts* and track the number of times each NER type appears.     

The *type_counts* dictionary would contain for example *McGeer* with ORG = 1, and PERSON = 1, because the function found 2 mentions of *McGeer*, each with a different type.

In [7]:
# ANSWER Q2 
def find_entity_types(entity, entities):
    '''
    Given a specific entity and a list of entities, finds all entities from the list that match surface form of the specified
    entity, but that could be of a different type.
    
    Returns the different NER types that have been classified for an entity and the count per NER type
    as a dictionary with the keys as the NER type and the value as the count
    '''
    type_counts = { }
    add_list=[]
    add_list_label=[]
    for token in entities:
        if (str(entity) in str(token)) or (str(token) in str(entity)):
            if(str(token) not in add_list) or (str(token.label_) not in add_list_label):
                if(token.label_ not in type_counts):
                    type_counts[token.label_] = 1
                    add_list.append(str(token))
                    add_list_label.append(str(token.label_))
                else:
                    type_counts[token.label_] += 1
                    add_list.append(str(token))
                    add_list_label.append(str(token.label_))
    return type_counts    

In [8]:
# Test the above to find the result when checking for the types of the entity 'Bourouiba' 
# from the document loaded above
print("All possible NER types for \"" + doc_sp.ents[47].text + "\" are " + str(find_entity_types(doc_sp.ents[47], doc_sp.ents)))
print("All possible NER types for \"" + doc_sp.ents[44].text + "\" are " + str(find_entity_types(doc_sp.ents[44], doc_sp.ents)))
print("All possible NER types for \"" + doc_sp.ents[4].text + "\" are " + str(find_entity_types(doc_sp.ents[4], doc_sp.ents)))

All possible NER types for "Bourouiba" are {'PERSON': 2, 'ORG': 1}
All possible NER types for "McGeer" are {'PERSON': 1, 'ORG': 1}
All possible NER types for "COVID-19" are {'PERSON': 1, 'ORG': 2}


**(TO DO) Q3 - 2 marks**  
In the previous method, *find_entity_types*, we found all the possible entity types for a single entity.  Now, we want to use these to find the most common type.  For example, in the case of *McGeer*, it's a tie.  But for *Bourouiba*, there is one ORG type, and 2 PERSON type, so the most common would be PERSON.

Complete the definition of the *most_common_type* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter and a list of all spaCy entities defined by the *entities* parameter.        

Note: You can handle ties as you please.  Also, make sure to use the function *find_entity_types* which you just wrote in Q2.

In [9]:
# ANSWER Q3 
def most_common_type(entity, entities):
    '''
    Given a specific entity and a list of entities, find the most similar entities and assign the
    NER type to entity based on the most common NER type assigned to entities of the same name (if there
    is a tie, you decide how to handle this).
    
    Returns the most common NER type based on similar entities
    '''
    # TODO
    type_counts = find_entity_types(entity, entities)
    max_value = 0
    for key in type_counts:
        if type_counts[key] > max_value:
            max_value = type_counts[key]
    for key in type_counts:
        if type_counts[key] == max_value:
            return key

In [10]:
# Test the above to find the result when checking for the types of the entity 'Bourouiba' 
# from the document loaded above
print("The most common NER type to \"" + doc_sp.ents[47].text + "\" is " + most_common_type(doc_sp.ents[47], doc_sp.ents))

The most common NER type to "Bourouiba" is PERSON



Our first exploration (in Q2/Q3) was about frequency of occurrence.  We assumed the most common entity type could be the correct one.  Now, we'll explore the idea that the least ambiguous reference to an entity (the actual text) could be the correct one.  For example, *McGeer* is more ambiguous (shorter form) than *Allison McGeer* (longer form).  Often the longer form of reference to an entity is the least ambiguous.  But because it is long to write, we often use it sparingly in a text (perhaps only once) and then subsequent references to the same entity will use the shorter form.  For example, the text might mention *Allison McGeer* once, and then use the short form *McGeer* to refer to the same person many times in the document.

In the course videos, we talked about the coreference chains. Thus, a chain contains long and short mentions, all referring to the same entity.

The longer form is often referred to as the *normalized form*, and it is a form that we are likely to find in an external resource.  We'll see in part 3 of this notebook, when we do entity linking, that there is a Wikipedia entry for *Allison McGeer* that we could link to. We can consider the longer *Allison McGeer* form as the normalized form.

**(TO DO) Q4 (a) - 3 marks**  
 
You must write a function that will find the longest form that can match a mention.

Your function will have the same *entity* and *entities* parameters, but this time the function must assign to *entity* the NER type of another entity in the *entities* iterator, that of the longest form found.   

Specifically, you must look through *entities* to find a normalized form of *entity*. In this scenario, the longest entity that contains *entity* as a substring will be considered the normalized form and should be returned.  If no longer form is found, the entity itself *entity* should be returned.

Ex: *Lydia Bourouiba* is the normalized form of *Bourouiba*. Thus this entity should be returned.  But *McMaster University* is already the longest form, so if we search for a normalized form for that *entity*, the function should return *entity* itself.

In [31]:
# ANSWER Q4(a)
# Find the longest surface form within "entities" for which the surface for of "entity" is a substring
def assign_normalized_form(entity, entities):
    max_word = entity
    max_length = 0
    for word in entities:
        #max_word = word
        #max_length = 0
        if str(entity) in str(word) and str(entity.label_) == str(word.label_):
            if len(str(word)) > max_length:
                max_length = len(str(word))
                max_word = word
            else:
                max_word = max_word
                max_length = max_length
    return max_word
                
            

Let's test the above function, assuming the candidates are only found in the previous mentions, as often a long form is given first to (e.g. *Allison McGeer*) and subsequent forms are the short forms (e.g. *McGeer*).

In [33]:
# Testing using only the previous references as candidates
test = df["text"][1]
# Parse the text with spaCy
test_sp = sp(test)
for i, token in enumerate(test_sp.ents):
    ent = assign_normalized_form(test_sp.ents[i], test_sp.ents[0:i-1])
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

0: "weekly" is a DATE  weekly  DATE
1: "Saturday" is a DATE  Saturday  DATE
2: "morning" is a TIME  morning  TIME
3: "two metres" is a QUANTITY  two metres  QUANTITY
4: "COVID-19" is a PERSON  COVID-19  PERSON
5: "the World Health Organization" is a ORG  the World Health Organization  ORG
6: "WHO" is a ORG  WHO  ORG
7: "more than one metre" is a QUANTITY  more than one metre  QUANTITY
8: "the Public Health Agency" is a ORG  the Public Health Agency  ORG
9: "Canada" is a GPE  Canada  GPE
10: "at least two metres" is a QUANTITY  at least two metres  QUANTITY
11: "two" is a CARDINAL  two  CARDINAL
12: "2 metres" is a QUANTITY  2 metres  QUANTITY
13: "the 19th century" is a DATE  the 19th century  DATE
14: "1934" is a DATE  1934  DATE
15: "W.F. Wells" is a PERSON  W.F. Wells  PERSON
16: "the Harvard School of Public Health" is a ORG  the Harvard School of Public Health  ORG
17: "two metres" is a QUANTITY  at least two metres  QUANTITY
18: "Wells" is a ORG  Wells  ORG
19: "56,000" is a CARD

**(TO DO) Q4 (b) - 2 marks**  

Do other tests without the limitation of using only longer forms mentioned before an entity (see *test_sp.ents[0:i-1]* in the code above), try searching before and after.  Or try an interval (e.g. max N entities before or after).  Explain what you tested.  Any difference?  Provide at least 2 examples of changes that you notice.


In [13]:
# ANSWER Q4(b)
# Do a different test
test = df["text"][1]
# Parse the text with spaCy
test_sp = sp(test)
for i, token in enumerate(test_sp.ents):
    ent = assign_normalized_form(test_sp.ents[i], test_sp.ents[0:10])
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

0: "weekly" is a DATE  weekly  DATE
1: "Saturday" is a DATE  Saturday  DATE
2: "morning" is a TIME  morning  TIME
3: "two metres" is a QUANTITY  two metres  QUANTITY
4: "COVID-19" is a PERSON  COVID-19  PERSON
5: "the World Health Organization" is a ORG  the World Health Organization  ORG
6: "WHO" is a ORG  WHO  ORG
7: "more than one metre" is a QUANTITY  more than one metre  QUANTITY
8: "the Public Health Agency" is a ORG  the Public Health Agency  ORG
9: "Canada" is a GPE  Canada  GPE
10: "at least two metres" is a QUANTITY  at least two metres  QUANTITY
11: "two" is a CARDINAL  two metres  QUANTITY
12: "2 metres" is a QUANTITY  2 metres  QUANTITY
13: "the 19th century" is a DATE  the 19th century  DATE
14: "1934" is a DATE  1934  DATE
15: "W.F. Wells" is a PERSON  W.F. Wells  PERSON
16: "the Harvard School of Public Health" is a ORG  the Harvard School of Public Health  ORG
17: "two metres" is a QUANTITY  two metres  QUANTITY
18: "Wells" is a ORG  Wells  ORG
19: "56,000" is a CARDIN

**ANSWER Q4(b)**

I test 10 entites after. The accuracy become lower, such as Bourouiba can not be connect to Lydia Bourouiba, and McGeer also can not be connect to Allison McGeer.

**(TO DO) Q5 - 5 marks**  
Use a different news article in the corpus, the 7th article, so index 6.  

(a) (2 marks) Run the two approaches (most frequent, longest form).  For each entity found in the text, print its original entity type (as found by spaCy, then the most common entity type, and then the normalized form with its entity type. \
(b) (3 marks) Analyze and discuss the results.  Do you think these text coherence approaches help or are they too simple?  Are there conflicting results (the two approaches give different results).  If yes, show examples that are different.  

In [14]:
# ANSWER Q5(a) 
# Select document index 6
doc = df["text"][6]
doc_sp = sp(doc)
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ )
    print("The most common NER type to \"" + doc_sp.ents[i].text + "\" is " + most_common_type(doc_sp.ents[i], doc_sp.ents))
    ent = assign_normalized_form(doc_sp.ents[i], doc_sp.ents[0:i-1])
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

0: "Canada" is a GPE
The most common NER type to "Canada" is ORG
0: "Canada" is a GPE  Canada  GPE
1: "C.D. Howe" is a ORG
The most common NER type to "C.D. Howe" is ORG
1: "C.D. Howe" is a ORG  C.D. Howe  ORG
2: "Ontario" is a PERSON
The most common NER type to "Ontario" is PERSON
2: "Ontario" is a PERSON  Ontario  PERSON
3: "Monday" is a DATE
The most common NER type to "Monday" is DATE
3: "Monday" is a DATE  Monday  DATE
4: "Alberta" is a GPE
The most common NER type to "Alberta" is GPE
4: "Alberta" is a GPE  Alberta  GPE
5: "first" is a ORDINAL
The most common NER type to "first" is ORDINAL
5: "first" is a ORDINAL  first  ORDINAL
6: "Saturday" is a DATE
The most common NER type to "Saturday" is DATE
6: "Saturday" is a DATE  Saturday  DATE
7: "Air Canada" is a ORG
The most common NER type to "Air Canada" is GPE
7: "Air Canada" is a ORG  Air Canada  ORG
8: "Christmas" is a DATE
The most common NER type to "Christmas" is DATE
8: "Christmas" is a DATE  Christmas  DATE
9: "more than $1.

The most common NER type to "Canadian" is NORP
87: "Canadian" is a NORP  Canadians  NORP
88: "Air Canada" is a ORG
The most common NER type to "Air Canada" is GPE
88: "Air Canada" is a ORG  Air Canada  ORG
89: "more than" is a CARDINAL
The most common NER type to "more than" is CARDINAL
89: "more than" is a CARDINAL  more than $1.2 million  MONEY
90: "Canadian Club Toronto" is a ORG
The most common NER type to "Canadian Club Toronto" is NORP
90: "Canadian Club Toronto" is a ORG  Canadian Club Toronto  ORG
91: "today" is a DATE
The most common NER type to "today" is DATE
91: "today" is a DATE  today  DATE
92: "North American" is a NORP
The most common NER type to "North American" is NORP
92: "North American" is a NORP  North American  NORP
93: "Air Canada" is a ORG
The most common NER type to "Air Canada" is GPE
93: "Air Canada" is a ORG  Air Canada  ORG
94: "Sunwing and American Airlines" is a ORG
The most common NER type to "Sunwing and American Airlines" is ORG
94: "Sunwing and Ameri

**ANSWER Q5(b)**
This method are helpful to get the result, but there are still some conflict.


It determint "Canada" is a GPE, but the most common NER type to "Canada" is ORG, and the normolize result for "Canada" GPE is Air Canada ORG.

It determint "the University of Calgary" is a ORG, but the most common NER type to "the University of Calgary" is GPE. The normallize result for "the University of Calgary" is a ORG  the University of Calgary  ORG.


**PART 3 - Entity Linking / Text enhancement**  

For the third part of this notebook, we will be exploring how we can enhance the text of documents. In this scenario, we will be enhancing the text by performing entity linking. This means that we will attempt the linking of the entities that are detected by spaCy's NER to an active webpage that a reader can click on to obtain more information regarding the entity. Wikipedia, is a very good resource to find out more information about an entity, and we will use this resource for entity linking.    

Before going straight into an example through code, below is an example of how a text with no entity linking compares to a text with entity linking:    

*No entity linking:* \
During the pandemic, U.S. cities such as Atlanta, Chicago and Denver have made several adjustments to their transit systems.      

*With entity linking:*  \    
During the pandemic, U.S. cities such as <a href="http://en.wikipedia.org/wiki/Atlanta">Atlanta</a>, <a href="http://en.wikipedia.org/wiki/Chicago">Chicago</a> and <a href="http://en.wikipedia.org/wiki/Denver">Denver</a> have made several adjustments to their transit systems.

Transforming a text automatically with clickable links requires several processing at the character string level. In this Notebook, we will be satisfied with finding the links without making the replacements directly in the text. This will allow us to explore the Wikipedia resource, and understand the difficulties relating to "entity linking" without wasting too much time in the complex manipulation of strings.

For example, with the document (index 6), we would like to be able to link the entities found by spaCy to the most likely wikipedia page giving access to additional information on that entity.

**This enriched list format shown is the type of output requested in question Q6 below.**  For coding simplicity, we will use this type of output instead of an article in which the text would replaced by links.

0: "Coronavirus Brief" is a ORG found at http://en.wikipedia.org/wiki/Coronavirus_Brief \
1: "CBC" is a ORG found at http://en.wikipedia.org/wiki/CBC \
2: "Canada" is a GPE found at http://en.wikipedia.org/wiki/Canada \
3: "C.D. Howe" is a PERSON found at http://en.wikipedia.org/wiki/C.D._Howe \
4: ... \



**(TO DO) Q6 - 5 marks**  
Write the code needed to search a wikipedia page for the entities found by spaCy (as shown above) in a particular document.  

*You can write the code as you like, but it must include the following elements:*

*   (a) A restriction on which type of entities you are linking.  For example, Wikipedia does not contain quantities (such as "two meters") so it would be inappropriate to include a link to a quantity.
*   (b) The use of the *normalized form* of the entity to perform the linking.  For example, *Allison McGeer* does have a Wikipedia page (https://en.wikipedia.org/wiki/Allison_McGeer) that you can link to, even when you are looking at the entity with label *McGeer*.  So make sure to use the function you developed in Q4.
*   (c) Attention:  the wikipedia page uses underscores. So for example, *McMaster University* should be transformed to https://en.wikipedia.org/wiki/McMaster_University (with an underscore between *McMaster* and *University*
*   (d) Include one element of post-processing on the longer form.  For example *the C.D. Howe Institute's* is tagged by spaCy, but Wikipedia will contain *C.D._Howe_Institute*.  You can remove small particules like *the* to augment the chance of linking.
*   (e) For a specific document, output the surface form, entity type and link to Wikipedia in a list as shown above

Be sure to put comments in your code to make it clear what corresponds to parts (a), (b), (c), (d) and (e).

There will be probably many links that you include that will link to wikipedia pages that do not exist.  That's ok, don't worry about that.  Wikipedia does not contain everything, and some normalized forms will not be there.  You will be asked to discuss this later in Q7.




In [45]:
# ANSWER - Q6
doc = df["text"][6]
doc_sp = sp(doc)
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    if (token.label_ ==  'GPE' or token.label_ ==  'PERSON' or token.label_ ==  'ORG' or token.label_ ==  'NORP'):
        ent = assign_normalized_form(doc_sp.ents[i], doc_sp.ents[0:i-1])
        if('the' in ent.text):
            after_process = ent.text.replace('the','')
        else:
            after_process = ent.text
        ans1 = after_process.lstrip()
        ans = ans1.replace(' ','_')
        #print("The most common NER type to \"" + doc_sp.ents[i].text + "\" is " + most_common_type(doc_sp.ents[i], doc_sp.ents))
        print(str(i) + ": \"" + ent.text + "\" is a " + ent.label_ + " found at https://en.wikipedia.org/wiki/" + ans)


0: "Canada" is a GPE found at https://en.wikipedia.org/wiki/Canada
1: "C.D. Howe" is a ORG found at https://en.wikipedia.org/wiki/C.D._Howe
2: "Ontario" is a PERSON found at https://en.wikipedia.org/wiki/Ontario
4: "Alberta" is a GPE found at https://en.wikipedia.org/wiki/Alberta
7: "Air Canada" is a ORG found at https://en.wikipedia.org/wiki/Air_Canada
10: "England" is a GPE found at https://en.wikipedia.org/wiki/England
11: "Peter Cziborra/Reuters" is a PERSON found at https://en.wikipedia.org/wiki/Peter_Cziborra/Reuters
13: "CBC" is a ORG found at https://en.wikipedia.org/wiki/CBC
14: "Andre Mayer" is a PERSON found at https://en.wikipedia.org/wiki/Andre_Mayer
15: "Canada" is a GPE found at https://en.wikipedia.org/wiki/Canada
17: "cholera" is a ORG found at https://en.wikipedia.org/wiki/cholera
19: "Calgary" is a GPE found at https://en.wikipedia.org/wiki/Calgary
20: "John Brown" is a PERSON found at https://en.wikipedia.org/wiki/John_Brown
21: "the University of Calgary" is a ORG 

**(TO DO) Q7 - 5 marks**  
Perform a qualitative evaluation of the entity linking method you wrote in Q6.  For your qualitative evaluation, you must choose a document (any one you want from the corpus of covid-19 related news, but make sure to mention which one) and run your method on that document.  Answer the following questions : 

* a. Give 2 examples of entities where the longer form was found in Wikipedia.  Is the page found appropriate? Would the shorter form be found too? Would it link to the same page?  
* b.  Give 2 examples of entities where the wikipedia page did not exist.  Why is that?  Was the form searched on incorrect?  
* c.  Try restricting your search with different entity types.  Do you see DATE covered by Wikipedia?  What about PERSON or GPE?  Discuss the coverage of different entity types by giving some examples.




In [43]:
doc = df["text"][1]
doc_sp = sp(doc)
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    if (token.label_ ==  'GPE' or token.label_ ==  'PERSON' or token.label_ ==  'ORG' or token.label_ ==  'NORP'):
        ent = assign_normalized_form(doc_sp.ents[i], doc_sp.ents[0:i-1])
        if('the' in ent.text):
            after_process = ent.text.replace('the','')
        else:
            after_process = ent.text
        ans1 = after_process.lstrip()
        ans = ans1.replace(' ','_')
        #print("The most common NER type to \"" + doc_sp.ents[i].text + "\" is " + most_common_type(doc_sp.ents[i], doc_sp.ents))
        print(str(i) + ": \"" + token.text + "\" is a " + ent.label_ + "found at https://en.wikipedia.org/wiki/"+ans)
        #print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

4: "COVID-19" is a PERSONfound at https://en.wikipedia.org/wiki/COVID-19
5: "the World Health Organization" is a ORGfound at https://en.wikipedia.org/wiki/World_Health_Organization
6: "WHO" is a ORGfound at https://en.wikipedia.org/wiki/WHO
8: "the Public Health Agency" is a ORGfound at https://en.wikipedia.org/wiki/Public_Health_Agency
9: "Canada" is a GPEfound at https://en.wikipedia.org/wiki/Canada
15: "W.F. Wells" is a PERSONfound at https://en.wikipedia.org/wiki/W.F._Wells
16: "the Harvard School of Public Health" is a ORGfound at https://en.wikipedia.org/wiki/Harvard_School_of_Public_Health
18: "Wells" is a ORGfound at https://en.wikipedia.org/wiki/Wells
20: "Canada" is a GPEfound at https://en.wikipedia.org/wiki/Canada
22: "Lydia Bourouiba" is a PERSONfound at https://en.wikipedia.org/wiki/Lydia_Bourouiba
23: "the Fluid Dynamics of Disease Transmission Laboratory" is a ORGfound at https://en.wikipedia.org/wiki/Fluid_Dynamics_of_Disease_Transmission_Laboratory
24: "the Massachuse

**ANSWER Q7**
a)I choose document 2. For example, for WHO and World Health Organization. They both point to the same website. Also,Bourouiba and Lydia Bourouiba also point to the same website

b)Grace Horsfall Couldwell can not be found in wiki website, because she is not a famous person.
 Also, N95 can not be found found in website, because it need more infomation, it is a kind of mask.

***SIGNATURE:***
My name is Tan Chen.
My student number is 300072995.
I certify being the author of this assignment.