# Using Linked-Data in Research

This notebook will cover a simple implementation of some work that could be conducted using linked-data. It will cover identifying important entities in text and attempting to disambiguate them against external linked-data collections. This allows us broaden the context of our data with all that is already known and publicised on the semantic web.

We'll first setup the project by installing the relevant python packages and downloading any SpaCy models if necessary.

In [1]:
%pip install SPARQLWrapper spacy pandas
!python -m spacy download en_core_web_sm
!mkdir -p NG_web-texts/

Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
from itertools import groupby
import pandas as pd
from pprint import pprint
import spacy
from SPARQLWrapper import SPARQLWrapper, JSON
import time
from tqdm.notebook import tqdm

# Ensures that all dataframes are displayed on one line instead of breaking columns across multiple lines
pd.set_option('display.expand_frame_repr', False)

We first need to read in the text file. I have selected the in-depth description of "Sunflowers" by Vincent van Gogh from the [National Gallery website](https://www.nationalgallery.org.uk/paintings/vincent-van-gogh-sunflowers) but this could be any text from your research. To make viewing the content easier in this worksheet, we will split the text into paragraphs - this doesn't necessarily need to be done if you are processing text normally but it may be useful for development and testing.

In [4]:
!wget https://raw.githubusercontent.com/wrmthorne/SLAF-Linked-Data-Workbook/main/NG_web-texts/sunflowers_vincent_van_gogh.txt -O NG_web-texts/sunflowers_vincent_van_gogh.txt

# Open the file and read the contents
with open('./NG_web-texts/sunflowers_vincent_van_gogh.txt', 'r') as file:
    text = file.read()

pprint(text)

# Split the text into paragraphs
paragraphs = text.split('\n\n')

--2024-03-21 11:59:13--  https://raw.githubusercontent.com/wrmthorne/SLAF-Linked-Data-Workbook/main/NG_web-texts/sunflowers_vincent_van_gogh.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6355 (6.2K) [text/plain]
Saving to: ‘NG_web-texts/sunflowers_vincent_van_gogh.txt’


2024-03-21 11:59:13 (77.8 MB/s) - ‘NG_web-texts/sunflowers_vincent_van_gogh.txt’ saved [6355/6355]

('‘The sunflower is mine’, Van Gogh once declared. Soon after his death he '
 'became known as the painter of sunflowers, an identification that endures to '
 'this day. No other artist has been so closely associated with a specific '
 'flower, and his sunflower pictures continue to be among Van Gogh’s most '
 'iconic – and loved – works.\n'
 '\n'
 'This painting is one of five versions 

## Named Entity Recognition (NER)

Named entities are real-world objects that are given a name. Common NER tags are PERSON, ORGANISATION, LOCATION, etc. but not all models use the same scheme so each model may be different. Quite often we also choose to capture numerical elements or temporal elements in NER tagging as they are also well-defined, if not abstract, concepts for which we prescribe a name. To extract our named entities, we will use SpaCy which is an easy to use, general purpose NLP library for python. We'll first load the model, then pass our text through the model to obtain an annotated document. We can then list the first 10 identified named entities and their associated types.

The English model for SpaCy uses the following tags (descriptions taken from [here](https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy?scriptVersionId=11364473&cellId=9))

<style>
table,td,tr,th {border:none!important}
</style>

<table width=100%>
    <tr>
        <td><b>PERSON</b> People, including fictional.</td>
        <td><b>NORP</b> Nationalities or religious or political groups.</td>
    </tr>
    <tr>
        <td><b>FAC</b> Buildings, airports, highways, bridges, etc.</td>
        <td><b>ORG</b> Companies, agencies, institutions, etc.</td>
    </tr>
    <tr>
        <td><b>GPE</b> Countries, cities, states.</td>
        <td><b>LOC</b> Non-GPE locations, mountain ranges, bodies of water.</td>
    </tr>
    <tr>
        <td><b>PRODUCT</b> Buildings, airports, highways, bridges, etc.</td>
        <td><b>EVENT</b> Named hurricanes, battles, wars, sports events, etc.</td>
    </tr>
    <tr>
        <td><b>WORK_OF_ART</b> Titles of books, songs, etc.</td>
        <td><b>LAW</b> Named documents made into laws.</td>
    </tr>
    <tr>
        <td><b>LANGUAGE</b> Any named language.</td>
        <td><b>DATE</b> Absolute or relative dates or periods.</td>
    </tr>
    <tr>
        <td><b>TIME</b> Times smaller than a day.</td>
        <td><b>PERCENT</b> Percentage, including "%".</td>
    </tr>
    <tr>
        <td><b>MONEY</b> Monetary values, including unit.</td>
        <td><b>QUANTITY</b> Measurements, as of weight or distance.</td>
    </tr>
    <tr>
        <td><b>ORDINAL</b> "first", "second", etc.</td>
        <td><b>CARDINAL</b> Numerals that do not fall under another type.</td>
    </tr>
</table>

In [6]:
# Uncomment this to print all the tag names from the model
# print(nlp.get_pipe('ner').labels)

# Load the spaCy model and process the paragraphs
nlp = spacy.load('en_core_web_sm')
docs = [nlp(paragraph) for paragraph in paragraphs]

# List the entities and their types from the first paragraph
for entity in docs[0].ents:
    print(f'{entity.label_:10} {entity.text}')

PERSON     Van Gogh
DATE       this day
PERSON     Van Gogh


We can visualise the annotations using the [displacy](https://spacy.io/api/top-level#displacy) sub-module from spacy. This very clearly allows us to see the annotations that SpaCy has identified. It also allows us to inspect the accuracy. NER tagging is not a perfect process and is entirely dependent on the quality and suitability of the model used. The SpaCy model is very fast, cheap and general purpose but this comes at the cost of accuracy and suitability to our specific domain of cultural heritage. In this paragraph, we can already see one mistake from the model: the artwork "sunflowers" is not identified using the `WORK_OF_ART` tag as it should have been. 

Some potential fixes for this could be to use a better model from SpaCy such as `en_core_web_lg` and hope that performs better, use more complicated tools such as [huggingface transformers](https://huggingface.co/docs/transformers/en/index), or, most simply, perform some manual work to fill in incomplete annotations and correct errors.

Feel free to change the document number to inspect each paragraph's annotations.

In [11]:
spacy.displacy.render(docs[0], style='ent', jupyter=True)

## Grounding Our Entities

When grounding our entities, there are some preprocessing steps we might want to do for a number of reasons. The first is that we want to extract just the entities from the text to query. Next, we might want to find all unique (entity, label) combinations to reduce the number of queries we have to make. Finally, we can group the entities by their type so we can select the entity types we want to search for.

In [12]:
# Extract (entity, label) pairs from all the documents
all_ents = [(entity.text, entity.label_) for doc in docs for entity in doc.ents]

# Remove any duplicates by converting the list to a set
unique_ents = set(all_ents)

# Group the entities by their labels - must be sorted first
grouped_ents = groupby(sorted(unique_ents, key=lambda x: x[1]), key=lambda x: x[1])
grouped_ents = {label: [ent[0] for ent in ents] for label, ents in grouped_ents}

print(f'Found {len(all_ents)} entities in the text.')
print(f'Found {len(unique_ents)} unique entities in the text.', end='\n\n')

for label, group in grouped_ents.items():
    print(f'{label:12} {len(list(group))}')

Found 119 entities in the text.
Found 65 unique entities in the text.

CARDINAL     8
DATE         16
EVENT        1
FAC          5
GPE          10
LOC          1
NORP         3
ORDINAL      2
ORG          8
PERSON       10
WORK_OF_ART  1


Some of these entity types are not groundable with the sources we will use or not particular interesting to ground in our use-case. For this example, we are mostly interested in actors, objects or events for this project, hence we can exclude numerical entities and date/time.

In [13]:
# Define subset of entity types to keep
ents_to_keep = ('EVENT', 'FAC', 'LOC', 'GPE', 'ORG', 'PERSON', 'NORP', 'WORK_OF_ART')

# Filter the grouped entities to only keep those in the subset
filtered_groups = {label: grouped_ents[label] for label in grouped_ents.keys() & ents_to_keep}

for label, group in filtered_groups.items():
    print(f'{label:12} {len(list(group))}')

LOC          1
WORK_OF_ART  1
ORG          8
NORP         3
PERSON       10
FAC          5
EVENT        1
GPE          10


### Getty Vocabularies

To disambiguate artists, we will use the [Getty vocabularies ULAN](https://www.getty.edu/research/tools/vocabularies/ulan/) database which records biographical information for artists, architects, firms, studios, repositories, and patrons. The Getty Vocabularies also contains the Art & Architecture Thesaurus (AAT), Thesaurus of Geographic Names (TGN), Cultural Objects Name Authority (CONA), Iconography Authority (IA), and Categories for the Descriptions of Works of Art (CDWA). Descriptions and links to each of these databases can be found [here](https://www.getty.edu/research/tools/vocabularies/index.html).

To access their SPARQL endpoint in python, we will use the [SPARQLWrapper](https://sparqlwrapper.readthedocs.io/en/stable/main.html) library for Python which handles the formatting of our SPARQL queries into http requests and resolves them for us. This allows us to focus on requesting the data that we want, rather than the implementation itself. We first create an instance of SPARQLWrapper that is instructed to query Getty Vocabularies and we set the return format to JSON to easily handle in Python. We will also make use of [pandas](https://pandas.pydata.org/) to help us nicely tabulate the data for presentation. Pandas is also a very powerful data manipulation library for if we wanted to perform some post-processing.

In [15]:
sparql = SPARQLWrapper('https://vocab.getty.edu/sparql')
sparql.setReturnFormat(JSON)

We can then write our SPARQL query with each of the names from the PERSON category. As covered, everything in a linked data triple has a URI and the predicate is no exception. 

* **Simple Knowledge Organization System** ([SKOS](https://www.w3.org/2004/02/skos/)) is a commonly used model for expressing basic structure in Linked-Data schemas.
* SKOS has the **eXtension for Labels** ([SKOS-XL](https://www.w3.org/2006/07/SWD/wiki/SkosDesign/SKOS-XL.html)) extension which enables representation of concepts with labels in different languages, synonyms and other lexical forms. 
* **Resource Description Framework Schema** ([RDFS](https://www.w3.org/TR/rdf-schema/)) is a foundational language for describing vocabularies and building ontologies for the semantic web. 

In [48]:
query = '''
SELECT ?person ?preferredName
WHERE {
    ?person skos:inScheme ulan: ; 
            rdfs:label "%s" ;
            gvp:prefLabelGVP [ xl:literalForm ?preferredName ] .
}
'''

def run_single_queries(query, people):
    results = []

    # Create progress bar
    with tqdm(total=len(people)) as pbar:
        # Run the defined query with each uniquely identified person
        for person in people:
            pbar.set_description(f'Querying for {person}')

            # Use Python string formatting to replace %s with person
            sparql.setQuery(query % person)
            result = sparql.queryAndConvert()

            # Create table for the results of the query
            result_parsed = pd.json_normalize(result['results']['bindings'])
            result_parsed['person'] = person
            results.append(result_parsed.set_index(['person', result_parsed.index]))

            pbar.update(1)

    return pd.concat(results)

single_start = time.time()
singles_results = run_single_queries(query, filtered_groups['PERSON'])
single_end = time.time()

  0%|          | 0/10 [00:00<?, ?it/s]

We can now view the data. One thing to immediately notice is that, despite the text being about van Gogh, no entities were identified. If we inspect the names representing [van Gogh on ULAN](https://www.getty.edu/vow/ULANFullDisplay?find=vincent+van+gogh&role=&nation=&prev_page=1&subjectid=500115588), we can see that there is an all lower-case representation "van gogh" but not the capitalised "Van Gogh" found in this text. This is an important not to know the database you are referencing against before relying on it. 

The other issue is that some entities are too ambiguous from just the word. We can infer from the context of the work that "Theo" is [Theo van Gogh](https://www.getty.edu/vow/ULANFullDisplay?find=van+gogh&role=&nation=&page=1&subjectid=500339434) but given the limited context, the query has defined the entity as being "Unidentified Named People and Firms". For Monticelli, three different potential entities have been identified because of the ambiguity of only using a last name. Using the time-period of van Gogh's life, we can infer that [Monticelli, Adolphe](https://www.getty.edu/vow/ULANServlet?english=Y&find=Monticelli%2C+Adolphe&role=&page=1&nation=) is the more likely candidate given that they were alive at the same time as van Gogh, whereas [Monticelli, Andrea](https://www.getty.edu/vow/ULANServlet?english=Y&find=Monticelli%2C+Andrea&role=&page=1&nation=) over a century prior. 

In [46]:
singles_results

Unnamed: 0_level_0,Unnamed: 1_level_0,person.type,person.value,preferredName.xml:lang,preferredName.type,preferredName.value
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Gauguin,0,uri,http://vocab.getty.edu/ulan/500011421,en,literal,"Gauguin, Paul"
Theo,0,uri,http://vocab.getty.edu/ulan/500174401,,literal,Theo
Theo,1,uri,http://vocab.getty.edu/ulan/500396193,,literal,Theo
Paul Gauguin,0,uri,http://vocab.getty.edu/ulan/500011421,en,literal,"Gauguin, Paul"
Albert Aurier,0,uri,http://vocab.getty.edu/ulan/500313658,,literal,"Aurier, Albert"
Monticelli,0,uri,http://vocab.getty.edu/ulan/500000984,nl,literal,"Monticelli, Adolphe"
Monticelli,1,uri,http://vocab.getty.edu/ulan/500050589,nl,literal,"Monticelli, Andrea"
Monticelli,2,uri,http://vocab.getty.edu/ulan/500431065,,literal,Monticelli


Although it might have seemed fast to make one query, the time cost of sending off one query at a time becomes very apparent, even when working with a moderately sized collection. To reduce communication overheads, we can batch all of our queries into one, asking the server to resolve all of our queries at once, setting the communication time to be the same for one sample and 1000 samples.

In [29]:
batched_query = '''
SELECT ?value ?person ?preferredName
WHERE {
    VALUES (?value) { %s }
    ?person skos:inScheme ulan: ; 
            rdfs:label ?value ;
            gvp:prefLabelGVP [ xl:literalForm ?preferredName ] .
}
'''

def run_batched_query(query, people):
    formatted_people = ' '.join(f'("{person}")' for person in people)

    sparql.setQuery(query % formatted_people)
    result = sparql.queryAndConvert()

    result_parsed = pd.json_normalize(result['results']['bindings']).rename(columns={'value.value': 'person'})
    result_parsed = result_parsed.set_index(['person', result_parsed.index]).drop(columns=['value.type'])
    
    return result_parsed

batched_start = time.time()
batched_results = run_batched_query(batched_query, filtered_groups['PERSON'])
batched_end = time.time()

print(f'Single query took {single_end - single_start:.2f} seconds.')
print(f'Batched query took {batched_end - batched_start:.2f} seconds.')

Single query took 7.46 seconds.
Batched query took 0.72 seconds.


In [233]:
batched_results

Unnamed: 0_level_0,Unnamed: 1_level_0,person.type,person.value,preferredName.type,preferredName.value,preferredName.xml:lang
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Theo,0,uri,http://vocab.getty.edu/ulan/500396193,literal,Theo,
Theo,1,uri,http://vocab.getty.edu/ulan/500174401,literal,Theo,
Paul Gauguin,2,uri,http://vocab.getty.edu/ulan/500011421,literal,"Gauguin, Paul",en
Gauguin,3,uri,http://vocab.getty.edu/ulan/500011421,literal,"Gauguin, Paul",en
Albert Aurier,4,uri,http://vocab.getty.edu/ulan/500313658,literal,"Aurier, Albert",
Monticelli,5,uri,http://vocab.getty.edu/ulan/500431065,literal,Monticelli,
Monticelli,6,uri,http://vocab.getty.edu/ulan/500000984,literal,"Monticelli, Adolphe",nl
Monticelli,7,uri,http://vocab.getty.edu/ulan/500050589,literal,"Monticelli, Andrea",nl


### Optional Exercise

Try and modify the previous batched SPARQL query to also **optionally** retrieve the literal form of the preferred nationality for each artist. Your table should return 4 new columns: nationality.xml:lang, nationality.type, nationality.value and preferredName.xml:lang.

<details>
<summary>Hint 1</summary>

1. You'll need to create a new variable for SELECT
2. You'll need to use the [optional](https://en.wikibooks.org/wiki/SPARQL/OPTIONAL) syntax for SPARQL

</details>

<details>
<summary>Hint 2</summary>

You can use `foaf:focus` to with `?person` as a predicate:

```
?person foaf:focus ...
```

</details>

<details>
<summary>Hint 3</summary>

Apply the same pattern we used to get the literal form of the artist's preferred name, but instead of applying the pattern to `?person`, apply it to the preferred nationality predicate.

</details>

<details>
<summary>Solution</summary>

1. The first stage is to add the `?nationality` value to `SELECT`. 
2. We can then add an `OPTIONAL` element to the `WHERE` block. This will allow the server to respond with a nationality if it can find it but will not prevent the query from returning the more important data we requested if it can't find anything. If we didn't wrap this in an optional block, no data would be returned if a nationality could not be found.
3. 

<div>

```python
batched_query = '''
SELECT ?value ?person ?preferredName ?nationality
WHERE {
    VALUES (?value) { %s }
    ?person skos:inScheme ulan: ;
            rdfs:label ?value ;
            gvp:prefLabelGVP [ xl:literalForm ?preferredName ] .

    OPTIONAL {
        ?person foaf:focus [ gvp:nationalityPreferred [ gvp:prefLabelGVP [ xl:literalForm ?nationality ]]]
    }
}
'''
```

</div>
</details>

In [None]:
nationality_query = '''
# YOUR QUERY HERE
'''

run_batched_query(nationality_query, filtered_groups['PERSON'])

This example has some issues which arise that need to be considered when being implemented in an actual system. For each of the NER tags we chose to keep, try to think of some potential considerations that must be made when trying to disambiguate entities. Once you've had a think, expand the tab below to see some examples we have come up with.

<details>
    <summary>Some Example Problems & Considerations</summary>
    <h4>People</h4>
    <ul>
        <li>Names are not always enough to disambiguate people. The name "van Gogh" is almost always used to refer to Vincent van Gogh but there are also <a href=https://www.getty.edu/vow/ULANServlet?english=Y&find=van+gogh&role=&page=1&nation=>other painters</a> that share the same surname.</li>
        <li>Historically, women were referred to by their husband's name e.g. Mrs. John Smith. These women may be possible to identify by context or may not be named in any historical record at all.</li>
    </ul>
    <h4>Locations & Geopolotical Entities</h4>
    <ul>
        <li>Map borders are not static; place names in text must be disambiguated based on the time that it was written/discussing. Discussing the Roman Empire in the 1st century versus the 3rd century draws a very different world map.</li>
        <li>Ambiguous place names such as <a href=https://en.wikipedia.org/wiki/Springfield>Springfield</a> require context to unambiguously define. Sometimes it is impossible with the provided information.</li>
    </ul>
    <h4>Work of Art</h4>
    <ul>
    <li>Identifying artworks in text using NER is very challenging as the names often appear as generic, natural language to NER systems unless specifically trained for artwork identification or at least the semantics of cultural heritage (<a href=https://link.springer.com/chapter/10.1007/978-3-030-30760-8_10>source</a>)</li>
    <li>Before the 18th century, there was little need to name artworks and so the majority of historical works have been prescribed names by galleries or art-historians based on their subject matter (<a href=https://www.artsy.net/article/artsy-editorial-artworks-untitled>source</a>). Artwork names are not a fixed identification of a piece and may change over time, meaning the same work can have many names including "untitled".</li>
    <li>By chance, intention or for the aforementioned reason, many works can share the same name, especially "untitled". As such, context and additional information is required to disambiguate a specific reference to an artwork.</li>
    <li>Even trying to disambiguate by artist and title is often insufficient. <a href=https://en.wikipedia.org/wiki/Sunflowers_(Van_Gogh_series)>Sunflowers</a> is an example where van Gogh produced multiple series of the work under the same name.</li>
    </ul>
</details>

It is important to note that these issues become the strength of linked data once accurately disambiguated. Being able to uniquely refer to a very specific entity without any possibility for misreference is very powerful for accurately assigning meta-data and progressing research. Furthermore, developing these unambiguous entities allows for others to link their knowledge with ours to expand the possibility for mapping and discovery. Once assigned a URI, you can also periodically request data from other databases and benefit from any updates or additions that they made provide. The quality of your data could improve without any effort from you if others improve their data and vice versa.

In [None]:
from functools import cache

In [None]:
def sparql_query_cache():
    # requires function signature to have query as a kwarg
    pass

In [None]:
@Thing
def run_batched_query(people, query=query):
    formatted_people = ' '.join(f'("{person}")' for person in people)

    sparql.setQuery(query % formatted_people)
    result = sparql.queryAndConvert()

    result_parsed = pd.json_normalize(result['results']['bindings']).rename(columns={'value.value': 'person'})
    result_parsed = result_parsed.set_index(['person', result_parsed.index]).drop(columns=['value.type'])
    
    return result_parsed