In [None]:
#!pip install spacy
#!pip install spacy-transformers
#!pip install wikipedia
#!pip install neo4j '''
#Download medium sized language model
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md


: 

<h1>Assignment 1: Introduction</h1>
<h2> CTRL+F Like Search </h2>
<p>The dataset `assignment1_data.xlsx` contains five people with a summary about their career. They have all won prizes or have received awards for their work except for one person. The first assignment is to figure out who hasn't won any prizes or awards. First we'll show you how we tried doing it in the cells below, by using the wonderful "CTRL+F like search". 

The first cell (#1)  uses the <i>pandas.Series.str.contains</i> function with the keyword 'win' and filters the dataset by excluding the people who have the string 'win' in their summary. This approach has narrowed it down to three people. 
    
After adding more keywords in the second cell (#2), there were only two rows (persons) left. Verbs might not be the best key words, so we decided using nouns instead: 'prize' and 'award' (#3). This resulted in Rosalind Franklin, who was excluded in the first (#1) search because she had the word 'win' in her summary. I checked the summary and apparantly she didn't have the actual word 'win' in it but it detected the string 'win' in 'owing'.</br>
    <i>owing to disagreement with her director, john randall, and her colleague maurice wilkins'. </i></br>

This approach is giving us unreliable and wrong results as it only matches strings with strings and shows the importance of tokenizing the text beforehand. The assignment starts after the next three cells, go through them first and check the output of the <i>pandas.Series.str.contains</i> function.
</p>



In [150]:
import pandas as pd
from pathlib import Path

#read the assignment's dataset
df = pd.read_excel(Path('assignment1_data.xlsx'))

In [71]:
#1 use the keyword win
keyword = 'win'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
0,Jennifer Doudna,"Jennifer Anne Doudna (; born February 19, 196..."
3,Gertrude Elion,"Gertrude ""Trudy"" Belle Elion (January 23, 1918..."
4,Rita Levi-Montalcini,Rita Levi-Montalcini (22 April 1909 – 30 Decem...


In [72]:
#2 use the keyword win or receive
keyword = 'win|receive'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
3,Gertrude Elion,"Gertrude ""Trudy"" Belle Elion (January 23, 1918..."
4,Rita Levi-Montalcini,Rita Levi-Montalcini (22 April 1909 – 30 Decem...


In [73]:
#3 use the keyword prize or award
keyword = 'prize|award'
df[~df.summary.str.contains('(?i)'+keyword)]

Unnamed: 0,person,summary
2,Rosalind Franklin,Rosalind Elsie Franklin (25 July 1920 – 16 Apr...


<h2> Improve the CTRL+F Like Search results</h2> </br>
As mentioned before, the first assignment is to find out which person hasn't won an award.

You have two options, follow the code we have prepared for you below or you can build your own solution to tackle this problem.

The code below uses the language models of the Python NLP package spaCy. First we use <b>tokenization</b> and <b>lemmatization</b> in order to match the key words (relating to awards) with words in the texts of the 5 persons. 

<i> language model </i>
    
    A language model is a machine learning model designed to represent the language domain. It can be used as a basis for a number of different language-based tasks, for instance:

        Question answering
        Semantic search
        Summarization

<i> tokenization </i>

    Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we call it word tokenization.

<i> lemmatization </i>

    The specific discipline of lemmatization is a subcategory of a process called stemming. In natural language processing, stemming allows the computer to group together words according to their various inflections that are tagged with a particular stem. For instance: “walk,” “walked” and “walking.”

    Lemmatization is a bit more complex in that the computer can group together words that do not have the same stem, but still have the same inflected meaning. Grouping the word “good” with words like “better” and “best” is an example of lemmatization. Lemmatization is also a harbinger of increased artificial intelligence sophistication – as natural language processing advances in accommodating lemmatization, it is more able to parse inputs and provide intelligent outputs.

For more information about tokenization and lemmatization please refer to the previous Techathon NLP II:  
https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/spacy-hackathon-presentation-november-2022.pdf  
https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/Assignments/Assignment-1-spaCy-101.ipynb


<h3> start assignment 1</h3>
You can think of your own solution or use the cells below as inspiration. The code below uses two different approaches: comparing lemmas and using word embeddings.

In [163]:
#import packages
import pandas as pd
from pathlib import Path
import spacy
import re

#read the assignment's dataset
df = pd.read_excel(Path('assignment1_data.xlsx'))

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load('en_core_web_sm')

#Bonus question: "cleaning" data important or not?
def alphanumericalOnly(text):
    return re.sub(r'[^a-zA-Z0-9 ]', '', text).lower()

In [157]:
#the dataframe contains two columns: person and the summaries
print(df)

                 person                                            summary
0       Jennifer Doudna  Jennifer Anne Doudna  (; born February 19, 196...
1         Rachel Carson  Rachel Louise Carson (May 27, 1907 – April 14,...
2     Rosalind Franklin  Rosalind Elsie Franklin (25 July 1920 – 16 Apr...
3        Gertrude Elion  Gertrude "Trudy" Belle Elion (January 23, 1918...
4  Rita Levi-Montalcini  Rita Levi-Montalcini (22 April 1909 – 30 Decem...


The cell above reads in the data and loads in the small version of the English model.
You can try to play around with different kinds of _keywords_ to improve the search in the cells below.

In [248]:
keywords = ["win", "award", "prize", "receive"]

#apply the language model on the summaries 
#as well as the keywords in order to use tokenization and lemmatization later on
keywords_nlp = [nlp(k) for k in keywords]
docs = [nlp(doc) for doc in df['summary']] 

#use the method _displacy_ to show the summaries with the highlighted entities. 
#What those entities are will be explained in assignment 2.
spacy.displacy.render(docs, style='ent')

<h3> lemmatization </h3>
Our first approach is to count the amount of times a keyword is present in the summaries. We do this by creating a dictionary first where each person is a key: _awardCounter_

Afterwards we loop through the analyzed summaries (_docs_) and keywords (_keywordsnlp_) and see whether the lemmatized summaries match with the lemmatized keywords.

In [161]:
#dictionary
awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}
for doc in docs:
    for token in doc:
        for k in keywords_nlp:
            if (token.lemma_.casefold() == k[0].lemma_):
                awardCounter[doc[0].text] += 1
                print(f"Person: {doc[0].text}, token: {token}, "
                      #f"token_lemma: {token.lemma_.casefold()}, " 
                      f"keyword: {k[0].lemma_}")

print(f"\n{awardCounter}")

Person: Jennifer, token: received, keyword: receive
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: awards, keyword: award
Person: Jennifer, token: Award, keyword: award
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Prize, keyword: prize
Person: Jennifer, token: Award, keyword: award
Person: Jennifer, token: Prize, keyword: prize
Person: Rachel, token: won, keyword: win
Person: Rachel, token: Award, keyword: award
Person: Rachel, token: awarded, keyword: award
Person: Gertrude, token: Prize, keyword: prize
Person: Rita, token: awarded, keyword: award
Person: Rita, token: Prize, keyword: prize

{'Jennifer': 9, 'Rachel': 3, 'Rosalind': 0, 'Gertrude': 1, 'Rita': 2}


The dictionaries show that Rosalind has zero hits for any of the keywords. It is most likely that she hasn't won any award. 

<h3> word embeddings </h3>

Now our second approach is using word embeddings to find matches with our key words. 

<i> word embeddings </i>

    Word embedding in NLP is an important term that is used for representing words for text analysis in the form of real-valued vectors. It is an advancement in NLP that has improved the ability of computers to understand text-based content in a better way. Words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations. The extracted features are fed into a machine learning model so as to work with text data and preserve the semantic and syntactic information.
    The most important feature of word embeddings is that similar words in a semantic sense have a smaller distance (either Euclidean, cosine or other) between them than words that have no semantic relationship. For example, words like “mom” and “dad” should be closer together than the words “mom” and “ketchup” or “dad” and “butter”.
    
Word embeddings were introduced in a previous Techathon:
* https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/spacy-hackathon-presentation-november-2022.pdf  
* https://github.com/DataScienceOrdina/techathon-NLP-II/blob/main/Assignments/Assignment-1-spaCy-101.ipynb

for more information you can look it up here:
* https://en.wikipedia.org/wiki/Word_embedding  
* https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/


With word embeddings you can calculate the distance between the keywords and the tokens of the summaries. The Python package spaCy has a function called _similarity_ which will do this for you. It has an output value ranging from 0 to 1. The higher the value of the function the higher the semantic relationship of the token and the keyword. 

Try playing around with different settings for _similarityThreshold_, as well as with different _keywords_. How much does the threshold influence the output of the awardCounter? Also compare these results with the results from the lemmatization approach.

In [166]:
# we use the medium version of the language model as this one contains the word embeddings
nlp = spacy.load('en_core_web_md')
keywords = ["win", "award", "prize", "receive"]
keywords_nlp = [nlp(k) for k in keywords]
docs = [nlp(doc) for doc in df['summary']] 

#similar to the first approach we use a dictionary to count how often a token has a higher similarity score
#than the similarity threshold
awardCounter = {"Jennifer":0, "Rachel":0, "Rosalind":0,
                "Gertrude":0, "Rita":0}

#setting the similarity threshold
similarityThreshold = 0.5


for doc in docs:
    for token in doc:
        for k in keywords_nlp:
            similarityScore = token.similarity(k)
            if similarityScore > similarityThreshold:
                awardCounter[doc[0].text] += 1
                print(f'Person: {doc[0].text}, {token} <-> {k}, '
                    f'similarity: {similarityScore}')

print(f"\n{awardCounter}")

#UserWarning: [W008] Evaluating Token.similarity based on empty vectors. 
# -> when evaluating unknown tokens that have no valid word vector

Person: Jennifer, of <-> receive, similarity: 0.6314075888511705
Person: Jennifer, to <-> receive, similarity: 0.5395064114154078
Person: Jennifer, share <-> receive, similarity: 0.6023202883403633
Person: Jennifer, received <-> award, similarity: 0.999999980088709
Person: Jennifer, received <-> prize, similarity: 0.999999980088709
Person: Jennifer, Prize <-> award, similarity: 0.999999980088709
Person: Jennifer, Prize <-> prize, similarity: 0.999999980088709
Person: Jennifer, for <-> receive, similarity: 0.6884468395821037
Person: Jennifer, of <-> receive, similarity: 0.6314075888511705
Person: Jennifer, for <-> receive, similarity: 0.6884468395821037
Person: Jennifer, of <-> receive, similarity: 0.6314075888511705
Person: Jennifer, of <-> receive, similarity: 0.6314075888511705
Person: Jennifer, of <-> receive, similarity: 0.6314075888511705
Person: Jennifer, Medical <-> receive, similarity: 0.5634252535943546
Person: Jennifer, Medical <-> receive, similarity: 0.5634252535943546
Pers

  similarityScore = token.similarity(k)


<h1>Assignment 2: Public Ivy School</h1>
Now do it yourself: find out which one of the five persons worked at a Public Ivy School and show your results and findings. Public Ivy School is an informal term for prestigious universities in the United States of America. 
(See: https://en.wikipedia.org/wiki/Public_Ivy)

We'll discuss the results between 15:15 and 15:30.

Hint 1: The term _Public Ivy School_ itself is <b>not</b> mentioned in the summaries.  
Hint 2: Perhaps comparing _entities_ rather than _tokens_ will give better results.

If you've build your own solution then try using the same code to answer this question.

In [168]:
# re-use the code from Assignment 1 to answer this question

In [None]:
# re-use the code from Assignment 1 to answer this question

In [None]:
# re-use the code from Assignment 1 to answer this question

In [None]:
# re-use the code from Assignment 1 to answer this question

In [None]:
# re-use the code from Assignment 1 to answer this question

In [None]:
# re-use the code from Assignment 1 to answer this question

<h1> Demo / Discussion: Semantic Search </h1>
Using spaCy improved the "ctrl+f like search" because it recognizes tokens, named entities and lemmas.

SpaCy's similarity function uses word embeddings to calculate the distance of tokens, entities or even doc objects. This way you'll find which tokens or entities have the highest semantic relation with "Public Ivy School". Let's discuss how "semantic" a semantic relation of a word embedding really is.

The language model that we used _en_core_web_md_ was trained on English blogs, news and comments. The wikipedia page of Public Ivy School lists the different schools and you'll see that the University of California, Berkeley where Jennifer Doudna attended is on that list. We could train the model on a more specific dataset containing information about schools. This could decrease the distance between the named entities 'University of California, Berkeley' and 'Public Ivy School', meaning: it would increase the semantic relation but we still won't have an explicit relation between those two concepts. It says something about the distance between two named entities and not what this actually means.

Another question to think about is what are the requirements for retraining your language model with more data in order to expand your knowledge base? Training a large language model is expensive and in some companies often times not feasible. This is the part where Semantic Web technology can help you. With Semantic Web technology (using taxonomies and linked data) you can make the information explicit, accessible and interpretable for your application. 

If we look back at Assignment 2, in order to answer the question, you needed information about which universities were considered a Public Ivy School. Luckily for us, there are several Linked Data versions of Wikipedia and for today we will be using Wikidata. This demo will showcase a simplified version of semantic search and how to apply entity linking and enrich the data with a taxonomy about the Publc Ivy schools.

The next blocks will:
* explain what entity linking is
* describe how we retrieved Wikidata information
* explain how we used a graph database
* show the different outputs when querying with and without a taxonomy.


<h2> Entity Linking </h2>


Semantic Search is about enriching your search query and/or your search results with taxonomies and ontologies. A way to solve Assignment 2's question about the Public Ivy Schools is using a taxonomy containing domain knowledge of Public Ivy Schools. The key here is to link the named entities with concepts of a taxonomy. We call this Entity Linking. 


<b>What is Named Entity Linking?</b>
<i> copied from https://medium.com/analytics-vidhya/entity-linking-a-primary-nlp-task-for-information-extraction-22f9d4b90aa8</i>

Information extraction comprises of multiple sub-tasks. In most cases, we will have the following sub-tasks. And they are performed in order, to extract the information from unstructured data.

    Named Entity Recognition (NER)
    Named Entity Linking (NEL)
    Relation Extraction

A named entity is a real-world object, such as persons, locations, organizations, etc. NER identifies and classify named entity occurrences in text into pre-defined categories. NER is modeled as a task of assigning a tag to each word in a sentence. Below is an example result from Jennifer Anne Doudna's summary.


In [45]:
spacy.displacy.render(docs[0], style='ent')

NER will tell us what words are entities and what are their types. In the above example, NER will locate “Jennifer Anne Doudna” as a person. But we still don’t know exactly which “Jennifer Anne Doudna” the text is speaking about in the above example. NEL is the next sub-task that will answer this question.

NEL will assign a unique identity to entities mentioned in the text. In other words, NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base. The target knowledge base depends on the application, but we can use knowledge bases derived from Wikipedia for open-domain text. In our above example, we can find exactly which “Jennifer Anne Doudna” by linking the entities to Wikidata. Wikidata is a structured knowledge base extracted from Wikipedia. This process of linking entities to Wikipedia is also called as Wikification.

NEL is also referred to as Entity Linking, Named Entity Disambiguation (NED), Named Entity Recognition and Disambiguation (NERD) or Named Entity Normalization (NEN). NEL has a wide range of applications other than Information Extraction. NEL is used in Information Retrieval, Content Analysis, Intelligent Tagging, Question Answering System, Recommender Systems, etc.

NEL also plays a significant role in the Semantic Web. The Semantic Web is a term coined by Tim Berners-Lee for a web of data that can be processed by machines. A vital issue in Semantic Web is to automatically populate and enrich existing knowledge bases with newly extracted facts. NEL is inherently considered as an essential subtask for knowledge base population.

<h3> Entity linking simplified </h3>
There are many ways to do Entity linking and for this techathon demo we simplified it by matching the Named Entities retrieved from spaCy's language model with concepts from Wikidata. If there's a wikidata page with the exact name of the named entity it will take the first hit from Wikidata's results. 

This approach is inspired by GCP's NLP API (https://cloud.google.com/natural-language). Try the demo and you'll see that most entities have a Wikipedia link to it. This is what we will be doing in the cells below.

     1. Google
        https://en.wikipedia.org/wiki/Google
        Salience: 0.19 

We've added a filter to the named entities: 'PERSON', 'ORG', 'WORK_OF_ART', 'GPE', 'EVENT', 'NORP' as we wanted to keep it small. 

The whole code is not shown below but we've added a part of the code in the cells below. The output of the entity linking can be found in _entitylinkingdata.csv_. The csv file contains the named entities and the different summaries and the actual linking part happens in the graph database.

In [221]:
#entity linking applied by matching NER with wikidata endpoint
def call_wiki_api(item):
  try:
    url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={item}&language=en&format=json"
    data = requests.get(url).json()
    # Return the first id
    return [data['search'][0]['concepturi'],data['search'][0]['id']]
  except:
    return 'id-less'


In [251]:
#create a dict for each entity and store the concepturi as the entity link
docentities = dict()
global cntr
cntr = 0
types = ['PERSON', 'ORG', 'WORK_OF_ART', 'GPE', 'EVENT', 'NORP']

def make_dict(row):
    global cntr
    doc = nlp(row['summary'])
    for ent in doc.ents:
        if ent.label_ in types:
            result = call_wiki_api(ent)
            if result != 'id-less':
                concepturi = result[0]
                ID = result[1]
            else:
                concepturi = 'id-less'
                ID = 'id-less'
            docentities[cntr] = [str(ent), concepturi, ID, row['person']]
            cntr = cntr + 1

<h2> Adding Public Ivy School information from Wikidata </h2>
The next step is to retrieve data about which universities are considered Public Ivy Schools.
We already know that the University of California, Berkeley is one of them. Read the wikidata page https://www.wikidata.org/wiki/Q168756. You'll see that it's an instance of the concept Original Public Ivy.

What we need to do is retrieve all instances of Original Public Ivy. We've done this with a query, you can try this query out at: https://w.wiki/6DJb. The results from the query are stored as a taxonomy in `wikidatataxonomy.ttl`.

During the Semantic Web 101 part we explained that taxonomies are hierarchical models meaning that we use relations which define hierarchy. We call these broader and narrower relations. If you look in the `wikidatataxonomy.ttl` file you'll see that we used the skos:broader relation to define the hierarchy of the Public Ivy schools. We also used the relation skos:altLabel to describe the alternative names of Public Ivy School: Original Public Ivy or Public Ivy.

<h2> Storing all data in a graph data base </h2>
After applying entity linking and retrieving the Public Ivy School information we now have three data sets:

* the summaries: assignment1_data.xlsx (_documents_ in `docs-ents-tax.png`)
* the linked entitities: entitylinkingdata.csv (_entities_ in `docs-ents-tax.png`)
* the taxonomy: wikidatataxonomy.ttl (_taxonomy_ in `docs-ents-tax.png`)

We will use a graph database to store the data. 

<i> graph database </i>

    A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it.
    more on https://neo4j.com/developer/graph-database/
    
For this demo we have populated two sandboxes of the graph database _neo4j_:
* one sandbox only contains the summaries and the linked entities (see `docs-ents.png`)
* the other sandbox is enriched with the taxonomy (see `docs-ents-tax.png`)

Neo4j uses it's own query language called Cypher and uses Node Labels to distinguish the different nodes in the database. 

    We will not go too much into detail about neo4j. Just see it as a simple way to store the data for now. For the semantic data management colleagues: if you're interested in the difference between triple stores and property graphs, check out `property-graphs-vs-triple-stores.png`.

We have created three node labels:
* Article for the summaries 
* Entities for the linked entities 
* Concepts for the taxonomy data. 

The Article and Entity nodes have a relation called `refers_to` which is the entity link. So, when you're looking for a certain keyword it will go through the labels of the Entity nodes and you can return the relevant article through the relation (or edge) `refers_to`. Check `neo4j-graph.png` to see how these nodes are related. Compare the named entities from the displacy output below with the neo4j-graph image. Remember that we've applied a filter which explains that some named entities aren't shown in the neo4j image. The red nodes are your Entities and the purple nodes are the Articles.


In [180]:
spacy.displacy.render(docs[3], style='ent')

<h2> Querying the graph database </h2>
The cells below will query both sandboxes and look for summaries with a relation with Public Ivy Schools.

#extra: if you want to experiment a bit, you can use Pypher https://neo4j.com/blog/express-cypher-queries-pure-python-pypher/ or check out the manuals:
* https://neo4j.com/docs/cypher-manual/current/clauses/match/
* https://neo4j.com/docs/cypher-manual/current/syntax/patterns/

In [236]:
#import neo4j package
from neo4j import GraphDatabase

In [237]:
#setup credentials and neo4j driver for sandbox 1 with taxonomy
uri1 = "bolt://44.211.205.216:7687"
pwd1 = "cells-junction-ball"
driver1 = GraphDatabase.driver(uri1, auth=("neo4j", pwd1))

#setup credentials and neo4j driver for sandbox 2 without taxonomy
uri2 = "bolt://3.234.224.15:7687"
pwd2 = "loss-fifties-rim"
driver2 = GraphDatabase.driver(uri2, auth=("neo4j", pwd2))


Now that we have set up the connections to the sandboxes we can query the data. Below you'll find a few Cypher queries.  

In [239]:
#see df to remind you of the data in summary (with node label Article)
df

Unnamed: 0,person,summary
0,Jennifer Doudna,"Jennifer Anne Doudna (; born February 19, 196..."
1,Rachel Carson,"Rachel Louise Carson (May 27, 1907 – April 14,..."
2,Rosalind Franklin,Rosalind Elsie Franklin (25 July 1920 – 16 Apr...
3,Gertrude Elion,"Gertrude ""Trudy"" Belle Elion (January 23, 1918..."
4,Rita Levi-Montalcini,Rita Levi-Montalcini (22 April 1909 – 30 Decem...


Let's start with a basic query to count the amount of nodes in the sandbox databases.

In [255]:
query = "MATCH (n) RETURN COUNT(n)" 
with driver2.session() as session:
    result = session.run(query)
    print(f'Sandbox without taxonomy: {result.single()["COUNT(n)"]} nodes')
    
with driver1.session() as session:
    result = session.run(query)
    print(f'Sandbox with taxonomy: {result.single()["COUNT(n)"]} nodes')

Sandbox without taxonomy: 85 nodes
Sandbox with taxonomy: 94 nodes


The advantage of using a graph database is that you can query the data without actually knowing the structure of the data base. All you need to know is the labels of the nodes which you are looking for. In this case you want to know if a Concept node with the name Public Ivy School has any relation to an Article node.

Cypher makes this possible by using `<-[*]-` in the query (see https://neo4j.com/docs/cypher-manual/current/syntax/patterns/).


Let's try it first with the sandbox connection without the taxonomy (driver2) and then try it with the sandbox with the taxonomy (driver1)

In [240]:
query1 = """MATCH (a:Concept
WHERE a.prefLabel = 'Public Ivy School')<-[*]-(b:Article)
RETURN b.title AS title"""
with driver2.session() as session:
    result = session.run(basicQuery)
    for r in result:
        print(f"The related person is: {r['title']}.")
        

In [241]:
with driver1.session() as session:
    result = session.run(query2)
    for r in result:
        print(f"The related person is: {r['title']}.")

The related person is: Jennifer Doudna.
The related person is: Jennifer Doudna.


You'll see that the sandbox without the taxonomy doesn't return results and shows similar behaviour as the lemmatization approach in assignment 1.

The second sandbox does return a result and it shows that Jennifer Doudna's summary has a relation with the concept Public Ivy School.


Another way to query the data is to look at the data structure beforehand and build your query based on the data structure. 


_(disclaimer: there are two Berkeley nodes in the image and this has to do with some issues with duplicates while populating the database. Please ignore it.)_

Open `neo4j-enriched.png`. This is the part of the graph that you want to retrieve with your query. 
* The blue nodes are the Concepts from your taxonomy.
* The red nodes are your Entities
* The purple node is the Article


The labels of the nodes are not clear unfortunately because it uses the Wikidata IDs. However if you look in the taxonomy file you'll see that the node labeled 'Q20971972' is the Public Ivy School node. According to our taxonomy there are seven Public Ivy Schools. 
When you look at the image you'll see that there are quite a few edges between the Public Ivy School node and the Article node named Jennifer Doudna. 




The next Cypher query  does the following:
* it looks for a Concept node which contains the string Public Ivy School
* it looks for a connected Concept node
* it looks for a connected Entity node
* it looks for a connected Article node
* it returns
    * the keywords label
    * the related concept label
    * the related named entity
    * the person mentioned in the article
    * the body of the article

If you want to you can try writing the Cypher query yourself or try changing the where clause by adding altLabels to it. Or using CONTAINS instead of =.

In [242]:
query2 = """
    MATCH (n:Concept
        WHERE n.prefLabel = 'Public Ivy School')
                        <--(c:Concept)
                        <--(e:Entity)
                        <--(a:Article)
RETURN n.prefLabel as keywordConcept,
        c.prefLabel as relatedKeywordConcept,
        e.title AS relatedNamedEntity,
        a.title AS person,
        a.body AS summary""" 

In [243]:
with driver1.session() as session:
    result = session.run(query2)
    for r in result:
        print(f"The related person is: {r['person']} and the linked entity match is: {r['relatedNamedEntity']}.")


The related person is: Jennifer Doudna and the linked entity match is: Berkeley.
The related person is: Jennifer Doudna and the linked entity match is: Berkeley.


<h2> Discussion</h2>

* We've seen different ways to query the data. One query is flexible in a way that you don't need much information about the edges in the database and the other makes the output of your query more understandable. Which query would you preferably use in an application for a client and why? Or would you use a different type of query?

* We've also seen different ways to apply search and a simplified version of semantic search where we used a taxonomy. How do you store and access your NLP (or ML) model results now? In which parts of your work would a taxonomy help you? Could you use this in other data science domains?

* For our semantic data management colleagues: how would you use NLP in your taxonomy (or ontology) engineering process?

* The fact that University of California, Berkeley was mentioned in the summary of Jennifer Doudna made it possible to answer the question. However, you see that the NER didn't recognize it as a whole entity. It cut it into two different entities and luckily Berkeley was linked with the correct wikidata Id. How would you improve the entity linking process?

* Another point to mention is you'll only know for sure if Jennifer Doudna worked at a Public Ivy School if you've read the summary. You'll need more information to know whether she  worked there or if she studied there. Think of ways on how to process or store the data in order to increase the credibility of your answer. 

