# Self study 9

## Libraries
For this task you will need the RDFLib libraries. You can find the documentation at this [link](https://rdflib.readthedocs.io/en/stable/). After installing it, you can execute the following command.

In [65]:
from rdflib import Graph
from rdflib.term import BNode, Literal, URIRef
from rdflib.namespace import RDF

## Tasks
In this task you are expected to collect some RDF graphs from web and merge them into a local RDF graph. 

We will use RDF graphs from DBLP, a web site describing bibliographic data about scientific article in computer science. Each researcher has an associated DBLP person ID. For example, the person ID of Daniele is 57/8026. The code can be used to create Daniele's IRI: [http://dblp.org/pid/57/8026](http://dblp.org/pid/57/8026). Starting from here, it is possible to access different representations of the resource by adding a file extension. For example:

* By adding ".html", one can access the HTML page describing Daniele: [http://dblp.org/pid/57/8026.html](http://dblp.org/pid/57/8026.html)
* By adding ".rdf", one can access the XML/RDF page describing Daniele: [http://dblp.org/pid/57/8026.rdf](http://dblp.org/pid/57/8026.rdf)
* By adding ".ttl", one can access the Turtle page describing Daniele: [http://dblp.org/pid/57/8026.ttl](http://dblp.org/pid/57/8026.ttl)

### Task 1
The code below constructs a simple crawler to start from a person and collect all articles written by that person or their coauthors (i.e., only using two types of links: authorOf and coCreatorWith). To fulfill the task you should extend it as follows:

* Select about 5 teachers or project supervisors you have worked with. 
* Identify their DBLP personal identifiers, and consequently their IRIs.
* Collect the RDF describing them and combine them in a RDF graph
* Analyse the graph (i.e. iterate over the graph and check its content) to answer the following questions:
    * How many articles have they authored?
    * Who is the person who co-authored with more people?
    * What are their Google Scholar IDs?

### Task 2
Starting from the IRIs of the people you used in Task 1, create a co-author network, i.e., a graph describing who wrote papers with whom. In this graph, the nodes represent people, and the edges indicate that the two nodes wrote an article together. Create this graph as a knowledge graph, using the Graph class from RDFLib. 

In detail:

* Use the profiles you used in Task 1 as seeds, and collect their RDF graphs.
* Populate the co-author network with the information about the co-authorships. For each author, store its IRI and the Literal with the name of the author.
* Develop a strategy to decide the next person to crawl.
* Stop when the co-authors network contains ca. 1000 authors. 

### Code snippets
Feel free to reuse the following code to solve the above tasks.

This function retrieves a graph from DBLP, given the IRI.

In [66]:
#used code snippets:
def get_graph(iri: str) -> Graph:
  g = Graph()
  g.parse(iri + '.nt', format='nt')
  return g

#Task 1: 
#Select about 5 teachers or project supervisors you have worked with. 
#Selected: Daniele Dell'Aglio, Thomas D. Nielsen, Manfred Jaeger, Christian Schilling, Hans Hüttel

#Identify their DBLP personal identifiers, and consequently their IRIs.
DBLPids = ["57/8026", "23/1643", "50/4079", "72/2103-1", "25/3348"]
iris = [{'link': "https://dblp.org/pid/57/8026", 'name': "Daniele Dell'Aglio"}, 
        {'link': "https://dblp.org/pid/23/1643", 'name': "Thomas D. Nielsen"},
        {'link': "https://dblp.org/pid/50/4079", 'name': "Manfred Jaeger"},
        {'link': "https://dblp.org/pid/72/2103-1", 'name': "Christian Schilling"},
        {'link': "https://dblp.org/pid/25/3348", 'name': "Daniele Dell'Aglio"}
       ]

#* Collect the RDF describing them and combine them in a RDF graph
RDFgraph = Graph()

for iri in iris:
  for s, p, o in get_graph(iri['link']):
    RDFgraph.add((s, p, o,))

#* Analyse the graph (i.e. iterate over the graph and check its content) to answer the following questions:
#    * How many articles have they authored?
#    * Who is the person who co-authored with more people?
authorOfOccurences = 0
coCreatorWithOccurences = 0

for s, p, o in RDFgraph:
  if ("authorOf" in p):
    authorOfOccurences += 1
  elif ("coCreatorWith" in p):
    coCreatorWithOccurences += 1

print(f"authorOf: {authorOfOccurences}, coCreatorWith: {coCreatorWithOccurences}")

#    * What are their Google Scholar IDs?
## NOT SOLVED

authorOf: 372, coCreatorWith: 379


In [84]:
### Task 2
#Use the profiles you used in Task 1 as seeds, and collect their RDF graphs.
#done above, defined as RDFgraph

#Populate the co-author network with the information about the co-authorships.
#For each author, store its IRI and the Literal with the name of the author.
#Develop a strategy to decide the next person to crawl.
  #Strategy: take 3 unique IRIs from object in coCreatorWith and goto next iri in list.
#Stop when the co-authors network contains ca. 1000 authors.

def insertGraph(iri):
  for s, p, o in get_graph(iri):
    RDFgraph.add((s, p, o,))

def getAllIriLinks():
  return [iri['link'] for iri in iris]

SearchDepthMax = 3
coAuthorsMax = 1000
coAuthorsCount = 0
allIriLinks = [iri['link'] for iri in iris]
nextIndex = 0

import time

while(coAuthorsCount < coAuthorsMax):
  CoAuthorsSearched = 0
  for s, p, o in RDFgraph:
    if (str(s) == allIriLinks[nextIndex] and "coCreatorWith" in p):
      if(o in allIriLinks): continue
      
      iris.append({'link': o, 'name': ""}) #crawl to get name like selfstudy 1 maybe?
      insertGraph(o)

      coAuthorsCount += 1
      CoAuthorsSearched += 1
      allIriLinks = getAllIriLinks()
      print(f"added {o}, depth: {CoAuthorsSearched}/{SearchDepthMax} {coAuthorsCount}/{coAuthorsMax} {allIriLinks}")
      if (CoAuthorsSearched == int(SearchDepthMax)): 
        nextIndex += 1
        break
      time.sleep(0.3)



added https://dblp.org/pid/98/2962, depth: 1/3 1/1000 ['https://dblp.org/pid/57/8026', 'https://dblp.org/pid/23/1643', 'https://dblp.org/pid/50/4079', 'https://dblp.org/pid/72/2103-1', 'https://dblp.org/pid/25/3348', rdflib.term.URIRef('https://dblp.org/pid/02/6723'), rdflib.term.URIRef('https://dblp.org/pid/206/3081'), rdflib.term.URIRef('https://dblp.org/pid/281/2601'), rdflib.term.URIRef('https://dblp.org/pid/s/JiriSrba'), rdflib.term.URIRef('https://dblp.org/pid/30/4476'), rdflib.term.URIRef('https://dblp.org/pid/234/2748'), rdflib.term.URIRef('https://dblp.org/pid/70/2802'), rdflib.term.URIRef('https://dblp.org/pid/27/10279'), rdflib.term.URIRef('https://dblp.org/pid/h/KatjaHose'), rdflib.term.URIRef('https://dblp.org/pid/254/2048'), rdflib.term.URIRef('https://dblp.org/pid/139/4850'), rdflib.term.URIRef('https://dblp.org/pid/16/5524'), rdflib.term.URIRef('https://dblp.org/pid/291/2481'), rdflib.term.URIRef('https://dblp.org/pid/132/2073'), rdflib.term.URIRef('https://dblp.org/pid

KeyboardInterrupt: 

In [8]:
def get_graph(iri: str) -> Graph:
    g = Graph()
    g.parse(iri + '.nt', format='nt')
    return g

This example uses the IRI of Daniele in DBLP and it retrieves the graph describing him, his publications and his collaborators.

In [4]:
daniele_iri = "https://dblp.org/pid/57/8026"
daniele_graph = get_graph(daniele_iri)

The following code counts the statements stored in the graph.

In [5]:
i = 0
for s, p, o in daniele_graph:
    i += 1

print(f"Number of triples: {i}")

Number of triples: 221


The following code prints the statements where the object is a literal

In [20]:
for s, p, o in daniele_graph:
    if isinstance(o, Literal):
        print(s, p, o)

https://dblp.org/pid/57/8026.nt http://purl.org/dc/terms/modified 2023-09-27T22:44:29+0200
N4004fbb370d6417c9467c19dd682df77 http://purl.org/spar/literal/hasLiteralValue dandellaglio
https://dblp.org/pid/57/8026 https://dblp.org/rdf/schema#creatorName Daniele Dell'Aglio
N3bb637c1c7584c0a9c6637faef72b3af http://purl.org/spar/literal/hasLiteralValue 81485655252
N57a8d2a030b044b7a8cc30e3f8957ec7 http://purl.org/spar/literal/hasLiteralValue 1217255869
Nabfecf9fb9fe4eb093d32243923107ac http://purl.org/spar/literal/hasLiteralValue 0000-0003-4904-2511
https://dblp.org/pid/57/8026.nt http://www.w3.org/2000/01/rdf-schema#label provenance information for RDF data of dblp person '57/8026'
Nf8e9e4686d0c4345ae3b79474730fcf2 http://purl.org/spar/literal/hasLiteralValue 57/8026
https://dblp.org/pid/57/8026 https://dblp.org/rdf/schema#primaryCreatorName Daniele Dell'Aglio
N194e1780af124c04b2ae169880d32f05 http://purl.org/spar/literal/hasLiteralValue 3I26zx0AAAAJ
https://dblp.org/pid/57/8026 http://www

This code prints the statements having rdf:type as predicate

In [7]:
for s, p, o in daniele_graph.triples((None, RDF.type, None)):
    print(s, p, o)

https://dblp.org/pid/57/8026 http://www.w3.org/1999/02/22-rdf-syntax-ns#type https://dblp.org/rdf/schema#Person
https://dblp.org/pid/57/8026 http://www.w3.org/1999/02/22-rdf-syntax-ns#type https://dblp.org/rdf/schema#Creator
Nf8e9e4686d0c4345ae3b79474730fcf2 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/PersonalIdentifier
N194e1780af124c04b2ae169880d32f05 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/PersonalIdentifier
N3bb637c1c7584c0a9c6637faef72b3af http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/PersonalIdentifier
N4004fbb370d6417c9467c19dd682df77 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/PersonalIdentifier
Nabfecf9fb9fe4eb093d32243923107ac http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/PersonalIdentifier
Nf5ca850a0d324d8b8464d0d5c3f789b5 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/spar/datacite/Identifi

More information on how to navigate graphs is available [here](https://rdflib.readthedocs.io/en/stable/intro_to_graphs.html).