<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/Lord_of_the_wikidata/Part1%20Importing%20Wikidata%20into%20Neo4j%20and%20analyzing%20family%20trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Updated to GDS 2.0 version
* Link to original blog post: https://towardsdatascience.com/lord-of-the-wiki-ring-importing-wikidata-into-neo4j-and-analyzing-family-trees-da27f64d675e

In [1]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.4.2.tar.gz (89 kB)
[?25l[K     |███▋                            | 10 kB 24.3 MB/s eta 0:00:01[K     |███████▎                        | 20 kB 14.7 MB/s eta 0:00:01[K     |███████████                     | 30 kB 10.6 MB/s eta 0:00:01[K     |██████████████▋                 | 40 kB 9.2 MB/s eta 0:00:01[K     |██████████████████▎             | 51 kB 4.6 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████████▋      | 71 kB 5.8 MB/s eta 0:00:01[K     |█████████████████████████████▎  | 81 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 3.3 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=1263c50dc3a5bf370b9d96525936014a00881f6b44f5d41da992312297966fac
  Stored in directory: /root/.cache/pip/wheels/10/d6/28/950

I recommend you setup a [blank project on Neo4j Sandbox environment](https://sandbox.neo4j.com/?usecase=blank-sandbox), but you can also use other environment versions

In [2]:
import pandas as pd
# Define Neo4j connections
from neo4j import GraphDatabase
host = 'bolt://3.235.2.228:7687'
user = 'neo4j'
password = 'seats-drunks-carbon'
driver = GraphDatabase.driver(host,auth=(user, password))

In [20]:
# Import libraries
import pandas as pd

def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

In [25]:
# Fix default timeout query setting in Sandbox

run_query("""
CALL dbms.setConfigValue('dbms.transaction.timeout','0')
""")

## Agenda

* Import Wikipedia data to Neo4j
* Basic graph exploration
* Populate missing value
* Some more graph exploration
* Weakly connected component
* Betweenness centrality

We have been using simple graph schemas for quite some time now. I am delighted to say that this time we have a bit more complicated schema. The graph schema revolves around the characters in the LOTR world. A character can be either a relative, father, mother, enemy, spouse, or sibling with another character. This represents a social network of characters with multiple types of relationships. We also have additional information about characters such as their race, country, and language. On top of that, we also know if they are part of any group or have participated in any event.

## WikiData import

As mentioned, we will fetch the data from the WikiData API with the help of the apoc.load.json procedure. If you don't know yet, APOC provides great support for importing data into Neo4j. Besides the ability to fetch data from any REST API, it also features integrations with other databases such as MongoDB or relational databases via the JDBC driver.

P.s. You should check out Neosematics library if you work a lot with RDF data, I only noticed it after I have written the post

We will start by importing all the races in the LOTR world. I have to admit I am a total noob when it comes to SPARQL, so I won't be explaining the syntax in depth. If you need a basic introduction on how to query WikiData, I suggest this tutorial on Youtube. Basically, all the races in the LOTR world are an instance of the Middle-earth races entity with id Q989255. To get the instances of a specific entity, we use the following SPARQL clause:

<code>?item wdt:P31 wd:Q989255</code>

This can be translated as "We would like to fetch an item, which is an instance of (wdt:P31) an entity with an id Q989255". After we have downloaded the data with APOC, we store the results to Neo4j.

In [21]:
import_races_query = """

// Prepare a SPARQL query 
WITH 'SELECT ?item ?itemLabel WHERE{ ?item wdt:P31 wd:Q989255 . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }}' AS sparql 
// make a request to Wikidata
CALL apoc.load.jsonParams('https://query.wikidata.org/sparql?query=' + 
                           sparql, 
                         { Accept: "application/sparql-results+json"}, null) 
YIELD value 
// Unwind results to row 
UNWIND value['results']['bindings'] as row 
// Prepare data 
WITH row['itemLabel']['value'] as race, 
     row['item']['value'] as url, 
     split(row['item']['value'],'/')[-1] as id 
// Store to Neo4j 
CREATE (r:Race) SET r.race = race, 
                    r.url = url, 
                    r.id = id

"""

r = run_query(import_races_query)

That was easy. The next step is to fetch the characters that are an instance of a given Middle-earth race. The SPARQL syntax is almost identical to the previous query, except this time we iterate over each race and find the characters that are an instance of a given race.

In [22]:
import_characters_query = """

// Iterate over each race in graph
MATCH (r:Race)
// Prepare a SparQL query
WITH 'SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:' + r.id + ' . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" } }' AS sparql, r 
// make a request to Wikidata 
CALL apoc.load.jsonParams( "https://query.wikidata.org/sparql?query=" + 
                            sparql, 
                            { Accept: "application/sparql-results+json"}, null)
YIELD value 
UNWIND value['results']['bindings'] as row 
WITH row['itemLabel']['value'] as name, 
     row['item']['value'] as url, 
     split(row['item']['value'],'/')[-1] as id, r 
// Store to Neo4j 
CREATE (c:Character) 
SET c.name = name, 
    c.url = url, 
    c.id = id 
CREATE (c)-[:BELONG_TO]->(r)

"""

r = run_query(import_characters_query)

Did you know that there are at least 700 characters in the Middle-earth world? I would never guess there would be so many documented characters on WikiData. Our first exploratory cypher query will be to count the characters by race.

In [23]:
race_size_query = """

MATCH (r:Race) 
RETURN r.race as race, 
       size((r)<-[:BELONG_TO]-()) as members 
ORDER BY members DESC 
LIMIT 10

"""

run_query(race_size_query)

Unnamed: 0,race,members
0,men in Tolkien's legendarium,345
1,Hobbit,150
2,Middle-earth elf,83
3,dwarves in Tolkien's legendarium,52
4,valar,16
5,half-elven,12
6,Maiar,10
7,Orcs in Tolkien's legendarium,9
8,Ent,5
9,dragons of Middle-earth,4


The Fellowship of the Ring group is a somewhat representative sample of races in the Middle-earth. Most of the characters are either human or hobbits, with a couple of elves and dwarves strolling by. This is the first time I have heard of Valar and Maiar races though.

Now it is time to enrich the graph with information about characters' gender, country, and manner of death. The SPARQL query will be a bit different than before. This time we will select a WikiData entity directly by its unique id and optionally fetch some of its properties. We can filter a specific entity by its id using the following SPARQL clause:

<code>filter (?item = wd:' + r.id + ')</code>

Similar to the cypher query language, SPARQL also differentiates between a MATCH and an OPTIONAL MATCH. When we want to return multiple properties of an entity, it is best to wrap each property into an OPTIONAL MATCH. This way we will get results if any of the properties exist. Without the OPTIONAL MATCH, we would only get results for entities where all three properties exist. This is an identical behavior to cypher.
<code>
OPTIONAL{ ?item wdt:P21 [rdfs:label ?gender] . 
           filter (lang(?gender)="en") }
</code>
The <code>wdt:P21</code> indicates we are interested in the gender property.  We also specify that we want to get the English label of an entity instead of its WikiData id. The easiest way to search for the desired property id is to inspect the entity on the WikiData web page and hover over a property name.

Another way is to use the WikiData query editor, which has a great autocomplete function by using the CTRL+T command.

To store the results back to Neo4j we will use the <code>FOREACH</code> trick. Because some of our results will contain null values, we have to wrap the <code>MERGE</code> statement into the <code>FOREACH</code> statement which supports conditional execution. Check the Tips and tricks blog post by Michael Hunger for more information.

In [26]:
import_gender_query = """

// Iterate over characters 
MATCH (r:Character) 
// Prepare a SparQL query 
WITH 'SELECT * WHERE{ ?item rdfs:label ?name . filter (?item = wd:' + r.id + ') filter (lang(?name) = "en" ) . ' +
     'OPTIONAL{ ?item wdt:P21 [rdfs:label ?gender] . filter (lang(?gender)="en") } ' + 
     'OPTIONAL{ ?item wdt:P27 [rdfs:label ?country] . filter (lang(?country)="en") } ' +
     'OPTIONAL{ ?item wdt:P1196 [rdfs:label ?death] . filter (lang(?death)="en") }}' AS sparql, r 
// make a request to Wikidata 
CALL apoc.load.jsonParams( "https://query.wikidata.org/sparql?query=" 
    + sparql, 
    { Accept: "application/sparql-results+json"}, null)
YIELD value 
UNWIND value['results']['bindings'] as row 
SET r.gender = row['gender']['value'], 
    r.manner_of_death = row['death']['value'] 
// Execute FOREACH statement 
FOREACH(ignoreme in case when row['country'] is not null then [1] else [] end | 
    MERGE (c:Country{name:row['country']['value']}) 
    MERGE (r)-[:IN_COUNTRY]->(c))

"""

r = run_query(import_gender_query)

We are connecting additional information to our graph bit by bit and slowly transforming it into a knowledge graph. Let's first look at the manner of death property.

In [27]:
manner_of_death_query = """

MATCH (n:Character) 
WHERE exists (n.manner_of_death) 
RETURN n.manner_of_death as manner_of_death, 
       count(*) as count

"""

run_query(manner_of_death_query)

Unnamed: 0,manner_of_death,count
0,homicide,3
1,death in battle,1
2,accident,1


Nothing of interest. This is obviously not the Game of Thrones series. Let's also inspect the results of the country property.

In [28]:
country_info_query = """

MATCH (c:Country)
RETURN c.name as country, 
       size((c)<-[:IN_COUNTRY]-()) as members
ORDER BY members DESC 
LIMIT 10

"""

run_query(country_info_query)

Unnamed: 0,country,members
0,Gondor,70
1,Shire,48
2,Rohan,34
3,Númenor,34
4,Arthedain,16
5,Arnor,8
6,Doriath,5
7,Reunited Kingdom,3
8,Lothlórien,3
9,Gondolin,3


We have the country information for 236 characters. We could make some hypotheses and try to populate missing country values. Let's assume that if two characters are siblings, they belong to the same country. This makes a lot of sense. To be able to achieve this, we have to import the familial ties from WikiData. Specifically, we will fetch the father, mother, relative, sibling, and spouse connections.

In [30]:
import_social_query = """

// Iterate over characters 
MATCH (r:Character) 
WITH 'SELECT * WHERE{ ?item rdfs:label ?name . filter (?item = wd:' + r.id + ') filter (lang(?name) = "en" ) . ' + 
     'OPTIONAL{ ?item wdt:P22 ?father } OPTIONAL{ ?item wdt:P25 ?mother } OPTIONAL{ ?item wdt:P1038 ?relative } ' +
     'OPTIONAL{ ?item wdt:P3373 ?sibling } OPTIONAL{ ?item wdt:P26 ?spouse }}' AS sparql, r 
// make a request to wikidata 
CALL apoc.load.jsonParams( "https://query.wikidata.org/sparql?query=" + 
    sparql, 
    { Accept: "application/sparql-results+json"}, null) YIELD value 
UNWIND value['results']['bindings'] as row 
FOREACH(ignoreme in case when row['mother'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['mother']['value']}) 
    MERGE (r)-[:HAS_MOTHER]->(c)) 
FOREACH(ignoreme in case when row['father'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['father']['value']}) 
    MERGE (r)-[:HAS_FATHER]->(c)) 
FOREACH(ignoreme in case when row['relative'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['relative']['value']}) 
    MERGE (r)-[:HAS_RELATIVE]-(c)) 
FOREACH(ignoreme in case when row['sibling'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['sibling']['value']}) 
    MERGE (r)-[:SIBLING]-(c))
FOREACH(ignoreme in case when row['spouse'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['spouse']['value']}) 
    MERGE (r)-[:SPOUSE]-(c))

"""

r = run_query(import_social_query)

Before we begin filling-in missing values, let's check for promiscuity in the Middle-earth. The first query will search for characters with multiple spouses.

In [33]:
multiple_spouses_query = """

MATCH p=(a)-[:SPOUSE]-(b)-[:SPOUSE]-(c) 
RETURN [n IN nodes(p) | n.name] AS result LIMIT 10

"""

run_query(multiple_spouses_query)

Unnamed: 0,result
0,"[Indis, Finwë, Míriel]"
1,"[Míriel, Finwë, Indis]"


We actually found a single character with two spouses. It is Finwë, the first King of the Noldor. We can also take a look if someone has kids with multiple partners

In [34]:
multiple_kids_query = """

MATCH (c:Character)<-[:HAS_FATHER|HAS_MOTHER]-()-[:HAS_FATHER|HAS_MOTHER]->(other) 
WITH c, collect(distinct other) as others 
WHERE size(others) > 1 
MATCH p=(c)<-[:HAS_FATHER|HAS_MOTHER]-()-[:HAS_FATHER|HAS_MOTHER]->() 
RETURN [n IN nodes(p) | n.name] AS result LIMIT 10

"""

run_query(multiple_kids_query)

Unnamed: 0,result
0,"[Finwë, Fingolfin, Indis]"
1,"[Finwë, Finarfin, Indis]"
2,"[Finwë, Findis, Indis]"
3,"[Finwë, Irimë, Indis]"
4,"[Finwë, Fëanor, Míriel]"


So it seems that Finwë has four children with Indis and a single child with Míriel. On the other hand, it is quite weird that Beren has two fathers. I guess Adanel has some explaining to do. We would probably find more death and promiscuity in the GoT world.

## Populate missing values

Now that we know that the Middle-earth characters abstain from promiscuity, let's populate the missing country values. Remember our hypothesis was:

>If two characters are siblings, they belong to the same country.

Before we populate the missing values for countries, let's populate the missing values for siblings. We will assume that if two characters have the same mother or father, they are siblings. Let's look at some sibling candidates.

In [35]:
sibling_candidate_query = """

MATCH p=(a:Character)-[:HAS_FATHER|:HAS_MOTHER]->()<-[:HAS_FATHER|:HAS_MOTHER]-(b:Character) 
WHERE NOT (a)-[:SIBLING]-(b) 
RETURN [n IN nodes(p) | n.name] AS result LIMIT 10

"""

run_query(sibling_candidate_query)

Unnamed: 0,result
0,"[Ferumbras Took II, Isumbras Took III, Bandobr..."
1,"[Bingo Baggins, Laura Grubb, Bungo Baggins]"
2,"[Belba Baggins, Laura Grubb, Bungo Baggins]"
3,"[Linda Proudfoot, Laura Grubb, Bungo Baggins]"
4,"[Bingo Baggins, Mungo Baggins, Bungo Baggins]"
5,"[Linda Proudfoot, Mungo Baggins, Bungo Baggins]"
6,"[Belba Baggins, Mungo Baggins, Bungo Baggins]"
7,"[Hildigard Took, Gerontius Took, Isembard Took]"
8,"[Isengar Took, Gerontius Took, Isembard Took]"
9,"[Isengrim Took III, Gerontius Took, Isembard T..."


Adamanta Chubb has at least six children. Only two of them are marked as siblings. Because all of them are siblings by definition, we will fill in the missing connections.

In [36]:
sibling_populate_query = """

MATCH p=(a:Character)-[:HAS_FATHER|:HAS_MOTHER]->()<-[:HAS_FATHER|:HAS_MOTHER]-(b:Character) 
WHERE NOT (a)-[:SIBLING]-(b) 
MERGE (a)-[:SIBLING]-(b)

"""
run_query(sibling_populate_query)

The query added 118 missing relationships. I need to learn how to update the WikiData knowledge graph and add the missing relationships in bulk. Now we can fill in the missing country values for siblings. We will match all characters with the filled in country information and search for their siblings that don't have the country information. I love how easy it is to express this pattern with cypher query language.

In [37]:
country_populate_query = """

MATCH (country)<-[:IN_COUNTRY]-(s:Character)-[:SIBLING]-(t:Character) 
WHERE NOT (t)-[:IN_COUNTRY]->() 
MERGE (t)-[:IN_COUNTRY]->(country)

"""
run_query(country_populate_query)

There were 49 missing countries added. We could easily come up with more hypotheses to fill in the missing values. You can try and maybe add some other missing values yourself.

We still have to add some information to our graph. In this query, we will add the information about the occupation, language, groups, and events of characters. The SPARQL query is identical to before where we iterate over each character and fetch additional properties.

In [38]:
import_groups_query = """

MATCH (r:Character) 
WHERE exists (r.id) 
WITH 'SELECT * WHERE{ ?item rdfs:label ?name . filter (?item = wd:' + r.id + ') filter (lang(?name) = "en" ) . ' +
      'OPTIONAL { ?item wdt:P106 [rdfs:label ?occupation ] . filter (lang(?occupation) = "en" ). } ' +
      'OPTIONAL { ?item wdt:P103 [rdfs:label ?language ] . filter (lang(?language) = "en" ) . } ' +
      'OPTIONAL { ?item wdt:P463 [rdfs:label ?member_of ] . filter (lang(?member_of) = "en" ). } ' +
      'OPTIONAL { ?item wdt:P1344[rdfs:label ?participant ] . filter (lang(?participant) = "en") . } ' +
      'OPTIONAL { ?item wdt:P39[rdfs:label ?position ] . filter (lang(?position) = "en") . }}' AS sparql, r 
CALL apoc.load.jsonParams( "https://query.wikidata.org/sparql?query=" + 
                             sparql, 
                             { Accept: "application/sparql-results+json"}, null) 
YIELD value 
UNWIND value['results']['bindings'] as row 
FOREACH(ignoreme in case when row['language'] is not null then [1] else [] end | 
        MERGE (c:Language{name:row['language']['value']}) 
        MERGE (r)-[:HAS_LANGUAGE]->(c)) 
FOREACH(ignoreme in case when row['occupation'] is not null then [1] else [] end | 
        MERGE (c:Occupation{name:row['occupation']['value']}) 
        MERGE (r)-[:HAS_OCCUPATION]->(c)) 
FOREACH(ignoreme in case when row['member_of'] is not null then [1] else [] end | 
        MERGE (c:Group{name:row['member_of']['value']}) 
        MERGE (r)-[:MEMBER_OF]->(c)) 
FOREACH(ignoreme in case when row['participant'] is not null then [1] else [] end | 
        MERGE (c:Event{name:row['participant']['value']}) 
        MERGE (r)-[:PARTICIPATED]->(c)) 
SET r.position = row['position']['value']

"""
run_query(import_groups_query)

Let's investigate the results of the groups and the occupation of the characters.

In [39]:
investigate_groups_query = """

MATCH (n:Group)<-[:MEMBER_OF]-(c)
OPTIONAL MATCH (c)-[:HAS_OCCUPATION]->(o) 
RETURN n.name as group, 
       count(*) as size, 
       collect(c.name)[..3] as members, 
       collect(distinct o.name)[..3] as occupations 
ORDER BY size DESC

"""

run_query(investigate_groups_query)

Unnamed: 0,group,size,members,occupations
0,Thorin and Company,14,"[Bofur, Óin, Glóin]","[diarist, swordfighter]"
1,Fellowship of the Ring,8,"[Samwise Gamgee, Frodo Baggins, Legolas]","[swordfighter, archer]"
2,White Council,2,"[Elrond, Gandalf]",[]
3,Union of Maedhros,2,"[Haldir, Halmir]",[]
4,Wise,2,"[Adanel, Andreth]",[]
5,Rangers of Ithilien,2,"[Damrod, Madril]",[]
6,Istari,1,[Gandalf],[]
7,White Company,1,[Beregond],[]


It was at this moment that I realized the whole Hobbit series are included. Balin was the diarist for the Thorin and Company group. For some reason, I was expecting Bilbo Baggins to be the diarist. Obviously, there can be only one archer in the Fellowship of the Ring group, and that is Legolas. Gandalf seems to be involved in a couple of groups.

We will execute one more WikiData API call. This time we will fetch the enemies and the items the characters own.

In [41]:
import_enemy_query = """

MATCH (r:Character) 
WHERE exists (r.id) 
WITH 'SELECT * WHERE { ?item rdfs:label ?name . filter (?item = wd:' + r.id + ') filter (lang(?name) = "en" ) . ' +
      'OPTIONAL{ ?item wdt:P1830 [rdfs:label ?owner ] . filter (lang(?owner) = "en" ). } ' +
      'OPTIONAL{ ?item wdt:P7047 ?enemy }}' AS sparql, r 
CALL apoc.load.jsonParams( "https://query.wikidata.org/sparql?query=" + 
                            sparql, 
                            { Accept: "application/sparql-results+json"}, null) 
YIELD value 
WITH value,r 
WHERE value['results']['bindings'] <> [] 
UNWIND value['results']['bindings'] as row 
FOREACH(ignoreme in case when row['owner'] is not null then [1] else [] end |
    MERGE (c:Item{name:row['owner']['value']}) 
    MERGE (r)-[:OWNS_ITEM]->(c)) 
FOREACH(ignoreme in case when row['enemy'] is not null then [1] else [] end | 
    MERGE (c:Character{url:row['enemy']['value']}) 
    MERGE (r)-[:ENEMY]->(c))

"""

r = execute_query(import_enemy_query)

Finally, we have finished importing our graph. Let's look at how many enemies are there between direct family members.

In [42]:
family_enemy_query = """

MATCH p=(a)-[:SPOUSE|SIBLING|HAS_FATHER|HAS_MOTHER]-(b) 
WHERE (a)-[:ENEMY]-(b) 
RETURN [n IN nodes(p) | n.name] AS result LIMIT 10

"""
run_query(family_enemy_query)

Unnamed: 0,result
0,"[Manwë, Morgoth]"
1,"[Morgoth, Manwë]"


It looks like Morgoth and Manwë are brothers and enemies. This is the first time I have heard of the two, but LOTR fandom site claims Morgoth was the first Dark Lord. Let's look at how many enemies are within the second-degree relatives.

In [43]:
family_enemy_2hops_query = """

MATCH p=(a)-[:SPOUSE|SIBLING|HAS_FATHER|HAS_MOTHER*..2]-(b) 
WHERE (a)-[:ENEMY]-(b) 
RETURN [n IN nodes(p) | n.name] AS result LIMIT 10

"""
run_query(family_enemy_2hops_query)

Unnamed: 0,result
0,"[Manwë, Morgoth]"
1,"[Morgoth, Manwë]"
2,"[Morgoth, Manwë, Varda]"
3,"[Varda, Manwë, Morgoth]"


Not a lot of enemies within the second-degree relatives. We can observe that Varda has taken her husband's stance and is also an enemy with Morgoth. This is an example of a stable triangle or triad. The triangle consists of one positive relationship (SPOUSE) and two negatives (ENEMY). In social network analysis, triangles are used to measure the cohesiveness and structural stability of a network.

## Graph data science

If you have read any of my previous blog posts, you know that I just have to include some example use cases of graph algorithms from the Graph Data Science library. If you need a quick refresher on how the GDS library works and what is happening behind the scenes, I suggest you read my previous blog post.

We will start by projecting the family network. We load all the characters and the familial relationships like SPOUSE, SIBLING, HAS_FATHER, and HAS_MOTHER between them.

In [44]:
project_graph = """
CALL gds.graph.project('family','Character', 
    ['SPOUSE','SIBLING','HAS_FATHER','HAS_MOTHER'])
"""
run_query(project_graph)

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
0,"{'Character': {'label': 'Character', 'properti...","{'HAS_MOTHER': {'orientation': 'NATURAL', 'agg...",family,699,1054,102


### Weakly connected component

The weakly connected component algorithm is used to find islands or disconnected components within our network. The following visualizations contain two connected components. The first component is composed of Michael, Mark, and Doug while the second one consists of Alice, Charles, and Bridget.

In our case, we will use the weakly connected component algorithm to find islands within the family network. All members within the same family component are related to each other somehow. Could be a cousin of the sister-in-law's grandmother or something more direct like a sibling. To get a rough feeling of the results, we will run the stats mode of the algorithm.

In [46]:
wcc_stats_query = """

CALL gds.wcc.stats('family') 
YIELD componentCount, 
      componentDistribution 
RETURN componentCount as components, 
       componentDistribution.p75 as p75, 
       componentDistribution.p90 as p90, 
       apoc.math.round(componentDistribution.mean,2) as mean, 
       componentDistribution.max as max

"""

run_query(wcc_stats_query)

Unnamed: 0,components,p75,p90,mean,max
0,147,1,3,4.76,324


There are 145 connected components in our graph. More than 75% of the components contain only a single character. This means that around 110 (75% * 145) characters don't have a single familial link to any other character. If they had a single link, the size of the component would be at least two.  The biggest component has 328 members, so that must be one happy family. Let's write back the results and further analyze the family components.

In [47]:
wcc_write_query = """

CALL gds.wcc.write('family', {writeProperty:'familyComponent'})

"""

run_query(wcc_write_query)

Unnamed: 0,writeMillis,nodePropertiesWritten,componentCount,componentDistribution,postProcessingMillis,preProcessingMillis,computeMillis,configuration
0,181,699,147,"{'p99': 139, 'min': 1, 'max': 324, 'mean': 4.7...",5,0,19,"{'writeConcurrency': 4, 'seedProperty': None, ..."


In [52]:
# Also need to mutate in order to be able to use subgraph later on

wcc_mutate_query = """

CALL gds.wcc.mutate('family', {mutateProperty:'familyComponent'})

"""

run_query(wcc_mutate_query)

Unnamed: 0,mutateMillis,nodePropertiesWritten,componentCount,componentDistribution,postProcessingMillis,preProcessingMillis,computeMillis,configuration
0,0,699,147,"{'p99': 139, 'min': 1, 'max': 324, 'mean': 4.7...",4,0,16,"{'seedProperty': None, 'consecutiveIds': False..."


We will start by looking at the top five largest family components. The first thing we are interested in is which races are present in the family trees. We'll also add some random members in the results to get a better feeling of the data.

In [48]:
top5_families_query = """

MATCH (c:Character) 
OPTIONAL MATCH (c)-[:BELONG_TO]->(race) 
WITH c.familyComponent as familyComponent, 
     count(*) as size, 
     collect(c.name) as members, 
     collect(distinct race.race) as family_race 
ORDER BY size DESC LIMIT 5 
RETURN familyComponent, 
       size, 
       members[..3] as random_members, 
       family_race
"""

run_query(top5_families_query)

Unnamed: 0,familyComponent,size,random_members,family_race
0,115,324,"[Galadriel, Fingolfin, Amras]","[Middle-earth elf, Maiar, men in Tolkien's leg..."
1,0,139,"[Frodo Baggins, Bilbo Baggins, Samwise Gamgee]",[Hobbit]
2,198,29,"[Thorin II, Gimli, Balin]",[dwarves in Tolkien's legendarium]
3,273,21,"[Túrin I, Dior of Gondor, Hador of Gondor]",[men in Tolkien's legendarium]
4,99,6,"[Aulë, Oromë, Tulkas]",[valar]


As mentioned, the largest family has 328 members of various races ranging from elves to humans and even Maiar. It appears that elven and human lifes are quite intertwined in the Middle-earth. Also their legs. There is a reason why the half-elven race even exists. Other races like hobbits and dwarves stick more to their own kind.

Let's examine the interracial marriages in the largest community.

In [50]:
ir_query = """

MATCH (c:Character) 
WHERE c.familyComponent = 115 // fix the family component 
MATCH p=(race)<-[:BELONG_TO]-(c)-[:SPOUSE]-(other)-[:BELONG_TO]->(other_race) 
WHERE race <> other_race AND id(c) > id(other) 
RETURN c.name as spouse_1, 
       race.race as race_1, 
       other.name as spouse_2, 
       other_race.race as race_2
"""

run_query(ir_query)

Unnamed: 0,spouse_1,race_1,spouse_2,race_2
0,Beren Erchamion,men in Tolkien's legendarium,Lúthien,Middle-earth elf
1,Melian,Maiar,Thingol,Middle-earth elf
2,Elrond,half-elven,Celebrían,Middle-earth elf
3,Tuor,men in Tolkien's legendarium,Idril,Middle-earth elf
4,Dior Eluchíl,half-elven,Nimloth,Middle-earth elf
5,Arwen,half-elven,Aragorn,men in Tolkien's legendarium


First of all, I didn't know that Elrond was a half-elf. It seems like the human and elven "alliance" is as old as time itself. I was mainly expecting to see Arwen and Aragorn as I remember that from the movies. It would be interesting to learn how far back do half-elves go. Let's look who are the half-elves with the most descendants.

In [51]:
oldest_halfelf_query = """

MATCH (c:Character)
WHERE (c)-[:BELONG_TO]->(:Race{race:'half-elven'})
MATCH p=(c)<-[:HAS_FATHER|HAS_MOTHER*..20]-(end)
WHERE NOT (end)<-[:HAS_FATHER|:HAS_MOTHER]-()
WITH c, max(length(p)) as descendants
ORDER BY descendants DESC
LIMIT 5
RETURN c.name as character,
       descendants

"""

run_query(oldest_halfelf_query)

Unnamed: 0,character,descendants
0,Dior Eluchíl,11
1,Elwing,10
2,Eärendil,10
3,Elros,9
4,Elrond,2


It seems like Dior Eluchíl is the oldest recorded half-elf. I inspected results on LOTR fandom site, and it seems we are correct. Dior Eluchil was born in the First Age in the year 470. There are a couple of other half-elves who were born within 50 years of Dior.

### Betweenness centrality

We will also take a look at the betweenness centrality algorithm. It is used to find bridge nodes between different communities. If we take a look at the following visualization, we can observe that Captain America has the highest betweenness centrality score. That is because he is the main bridge in the network and connects the left-hand side of the network to the right-hand side. The second bridge in the network is the Beast. We can easily see that all the information exchanged between the main and right-hand side of the network has to go through him to reach the right-hand side.

We will look for the bridge characters in the largest family network. My guess would be that spouses in an interracial marriage will come out on top. This is because all the communication between the races flows through them. We've seen that there are only six interracial marriages, so probably some of them will come out on top.

In [58]:
create_largest_wcc_query = """
CALL gds.graph.project.cypher('largest-wcc', 
  'MATCH (n:Character) WHERE n.familyComponent = 115 
   RETURN id(n) as id',
   'MATCH (s:Character)-[:HAS_FATHER|HAS_MOTHER|SPOUSE|SIBLING]-(t:Character) 
                       RETURN id(s) as source, id(t) as target',
    {validateRelationships: false})
"""

run_query(create_largest_wcc_query)

Unnamed: 0,nodeQuery,relationshipQuery,graphName,nodeCount,relationshipCount,projectMillis
0,MATCH (n:Character) WHERE n.familyComponent = ...,MATCH (s:Character)-[:HAS_FATHER|HAS_MOTHER|SP...,largest-wcc,324,1114,93


In [59]:
betwenness_centrality_query = """

CALL gds.betweenness.stream('largest-wcc')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as character,
        score
ORDER BY score DESC 
LIMIT 10

"""

run_query(betwenness_centrality_query)

Unnamed: 0,character,score
0,Arwen,44100.0
1,Aragorn,43584.0
2,Arathorn II,42224.0
3,Arador,41940.0
4,Argonui,41652.0
5,Arathorn I,41360.0
6,Arassuil,41064.0
7,Arahad II,40764.0
8,Elrond,40483.107143
9,Aravorn,40460.0


Interesting to see that Arwen and Aragorn come out on top. Not exactly sure why, but I keep on thinking that they are the modern Romeo and Juliet that have formed an alliance between men and half-elves with their marriage. I have no idea how the JRR Tolkien system for generating names worked, but it seems a bit biased towards names starting with an A.