# Import Open Ownership data into a Neo4j database

Open Ownership provides a regularly-updating register of beneficial company owners from corporate registrars around the world. This notebook provides a rough guide to importing their data (once) into a Neo4j graph database, and setting it up for search.

## Installation

This is the process for deploying Neo4j as a [Debian installation](https://neo4j.com/docs/operations-manual/current/installation/linux/debian/).

Add the repository (check the link for the latest version):

    wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
    echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee -a /etc/apt/sources.list.d/neo4j.list
    sudo apt-get update
    
Install community edition:

    sudo apt-get install neo4j=1:3.5.12
    
And [start](https://medium.com/@Jessicawlm/installing-neo4j-on-ubuntu-14-04-step-by-step-guide-ed943ec16c56)

    sudo service neo4j restart
    
[File locations](https://neo4j.com/docs/operations-manual/current/configuration/file-locations/)
[Configuration](https://neo4j.com/docs/operations-manual/current/configuration/neo4j-conf/)

    sudo nano /etc/neo4j/neo4j.conf
 
## Using Neo4j with Python / Django

I use Python and Django, but you should find your language of choice supported in Neo4j's docs.

[Python drivers](https://neo4j.com/developer/python/)

    pip install neo4j
    
Preferred [Neomodel](https://pypi.python.org/pypi/neomodel)

    pip install neomodel
    
[Neomodel documentation](https://neomodel.readthedocs.io/en/latest/getting_started.html)

To setup and automatically [create nodes](https://neomodel.readthedocs.io/en/latest/configuration.html):

    from neomodel import install_labels
    install_labels(YourClass)

Or, in Django:

    import yourapp  # make sure your app is loaded
    from neomodel import install_all_labels

    install_all_labels()

Or, from the command line:

    neomodel_install_labels yourapp.py someapp.models --db bolt://neo4j:ne05j@localhost:7687
    neomodel_remove_labels --db bolt://neo4j:ne05j@localhost:7687
    
## Building and populating the database

In [1]:
from neomodel import (config, db, StructuredNode, StructuredRel, 
                      StringProperty, IntegerProperty, ArrayProperty,
                      JSONProperty, BooleanProperty,
                      DateProperty, RelationshipTo)

NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "ne05j"
NEO4J_BOLT_URL = F"bolt://{NEO4J_USERNAME}:{NEO4J_PASSWORD}@localhost:7687"

config.DATABASE_URL = NEO4J_BOLT_URL  # default
config.FORCE_TIMEZONE = True  # default False

Create the Python database model via `neomodel` based on the Open Ownership [data standard / schema](http://standard.openownership.org/en/v0-2-0/schema/index.html):

In [2]:
from neomodel import (config, db, StructuredNode, StructuredRel, 
                      StringProperty, IntegerProperty, ArrayProperty,
                      JSONProperty, BooleanProperty,
                      DateProperty, RelationshipTo)

config.DATABASE_URL = NEO4J_BOLT_URL

class OwnershipOrControl(StructuredRel):
    statementID = StringProperty(required=True)
    statementType = StringProperty(required=True)
    statementDate = DateProperty()
    isActive = BooleanProperty(default=True)
    # isComponent = BooleanProperty()
    # These are the objects of the ownership or control statement
    # componentStatementIDs = ArrayProperty(StringProperty())
    # This is the subject of the ownership or control statement
    # subject = StringProperty()
    # This is the object of the ownership or control statement
    # interestedParty 
    interests = ArrayProperty(JSONProperty())
    source = JSONProperty()
    annotations = ArrayProperty(JSONProperty())
    replacesStatements = ArrayProperty(StringProperty())
    
class BaseEntity(StructuredNode):
    statementID = StringProperty(unique_index=True, required=True)
    statementType = StringProperty(required=True)
    statementDate = DateProperty()
    isActive = BooleanProperty(default=True)
    publicationDetails = JSONProperty()
    source = JSONProperty()
    annotations = ArrayProperty(JSONProperty())
    replacesStatements = ArrayProperty(StringProperty())
    addresses = ArrayProperty(JSONProperty())
    postcode = StringProperty()
    controls = RelationshipTo("BaseEntity", "CONTROLS", model=OwnershipOrControl)

class Entity(BaseEntity):
    entityType = StringProperty()
    unspecifiedEntityDetails = JSONProperty()
    name = StringProperty()
    name_token = StringProperty()
    alternateNames = ArrayProperty(StringProperty())
    incorporatedInJurisdiction = JSONProperty()
    identifiers = ArrayProperty(JSONProperty())
    foundingDate = DateProperty()
    dissolutionDate = DateProperty()
    uri = StringProperty()
    
class Person(BaseEntity):
    personType = StringProperty()
    unspecifiedPersonDetails = JSONProperty()
    names = ArrayProperty(JSONProperty())
    identifiers = ArrayProperty(JSONProperty())
    nationalities = ArrayProperty(JSONProperty())
    birthDate = DateProperty()
    deathDate = DateProperty()
    placeOfBirth = JSONProperty()
    placeOfResidence = JSONProperty()
    taxResidencies = ArrayProperty(JSONProperty())
    hasPepStatus = BooleanProperty()
    pepStatusDetails = ArrayProperty(JSONProperty())

In [15]:
# https://neo4j.com/developer/kb/large-delete-transaction-best-practices-in-neo4j/
#from neomodel import clear_neo4j_database, db
#%time clear_neo4j_database(db)  # deletes all nodes and relationships

from neomodel import install_labels, remove_all_labels
install_labels(OwnershipOrControl)
install_labels(Entity)
install_labels(Person)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.15 s


## Open Ownership JSONLines to Neo4j

This is slightly more than simply importing the data. We are also going to prepare a tokenised search term for use in the Neo4j [full text search](https://neo4j.com/developer/kb/fulltext-search-in-neo4j/). Edit this out if you don't want to use it.

Note that you'll need to install NLTK, Unidecode and Cleanco:

    pip install nltk unidecode cleanco

In [4]:
class TokeniseName:

    def split_on_term(self, name, terms, return_second_term=False):
        splt = 0
        if return_second_term: splt = 1
        for term in terms:
            if term in name:
                name_split = name.split(term)
                # Double-check ... if the split choice is blank, pick the other
                if name_split[splt].strip(): 
                    name = name_split[splt].strip()
                else:
                    name = name_split[splt-1].strip()
        return name

    def parse_token(self, name):
        """
        Do the following:

        - Drop to lower-case;
        - Split on trading as ("t/a");
        - Remove punctuation;
        - Standardise whitespace (1-space);
        - Unidecode;
        - Remove legal control terms (co, ltd, etc);

         Create a word frequency dictionary: `token_frequency`

         Ref: [Figure out if a business name is very similar to another one]
         (https://stackoverflow.com/questions/6400416/figure-out-if-a-business-name-is-very-similar-to-another-one-python)
        """
        # Drop to lower-case and strip out leading / lagging spaces
        if not (name and str(name).strip()):
            return False
        parsed_name = " ".join(str(name).lower().strip().split())
        # Split 2nd term on t/a, ta, trading as
        second_term = ["t/a", " ta ", "trading as", "(a trading name)", "( a trading name)", "a trading name"]
        parsed_name = self.split_on_term(parsed_name, second_term, return_second_term=True)
        # Split on 1st term on ltd, limited, c/o, co, note that the space ensures it is a word, or not a start
        first_term = [" ltd", " limited", ",limited", "c/o", " co ", " pty", " proprietary", " sarl"]
        parsed_name = self.split_on_term(parsed_name, first_term)
        # Unidecode to normalise alphabet
        parsed_name = unidecode(parsed_name)
        # Remove punctuation and normalise whitespace
        # https://stackoverflow.com/a/34294398 & https://stackoverflow.com/a/1546251
        punc = "!\"#$%£()*+,./:;<=>?@[\\]^_`{|}~" # removed -'& from the punctuation list
        translator = str.maketrans("", "", punc)
        parsed_name = parsed_name.translate(translator)
        # Strip out legal control terms
        parsed_name = cleanco(parsed_name).clean_name()
        # Simple deletions
        parsed_name = parsed_name.replace("?", "") 
        if parsed_name.strip():
            return " ".join(parsed_name.strip().split())
        return False
    
    def parse_search_token(self, name):
        parsed_name = self.parse_token(name)
        if not parsed_name:
            return False
        punc = "!\"#$%£()*+,./:;<=>?@[\\]^_`{|}~-'&"
        translator = str.maketrans("", "", punc)
        parsed_name = parsed_name.translate(translator)
        parsed_name = " ".join([l[0].strip() for l in nltk.pos_tag(parsed_name.split()) if l[1] not in ["CC"]])
        if parsed_name:
            return parsed_name
        return False

Download the latest [jsonlines file](https://register.openownership.org/download) from Open Ownership.

In [17]:
import jsonlines

import nltk
from cleanco import cleanco
from unidecode import unidecode
import re

from datetime import date, datetime

IMPORT_DATA_DIRECTORY = "/directory/where/you/saved/oojsonlines"
tn = TokeniseName()

config.DATABASE_URL = NEO4J_BOLT_URL

def parse_postcode(x):
    # Postcodes: https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s16.html
    # https://stackoverflow.com/a/51885364
    RE_POST = "([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2})"
    u = re.search(RE_POST, str(x), re.IGNORECASE)
    if u == None:
        return None
    return u.group(0)

def generate_base_node(**kwargs):
    for dt in ["birthDate", "deathDate", "dissolutionDate", "foundingDate", "statementDate"]:
        if dt in kwargs:
            kwargs[dt] = date(*[int(i) for i in kwargs[dt].split("-")])
    if kwargs["statementType"] == "entityStatement" and kwargs.get("name"):
            name_token = tn.parse_search_token(kwargs["name"])
            if name_token: kwargs["name_token"] = name_token
    if "addresses" in kwargs:
        # list, including keys: 
        #   type: placeOfBirth, residence, registered, service, alternative, business
        #   address: 
        #   postCode: 
        ad_dict = {}
        for addr in kwargs["addresses"]:
            pc = addr.get("postCode")
            if not addr.get("type"): addr["type"] = "unclear"
            if not pc:
                pc = parse_postcode(addr["address"])
            if pc:
                ad_dict[addr["type"]] = pc
        for typ in ["business", "registered", "service", "alternative", "residence", "placeOfBirth", "unclear"]:
            if typ in ad_dict:
                kwargs["postcode"] = ad_dict[typ]
                break
    if "statementID" in kwargs:
        kwargs["statementID"] = cf.get_id(kwargs["statementID"])
        return kwargs
    return None

def create_or_update_node(**kwargs):
    node = None
    if kwargs["statementType"] == "entityStatement":
        node = Entity.create_or_update
        if kwargs.get("name"):
            name_token = tn.parse_search_token(kwargs["name"])
            if name_token: kwargs["name_token"] = name_token
    if kwargs["statementType"] == "personStatement":
        node = Person.create_or_update
    if not node: return None
    kwargs = generate_base_node(**kwargs)
    obj = None
    if kwargs:
        obj = node(kwargs)
    return obj

def get_edge_party(term):
    node = None
    if type(term) == str:
        node = Person.nodes.get_or_none(statementID=term, lazy=True)
        if not node:
            node = Entity.nodes.get_or_none(statementID=term, lazy=True)
    if type(term) == dict:
        if term.get("describedByPersonStatement"):
            node = Person.nodes.get_or_none(statementID=term["describedByPersonStatement"])
        if term.get("describedByEntityStatement"):
            node = Entity.nodes.get_or_none(statementID=term["describedByEntityStatement"])
    return node

def import_open_ownership_graph(source_file):
    # Open Ownership source data
    d = IMPORT_DATA_DIRECTORY
    node_errors = []
    edge_errors = []
    print("Starting ...")
    with jsonlines.open(d + source_file) as reader:
        for i, kwargs in enumerate(reader):
            if i % 1000 == 0 and i != 0: print("row: {}\r".format(i), end="")
            if kwargs["statementType"] in ["entityStatement","personStatement"]:
                node = create_or_update_node(**kwargs)
                if not node: node_errors.append(kwargs)
            if kwargs["statementType"] == "ownershipOrControlStatement":
                kwargs = generate_base_node(**kwargs)
                first_party = get_edge_party(kwargs["interestedParty"])
                second_party = get_edge_party(kwargs["subject"])
                if first_party and second_party:
                    with db.transaction:
                        edge = first_party.controls.connect(second_party, kwargs)
                        edge.save()
                else:
                    edge_errors.append(kwargs)
    print("Completed first run...")
    return edge_errors, node_errors

def test_open_ownership_graph(source_file):
    # Open Ownership source data
    d = IMPORT_DATA_DIRECTORY
    node_errors = []
    edge_errors = []
    print("Starting ...")
    with jsonlines.open(d + source_file) as reader:
        for i, kwargs in enumerate(reader):
            if i % 1 == 0 and i != 0: print("row: {}\r".format(i), end="")
            if kwargs["statementType"] in ["entityStatement" ,"personStatement", "ownershipOrControlStatement"]:
                print(kwargs)
                kwargs = generate_base_node(**kwargs)
                print(kwargs)
                if i > 10: break
    print("Completed test run...")
    return edge_errors, node_errors

In [18]:
# import jsonlines from OO
print(datetime.now())

config.DATABASE_URL = NEO4J_BOLT_URL

%time edge_errors, node_errors = import_open_ownership_graph("statements.latest.jsonl")
#%time edge_errors, node_errors = test_open_ownership_graph("statements.latest.jsonl")

2019-12-30 09:01:03.374281
Starting ...
Completed first run...
CPU times: user 9h 31min 28s, sys: 1h 2min 28s, total: 10h 33min 57s
Wall time: 22h 15min 31s


## Set up the full-text search

In [21]:
# Create a custom analyser to keep stopwords 
# https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html

#q = "CALL db.index.fulltext.drop('entityNames')"
#%time results, meta = db.cypher_query(q)

q = "CALL db.index.fulltext.createNodeIndex('entityNames',['Entity'],['name'])"
%time results, meta = db.cypher_query(q)
print(results)

q = "CALL db.index.fulltext.createNodeIndex('entityTokens',['Entity'],['name_token'], {analyzer: 'simple'})"
#q = "CALL db.index.fulltext.createNodeIndex('entityNames',['Entity'],['name', 'name_token'])"
%time results, meta = db.cypher_query(q)
print(results)

#q = "CALL db.index.fulltext.createNodeIndex('postCodes',['Entity'],['postcode'], {analyzer: 'keyword'})"
#%time results, meta = db.cypher_query(q)
#print(results)

CPU times: user 15.6 ms, sys: 15.6 ms, total: 31.2 ms
Wall time: 45 ms
[]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.31 ms
[]


## Test a query

Perform a simple query and see if you get anything

In [20]:
from neomodel import Q, Traversal

config.DATABASE_URL = NEO4J_BOLT_URL

q = Q(name__startswith='DONE BROTHERS')

node = Entity.nodes.filter(q)
for n in node:
    print(n)

{'statementID': 'CxD9NxIcFcFBdPRkHCTP3VBG', 'statementType': 'entityStatement', 'statementDate': None, 'isActive': True, 'publicationDetails': None, 'source': None, 'annotations': None, 'replacesStatements': None, 'addresses': [{'type': 'registered', 'address': 'The Spectrum, 56-58 Benson Road Birchwood, Warrington, Cheshire, WA3 7PQ', 'country': 'GB'}], 'postcode': 'WA3 7PQ', 'entityType': 'registeredEntity', 'unspecifiedEntityDetails': None, 'name': 'DONE BROTHERS (CASH BETTING) LIMITED', 'name_token': 'done brothers cash betting', 'alternateNames': None, 'incorporatedInJurisdiction': {'name': 'United Kingdom of Great Britain and Northern Ireland', 'code': 'GB'}, 'identifiers': [{'scheme': 'GB-COH', 'id': '01277703'}], 'foundingDate': datetime.date(1976, 9, 17), 'dissolutionDate': None, 'uri': None, 'id': 195220}
