<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/gazette/UK%20Gazette.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Install dependencies
!pip install neo4j beautifulsoup4 spacy pandas
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 7.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


# Represent United Kingdom's public record as a knowledge graph
## Utilize UK Gazette API capabilities to construct a knowledge graph and analyze it in Neo4j
I love constructing knowledge graphs from various sources. I've wanted to create a government knowledge graph for some time now but was struggling to find any data that is easily accessible and doesn't require me to spend weeks developing a data pipeline. At first, I thought I would have to use OCR and NLP techniques to extract valuable information from public records, but luckily I stumbled upon UK Gazette. The UK Gazette is a website that holds the United Kingdom's official public record information. All the content on the website and via its APIs is available under the Open Government License v3.0. Since they offer the public record information through an API endpoint, we don't have to use any scraping tool to extract the information. Even more impressive is that you can export the data in linked data format (RDF). Linked data is a format to represent structured data which is interlinked and essentially represents a graph as you are dealing with nodes and relationships.

Since the linked data structure (RDF) already contains information about nodes and relationships, we don't have to define a graph schema manually.
Most graph databases use either the RDF (Resource Description Framework) or the LPG (Labeled-property graph) model under the hood. If you are using an RDF graph database, the structure of the linked data information will be identical to the graph model in the database. However, as you might know from my previous posts, I like to use Neo4j, which utilizes an LPG graph model. I won't go much into the difference between the two models here. If you want to learn more about the difference between the RDF and LPG models, I would point you to the [presentation by Jesús Barrasa](https://neo4j.com/blog/rdf-triple-store-vs-labeled-property-graph-difference/).
Since the linked data structure is frequently used to transmit data, the folks at Neo4j have made it easy to import or export data in the linked data format by using the [Neosemantics library](https://neo4j.com/labs/neosemantics/). In this post, we will be using the Neo4j database in combination with the Neosemantics library to store the linked data information fetched from the UK Gazette's API.
## Environment setup
To follow along, you need to have a running instance of the Neo4j database with the Neosemantics library installed.
One option is to use the Neo4j Sandbox environment, a free cloud instance of the Neo4j database with the Neosemantics library pre-installed. If you want to use the Neo4j Sandbox environment, [start a blank project](https://sandbox.neo4j.com/?usecase=blank-sandbox) that comes with an empty database.
On the other hand, you could also use a local environment of Neo4j. If you opt for a local version, I recommend using the Neo4j Desktop application, a database management application that has a simple interface for adding plugins with a single click.
## Setting up connection to Neo4j instance
Before we begin, we have to establish connection with Neo4j from the notebook environment. If you are using the Sandbox instance, you can copy details from the Connection Details tab.

In [3]:
# Define Neo4j connections
import pandas as pd
from neo4j import GraphDatabase
host = 'bolt://3.83.239.168:7687'
user = 'neo4j'
password = 'swim-ram-percents'
driver = GraphDatabase.driver(host,auth=(user, password))

def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

## Configuring Neosemantics library
It is required to define a unique constraint on the Resource nodes for the Neosemantics library to work. You can define the unique constraint using the following Cypher statement.

In [4]:
run_query("""
CREATE CONSTRAINT n10s_unique_uri IF NOT EXISTS ON (r:Resource)
ASSERT r.uri IS UNIQUE
""")

Next, we need to define the Neosemantics configuration. We have a couple of options to specify how we want the RDF data to be imported as an LPG graph. We'll keep most of the configuration default and only set the handleVocabUri and applyNeo4jNaming parameters. Again, you can inspect the documentation for the [complete reference of configuration options](https://neo4j.com/labs/neosemantics/4.3/reference/).

Use the following Cypher statement to define the Neosemantics configuration.

In [5]:
run_query("""
CALL n10s.graphconfig.init({
  handleVocabUris: 'MAP',
  applyNeo4jNaming: true
})
""")

Unnamed: 0,param,value
0,handleVocabUris,MAP
1,handleMultival,OVERWRITE
2,handleRDFTypes,LABELS
3,keepLangTag,False
4,keepCustomDataTypes,False
5,applyNeo4jNaming,True
6,baseSchemaNamespace,neo4j://graph.schema#
7,baseSchemaPrefix,n4sch
8,classLabel,Class
9,subClassOfRel,SCO


## Construct a knowledge graph of UK public record
We will utilize the UK Gazette API to search for notices. The Notice Feed API is publicly available and doesn't require any authorization. However, you need to pretend you are a browser for it to work for some reason. I have no idea the reason behind this, but I spent 30 minutes of my life trying to make it work. The documentation of the API is available on [GitHub](https://github.com/TheGazette/DevDocs/blob/master/notice/notice-feed.md).

The main two parameters to filter notices via the API are the category code and notice type. The category code is the higher-level filter, while the notice type allows you to select only a subsection of a category. The complete list of category codes and notice types is available on the following website. There is a broad selection of notices you can choose from, ranging from State and Parliament to Companies regulation and more.
As mentioned, we can download the linked data format information for each notice. A nice thing about the Neosemantics library is that it can fetch data from local files as well as simple APIs. The workflow will be the following.
Use the Notice Feed API to find relevant notice ids
Use the Neosemantics to extract RDF information about specified notice ids and store it in Neo4j.

Lastly, we will define the function that will take in the category code and notice type parameters and store the information about notices in the Neo4j database.

In [6]:
import requests
from requests.structures import CaseInsensitiveDict

# Query to import RDF/XML data to Neo4j using Neosemantics
import_rdf_query = """
UNWIND $data AS link
CALL n10s.rdf.import.fetch(
  link,
  'RDF/XML'
) YIELD triplesLoaded
RETURN sum(triplesLoaded) AS totalTriplesLoaded
"""

def make_request(uri):
    # For some reason, the API only works when I pretend to be a browser
    headers = CaseInsensitiveDict()
    headers["user-agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
    return requests.get(uri, headers=headers)
    
def ukgazzette_to_neo4j(pages=1, categorycode="", noticetype=""):
    for page in range(1, pages + 1):
        baseUrl = f"https://www.thegazette.co.uk/all-notices/notice/data.json"
        ccode = "categorycode=" + categorycode + "&" if categorycode else ""
        ntype = "noticetype=" + noticetype + "&" if noticetype else ""
        parameters = f"?{ccode}{ntype}results-page-size=100&sort-by=latest-date&results-page={page}"
        
        try:
            response = make_request(baseUrl + parameters)
            responseJson = response.json()
        except Exception as e:
            print(response.text)
            print(e)
            break

        # Define RDF/XML URL links
        data = []
        for notice in responseJson['entry']:
            id = notice['id'].split('/')[-1]
            rdf_uri = f"https://www.thegazette.co.uk/notice/{id}/data.rdf?view=linked-data"
            data.append(rdf_uri)

        # Import RDF into Neo4j with Neosemantics
        query_response = run_query(import_rdf_query, {'data': data})
        print(query_response)

The Cypher statement expects the data parameter to contain a list of links where the RDF/XML information about the notices is available. The Neosemantics library supports other RDF serialization formats as well, such as JSON-LD and others, if you might be wondering.

We get 100 notice ids for each request to the Notice Feed API. I've included the pagination feature in the function if you want to import more. As you might see from the code, we make a request to the notice feed and construct a list of links where the RDF/XML information about notices is stored. Next, we input that list as the parameter to the Cypher statement, where the Neosemantics library will iterate over all the links and store the information in Neo4j. It's about as simple as it gets.
Now we can go ahead and import the last 1000 notices under the state category code. If you look at the notice code reference, you can see that the state notices fall under the category code value of 11.

In [7]:
# Import last 1000 state notices
ukgazzette_to_neo4j(10, "11")

   totalTriplesLoaded
0                3519
   totalTriplesLoaded
0                4918
   totalTriplesLoaded
0                4953
   totalTriplesLoaded
0                5008
   totalTriplesLoaded
0                5110
   totalTriplesLoaded
0                4997
   totalTriplesLoaded
0                4899
   totalTriplesLoaded
0                4932
   totalTriplesLoaded
0                4928
   totalTriplesLoaded
0                4875


The graph schema of the imported graph is complex and not easy to visualize so we will skip that. I didn't spend much time apprehending the whole graph structure, but I prepared a couple of sample Cypher statements that could get us started.

For example, we can examine the last five receivers of any awards.

In [8]:
# Who has received any awards

run_query("""
MATCH (award)<-[:ISAWARDED]-(t:AwardandHonourThing)-[:HASAWARDEE]->(person)-[:HASEMPLOYMENT]->(employment)-[:ISMEMBEROFORGANISATION]->(organization)
RETURN award.label AS award,
       t.relatedDate AS relatedDate,
       person.name AS person,
       employment.jobTitle AS jobTitle,
       organization.name AS organization
ORDER BY relatedDate DESC
LIMIT 5
""")


Unnamed: 0,award,relatedDate,person,jobTitle,organization
0,B.E.M.,2021-12-31,Cornel GRANT,Bus Driver,Stagecoach
1,B.E.M.,2021-12-31,Bryn Owen WILLIAMS,Desk Officer,"Foreign, Commonwealth and Development Office"
2,B.E.M.,2021-12-31,Brian WARING,Desk Officer,"Foreign, Commonwealth and Development Office"
3,B.E.M.,2021-12-31,Angela Joy FRENCH,Former Chair,The Women’s Royal Voluntary Service
4,B.E.M.,2021-12-31,Natalie Claire COLEMAN,Director,National Gallery of the Cayman Islands


Remember, these was the example I showed in the introduction of this article. I've learned that one can also be appointed as the commander in the Order of the British Empire.

In [9]:
run_query("""
MATCH (n:CommanderOrderOfTheBritishEmpire)<-[:ISAPPOINTEDAS]-(notice)-[:HASAPPOINTEE]->(appointee),
      (notice)-[:HASAUTHORITY]->(authority)
RETURN n.label AS award,
       notice.relatedDate AS date,
       appointee.name AS appointee,
       authority.label AS authority
ORDER BY date DESC
LIMIT 5
""")

Unnamed: 0,award,date,appointee,authority
0,C.B.E.,2021-12-31,Bernard John TAUPIN,Central Chancery of the Orders of Knighthood
1,C.B.E.,2021-12-31,Robert Adrian STRINGER,Central Chancery of the Orders of Knighthood
2,C.B.E.,2021-12-22,Dr. Kai Hung LEE,Central Chancery of the Orders of Knighthood


To steer away from awards, we can also inspect which notices that are related to various legislation.

In [10]:
run_query("""
MATCH (provenance)<-[:HAS_PROVENANCE]-(n:Notice)-[:ISABOUT]->(l:Legislation:NotifiableThing)-[:RELATEDLEGISLATION]->(related)
RETURN n.hasNoticeID AS noticeID,
       n.uri AS noticeURI,
       l.relatedDate AS date,
       provenance.uri AS provenance,
       collect(related.label) AS relatedLegislations
ORDER BY date DESC
LIMIT 5
""")

Unnamed: 0,noticeID,noticeURI,date,provenance,relatedLegislations
0,3999024,https://www.thegazette.co.uk/id/notice/3999024,2022-02-21,https://www.thegazette.co.uk/id/notice/3999024...,[Universities of Oxford and Cambridge Act 1923]
1,3999023,https://www.thegazette.co.uk/id/notice/3999023,2022-02-21,https://www.thegazette.co.uk/id/notice/3999023...,[Universities of Oxford and Cambridge Act 1923]
2,3992302,https://www.thegazette.co.uk/id/notice/3992302,2022-02-14,https://www.thegazette.co.uk/id/notice/3992302...,[BURIAL ACT 1853]
3,3991643,https://www.thegazette.co.uk/id/notice/3991643,2022-02-14,https://www.thegazette.co.uk/id/notice/3991643...,[BURIAL ACT 1853]


The nice thing about our knowledge graph is that it contains all the data references to the Gazette website. This allows us to verify and also find more information if needed. In addition, through my data exploration, I've noticed that not all information is parsed from notices as a lot of information is hard to structure as a graph automatically. More on that later.
Suppose you are like me and get quickly bored by state information. In that case, you could fetch more business-related information such as companies buying back their own stock, company directors being disqualified, or partnership dissolutions.

In [11]:
# Redemption or purchase of own shares out of capital, Company director disqualification order, Dissolution of partnership
ukgazzette_to_neo4j(10, "26+27", "2602+2608+2702")

   totalTriplesLoaded
0                3998
   totalTriplesLoaded
0                4030
   totalTriplesLoaded
0                3894
   totalTriplesLoaded
0                3644
   totalTriplesLoaded
0                3300
   totalTriplesLoaded
0                4050
   totalTriplesLoaded
0                3905
   totalTriplesLoaded
0                4029
   totalTriplesLoaded
0                4228
   totalTriplesLoaded
0                4144


I've spent 30 minutes figuring out how to properly use notice types and category code parameters to filter notice feeds. You must also include the category code parameter when you want to filter by notice type. Otherwise, the filtering won't work as expected.
We don't have to worry about creating separate graphs or databases for additional notice feeds. The graph schema is already defined in the RDF/XML data structure, and you can import all the notice types into a single Neo4j instance.

Now you can examine which partnerships have dissolved.

In [12]:
run_query("""
MATCH (n:PartnershipDissolutionNotice)-[:ISABOUT]->(notifiableThing)-[:HASCOMPANY]->(partnership),
      (notifiableThing)-[:ISENABLEDBYLEGISLATION]->(enabledby)
RETURN n.hasNoticeID AS noticeID,
       notifiableThing.relatedDate AS date,
       notifiableThing.uri AS noticeURI,
       enabledby.label AS enablingLegislation,
       partnership.name AS partnership
ORDER BY date DESC
LIMIT 5
""")

Unnamed: 0,noticeID,date,noticeURI,enablingLegislation,partnership
0,4001999,2022-02-25,https://www.thegazette.co.uk/id/notice/4001999...,LIMITED PARTNERSHIPS ACT 1907,PRAMERICA REAL ESTATE CAPITAL I (SCOTLAND) LIM...
1,3998712,2022-02-21,https://www.thegazette.co.uk/id/notice/3998712...,LIMITED PARTNERSHIPS ACT 1907 & PARTNERSHIP AC...,\n Edammer Limited Partnership
2,3996994,2022-02-18,https://www.thegazette.co.uk/id/notice/3996994...,PARTNERSHIP ACT 1890,\n W H MAYES PARTNERSHIP
3,3991200,2022-02-15,https://www.thegazette.co.uk/id/notice/3991200...,LIMITED PARTNERSHIPS ACT 1907,17 CAPITAL MEZZANINE CO-INVEST LP
4,3991197,2022-02-11,https://www.thegazette.co.uk/id/notice/3991197...,LIMITED PARTNERSHIPS ACT 1907,17CAPITAL (OLYMPUS) LP


Another interesting information is about which companies have or intend to buyback their own shares.

In [13]:
run_query("""
MATCH (legislation)<-[:RELATEDLEGISLATION]-(n:RedemptionOrPurchase)-[:HASCOMPANY]->(company)
RETURN n.relatedDate AS date,
       company.name AS company,
       company.uri AS companyURI,
       collect(legislation.label) AS relatedLegislations,
       n.uri AS noticeURI
ORDER BY date DESC
LIMIT 5
""")

Unnamed: 0,date,company,companyURI,relatedLegislations,noticeURI
0,2022-02-18,G. & B. (NORTH WEST) LIMITED,http://business.data.gov.uk/id/company/01797547,"[Companies Act 2006, Companies Act 2006, s. 719]",https://www.thegazette.co.uk/id/notice/3997065...
1,2022-01-27,\n RAS CAPITAL NO 1 LIMITED,http://business.data.gov.uk/id/company/10153195,"[Companies Act 2006, Companies Act 2006, s. 719]",https://www.thegazette.co.uk/id/notice/3981164...
2,2022-01-12,VETERINARY BUSINESS DEVELOPMENT LIMITED,http://business.data.gov.uk/id/company/02185105,"[Companies Act 2006, s. 719, Companies Act 2006]",https://www.thegazette.co.uk/id/notice/3950477...
3,2022-01-07,ROOSTEN LIMITED,http://business.data.gov.uk/id/company/08123072,"[Companies Act 2006, Companies Act 2006, s. 719]",https://www.thegazette.co.uk/id/notice/3969127...
4,2022-01-07,WOODMAN MOTOR COMPANY LIMITED,http://business.data.gov.uk/id/company/06453796,"[Companies Act 2006, Companies Act 2006, s. 71...",https://www.thegazette.co.uk/id/notice/3965706...


# Taking in to the next level
As mentioned before, there are some example where not all information is extracted from notices in a linked data structure. One such example are the members change in partnership. We have the information about the partnership in which the membership changed, but not exactly what has changed. All the data we can retrieve is the following:

In [14]:
run_query("""
MATCH (notice)-[:ISABOUT]->(n:PartnershipChangeInMembers)-[:HASCOMPANY]->(company)
RETURN notice.hasNoticeID AS noticeID,
       notice.uri AS noticeURI,
       n.relatedDate AS date,
       company.name AS company
ORDER BY date DESC
LIMIT 5
""")

Unnamed: 0,noticeID,noticeURI,date,company
0,3981940,https://www.thegazette.co.uk/id/notice/3981940,2022-02-04,INVERGORDON D SCOTTISH LIMITED PARTNERSHIP
1,3984531,https://www.thegazette.co.uk/id/notice/3984531,2022-02-03,WALES FAMILY PARTNERSHIP
2,3985272,https://www.thegazette.co.uk/id/notice/3985272,2022-02-01,Gilberts Chartered Accountants
3,3970041,https://www.thegazette.co.uk/id/notice/3970041,2022-01-14,\n KINGSWOOD SURGERY TUNBRIDGEWELLS
4,3964974,https://www.thegazette.co.uk/id/notice/3964974,2022-01-05,Thornton & Wright Opticians


For example, if we inspect the first notice on the website we can observe that a the actual changes are not available in the linked data format.

I totally understand why the actual changes are not in a structured format. The reason is that there are too many variations of the membership change notice to capture them all in a structured way.

It seems like all the roads lead to Rome, or in our case, when dealing with text, you will possibly have to utilize NLP techniques. So I've added a simple example of using SpaCy to extract organizations and person entities from notices.

In [15]:
from bs4 import BeautifulSoup as bs
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_entities(noticeId):
    print(f"\nExtracting entities for {noticeId}")
    uri = f"https://www.thegazette.co.uk/notice/{noticeId}/data.xml?download=true"
    content = make_request(uri).content
    bs_content = bs(content, "lxml")
    text = " ".join([el.text for el in bs_content.findAll("p", {"data-gazettes":"Text"})])
    print(text)
    doc = nlp(text)
    # Find named entities, phrases and concepts
    print('Entities \n --------------------')
    for entity in doc.ents:
        if not entity.label_ in ['PERSON', 'ORG']:
            continue
        print(entity.text, entity.label_)


The text of notices is not stored in our knowledge graph, so we have to utilize the UK Gazette API to retrieve it. I've used BeatifulSoup to extract the text from the XML response and then run it through SpaCy's NLP pipeline to detect organizations and person mentions. The code doesn't store the entities back to Neo4j. I've just wanted to give you a simple example of how you could start utilizing NLP capabilities to extract more information.

We can now detect entities for a couple of changes in members of partnership notices.

In [16]:
partnership_changes = run_query("""
MATCH (notice)-[:ISABOUT]->(n:PartnershipChangeInMembers)
RETURN notice.hasNoticeID AS noticeID
LIMIT 5
""")['noticeID'].to_list()

for i in partnership_changes:
    extract_entities(i)


Extracting entities for 3996989

Pursuant to section 10 of the Limited Partnerships Act 1907, notice is hereby given in respect of IIF UK 1 LP, a limited partnership registered in England with registered number LP012764 (the “Partnership”), that:

 1.	FCA Pension Plan Trustee Limited as trustee of the FCA Pension Plan was admitted as a new limited partner of the Partnership.

          
Entities 
 --------------------
IIF UK 1 LP ORG

Extracting entities for 3989075

Notice is hereby given that Dr Stephen Kirkham, Dr Brij Patel and Dr John Hampson ceased to be a partners at Tower Family Healthcare, 16 Market St, Tottington, Bury, BL8 4AD. Dr Stephen Kirkham with effect from the 31st December 2021, Dr Brij Patel with effect from the 17th December 2021 and Dr John Hampson with effect from the 30th June 2021 . The business will continue with the remaining partners

Entities 
 --------------------
Stephen Kirkham PERSON
Brij Patel PERSON
John Hampson PERSON
Tower Family Healthcare ORG
Mar

The NLP pipeline doesn't extract specific changes, but at least it's a start since you can't create a rule-based difference extraction due to having a non-standard structure of the text. You can observe that even the two examples are wildly different in text structure and information.