# Round 4 Building the second KG

For Round 4 we build the second KG using the model outputs from Round 2, the additional post-processing steps from Round 3, and two new post-processing steps as follows:

- Cleanup overlapping relationships - for example 2 relations may have been found: _Raymond Zondo >> position held >> Deputy Chief Justice_ and _Raymond Zondo >> position held >> Chief Justice_ and we want to get the most complete version.
- Remove ambiguous relations - for example 2 entities may have 2 different relation types associated with them, in this case we remove both as it would be unclear what the definitive relation is. 

The below code reads in the model outputs from Round 2, carries out additional post-processing steps on REX, builds the second KG, and exports the entities and relations identified to CSV files.

In [1]:
import os
import logging

log_file_path = os.path.join(os.getcwd(), 'kg_builder.log')

logger = logging.getLogger(__name__)
logging.basicConfig(filename=log_file_path, encoding='utf-8', level=logging.DEBUG)

In [2]:
# Import required libraries
import pickle
import time
from datetime import datetime, timedelta
from kg_builder import kg
from kg_builder import rex
from kg_builder import get_wikidata_prepared_info
from kg_builder import make_lookup_dict_from_df

In [3]:
# Get the latest results
with open('model_outputs/round2/results.pkl', 'rb') as file:
    articles = pickle.load(file)

In [4]:
# rebel_flair_overview contains a summary of relations to be included
rebel_flair_overview, _, _, _, _ = get_wikidata_prepared_info('reference_info/wikidata_references.pkl')

# We only want to evaluate relations which have been preselected for inclusion
included_relations = list(rebel_flair_overview.loc[rebel_flair_overview['rebel description'].notna(), 'rebel description'])
included_relations += list(make_lookup_dict_from_df(rebel_flair_overview[rebel_flair_overview['rebel description'].notna()], 'rebel description', 'inverse description').values())
included_relations += ['alternate_name'] # additional Flair relation not in REBEL
included_relations = list(set(included_relations))

In [5]:
run_start = datetime.now()

In [6]:
# Only include pre-identified relations
for article in articles:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]

In [7]:
for article in articles:
    rex.populate_inverse_relations(article)
    rex.get_main_relations_only(article)
    rex.cleanup_alternate_name_pairs(article)
    rex.cleanup_duplicate_alternate_name_pairs(article)
    rex.populate_alt_names_mentions(article)
    rex.populate_clean_relation_texts(article)
    
    # Additional for Round 4
    rex.cleanup_overlapping_relations(article)
    rex.remove_ambiguous_relations(article)
    
    rex.populate_node_types(article)

In [8]:
el_tagger = kg.setup_el_tagger()

In [9]:
my_kg = kg.KGData()

In [10]:
sleeps = 0
start_index, end_index = (0, 31)
for i, article in enumerate(articles[start_index:end_index]):
    
    logger.info(f'''Fetching article # {i + start_index}''')
    kg.update_kg_from_article(my_kg, article, el_tagger)
    
    # Do a long sleep every 50 articles
    if (i + start_index) % 50 == 0 and i != 0:
        print(f'''{i + start_index} articles completed, long sleep...''')
        sleeps += 90
        time.sleep(90)
        print(f'''resuming...''')
        
    # Do a short sleep every 5 articles
    elif (i + start_index) % 5 == 0 and i != 0:
        print(f'''{i + start_index} articles completed, short sleep...''')
        sleeps += 30
        time.sleep(30)
        print(f'''resuming...''')
        
run_end = datetime.now()



5 articles completed, short sleep...
resuming...




10 articles completed, short sleep...
resuming...


Sleeping for 5.0 seconds, 2024-09-01 13:20:24
Sleeping for 5.0 seconds, 2024-09-01 13:20:29
Sleeping for 5.0 seconds, 2024-09-01 13:20:34


15 articles completed, short sleep...
resuming...




20 articles completed, short sleep...
resuming...




25 articles completed, short sleep...
resuming...


Sleeping for 5.0 seconds, 2024-09-01 13:23:34
Sleeping for 5.0 seconds, 2024-09-01 13:23:40
Sleeping for 6.1 seconds, 2024-09-01 13:23:45
Sleeping for 5.3 seconds, 2024-09-01 13:23:51


In [11]:
runtime = run_end - run_start - timedelta(seconds=sleeps)
print(runtime)

0:03:32.360173


In [12]:
# We want to export the whole KG so pick an old date and export all entries
very_old_date = datetime(2023, 8, 11, 15, 25, 28, 569055)
my_kg.prepare_kg_nx_files(very_old_date, 'csv_outputs/nx', 'round4')

nx CSV files exported for entities.
nx CSV files exported for relations.
