# Converting AIF To Pandas
This notebook shows how to convert an AIDA TA1 AIF file to Pandas to make it programmer-friendly

In [1]:
import numpy as np
import pandas as pd
import os
import io
from IPython.display import display, HTML, Image

### Before you start
All the examples used in this document read from the /aida folder to make sure that the cells can be run in an independent manner.

We create the /results folder inside so you can see the results generated from each of the KGTK operations. This way if a cells produces an error, you can continue browsing the notebook.

In [2]:
mkdir sample_data/aida/results

mkdir: sample_data/aida/results: File exists


### Convert AIF triples to TSV KGTK format

In [3]:
!head sample_data/aida/HC00001DO.ttl.nt

<http://www.isi.edu/gaia/entities/e34874a6-a857-4f14-8aee-9947d3e9caaf> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#Entity> .
<http://www.isi.edu/gaia/entities/e34874a6-a857-4f14-8aee-9947d3e9caaf> <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#informativeJustification> _:b0 .
<http://www.isi.edu/gaia/entities/e34874a6-a857-4f14-8aee-9947d3e9caaf> <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#justifiedBy> _:b1 .
<http://www.isi.edu/gaia/entities/e34874a6-a857-4f14-8aee-9947d3e9caaf> <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#privateData> _:g0 .
_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#PrivateData> .
_:g0 <https://tac.nist.gov/tracks/SM-KBP/2019/ontologies/InterchangeOntology#jsonContent> "{\"fileType\":\"en\"}"^^<http://www.w3.org/2001/XMLSchema#

**Define prefixes to compress the URIs**

In [4]:
pd.read_csv("sample_data/aida/aida-namespaces.tsv", delimiter='\t')

Unnamed: 0,node1,label,node2
0,entity,prefix_expansion,http://www.isi.edu/gaia/entities/
1,relation,prefix_expansion,http://www.isi.edu/gaia/relations/
2,event,prefix_expansion,http://www.isi.edu/gaia/events/
3,rdf,prefix_expansion,http://www.w3.org/1999/02/22-rdf-syntax-ns#
4,ont,prefix_expansion,https://tac.nist.gov/tracks/SM-KBP/2019/ontolo...
5,rpi,prefix_expansion,http://www.rpi.edu/
6,xml-schema-type,prefix_expansion,http://www.w3.org/2001/XMLSchema#
7,columbia,prefix_expansion,http://www.columbia.edu/
8,isi,prefix_expansion,http://www.isi.edu/
9,isi1,prefix_expansion,www.isi.edu/


**Import the AIF triples**

In [5]:
!kgtk import-ntriples -i sample_data/aida/HC00001DO.ttl.nt \
  --namespace-file sample_data/aida/aida-namespaces.tsv \
  --namespace-id-use-uuid True \
  --local-namespace-use-uuid False \
  --local-namespace-prefix _ \
  --newnode-use-uuid True  \
  / sort \
  > sample_data/aida/results/HC00001DO.ttl.tsv

**Reified information is cumbersome to work with**

In [6]:
ta1 = pd.read_csv("sample_data/aida/results/HC00001DO.ttl.tsv", delimiter='\t')
display(HTML(ta1.loc[ta1.node1 =='_:g10'].to_html()))

EmptyDataError: No columns to parse from file

## Simplify the KG

**What we want an easy to understand representation that is close to the diagrams that people want to see**

<img src="https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/examples/images/aida-event-graph.png" width=700/>

**Undo the reification, and put the justifications as annotations on the semantic edges**

In [None]:
!kgtk unreify-rdf-statements -i sample_data/aida/results/HC00001DO.ttl.tsv \
  / sort --columns 1,2 \
  >  sample_data/aida/results/HC00001DO.ttl.unreified.tsv

**Events now have direct edges to the role fillers (orange diamonds), the justifications are in the id object**

In [None]:
unreified = pd.read_csv("sample_data/aida/HC00001DO.ttl.unreified.tsv", delimiter='\t')
unreified.loc[unreified.node1 == 'event:fd2323ad-b9c6-4b57-9228-8579b52475c8']

**The relations are also objects with direct links to the entities (green diamonds)**

In [None]:
unreified.loc[unreified.node1 == 'relation:4b8f6334-dbc1-4186-8d9e-a04d864d9a9d']

## Create files to Work in TA2

**We want Pandas-friendly files, having a single rows for entities, relations and events.**

For initial analysis, let's remove justifications, etc.

In [None]:
!kgtk filter \
  --invert \
  -p ';ont:justifiedBy,ont:privateData,ont:system,ont:informativeJustification;' sample_data/aida/results/HC00001DO.ttl.unreified.tsv \
  > sample_data/aida/results/HC00001DO.ttl.unreified.nojust.tsv

**Split into a separate file for each of entities, relations and events**

In [None]:
!kgtk filter -p ';rdf:type;ont:Entity' sample_data/aida/results/HC00001DO.ttl.unreified.tsv > sample_data/aida/results/HC00001DO.entity_ids.tsv
!kgtk filter -p ';rdf:type;ont:Event' sample_data/aida/results/HC00001DO.ttl.unreified.tsv > sample_data/aida/results/HC00001DO.event_ids.tsv
!kgtk filter -p ';rdf:type;ont:Relation' sample_data/aida/results/HC00001DO.ttl.unreified.tsv > sample_data/aida/results/HC00001DO.relation_ids.tsv

In [None]:
# Get all entities from the unreified file
!kgtk ifexists \
    --input-keys node1 \
    --filter-keys node1 \
    --filter-on sample_data/aida/HC00001DO.entity_ids.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.nojust.tsv \
  / sort --columns 1,2 \
  > sample_data/aida/results/HC00001DO.entities.tsv

# Get all events from the unreified file
!kgtk ifexists \
    --input-keys node1 \
    --filter-keys node1 \
    --filter-on sample_data/aida/HC00001DO.event_ids.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.nojust.tsv \
  / sort --columns 1,2 \
  > sample_data/aida/results/HC00001DO.events.tsv

# Get all relations from the unreified file
!kgtk ifexists \
    --input-keys node1 \
    --filter-keys node1 \
    --filter-on sample_data/aida/HC00001DO.relation_ids.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.nojust.tsv \
  / sort --columns 1,2 \
  > sample_data/aida/results/HC00001DO.relations.tsv

**Little hack : replace ont:textValue by label**

In [None]:
!sed 's/ont:hasName/label/' sample_data/aida/results/HC00001DO.entities.tsv \
  | sed 's/ont:textValue/label/' \
  > sample_data/aida/results/HC00001DO.entities.renamed.tsv 

**Remove the type edges as they do not provide useful info (e.g., we know, by construction, the entities file contains entities)**

In [None]:
!kgtk filter \
  --invert \
  -p ';;ont:Entity' sample_data/aida/results/HC00001DO.entities.renamed.tsv \
  > sample_data/aida/results/results/HC00001DO.entities.notype.tsv
!kgtk filter \
  --invert \
  -p ';;ont:Relation' sample_data/aida/results/HC00001DO.relations.tsv \
  > sample_data/aida/results/HC00001DO.relations.notype.tsv
!kgtk filter \
  --invert \
  -p ';;ont:Event' sample_data/aida/results/HC00001DO.events.tsv \
  > sample_data/aida/results/HC00001DO.events.notype.tsv

## Let's make a file that has one entity per row
**Start by lifting the labels into a column**

In [None]:
!kgtk lift --suppress-empty-columns True sample_data/aida/results/HC00001DO.entities.notype.tsv / sort > sample_data/aida/results/HC00001DO.entities.labels.tsv

In [None]:
entities = pd.read_csv("sample_data/aida/results/HC00001DO.entities.labels.tsv", delimiter='\t')
entities

**Now lift the LDC link targets into a separate column, this is a bit complicated because of the extra level of reification**

In [None]:
!kgtk lift \
    --suppress-empty-columns True \
    --label-value ont:linkTarget \
    --lift-suffix ';temp' \
    --label-file sample_data/aida/results/HC00001DO.ttl.unreified.tsv \
    sample_data/aida/HC00001DO.entities.labels.tsv \
  / lift \
    --suppress-empty-columns True \
    --label-value ont:link \
    --lift-suffix ';linkTarget' \
    --node2-name 'node2;temp' \
  / sort \
  / remove_columns  -c 'node2;temp' \
  > sample_data/aida/results/HC00001DO.entities.labels.linktargets.tsv

In [None]:
entities = pd.read_csv("sample_data/aida/results/HC00001DO.entities.labels.linktargets.tsv", delimiter='\t')
entities

**Statistics of fraction of entities have labels or link targets**

In [None]:
((entities.shape[0]-entities.isnull().sum())/entities.shape[0]).round(3)

**Distribution of types**

In [None]:
entities['node2'].value_counts()

**Add the labels of the entities to the event file**

In [None]:
!kgtk filter \
  -p ';label;' sample_data/aida/results/HC00001DO.entities.renamed.tsv \
  > sample_data/aida/results/HC00001DO.entities.renamed.labels.tsv

In [None]:
!kgtk join sample_data/aida/results/HC00001DO.events.notype.tsv sample_data/aida/results/HC00001DO.entities.renamed.labels.tsv \
  --left-join \
  --left-file-join-columns node2 \
  --right-file-join-columns node1 \
  / lift --suppress-empty-columns \
  > sample_data/aida/results/HC00001DO.events.notype.entity-labels.tsv

In [None]:
events = pd.read_csv("sample_data/aida/results/HC00001DO.events.notype.entity-labels.tsv", delimiter='\t')
display(HTML(events[:10].to_html()))

In [None]:
events['node1'].value_counts()[:10]

### Work with clusters

In [None]:
!kgtk filter -p ';ont:clusterMember;' sample_data/aida/results/HC00001DO.ttl.unreified.tsv > sample_data/aida/results/HC00001DO.ttl.clusters.tsv

In [None]:
!kgtk join sample_data/aida/HC00001DO.ttl.clusters.tsv sample_data/aida/results/HC00001DO.entities.notype.tsv \
  --left-file-join-columns node2 \
  --right-file-join-columns node1 \
  > sample_data/aida/results/HC00001DO.cluster.ids.entities.tsv 
!kgtk join sample_data/aida/HC00001DO.ttl.clusters.tsv sample_data/aida/results/HC00001DO.relations.notype.tsv \
  --left-file-join-columns node2 \
  --right-file-join-columns node1 \
  > sample_data/aida/results/HC00001DO.cluster.ids.relations.tsv 
!kgtk join sample_data/aida/HC00001DO.ttl.clusters.tsv sample_data/aida/results/HC00001DO.events.notype.tsv \
  --left-file-join-columns node2 \
  --right-file-join-columns node1 \
  > sample_data/aida/results/HC00001DO.cluster.ids.events.tsv 

In [None]:
!kgtk ifexists \
  --input-keys node1 \
  --filter-keys node1 \
  --filter-on sample_data/aida/results/HC00001DO.cluster.ids.entities.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.tsv \
  > sample_data/aida/results/HC00001DO.cluster.entities.tsv 
!kgtk ifexists \
  --input-keys node1 \
  --filter-keys node1 \
  --filter-on sample_data/aida/results/HC00001DO.cluster.ids.relations.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.tsv \
  > sample_data/aida/results/HC00001DO.cluster.relations.tsv 
!kgtk ifexists \
  --input-keys node1 \
  --filter-keys node1 \
  --filter-on sample_data/aida/results/HC00001DO.cluster.ids.events.tsv \
    sample_data/aida/HC00001DO.ttl.unreified.tsv \
  > sample_data/aida/results/HC00001DO.cluster.events.tsv 

### Create and edge file with ids to load in Wikidata SPARQL and browse using SQID

In [None]:
!kgtk add_id sample_data/aida/results/HC00001DO.ttl.unreified.tsv  > sample_data/aida/results/HC00001DO.ttl.unreified.ids.tsv

In [None]:
# Read KGTK results into lines and directly into Pandas
# lines = !kgtk filter -p ';prefix_expansion;' ta1/HC00001DO/HC00001DO.ttl.tsv
# pd.read_csv(io.StringIO('\n'.join(lines)), delimiter='\t')