## Annotate Stories against KG

In this notebook, we will annotate 4 random stories scraped manually from [ScienceDaily.com](https://www.sciencedaily.com/) against a company-proprietary Knowledge Graph (KG).

The KG is provided as a TSV file of nodes (concepts) and edges (relationships). We will use the node file to construct a data structure, called an automaton, to support the [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). Conceptually, the data structure maps concept names (and synonyms) to the concept ID. As an article text is streamed against it, the algorithm will capture the text spans that match the concept names in the structure.

Output is a single TSV file containing the story ID, the concept ID, and the number of times the concept ID was found in the story.

In [1]:
import ahocorasick
import pandas as pd
import operator
import os
import string

In [2]:
DATA_DIR = "../data"

VERTEX_FILEPATH = os.path.join(DATA_DIR, "emmet-vertices.tsv")
CONCEPTMAP_FILEPATH = os.path.join(DATA_DIR, "story-concepts.tsv")

### Build the Aho-Corasick Automaton

We filter out `qualifier` and `event` type concepts since they are somewhat noisy. We also remove acronyms that are 2 characters or less in size. Both are attempts to reduce noisy matches.

In [3]:
def build_annotator(vertices_file):
    A = ahocorasick.Automaton()
    fvert = open(vertices_file, "r")
    for line in fvert:
        cols = line.strip().split('\t')
        if len(cols) != 5:
            continue
        if cols[-1] == "qualifier" or cols[-1] == "event":
            continue
        cid = cols[0]
        syns = cols[1]
        for syn in syns.split('|'):
            if len(syn) < 3:
                continue
            A.add_word(syn, (cid, syn))

    fvert.close()
    A.make_automaton()    
    return A

annotator = build_annotator(VERTEX_FILEPATH)

### Read Story, Annotate, Write Concept Frequencies

We loop through each `.story` file, reading the text and doing some cleanup on it. 

The `get_story_text` function reads each file, and performs some basic clean-up, mainly removing punctuations from the text. The other thing is to put a leading and trailing space in the text, this is to support the whole word condition in the `annotate_text` function.

In [4]:
def get_story_text(story_file):
    text_lines = []
    ftext = open(os.path.join(DATA_DIR, story_file))
    for line in ftext:
        text_lines.append(line.strip())
    ftext.close()
    text = " ".join(text_lines)
    # remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    # add space in front and back (for word check below)
    text = " " + text + "  "
    return text


def annotate_text(annotator, text, debug=False):
    matched_concepts = {}
    for end_index, (idx, orig_value) in annotator.iter(text):
        # make sure word identified is not part of another word
        start_index = end_index - len(orig_value) + 1
        if text[start_index - 1] != ' ' or text[end_index + 2] != ' ':
            continue
        if debug:
            print(start_index, end_index, idx, orig_value)
        if idx in matched_concepts.keys():
            matched_concepts[idx] += 1
        else:
            matched_concepts[idx] = 1
    concept_counts = sorted(
        [(k, matched_concepts[k]) for k in matched_concepts.keys()],
        key=operator.itemgetter(1), reverse=True)
    return concept_counts

In [5]:
fconcepts = open(CONCEPTMAP_FILEPATH, "w")
for story_filename in os.listdir(DATA_DIR):
    if not story_filename.endswith(".story"):
        continue
    print("Processing {:s}...".format(story_filename))
    story_id = story_filename.split('.')[0]
    text = get_story_text(os.path.join(DATA_DIR, story_filename))
    concept_map = annotate_text(annotator, text)
    for cid, count in concept_map:
        fconcepts.write("{:s}\t{:s}\t{:d}\n".format(story_id, cid, count))

fconcepts.close()

Processing 190823140729.story...
Processing 190904194433.story...
Processing 190909193211.story...
Processing 190916092109.story...


### Verify Output

In [6]:
story_concepts_df = pd.read_csv(CONCEPTMAP_FILEPATH,
    delimiter="\t",
    names=["story_id", "concept_id", "concept_count"])
story_concepts_df.head()

Unnamed: 0,story_id,concept_id,concept_count
0,190823140729,8001550,15
1,190823140729,9199226,3
2,190823140729,2795416,2
3,190823140729,9790284,2
4,190823140729,8837843,2
