## Merge and Filter ORG keywords

Both Stanford and SpaCy NERs aggressively find and return ORG entities. We tried to reduce the recall by restricting to the first 50 lines for the text files. We still find see extraneous words in the entities.

First thing we do is construct a term frequency dictionary from the words and find the most frequent words used, then select a subset of these words as good indicators that the ORG strings extracted are what we want.

The output is better, but still quite noisy. So we will use two third party dictionaries, one of [world's top universities](https://github.com/endSly/world-universities-csv) and another of the [world's top 2000 companies](https://www.someka.net/excel-template/forbes-global-2000-list-2017/) according to Forbes, and extract them into a Aho-Corasick trie structure. If an ORG string passes throgh the first layer, if there is a match with the university or company name, then we will add to the list of ORGs for the document the _dictionary entry (not the actual ORG selected)_. Hopefully this will enable us to keep our ORG facets clean and also allow for reliable similarity calculations based on ORG.

In [1]:
import ahocorasick
import collections
import csv
import os
import nltk
import string

In [2]:
DATA_DIR = "../data"

TEXTFILES_DIR = os.path.join(DATA_DIR, "textfiles")
STANFORD_ORGS_DIR = os.path.join(DATA_DIR, "orgs_stanford")
SPACY_ORGS_DIR = os.path.join(DATA_DIR, "orgs_spacy")

DICT_UNIVS = os.path.join(DATA_DIR, "world-universities.csv")
DICT_CORPS = os.path.join(DATA_DIR, "forbes-global-2000-company-list.csv")

MERGED_FILTERED_ORGS_DIR = os.path.join(DATA_DIR, "orgs")

### Plausible ORG keywords

We run through the words and look for words that occur frequently. These would become keywords which we will use to weed out bad ORGs suggested by Stanford and SpaCy NERs.

In [3]:
def get_orgs(orgfile):
    lines = []
    forg = open(orgfile, "r")
    for line in forg:
        lines.append(line.strip())
    return lines


ctr = collections.Counter()
stopwords = set(nltk.corpus.stopwords.words("english"))
for textfile in os.listdir(TEXTFILES_DIR):
    doc_id = int(textfile.split(".")[0])
    stanford_orgs = get_orgs(os.path.join(STANFORD_ORGS_DIR, "{:d}.org".format(doc_id)))
    spacy_orgs = get_orgs(os.path.join(SPACY_ORGS_DIR, "{:d}.org".format(doc_id)))
    for org in stanford_orgs + spacy_orgs:
        words = org.lower().split(" ")
        for word in words:
            if word in stopwords:
                continue
            ctr[word] += 1

In [4]:
ctr.most_common(50)

[('', 28406),
 ('university', 14205),
 ('department', 7207),
 ('science', 6184),
 ('computer', 5806),
 ('institute', 4249),
 ('abstract', 3091),
 ('engineering', 2674),
 ('research', 2479),
 ('&', 2295),
 ('technology', 2066),
 ('neural', 1677),
 ('california', 1596),
 ('stanford', 1424),
 ('school', 1351),
 ('information', 1299),
 ('systems', 1269),
 ('electrical', 1233),
 ('learning', 1188),
 ('berkeley', 1172),
 ('center', 1126),
 ('sciences', 994),
 ('statistics', 948),
 ('cambridge', 846),
 ('college', 813),
 ('laboratory', 810),
 ('ca', 806),
 ('usa', 782),
 ('princeton', 759),
 ('markov', 747),
 ('pca', 743),
 ('mellon', 721),
 ('carnegie', 715),
 ('massachusetts', 676),
 ('microsoft', 651),
 ('processing', 641),
 ('computational', 623),
 ('toronto', 595),
 ('national', 567),
 ('neuroscience', 566),
 ('machine', 563),
 ('new', 551),
 ('mit', 533),
 (',', 532),
 ('google', 529),
 ('computing', 523),
 ('london', 512),
 ('dept.', 510),
 ('psychology', 499),
 ('group', 494)]

Most common words found yield interesting words that we can use to clean up the ORGS found from Stanford and SpaCy NERs.

    institutions: 
        university, department, school, laboratory, laboratories, group, unit, division, 
        dept, center, college, 
    subjects:
        engineering, science, sciences, computer,  mathematics, physics, biological, 
        medical, statistics, psychology,
    famous colleges:
        stanford, berkeley, cambridge, princeton, carnegie, massachussets, national, 
        toronto, mit, london, pittsburgh, columbia, hebrew
    famous companies:
        google, microsoft,
    other ORG keywords:
        research, technology, learning, systems, computational, computing, 
        computation, recognition


In [5]:
common_words = set([
    "university", "college",
    "department", "school", "laboratory", "laboratories", 
    "group", "unit", "division", "dept", "center", 
    "engineering", "science", "sciences", "computer", "mathematics", "physics", 
    "biological", "medical", "statistics", "psychology",
    "stanford", "berkeley", "cambridge", "princeton", "carnegie", "massachussets", 
    "national", "toronto", "mit", "london", "pittsburgh", "columbia", "hebrew",
    "google", "microsoft", "bell",
    "research", "technology", "systems", "computational", "computing", 
    "computation", "recognition",
    "learning", 
])

In [6]:
def contains_common_word(org_str, punct_table, common_words):
    words = set([w.translate(punct_table) for w in org_str.lower().split(" ")])
    return len(words.intersection(common_words)) > 0

punct_table = str.maketrans({key: None for key in string.punctuation})
i = 0
for textfile in os.listdir(TEXTFILES_DIR):
    doc_id = int(textfile.split(".")[0])
    stanford_orgs = get_orgs(os.path.join(STANFORD_ORGS_DIR, "{:d}.org".format(doc_id)))
    spacy_orgs = get_orgs(os.path.join(SPACY_ORGS_DIR, "{:d}.org".format(doc_id)))
    for org in stanford_orgs + spacy_orgs:
        if contains_common_word(org, punct_table, common_words):
            if i <= 30:
                print(org)
        i += 1

Suguru Arimoto Osaka University
University, Toyonaka
Stanford University
Stanford
National Science Foundation
American Institute of Physics
Stanford University
American Institute of Physics
the National Science Foundation
Stanford
AT&T Bell Labs


### Set up Dictionary Filter

This does a good job of producing clean ORG annotations, but also loses some standard names, such as The Hebrew University, Caltech, etc. We can manually scan the misses and add them manually to the automaton.

In [7]:
org_dict = ahocorasick.Automaton()
idx = 0
with open(DICT_UNIVS, "r") as f:
    csv_reader = csv.DictReader(f, delimiter=",", 
                                fieldnames=["abbr", "name", "url"])
    for row in csv_reader:
        org_dict.add_word(row["name"], (idx, row["name"]))
        idx += 1

with open(DICT_CORPS, "r") as f:
    csv_reader = csv.DictReader(f, delimiter=",")
    for row in csv_reader:
        org_dict.add_word(row["Company Name"], (idx, row["Company Name"]))
        idx += 1

org_dict.make_automaton()
print("built dictionary trie with {:d} entries".format(len(org_dict)))

built dictionary trie with 11275 entries


In [8]:
def contains_name(org_str, org_dict):
    return [item[1][1] for item in org_dict.iter(org_str)]

print(contains_name("Suguru Arimoto Osaka University", org_dict))

['Osaka University']


### Apply filter to all keywords

In [9]:
if not os.path.exists(MERGED_FILTERED_ORGS_DIR):
    os.mkdir(MERGED_FILTERED_ORGS_DIR)

In [10]:
num_written = 0
for textfile in os.listdir(TEXTFILES_DIR):
    if num_written % 1000 == 0:
        print("{:d} org files written".format(num_written))
    doc_id = int(textfile.split(".")[0])
    mf_orgfile = os.path.join(MERGED_FILTERED_ORGS_DIR, "{:d}.org".format(doc_id))
    if os.path.exists(mf_orgfile):
        num_written += 1
        continue
    stanford_orgs = get_orgs(os.path.join(STANFORD_ORGS_DIR, "{:d}.org".format(doc_id)))
    spacy_orgs = get_orgs(os.path.join(SPACY_ORGS_DIR, "{:d}.org".format(doc_id)))
    fout = open(mf_orgfile, "w")
    already_seen = set()
    for org in stanford_orgs + spacy_orgs:
        if not contains_common_word(org, punct_table, common_words):
            continue
        org_entities = contains_name(org, org_dict)
        for org_entity in org_entities:
            if org_entity in already_seen:
                continue
            fout.write("{:s}\n".format(org_entity))
            already_seen.add(org_entity)
    fout.close()
    num_written += 1

print("{:d} org files written, COMPLETE".format(num_written))

0 org files written
1000 org files written
2000 org files written
3000 org files written
4000 org files written
5000 org files written
6000 org files written
7000 org files written
7238 org files written, COMPLETE
