## Extracting ORGs from papers using NLTK and Stanford NER

This notebook is based on the blog post [Named Entity Recognition with Stanford NER Tagger](https://pythonprogramming.net/named-entity-recognition-stanford-ner-tagger/) by Chuck Dishmon.

We scan the text of the papers looking for organization names, and use them in a manner similar to keywords and authors, ie, yet another facet (or feature for content similarity going forward).

In [1]:
import nltk
import os
import re

In [2]:
DATA_DIR = "../data"
MODELS_DIR = "../models"

STANFORD_MODELS = os.path.join(MODELS_DIR, "stanford-ner-2018-02-27")

STANFORD_NER_MODEL = os.path.join(STANFORD_MODELS, "stanford-ner.jar")
STANFORD_CRF_MODEL = os.path.join(STANFORD_MODELS, "classifiers",
                                  "english.all.3class.distsim.crf.ser.gz")

TEXTFILES_DIR = os.path.join(DATA_DIR, "textfiles")
ORGS_STANFORD_DIR = os.path.join(DATA_DIR, "orgs_stanford")

### Entity Extractor

The NLTK StanfordNERTagger tagger wraps the Stanford NER Java model.

In [3]:
tagger = nltk.tag.StanfordNERTagger(STANFORD_CRF_MODEL, STANFORD_NER_MODEL, 
                                    encoding='utf-8')

In [4]:
def extract_entities(tagger, text, debug=False):
    entities, entity = [], []
    for sid, sent in enumerate(nltk.sent_tokenize(text)):
        tokens = nltk.word_tokenize(sent)
        tagged = tagger.tag(tokens)
        prev_tid = None
        for tid, (token, tag) in enumerate(tagged):
            if tag == "ORGANIZATION":
                if debug:
                    print(sid, tid, token, tag)
                if prev_tid is None or prev_tid + 1 < tid:
                    if len(entity) > 0:
                        entities.append(" ".join(entity))
                    entity = []
                entity.append(token)
                prev_tid = tid
    if len(entity) > 0:
        entities.append(" ".join(entity))
    return entities


text = """Yann Le Cun, a native of France was not even 30 when he joined AT&T 
Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new 
machine learning methods, including the convolutional neural network—modeled 
after the visual cortex in animals. Today, he serves as chief AI scientist at
Facebook, where he works tirelessly towards new breakthroughs."""
text = text.replace("\n", " ")
text = re.sub("\s+", " ", text)
print(text)

entities = extract_entities(tagger, text)
print(entities)

Yann Le Cun, a native of France was not even 30 when he joined AT&T Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new machine learning methods, including the convolutional neural network—modeled after the visual cortex in animals. Today, he serves as chief AI scientist at Facebook, where he works tirelessly towards new breakthroughs.
['AT & T Bell Laboratories', 'Bell Labs', 'Facebook']


### Run ORG extractor for all files

__NOTE:__ running the above code on the contents of TEXTFILES_DIR is too slow. It is faster to execute the Java command on the individual files using a shell script and processing the output to extract ORGs. Command to run the extractor from shell:

    cd ../models/stanford-ner-2018-02-27
    java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier \
        -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz \
        -outputFormat tabbedEntities \
        -textFile ../../data/textfiles/1.txt > ../../data/orgs_stanford/1.org
    # create script (see following cell)
    ./my_script.sh

Output will be in a separate folder identified by ORGS_STANFORD_DIR. At the end of the process, we will collect these into a single file to check for sanity.

In [5]:
if not os.path.exists(ORGS_STANFORD_DIR):
    os.mkdir(ORGS_STANFORD_DIR)

In [6]:
def create_script(template, script_file, textfiles_dir):
    fscript = open(script_file, "w")
    for textfile in os.listdir(textfiles_dir):
        doc_id = int(textfile.split(".")[0])
        command = template.format(doc_id, doc_id)
        fscript.write("{:s}\n".format(command))
    fscript.close()


template = """java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier
    -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz
    -outputFormat tabbedEntities
    -textFile ../../data/textfiles/{:d}.txt > ../../data/orgs_stanford/{:d}.org"""
template = template.replace("\n", " ")
template = re.sub("\s+", " ", template)
template = template.lstrip().rstrip()
print(template.format(1, 1))

create_script(template, os.path.join(STANFORD_MODELS, "my_script.sh"), TEXTFILES_DIR)

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile ../../data/textfiles/1.txt > ../../data/orgs_stanford/1.org


### Parse script output

In [14]:
fsout = open(os.path.join(ORGS_STANFORD_DIR, "0.org"), "r")
for line in fsout:
    try:
        token, tag, _ = line.strip().split("\t")
        if tag == "ORGANIZATION":
            print(token)
    except ValueError:
        continue
fsout.close()

Suguru Arimoto Osaka University
American Institute of Physics
ANZA
Fa
Preprocessing
IBM
Global Training
Intell
PAMI
Robotics Research
Robotics & Automation
Robotics & Automation
Robotics & Automation
Robotics & Automation
