## Extracting ORGs from papers using NLTK and Stanford NER

This notebook is based on the blog post [Named Entity Recognition with Stanford NER Tagger](https://pythonprogramming.net/named-entity-recognition-stanford-ner-tagger/) by Chuck Dishmon.

We scan the text of the papers looking for organization names, and use them in a manner similar to keywords and authors, ie, yet another facet (or feature for content similarity going forward).

In [1]:
import nltk
import os
import re

In [10]:
DATA_DIR = "../data"
MODELS_DIR = "../models"

STANFORD_MODELS = os.path.join(MODELS_DIR, "stanford-ner-2018-02-27")

STANFORD_NER_MODEL = os.path.join(STANFORD_MODELS, "stanford-ner.jar")
STANFORD_CRF_MODEL = os.path.join(STANFORD_MODELS, "classifiers",
                                  "english.all.3class.distsim.crf.ser.gz")

TEXTFILES_DIR = os.path.join(DATA_DIR, "textfiles")

TOP_N_LINES = 50
TEXTFILES_ORG_DIR = os.path.join(DATA_DIR, "textfiles_org")

ORGS_STANFORD_RAW = os.path.join(DATA_DIR, "orgs_stanford_raw")

ORGS_STANFORD_DIR = os.path.join(DATA_DIR, "orgs_stanford")

### Entity Extractor

The NLTK StanfordNERTagger tagger wraps the Stanford NER Java model.

In [3]:
tagger = nltk.tag.StanfordNERTagger(STANFORD_CRF_MODEL, STANFORD_NER_MODEL, 
                                    encoding='utf-8')

In [4]:
def extract_entities(tagger, text, debug=False):
    entities, entity = [], []
    for sid, sent in enumerate(nltk.sent_tokenize(text)):
        tokens = nltk.word_tokenize(sent)
        tagged = tagger.tag(tokens)
        prev_tid = None
        for tid, (token, tag) in enumerate(tagged):
            if tag == "ORGANIZATION":
                if debug:
                    print(sid, tid, token, tag)
                if prev_tid is None or prev_tid + 1 < tid:
                    if len(entity) > 0:
                        entities.append(" ".join(entity))
                    entity = []
                entity.append(token)
                prev_tid = tid
    if len(entity) > 0:
        entities.append(" ".join(entity))
    return entities


text = """Yann Le Cun, a native of France was not even 30 when he joined AT&T 
Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new 
machine learning methods, including the convolutional neural network—modeled 
after the visual cortex in animals. Today, he serves as chief AI scientist at
Facebook, where he works tirelessly towards new breakthroughs."""
text = text.replace("\n", " ")
text = re.sub("\s+", " ", text)
print(text)

entities = extract_entities(tagger, text)
print(entities)

Yann Le Cun, a native of France was not even 30 when he joined AT&T Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new machine learning methods, including the convolutional neural network—modeled after the visual cortex in animals. Today, he serves as chief AI scientist at Facebook, where he works tirelessly towards new breakthroughs.
['AT & T Bell Laboratories', 'Bell Labs', 'Facebook']


### Preprocess text files

Our intent is to basically look only at the top few lines of the paper to find organizations that the authors are affiliated with. So we don't want to parse the full paper, because of (a) processing time and (b) noisy entities.

Based on looking at a few papers, it seems that this information can be found within the top 50 lines (`TOP_N_LINES`), so we will use that as a cutoff. This is preferable to looking for magic section names such as ABSTRACT or INTRODUCTION and all the misspellings. Also this allows us to save time on non-English papers, where our standard models would not work anyway.

This step will read the files from textfile and write out to `TEXTFILES_ORG_DIR` directory.

In [11]:
if not os.path.exists(TEXTFILES_ORG_DIR):
    os.mkdir(TEXTFILES_ORG_DIR)

In [8]:
def get_text(textfile, top_n):
    lines = []
    i = 0
    f = open(textfile, "r")
    for line in f:
        lines.append(line.strip())
        i += 1
        if i > top_n:
            break
    f.close()
    text = "\n".join(lines)
    return text


num_written = 0
for textfile in os.listdir(TEXTFILES_DIR):
    if num_written % 1000 == 0:
        print("{:d} headers written".format(num_written))
    doc_id = int(textfile.split(".")[0])
    head_file = os.path.join(TEXTFILES_ORG_DIR, "{:d}.txt".format(doc_id))
    if os.path.exists(head_file):
        num_written += 1
        continue
    text = get_text(os.path.join(TEXTFILES_DIR, textfile), TOP_N_LINES)
    fhead = open(head_file, "w")
    fhead.write(text)
    fhead.close()
    num_written += 1

print("{:d} headers written, COMPLETE".format(num_written))

0 headers written
1000 headers written
2000 headers written
3000 headers written
4000 headers written
5000 headers written
6000 headers written
7000 headers written
7238 headers written, COMPLETE


### Run ORG extractor for all files

__NOTE:__ running the above code on the contents of TEXTFILES_DIR is too slow. It is faster to execute the Java command on the individual files using a shell script and processing the output to extract ORGs. Command to run the extractor from shell:

    cd ../models/stanford-ner-2018-02-27
    java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier \
        -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz \
        -outputFormat tabbedEntities \
        -textFile ../../data/textfiles_org/1.txt > ../../data/orgs_stanford_raw/1.org

Finally, we will run the generated script to get Stanford NER output in `ORGS_STANFORD_RAW`.

    ./my_script.sh


In [12]:
if not os.path.exists(ORGS_STANFORD_RAW):
    os.mkdir(ORGS_STANFORD_RAW)

In [13]:
def create_script(template, script_file, textfiles_dir):
    fscript = open(script_file, "w")
    for textfile in os.listdir(textfiles_dir):
        doc_id = int(textfile.split(".")[0])
        command = template.format(doc_id, doc_id)
        fscript.write("{:s}\n".format(command))
    fscript.close()


template = """java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier
    -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz
    -outputFormat tabbedEntities
    -textFile ../../data/textfiles_org/{:d}.txt > ../../data/orgs_stanford_raw/{:d}.org"""
template = template.replace("\n", " ")
template = re.sub("\s+", " ", template)
template = template.lstrip().rstrip()
print(template.format(1, 1))

create_script(template, os.path.join(STANFORD_MODELS, "my_script.sh"), TEXTFILES_DIR)

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile ../../data/textfiles_org/1.txt > ../../data/orgs_stanford_raw/1.org


### Parse script output

Just pull out the tokens tagged as ORGANIZATION from the raw files.

In [26]:
if not os.path.exists(ORGS_STANFORD_DIR):
    os.mkdir(ORGS_STANFORD_DIR)
    
num_written = 0
for rawfile in os.listdir(ORGS_STANFORD_RAW):
    if num_written % 1000 == 0:
        print("{:d} org files written".format(num_written))
    orgfile = os.path.join(ORGS_STANFORD_DIR, rawfile)
    if os.path.exists(orgfile):
        num_written += 1
        continue
    forg = open(orgfile, "w")
    fraw = open(os.path.join(ORGS_STANFORD_RAW, rawfile), "r")
    for line in fraw:
        try:
            token, tag, _ = line.strip().split("\t", 3)
            if tag == "ORGANIZATION":
                forg.write("{:s}\n".format(token))
        except ValueError:
            continue
    num_written += 1
    forg.close()
    fraw.close()

print("{:d} org files written, COMPLETE".format(num_written))

0 org files written
1000 org files written
2000 org files written
3000 org files written
4000 org files written
5000 org files written
6000 org files written
7000 org files written
7238 org files written, COMPLETE
