## Extracting ORGs from papers using SpaCy

This notebook is based on the documentation on the [SpaCy Linguistic Features page](https://spacy.io/usage/linguistic-features#section-named-entities).

We try to extract ORG named entities from our papers dataset. These are likely to be universities and commercial research groups.

In [1]:
import os
import re
import spacy

In [2]:
DATA_DIR = "../data"

TEXTFILES_ORG_DIR = os.path.join(DATA_DIR, "textfiles_org")
ORGS_SPACY_DIR = os.path.join(DATA_DIR, "orgs_spacy")

### Entity Extractor

SpaCy entity extractor is __much faster__ compared to NLTK+Stanford.

In [3]:
def extract_entities(tagger, text):
    entities = []
    if text is None:
        return entities
    doc = tagger(text)
    for ent in doc.ents:
        if ent.label_ == "ORG":
            entities.append(ent.text)
    return entities
  
    
text = """Yann Le Cun, a native of France was not even 30 when he joined AT&T 
Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new 
machine learning methods, including the convolutional neural network—modeled 
after the visual cortex in animals. Today, he serves as chief AI scientist at
Facebook, where he works tirelessly towards new breakthroughs."""
text = text.replace("\n", " ")
text = re.sub("\s+", " ", text)
print(text)
    
nlp = spacy.load("en")
entities = extract_entities(nlp, text)
print(entities)

Yann Le Cun, a native of France was not even 30 when he joined AT&T Bell Laboratories in New Jersey. At Bell Labs, LeCun developed a number of new machine learning methods, including the convolutional neural network—modeled after the visual cortex in animals. Today, he serves as chief AI scientist at Facebook, where he works tirelessly towards new breakthroughs.
['AT&T Bell Laboratories', 'Bell Labs', 'Facebook']


## Apply to all (preprocessed) text files

The preprocessing was done in the `12-org-ner-nltk-stanford` notebook. It pulls the first 50 lines of the original file in an attempt to focus on the part of the text that are most likely to contain the ORGs we are interested in, ie, the affiliations of the authors.

In [4]:
if not os.path.exists(ORGS_SPACY_DIR):
    os.mkdir(ORGS_SPACY_DIR)

In [5]:
def get_text(textfile):
    lines = []
    f = open(textfile, "r")
    for line in f:
        lines.append(line.strip())
    f.close()
    text = "\n".join(lines)
    return text


num_written = 0
for textfile in os.listdir(TEXTFILES_ORG_DIR):
    if num_written % 1000 == 0:
        print("orgs extracted from {:d} files".format(num_written))
    doc_id = int(textfile.split(".")[0])
    orgfile = os.path.join(ORGS_SPACY_DIR, "{:d}.org".format(doc_id))
    if os.path.exists(orgfile):
        continue
    else:
        text = get_text(os.path.join(TEXTFILES_ORG_DIR, "{:d}.txt".format(doc_id)))
        entities = extract_entities(nlp, text)
        entities = list(set(entities))
        forgs = open(orgfile, "w")
        for entity in entities:
            forgs.write("{:s}\n".format(entity))
        forgs.close()
    num_written += 1
print("orgs extracted from {:d} files, COMPLETE".format(num_written))

orgs extracted from 0 files
orgs extracted from 1000 files
orgs extracted from 2000 files
orgs extracted from 3000 files
orgs extracted from 4000 files
orgs extracted from 5000 files
orgs extracted from 6000 files
orgs extracted from 7000 files
orgs extracted from 7238 files, COMPLETE
