No description, website, or topics provided.
Python
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
scrapy_corenlp Update with STANFORD_NER_FIELD_TO_PROCESS setting Nov 29, 2016
.gitignore Initial commit Nov 21, 2016
LICENSE.txt Create LICENSE.txt Nov 21, 2016
README.md Update README.md Nov 29, 2016
requirements.txt
setup.py Add support for Python 2.7, 3.4 & 3.5 Nov 29, 2016

README.md

scrapy-corenlp

PyPI PyPI

A Scrapy middleware to perform Named Entity Recognition (NER) on response with Stanford CoreNLP.

Settings

Option Value Example Value
STANFORD_NER_ENABLED Boolean True
STANFORD_NER_CLASSIFIER absolute path to CRFClassifier '/home/jithesh/stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz'
STANFORD_NER_JAR absolute path to stanford-ner.jar file '/home/jithesh/stanford-ner-2015-12-09/stanford-ner.jar'
STANFORD_NER_FIELD_TO_PROCESS A field or list of Item text fields to use for classification ['title', 'description']
STANFORD_NER_FIELD_OUTPUT scrapy item field to update the result with 'result'

In your settings.py file, add the previously described settings and add CoreNLP to your SPIDER_MIDDLEWARES, e.g.

SPIDER_MIDDLEWARES = {
    'scrapy_corenlp.middlewares.CoreNLP': 543,
}

An example value of the STANFORD_NER_FIELD_OUTPUT field after recognising the entities is:

{"result": {"DATE": ["1963", "2009", "1979", "1663", "1982"], "ORGANIZATION": ["Royal Society", "US National Academy of Science", "University of California", "Home Home About Stephen The Computer Stephen", "the University of Cambridge", "Sally Tsui Wong-Avery Director of Research", "Theoretical Physics", "Leiden University", "Baby Universe", "Department of Applied Mathematics", "Cambridge Lectures Publications Books Images Films", "Briefer History of Time", "ESA", "NASA", "Brief History of Time", "CBE", "Caius College", "The Universe"], "PERSON": ["P. Oesch", "Einstein", "D. Magee", "Stephen Hawking", "George", "Annie", "Isaac Newton", "G. Illingworth", "Dennis Stanton Avery", "R. Bouwens"], "LOCATION": ["London", "Santa Cruz", "Einstein", "Cambridge", "Gonville"]}}