In [2]:
# Install stanza, Installing and importing Stanza are as simple as running the following commands. 
!pip install stanza

# Import stanza
import stanza

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[?25l[K     |▊                               | 10 kB 27.1 MB/s eta 0:00:01[K     |█▌                              | 20 kB 13.4 MB/s eta 0:00:01[K     |██▎                             | 30 kB 10.8 MB/s eta 0:00:01[K     |███                             | 40 kB 8.3 MB/s eta 0:00:01[K     |███▉                            | 51 kB 4.9 MB/s eta 0:00:01[K     |████▌                           | 61 kB 5.8 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 5.9 MB/s eta 0:00:01[K     |██████                          | 81 kB 4.3 MB/s eta 0:00:01[K     |██████▉                         | 92 kB 4.8 MB/s eta 0:00:01[K     |███████▋                        | 102 kB 5.3 MB/s eta 0:00:01[K     |████████▍                       | 112 kB 5.3 MB/s eta 0:00:01[K     |█████████                       | 122 kB 5.3 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 5.3 MB/s eta 0:00:01[K  

Setting up Stanford CoreNLP

In order for the interface to work, the Stanford CoreNLP library has to be installed and a CORENLP_HOME environment variable has to be pointed to the installation location.

Here I am going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:

In [3]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

2022-01-27 17:18:47 INFO: Installing CoreNLP package into ./corenlp...


Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        …



That's all for the installation!

We can now double check if the installation is successful by listing files in the CoreNLP directory. 

You should be able to see a number of .jar files by running the following command:

In [4]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

build.xml				  jollyday.jar
corenlp.sh				  LIBRARY-LICENSES
CoreNLP-to-HTML.xsl			  LICENSE.txt
ejml-core-0.39.jar			  Makefile
ejml-core-0.39-sources.jar		  patterns
ejml-ddense-0.39.jar			  pom-java-11.xml
ejml-ddense-0.39-sources.jar		  pom-java-17.xml
ejml-simple-0.39.jar			  pom.xml
ejml-simple-0.39-sources.jar		  protobuf-java-3.19.2.jar
input.txt				  README.txt
input.txt.out				  RESOURCE-LICENSES
input.txt.xml				  SemgrexDemo.java
istack-commons-runtime-3.0.7.jar	  ShiftReduceDemo.java
istack-commons-runtime-3.0.7-sources.jar  slf4j-api.jar
javax.activation-api-1.2.0.jar		  slf4j-simple.jar
javax.activation-api-1.2.0-sources.jar	  stanford-corenlp-4.4.0.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.4.0-javadoc.jar
javax.json.jar				  stanford-corenlp-4.4.0-models.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.4.0-sources.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   StanfordCoreNlpDemo.java
jaxb-impl-2.4.0-b180830.0438.jar	  StanfordDependenciesManual.p

Constructing CoreNLPClient

At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.

We wrap these functionalities in a CoreNLPClient class. Therefore, we need to start by importing this class from Stanza.

In [5]:
# Import client module
from stanza.server import CoreNLPClient

After the import is done, we can construct a CoreNLPClient instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization, named entity recognition (NER), parsing and Coreference resolution. 

Additionally, the client constructor accepts a memory argument, which specifies how much memory will be allocated to the background Java process. An endpoint option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.

Also, here we manually set be_quiet=True to avoid an IO issue in colab notebook. You should be able to use be_quiet=False on your own computer, which will print detailed logging information from CoreNLP during usage.

For more options in constructing the clients, please refer to 'https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options'

In [6]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2022-01-27 17:19:15 INFO: Writing properties to tmp file: corenlp_server-422d6f085e7741f8.props
2022-01-27 17:19:15 INFO: Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-422d6f085e7741f8.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7efbfcd96fd0>


After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running.

In [7]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java

    165 java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-422d6f085e7741f8.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref -preload -outputFormat serialized
    186 /bin/bash -c ps -o pid,cmd | grep java
    188 grep java


Annotating Text

Annotating a piece of text is as simple as passing the text into an annotate function of the client object. After the annotation is complete, a Document object will be returned with all annotations.

Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient.

In [10]:
# Annotate some text
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
document = client.annotate(text)
print(type(document))

<class 'CoreNLP_pb2.Document'>


Accessing Annotations

Annotations can be accessed from the returned Document object.

A Document contains a list of Sentences, which contain a list of Tokens. Here let's first explore the annotations stored in all tokens.

In [11]:
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Word        	Lemma       	POS   	NER
[Sentence 1]
Albert      	Albert      	NNP   	PERSON
Einstein    	Einstein    	NNP   	PERSON
was         	be          	VBD   	O
a           	a           	DT    	O
German      	german      	JJ    	NATIONALITY
-           	-           	HYPH  	O
born        	bear        	VBN   	O
theoretical 	theoretical 	JJ    	TITLE
physicist   	physicist   	NN    	TITLE
.           	.           	.     	O

[Sentence 2]
He          	he          	PRP   	O
developed   	develop     	VBD   	O
the         	the         	DT    	O
theory      	theory      	NN    	O
of          	of          	IN    	O
relativity  	relativity  	NN    	O
.           	.           	.     	O



Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:

In [12]:
# Iterate over all detected entity mentions
print("{:30s}\t{}".format("Mention", "Type"))

for sent in document.sentence:
    for m in sent.mentions:
        print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

Mention                       	Type
Albert Einstein               	PERSON
German                        	NATIONALITY
theoretical physicist         	TITLE
He                            	PERSON


To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct.

In [13]:
# Print annotations of a token
print(document.sentence[0].token[0])

# Print annotations of a mention
print(document.sentence[0].mentions[0])

word: "Albert"
pos: "NNP"
value: "Albert"
before: ""
after: " "
originalText: "Albert"
ner: "PERSON"
lemma: "Albert"
beginChar: 0
endChar: 6
utterance: 0
speaker: "PER0"
beginIndex: 0
endIndex: 1
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
coarseNER: "PERSON"
fineGrainedNER: "PERSON"
corefMentionIndex: 0
entityMentionIndex: 0
nerLabelProbs: "PERSON=0.9999331283889166"

sentenceIndex: 0
tokenStartInSentenceInclusive: 0
tokenEndInSentenceExclusive: 2
ner: "PERSON"
entityType: "PERSON"
entityMentionIndex: 0
canonicalEntityMentionIndex: 0
entityMentionText: "Albert Einstein"



In [14]:
  # get the first sentence
sentence = document.sentence[0]
    
# get the constituency parse of the first sentence
print('---')
print('constituency parse of first sentence')
constituency_parse = sentence.parseTree
print(constituency_parse)

---
constituency parse of first sentence
child {
  child {
    child {
      child {
        value: "Albert"
      }
      value: "NNP"
      score: -8.849637985229492
    }
    child {
      child {
        value: "Einstein"
      }
      value: "NNP"
      score: -10.39391803741455
    }
    value: "NP"
    score: -22.208171844482422
  }
  child {
    child {
      child {
        value: "was"
      }
      value: "VBD"
      score: -0.42985981702804565
    }
    child {
      child {
        child {
          value: "a"
        }
        value: "DT"
        score: -1.5601264238357544
      }
      child {
        child {
          child {
            value: "German"
          }
          value: "JJ"
          score: -5.692482948303223
        }
        child {
          child {
            value: "-"
          }
          value: "HYPH"
          score: -0.01210630964487791
        }
        child {
          child {
            value: "born"
          }
          value: "VBN"
      

In [15]:
 # get the first subtree of the constituency parse
print('first subtree of constituency parse')
print(constituency_parse.child[0])

first subtree of constituency parse
child {
  child {
    child {
      value: "Albert"
    }
    value: "NNP"
    score: -8.849637985229492
  }
  child {
    child {
      value: "Einstein"
    }
    value: "NNP"
    score: -10.39391803741455
  }
  value: "NP"
  score: -22.208171844482422
}
child {
  child {
    child {
      value: "was"
    }
    value: "VBD"
    score: -0.42985981702804565
  }
  child {
    child {
      child {
        value: "a"
      }
      value: "DT"
      score: -1.5601264238357544
    }
    child {
      child {
        child {
          value: "German"
        }
        value: "JJ"
        score: -5.692482948303223
      }
      child {
        child {
          value: "-"
        }
        value: "HYPH"
        score: -0.01210630964487791
      }
      child {
        child {
          value: "born"
        }
        value: "VBN"
        score: -5.775586128234863
      }
      value: "ADJP"
      score: -15.493135452270508
    }
    child {
      child {


In [16]:
# get the value of the first subtree
print('---')
print('value of first subtree of constituency parse')
print(constituency_parse.child[0].value)

---
value of first subtree of constituency parse
S


In [17]:
  # get the first token of the first sentence
print('first token of first sentence')
token = sentence.token[0]
print(token)

first token of first sentence
word: "Albert"
pos: "NNP"
value: "Albert"
before: ""
after: " "
originalText: "Albert"
ner: "PERSON"
lemma: "Albert"
beginChar: 0
endChar: 6
utterance: 0
speaker: "PER0"
beginIndex: 0
endIndex: 1
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
coarseNER: "PERSON"
fineGrainedNER: "PERSON"
corefMentionIndex: 0
entityMentionIndex: 0
nerLabelProbs: "PERSON=0.9999331283889166"



In [18]:
  # get the part-of-speech tag
print('part of speech tag of token')
token.pos
print(token.pos)

part of speech tag of token
NNP


In [19]:
# get the named entity tag
print('named entity tag of token')
print(token.ner)

named entity tag of token
PERSON


In [20]:
# get an entity mention from the first sentence
print('first entity mention in sentence')
print(sentence.mentions[0])

first entity mention in sentence
sentenceIndex: 0
tokenStartInSentenceInclusive: 0
tokenEndInSentenceExclusive: 2
ner: "PERSON"
entityType: "PERSON"
entityMentionIndex: 0
canonicalEntityMentionIndex: 0
entityMentionText: "Albert Einstein"



In [21]:
 # access the coref chain
print('coref chains for the example')
print(document.corefChain)

coref chains for the example
[chainID: 2
mention {
  mentionID: 0
  mentionType: "PROPER"
  number: "SINGULAR"
  gender: "MALE"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 2
  headIndex: 1
  sentenceIndex: 0
  position: 1
}
mention {
  mentionID: 2
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "MALE"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 1
  headIndex: 0
  sentenceIndex: 1
  position: 1
}
representative: 0
]


In [22]:
mychains = list()
chains = document.corefChain
for chain in chains:
    mychain = list()
    # Loop through every mention of this chain
    for mention in chain.mention:
        # Get the sentence in which this mention is located, and get the words which are part of this mention
        # (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
        words_list = document.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
        #build a string out of the words of this mention
        ment_word = ' '.join([x.word for x in words_list])
        mychain.append(ment_word)
    mychains.append(mychain)

for chain in mychains:
    print(' <-> '.join(chain))

Albert Einstein <-> He


Shutting Down the CoreNLP Server

To shut down the background CoreNLP server process, simply call the stop function of the client. Note that once a server is shutdown, you'll have to restart the server with the start() function before any annotation is requested.

In [23]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java

    262 /bin/bash -c ps -o pid,cmd | grep java
    264 grep java
