<a href="https://colab.research.google.com/github/kullawattana/thesis_2020_spacy_colab/blob/master/42_sentencizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Set up spaCy
from spacy.lang.en import English
parser = English()
parser.add_pipe(parser.create_pipe('sentencizer'))

# Test Data
multiSentence = "There is an art, it says, or rather, a knack to flying." \
                "The knack lies in learning how to throw yourself at the ground and miss." \
                "In the beginning the Universe was created. This has made a lot of people " \
                "very angry and been widely regarded as a bad move."

# all you have to do to parse text is this:
#note: the first time you run spaCy in a file it takes a little while to load up its modules

parsedData = parser(multiSentence)

# Let's look at the tokens
# All you have to do is iterate through the parsedData
# Each token is an object with lots of different properties
# A property with an underscore at the end returns the string representation
# while a property without the underscore returns an index (int) into spaCy's vocabulary
# The probability estimate is based on counts from a 3 billion word
# corpus, smoothed using the Simple Good-Turing method.

for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

# Let's look at the sentences
sents = []
# the "sents" property returns spans
# spans have indices into the original string
# where each index value represents a token
for span in parsedData.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(parsedData[i].string for i in range(span.start, span.end)).strip()
    sents.append(sent)

for sentence in sents:
    print(sentence)

# Let's look at the part of speech tags of the first sentence
for span in parsedData.sents:
    sent = [parsedData[i] for i in range(span.start, span.end)]
    break

for token in sent:
    print(token.orth_, token.pos_)

# Let's look at the dependencies of this example:
example = "The boy with the spotted dog quickly ran after the firetruck."
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

# Let's look at the named entities of this example:
example = "Apple's stocks dropped dramatically after the death of Steve Jobs in October."
parsedEx = parser(example)
for token in parsedEx:
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")

print("-------------- entities only ---------------")
# if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
ents = list(parsedEx.ents)
for entity in ents:
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))

original: 6090035477591592277 There
lowercased: 2112642640949226496 there
lemma: 6090035477591592277 There
shape: 16072095006890171862 Xxxxx
prefix: 5582244037879929967 T
suffix: 18139757808136603089 ere
log probability: -20.0
Brown cluster id: 0
----------------------------------------
original: 3411606890003347522 is
lowercased: 3411606890003347522 is
lemma: 3411606890003347522 is
shape: 4370460163704169311 xx
prefix: 5097672513440128799 i
suffix: 3411606890003347522 is
log probability: -20.0
Brown cluster id: 0
----------------------------------------
original: 15099054000809333061 an
lowercased: 15099054000809333061 an
lemma: 15099054000809333061 an
shape: 4370460163704169311 xx
prefix: 11901859001352538922 a
suffix: 15099054000809333061 an
log probability: -20.0
Brown cluster id: 0
----------------------------------------
There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
In the beginning the Universe wa