In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [11]:
#Create a DOC object
doc = nlp(u"TIL foreigners that owe more than £1,000 to the NHS are banned from entry into the United Kingdom")

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)


TIL NOUN compound
foreigners NOUN nsubjpass
that ADJ nsubj
owe VERB relcl
more ADJ amod
than ADP quantmod
£ SYM quantmod
1,000 NUM dobj
to ADP prep
the DET det
NHS PROPN pobj
are VERB auxpass
banned VERB ROOT
from ADP prep
entry NOUN pobj
into ADP prep
the DET det
United PROPN compound
Kingdom PROPN pobj


___
## 1. Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information.

## 2. Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `NHS` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

## 3. Dependencies
We also looked at the syntactic dependencies assigned to each token. `foreigners` is identified as an `nsubjpass` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [14]:
spacy.explain('ADP')

'adposition'

In [13]:
spacy.explain('prep')

'prepositional modifier'

In [16]:
# Lemmas (the base form of the word):
print(doc[12].text)
print(doc[12].lemma_)

banned
ban


In [20]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc[13].text)
print(doc[13].pos_)
print(doc[13].tag_ + ' / ' + spacy.explain(doc[13].tag_))

from
ADP
IN / conjunction, subordinating or preposition


In [26]:
# Word Shapes:
print(doc[7].text+': '+doc[7].shape_)

more: d,ddd


In [29]:
# Boolean Values:
print(f"{doc[6].text} - {doc[6].is_alpha}")
print(f"{doc[4].text} - {doc[4].is_stop}")

£ - False
more - True


## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [33]:
doc2 = nlp(u"It's not very bad for wild populations of birds since most species are natural carriers. The problem is with domestic birds such a as chickens cause they are a lot more vulnerable to the bird flu. At least that's what they told us in vet school. They also told us to keep chickens separated from other bird species such as turkeys for example, cause they can also be carriers of the virus.")
for i,sent in enumerate(doc2.sents):
    print(f"Sentence {i}: {sent}")


Sentence 0: It's not very bad for wild populations of birds since most species are natural carriers.
Sentence 1: The problem is with domestic birds such a as chickens cause they are a lot more vulnerable to the bird flu.
Sentence 2: At least that's what they told us in vet school.
Sentence 3: They also told us to keep chickens separated from other bird species such as turkeys for example, cause they can also be carriers of the virus.
