<a href="https://colab.research.google.com/github/scskalicky/VocabAtVic2023NLPWorkshop/blob/main/04-spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy** 

- spaCy refers to itself as ["industrial" strength NLP](https://spacy.io/)

- spaCy is a more modern NLP package with some similarities and differences to NLP

- Perhaps most importantly, spaCy has access to [many different language models](https://spacy.io/models)


### Using spaCy

Just like NLTK, spaCy comes pre-installed on Colab. You need to `import spacy` first. 

In [3]:
# import spacy and save the parser to a variable
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
# create some parsed text
parsed_text = nlp('The sea was angry that day my friends. Like an old man sending back soup in a deli.')

Calling parsed text just gives us the text back, but...

In [10]:
parsed_text

The sea was angry that day my friends. Like an old man sending back soup in a deli.

What we have done is create a [`Doc`](https://spacy.io/api/doc) object, which is a parsed text using the spaCy model we chose. The `Doc` object will have a ton of built-in features we can use to extract various pieces of linguistic information. Unlike NLTK, the `Doc` object will already contain tokens, parts of speech, noun chunks, and more. You can see a [full description of spaCy's linguistic features](https://spacy.io/usage/linguistic-features).

In [11]:
# confirm that we now have a specific spaCy object
type(parsed_text)

spacy.tokens.doc.Doc

In [12]:
# You can get sentences...
list([sent for sent in parsed_text.sents])

[The sea was angry that day my friends.,
 Like an old man sending back soup in a deli.]

In [13]:
# You can get noun chunks...
list([chunk for chunk in parsed_text.noun_chunks])

[The sea, an old man, soup, a deli]

In [15]:
# And of course we can get tokens!
list([token for token in parsed_text[:10]])

[The, sea, was, angry, that, day, my, friends, ., Like]

### spaCy token and Doc information

Assuming you are looking at the tokens in a `Doc` object...

Information|Syntax
-|-
The token|`token`
Simple POS Tag|`token.tag_`
Detailed POS Tag|`token.pos_`
Dependency|`token.dep_`

You can also get information such as noun chunks and named entities from the `Doc` object itself

Information|Syntax
-|-
Noun Chunks |`Doc.noun_chunks`
Named Entities|`Doc.ents`





Another example showing Part of Speech tags

In [16]:
# how good is this tagging?
nin = nlp('my moral standing is laying down')

for token in nin:
  print(token, token.tag_, token.pos_)

my PRP$ PRON
moral JJ ADJ
standing NN NOUN
is VBZ AUX
laying VBG VERB
down RP ADP
