<a href="https://colab.research.google.com/github/scskalicky/VocabAtVic2023NLPWorkshop/blob/main/04-spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy** 

- spaCy refers to itself as ["industrial" strength NLP](https://spacy.io/)

- spaCy is a more modern NLP package with some similarities and differences to NLP

- Perhaps most importantly, spaCy has access to [many different language models](https://spacy.io/models)


### Using spaCy

Just like NLTK, spaCy comes pre-installed on Colab. You need to `import spacy` first. 

In [None]:
# import spacy and save the parser to a variable
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# now, let's get a text loaded
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt'

mb = open('marine_biologist.txt').read()

In [None]:
# create some parsed text
parsed_text = nlp(mb)

Calling parsed text just gives us the text back, but we can also directly access tokens:

In [None]:
parsed_text[85:105]

What we have done is create a [`Doc`](https://spacy.io/api/doc) object, which is a parsed text using the spaCy model we chose. The `Doc` object will have a ton of built-in features we can use to extract various pieces of linguistic information. Unlike NLTK, the `Doc` object will already contain tokens, parts of speech, noun chunks, and more. You can see a [full description of spaCy's linguistic features](https://spacy.io/usage/linguistic-features).

In [None]:
# confirm that we now have a specific spaCy object
type(parsed_text)

In [None]:
# You can get sentences...
list([sent for sent in parsed_text.sents])

In [None]:
# You can get noun chunks...
list([chunk for chunk in parsed_text.noun_chunks])

In [None]:
# And of course we can get tokens!
list([token for token in parsed_text[:10]])

### spaCy token and Doc information

Assuming you are looking at the tokens in a `Doc` object...

Information|Syntax
-|-
The token|`token`
Simple POS Tag|`token.tag_`
Detailed POS Tag|`token.pos_`
Dependency|`token.dep_`

You can also get information such as noun chunks and named entities from the `Doc` object itself

Information|Syntax
-|-
Noun Chunks |`Doc.noun_chunks`
Named Entities|`Doc.ents`


In [None]:
# tokenize with NLTK
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger'])
nltk_tokens = nltk.word_tokenize(mb)
nltk_tokens[:10]

In [None]:
for token in parsed_text[:10]:
  print(token,token.tag_, token.pos_)

In [None]:
for token in nltk.pos_tag(nltk_tokens[:10]):
  print(token[0], token[1])

In [None]:
# Worlds Collide!
nltk.FreqDist([token.text.lower() for token in parsed_text if token.is_alpha]).most_common(10)