<a href="https://colab.research.google.com/github/scskalicky/VocabAtVic2023NLPWorkshop/blob/main/04-spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy** 

- spaCy refers to itself as ["industrial" strength NLP](https://spacy.io/)

- spaCy is a more modern NLP package with some similarities and differences to NLP

- Perhaps most importantly, spaCy has access to [many different language models](https://spacy.io/models)


### Using spaCy

Just like NLTK, spaCy comes pre-installed on Colab. You need to `import spacy` first. 

In [3]:
# import spacy and save the parser to a variable
import spacy
nlp = spacy.load('en_core_web_sm')

In [20]:
# now, let's get a text loaded
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt'

mb = open('marine_biologist.txt').read()

--2023-12-10 21:56:55--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16603 (16K) [text/plain]
Saving to: ‘marine_biologist.txt.1’


2023-12-10 21:56:55 (15.0 MB/s) - ‘marine_biologist.txt.1’ saved [16603/16603]



In [21]:
# create some parsed text
parsed_text = nlp(mb)

Calling parsed text just gives us the text back, but we can also directly access tokens:

In [31]:
parsed_text[85:105]


JERRY But see look at the collar, see it's fraying. Golden Boy is slowly dying.

What we have done is create a [`Doc`](https://spacy.io/api/doc) object, which is a parsed text using the spaCy model we chose. The `Doc` object will have a ton of built-in features we can use to extract various pieces of linguistic information. Unlike NLTK, the `Doc` object will already contain tokens, parts of speech, noun chunks, and more. You can see a [full description of spaCy's linguistic features](https://spacy.io/usage/linguistic-features).

In [32]:
# confirm that we now have a specific spaCy object
type(parsed_text)

spacy.tokens.doc.Doc

In [None]:
# You can get sentences...
list([sent for sent in parsed_text.sents])

In [None]:
# You can get noun chunks...
list([chunk for chunk in parsed_text.noun_chunks])

In [35]:
# And of course we can get tokens!
list([token for token in parsed_text[:10]])

[ELAINE, Well, did, he, bring, it, up, in, the, meeting]

### spaCy token and Doc information

Assuming you are looking at the tokens in a `Doc` object...

Information|Syntax
-|-
The token|`token`
Simple POS Tag|`token.tag_`
Detailed POS Tag|`token.pos_`
Dependency|`token.dep_`

You can also get information such as noun chunks and named entities from the `Doc` object itself

Information|Syntax
-|-
Noun Chunks |`Doc.noun_chunks`
Named Entities|`Doc.ents`


In [37]:
# tokenize with NLTK
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger'])
nltk_tokens = nltk.word_tokenize(mb)
nltk_tokens[:10]

[nltk_data] Downloading package punkt to /Users/sskalicky/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sskalicky/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


['ELAINE', 'Well', 'did', 'he', 'bring', 'it', 'up', 'in', 'the', 'meeting']

In [39]:
for token in parsed_text[:10]:
  print(token,token.tag_, token.pos_)

ELAINE NNP PROPN
Well UH INTJ
did VBD AUX
he PRP PRON
bring VB VERB
it PRP PRON
up RP ADP
in IN ADP
the DT DET
meeting NN NOUN


In [40]:
for token in nltk.pos_tag(nltk_tokens[:10]):
  print(token[0], token[1])

ELAINE NNP
Well NNP
did VBD
he PRP
bring VB
it PRP
up RP
in IN
the DT
meeting NN
