<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/31_Intro_to_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy**

There are other libraries in Python which we can use to perform linguistic analysis. One of these is called spaCy, you can [view their website to get all the details](https://spacy.io/), but the goal of spaCy is to provide a flexible and arguably easier way to do many of the things we have been doing with NLTK. In this manner, spaCy can tokenize and tag your text, just like NLTK can, and can also perform automatic noun phrase chunking, named entity recognition, and allow you to customize the entire process, and mostly in a more streamlined manner when compared to NLTK.

So, if one were only concerned with using a package that will provide a quick and accurate parse of some text data, spaCy is a good choice (and you might want to look at other libraries, such as [AllenNLP](https://allennlp.org/) or transformers [Huggingface](https://huggingface.co/course/chapter1/1)).

You might be thinking *well why the hell have we been learning NLTK this whole time?!*, and that's a good question!

The simple answer is that NLTK has worked us through a lot of linguistic concepts that aren't covered in the modern NLP libraries in the same manner. NLTK included corpus linguistics, WordNet, CMU dictionary, etc...And, we haven't even touched a lot of the advanced things in NLTK, or the later chapters. In this sense, NLTK might be the library that a *linguist* uses when doing computational linguistics, since they can store and analyse their own custom grammars, etc. It might also be a bit old fashioned.

For people who may not identify as pure linguists (or even impure linguists) but still want to do NLP, spaCy and other NLP libraries allow for access to what is likely the most basic needs in NLP – parsing a text using linguistic information. Now that you have read through and thought about some of the concepts in NLTK, you should have a better understanding of what these more advanced libraries are doing, should you continue to use them. However, one huge difference between NLTK and more modern NLP libraries is the data that they use to perform tagging, chunking, and parsing. Specifically, one reason spaCy and the newer libraries perform so well is because they have different language models.


# **spaCy and language models**

What exactly is a language model? Conceptually, one way to think of language models is to picture a huge set of word co-occurance probabilites. These models take into account the distributional properties of words over a huge amount of input and thus "learn" which words go with other words in certain contexts.

**spaCy has models for languages other than English**

You can test spaCy out for different language models using their demos (wait a few seconds for each page to load)

[Here is an example in English](https://explosion.ai/demos/displacy?text=Victoria%20University%20of%20Wellington&model=en_core_web_sm&cpu=1&cph=1)

[Here is an example in Mandarin Chinese](https://explosion.ai/demos/displacy?text=你准备好了吗&model=zh_core_web_sm&cpu=0&cph=0)

# **basic spaCy usage**

You can take a free online class offered by the creators of spaCy [here](https://course.spacy.io/en/), so please use that material if you find spaCy to be cool and want to use it in the future. There is also a nice [Real Python](https://realpython.com/natural-language-processing-spacy-python/) tutorial for spaCy, and likely others floating around out there.

In this notebook I'll show you how to perform some of the same functions using spaCy that we've already done with NLTK.

The first thing we do with spacy is import the library and then load a language model. You can see all the available language models [here](https://spacy.io/usage/models).

If you want to run spaCy on your local machine, you will first have to pip install spaCy and then manually download models (e.g., in the terminal run `python -m spacy download en_core_web_sm`). This is all explained on their website, should you be interested.

For English, spaCy provides a small, medium, and large model, which reflects the search space that the models were trained on and the amount of word vectors that exist (i.e., larger means more information about word distribution). Now there is also a transformer based model. We can get by using the small model, but the larger models might be useful once you start performing more advanced tasks with spaCy.

If you want to install a larger language model and are using Colab, you can use pip to install them this way:

```
# download the larger english model (should only need to do this once per notebook session)
!python -m spacy download en_core_web_lg
```
For now, we'll stick with the default small language model.

In the cell below, I import spaCy and then create a variable `nlp` from the small model. The use of `nlp` as a variable name is arbitrary, but the creators of spaCy use it in all their examples and suggest that it is best practice to use `nlp`.


In [None]:
# import spacy and save the parser to a variable
import spacy
nlp = spacy.load('en_core_web_sm')

The `nlp` function is now very simple to use. We pass raw text directly to the `nlp` variable, using it as a function, to create parsed text. We should save the text to a variable to see all of the options we now have access to.

In [None]:
parsed_text = nlp('The sea was angry that day my friends. Like an old man sending back soup in a deli.')

Calling parsed text just gives us the text back, but...

In [None]:
parsed_text

What we have done is create a [`Doc`](https://spacy.io/api/doc) object, which is a parsed text using the spaCy model we chose. The `Doc` object will have a ton of built-in features we can use to extract various pieces of linguistic information. Unlike NLTK, the `Doc` object will already contain tokens, parts of speech, noun chunks, and more. You can see a [full description of spaCy's linguistic features](https://spacy.io/usage/linguistic-features).


In [None]:
# confirm that we now have a specific spaCy object
type(parsed_text)

The `Doc` object contains tokens, sentences, chunks, and many other sources of information, all by calling that one function above.


In [None]:
# You can get sentences...
list([sent for sent in parsed_text.sents])

In [None]:
# You can get noun chunks...
list([chunk for chunk in parsed_text.noun_chunks])

In [None]:
# And of course we can get tokens!
list([token for token in parsed_text])

spaCy provides a lot of information with each token, called `attributes`, which you can access using the `.` notation. For instance, we can quickly create tuples with the word and POS tag for the text. We do so with the `token.tag_` method. Notice that underscore after `tag` – spaCy stores information as both numeric and readable, and we use the underscore to get the readable version. (You can try using the `.tag` without the underscore to see what happens).


In [None]:
# print word/tag tuples
# tag_ will give us tags similar to Penn Treebank
list([(token, token.tag_) for token in parsed_text])

In [None]:
# pos_ will give us more generalized part of speech
# notice that we just have NOUN, ADJ, etc, instead of NN, NNS, etc.
list([(token, token.pos_) for token in parsed_text])

So, that's something we can also do with NLTK, and rather easily, but spaCy has even more syntactic information compared to NLTK, and parses it automatically. That's what we'll explore for the rest of this notebook. You can also explore [the full list of token attributes here.](https://spacy.io/api/token#attributes)

## Your Turn

Practice using spaCy to parse a few documents and strings of your choice. The use of built-in attributes is a bit different than the NLTK approach, but it doesn't take too long to get a grasp of, and the examples in thie notebook should provide you with what you need.

Which method do you prefer? NLTK is based more on chaining different functions to create pipelines and passing values among variables. spaCy is object-oriented, in that a single object is created which contains many different attributes you specify from the object.