<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy in the World of LLMs` `1`

This is lesson `1` of 3 in the educational series on `spaCy and Large Language Models (LLMs)`. This notebook is intended `introduce students to spaCy, while also considering the current state of the field of natural language processing (NLP)`.

**Skills:** 
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* A general understanding of natural language processing (NLP)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand the main concepts of natural language processing (NLP)
2. Understand the role of spaCy within NLP
3. Understand the key advantages and disadvantages of large language models.
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [natural language processing](https://docs.constellate.org/key-terms/#nlp).

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
### Import Libraries ###
import spacy
from spacy import displacy

# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

# What is Natural Language Processing (NLP)

Named entity recognition (addressed below) is a branch of natural language processing, better known as NLP. NLP is the process by which a researcher uses a computer system to parse human language and extract important metadata from texts. The purpose of NLP is to perform, among other things, distant reading.

Distant reading has a long history extending to the late-twentieth century. It is commonly used when the quantity of texts in a given corpus prevent a researcher (or a team of researchers) from reading the corpus closely in its entirety. In order to make sense of that large corpus, the researcher will often pass certain tasks to a computer with the understanding that there is a margin of error. This margin of error is accepted in exchange for the ability to gain a larger, distant understanding of that corpus. Distant reading is used to perform several significant tasks, such as:

1. sentiment analysis=> understanding the sentiment of a text
2. text classification=> classify texts into predetermined categories
3. named entity recognition=> extract entities from a text

The metadata from these tasks can then be used to get a sense of the texts without reading them closely, hence the term distant reading.


# What are Frameworks?

In order to engage in NLP, a researcher must first decide upon the framework they wish to use. Framework is a word that describes the software used by the researcher to engage in a specific task. A good way to think about a framework in Pythonic terms is as a library, or packaged set of usable classes and functions to perform complex tasks easily. Deciding which framework to use depends on a few variables. I will use the word “Pythonic” throughout this book. Pythonic is a term programmers of Python use to refer to the standard, or community-accepted, way to do something. A good example is the way in which one imports pandas, a library for analyzing and working with tabular data. When we import pandas, we import it as pd. Why? Because the documentation told us to do so and, perhaps even more importantly, everyone in the community follows this syntax.

First, not all frameworks support all languages and not all frameworks support the same languages equally.

Second, certain frameworks perform certain tasks better than others. While all frameworks will tokenize equally well (usually), the way in which some tasks, such as finding the root of words via lemmatization (spaCy) vs. stemming (Stanza) will vary. Decision on a framework for this purpose typically lies in the realm of computational linguistics or distance reading for the purpose of finding how a word (or words) appear in texts in all forms (conjugated and declined).

A common third thing to consider is the way in which the framework performs NLP. There are essentially two methods for performing NLP: rules-based and machine learning-based. Rules-based NLP is the process by which the frameworks has a predetermined set of rules for how to handle specific tasks. In order to find entities in a text, for example, a rules-based method will contain a dictionary of all types of entities or it may contain a RegEx formula for identifying patterns that match an entity.

Most frameworks today are moving away from a rules-based approach to NLP in favor of a machine learning-based approach. Machine learning-based NLP is the process by which developers use statistics to teach a computer system (known as a model) to perform a task based on past experiences (known as training). We will be speaking much more about machine learning-based NLP later in a later notebook as spaCy, the chief subject of this notebook, is a machine learning-based Python library.

# What is spaCy?

The spaCy (spelled correctly) library is a robust machine learning NLP library developed by Explosion AI, a Berlin based team of computer scientists and computational linguists. It supports a wide variety of European languages out-of-the-box with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. SpaCy is also capable of easily improving or training from scratch custom models on domain-specific texts.

In this notebook, we will go through the steps for installing spaCy, downloading a pretrained language model, and performing the essential tasks of NLP.

## How to Install spaCy

In order to install spaCy, I recommend visiting their website, [here](https://spacy.io/usage).


![find models](../images/spacy_demo.gif)

They have a nice user-friendly interface. Input your device settings, e.g. Mac or Windows or Linux, and your language, e.g. English, French, or German. The web-app will automatically populate the commands that you need to execute to get started. Since this is a JupyterBook, we can install these with a “!” in a cell to indicate that we want to run a terminal command. I will be installing spaCy and thee small English model, en_core_web_sm.

```bash
!pip install spacy
```
```bash
!python -m spacy download en_core_web_sm
```

Now that we’ve installed spaCy let’s import it to make sure we installed it correctly. (These steps were done above to conform to the TAP Institute's Format)

```python
import spacy
```

Great! Now, let’s make sure we downloaded the model successfully with the command below.

In [4]:
nlp = spacy.load("en_core_web_sm")

Excellent! spaCy is now installed correctly and we have successfully downloaded the small English model. We will pick up here with the code in the next notebook. For now, I want to focus on big-picture items, specifically spaCy “containers”.

# Containers

Containers are spaCy objects that contain a large quantity of data about a text. When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy containers. We will be focusing on three (emboldened): Doc, Span, and Token.

1. **Doc**
2. DocBin
3. Example
4. Language
5. Lexeme
6. **Span**
7. SpanGroup
8. **Token**

I created the image below to show how I visualize spaCy containers in my mind. At the top, we have a Doc container. This is the basis for all spaCy. It is the main object that we create. Within the Doc container are many different attributes and subcontainers. One attribute is the Doc.sents, which contains all the sentences in the Doc container. The doc container (and each sentence generator) is made up of a set of token containers. These are things like words, punctuation, etc.

Span containers are kind of like a token, in that they are a piece of a Doc container. Spans have one thing that makes them unique. They can cross multiple tokens.

We can give spans a bit more specificity by classifying them into different groups. These are known as SpanGroup containers.

![pyramid](https://python-textbook.pythonhumanities.com/_images/spacy_containers.png)

# The spaCy `Doc`

With spaCy imported, we can now create our nlp object. This is the standard Pythonic way to create your model in a Python script. Unless you are working with multiple models in a script, try to always name your model, nlp. It will make your script much easier to read. To do this, we will use spacy.load(). This command tells spaCy to load up a model. In order to know which model to load, it needs a string argument that corresponds to the model name. Since we will be working with the small English model, we will use “en_core_web_sm”. This function can take keyword arguments to identify which parts of the model you want to load, but we will get to that later. For now, we want to import the whole thing.

In [87]:
nlp = spacy.load("en_core_web_sm")

Great! With the model loaded, let’s go ahead and create a text. This text comes from the Founders Online Archive (see citation below)

In [66]:
# From George Washington to John Adam, 15 October 1780,” Founders Online, National Archives, https://founders.archives.gov/documents/Washington/03-28-02-0299. [Original source: The Papers of George Washington, Revolutionary War Series, vol. 28, 28 August–27 October 1780, ed. William M. Ferraro and Jeffrey L. Zvengrowski. Charlottesville: University of Virginia Press, 2020, p. 558.

text = """I understand Mr Skinner is gone to Philadelphia. You will keep the inclosed letter for him till he returns, when You will take the earliest opportunity of delivering it to him. I desire to see him as soon as he arrives & have written to him for the purpose.
You will inform the Officer who came with a flag to Elizabeth Town Yesterday—that he is not to wait for an answer to the letters he brought; and that One will be transmitted by an early conveyance. You will deliver him the letters in the packet which accompanies this."""


With the data loaded in, it’s time to make our first Doc container. Unless you are working with multiple Doc containers, it is best practice to always call this object “doc”, all lowercase. To create a doc container, we will usually just call our nlp object and pass our text to it as a single argument.

In [67]:
doc = nlp(text)

Great! Let’s see what this looks like.

In [68]:
doc

I understand Mr Skinner is gone to Philadelphia. You will keep the inclosed letter for him till he returns, when You will take the earliest opportunity of delivering it to him. I desire to see him as soon as he arrives & have written to him for the purpose.
You will inform the Officer who came with a flag to Elizabeth Town Yesterday—that he is not to wait for an answer to the letters he brought; and that One will be transmitted by an early conveyance. You will deliver him the letters in the packet which accompanies this.

Let's take a look at this `doc` a bit more closely. Let's count its length with the `len()` function! Remember, the `len()` function lets you count the length of an object in Python.

In [69]:
len(doc)

108

Why is this length so low? There are clearly more than 108 characters in the text above! The reason? This is counting the tokens in the `Doc` container. Let's iterate over the first ten items in the `Doc` and take a closer look.

In [70]:
for token in doc[:10]:
    print(token)

I
understand
Mr
Skinner
is
gone
to
Philadelphia
.
You


## Tokens

While on the surface it may seem that the Doc container’s length is dependent on the quantity of words, look more closely. You should notice that the `.` is also considered an item in the container. These are all known as tokens. Tokens are a fundamental building block of spaCy or any NLP framework. They can be words or punctuation marks. Tokens are something that has syntactic purpose in a sentence and is self-contained. A good example of this is the contraction “don’t” in English. When tokenized, or the process of converting the text into tokens, we will have two tokens. “do” and “n’t” because the contraction represents two words, “do” and “not”.

Let's learn a bit more about tokens by examining our first token more closely.

In [71]:
first_token = doc[0]

Let's now examine `first_token` to see the different attributes associated with it. We can use the `dir()` function to do so.

In [72]:
", ".join(dir(first_token))

'_, __bytes__, __class__, __delattr__, __dir__, __doc__, __eq__, __format__, __ge__, __getattribute__, __gt__, __hash__, __init__, __init_subclass__, __le__, __len__, __lt__, __ne__, __new__, __pyx_vtable__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__, __unicode__, ancestors, check_flag, children, cluster, conjuncts, dep, dep_, doc, ent_id, ent_id_, ent_iob, ent_iob_, ent_kb_id, ent_kb_id_, ent_type, ent_type_, get_extension, has_dep, has_extension, has_head, has_morph, has_vector, head, i, idx, iob_strings, is_alpha, is_ancestor, is_ascii, is_bracket, is_currency, is_digit, is_left_punct, is_lower, is_oov, is_punct, is_quote, is_right_punct, is_sent_end, is_sent_start, is_space, is_stop, is_title, is_upper, lang, lang_, left_edge, lefts, lemma, lemma_, lex, lex_id, like_email, like_num, like_url, lower, lower_, morph, n_lefts, n_rights, nbor, norm, norm_, orth, orth_, pos, pos_, prefix, prefix_, prob, rank, remove_extension, right_edge, righ

As we can see there are quite a bit of attributes attached to each token. This is one of the large strengths of using an NLP framework like spaCy to process documents. Let's take a closer look at some of the more important ones in the table below.

### Table for spaCy Token Attributes

| Name        | Description                                  | Code Example          |
|-------------|----------------------------------------------|-----------------------|
| `sent`      | The sentence to which the token belongs.     | `token.sent`          |
| `text`      | The raw text of the token.                   | `token.text`          |
| `head`      | The parent of the token in the dependency tree. | `token.head`         |
| `left_edge` | The leftmost token of the token's subtree.   | `token.left_edge`     |
| `right_edge`| The rightmost token of the token's subtree.  | `token.right_edge`    |
| `ent_type_` | The entity type label of the token, if any.  | `token.ent_type_`     |
| `lemma_`    | The lemmatized form of the token.            | `token.lemma_`        |
| `morph`     | The morphological details of the token.      | `token.morph`         |
| `pos_`      | The part of speech tag of the token.         | `token.pos_`          |
| `dep_`      | The syntactic dependency relation.           | `token.dep_`          |
| `lang_`     | The language of the parent document.         | `token.lang_`         |


The most important of these will be `pos_`, `dep_`, and `lemma_`. These allow you to do a significant amount of linguistic analysis on a document. Let's view each of these now in code.

In [73]:
first_token.pos_

'PRON'

In [74]:
first_token.dep_

'nsubj'

In [75]:
first_token.lemma_

'I'

### Visualizing Token Parts of Speech

Often when we process documents in bulk, we don't closely examine each document. However, sometimes it is necessary to examine documents closely to see how the model is performing or to get a deeper sense of a small sample of our data. While you could iterate through each token and look at the outputs, it is usually helpful to visualize the parts of speech of each token collectively.

We can do this with `displacy()`

In [85]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}
first_sentence = list(doc.sents)[0]
displacy.render(first_sentence, style="dep", options=options)

## Sentence Boundary Detection (SBD)

In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text. Again, this may seem fairly easy to do with rules. One could use split(“.”), but in English we use the period to also denote abbreviation. You could, again, write rules to look for periods not proceeded by a lowercase word, but again, I ask the question, “why bother?”. We can use spaCy and in seconds have all sentences fully separated through SBD.

To access the sentences in the Doc container, we can use the attribute sents, like so:

In [88]:
doc.sents

<generator at 0x3190fb420>

Uh oh! This may not be the expected output. This is known as a `generator`, a more advanced aspect of Python. Think of a generator as an efficient way to store data in memory that can be accessed iteratively, rather than loaded in memory all at once. Why does this matter? It matters for two reasons. First, it matters because it means that you can load larger documents more efficiently on your system via spaCy.

Secondly, it matters because it affects how you access that data. You cannot index a generator like you can a list. For example, were I to try and access the first sentence (index 0), I cannot do this:

In [89]:
doc.sents[0]

TypeError: 'generator' object is not subscriptable

How then do you access the data? There are a couple options at your disposal. First, if you need to iterate over each sentence, you can iterate over the data precisely the same way you would if it were a list. In the example below, we are using a `for` loop to iterate over the generator.

In [77]:
for sent in doc.sents:
    print(sent)

I understand Mr Skinner is gone to Philadelphia.
You will keep the inclosed letter for him till he returns, when You will take the earliest opportunity of delivering it to him.
I desire to see him as soon as he arrives & have written to him for the purpose.

You will inform the Officer who came with a flag to Elizabeth Town Yesterday—that he is not to wait for an answer to the letters he brought; and that One will be transmitted by an early conveyance.
You will deliver him the letters in the packet which accompanies this.


But what if we wanted to access a specific index of the sentences. To do this, we need to convert the `doc.sents` `generator` into a list. We can do this with the `list()` function. **WARNING!** If you do this, remember, you will load the entire generator into memory. If you are working with a single document, this likely will not be an issue, but if you are working with large quantities of data this can cause an issue with memory and cause your Python to return and `Out of Memory` error and crash.

To do this, we can use the following code.

In [90]:
sentences = list(doc.sents)

Now, I can access index 0 of sentences without issue.

In [91]:
sentences[0]

I understand Mr Skinner is gone to Philadelphia.

SpaCy does far more than simply do sentence boundary detection, though. It also provides us the same metadata for each sentence that we have at the `Doc` level. This is because a sentence is technically a `Span` in spaCy terms. This means it functions underneath the umbrella of `Doc`, but functions conceptually as a similar item. This means that we can iterate over the tokens of a sentence, just like as we would if it were a `Doc`.

In [93]:
for sentence_token in sentences[0]:
    print(sentence_token)

I
understand
Mr
Skinner
is
gone
to
Philadelphia
.


And again, the tokens themselves each have the same precise metadata. We can, for example, grab each token's part of speech with the `.pos_` attribute.

In [94]:
for sentence_token in sentences[0]:
    print(sentence_token, sentence_token.pos_)

I PRON
understand VERB
Mr PROPN
Skinner PROPN
is AUX
gone VERB
to ADP
Philadelphia PROPN
. PUNCT


## Named Entity Recognition

`Entities` are words in a text that correspond to a specific type of data. They can be numerical, such as cardinal numbers; temporal, such as dates; nominal, such as names of people and places; and political, such as geopolitical entities (GPE). In short, an entity can be anything the designer wishes to designate as an item in a text that has a corresponding label.

Named entity recognition, or NER, is the process by which a system takes an input of unstructured data (a text) and outputs structured data, specifically the identification of entities. Let us consider this short example.

Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.

In this example, we have several potential entities. First, there is “Martha”. Different NER models will have different corresponding labels for such an entity, but PERSON or PER is considered standard practice. Note here that the label is capitalized. This is also standard practice. We also have a GPE, or Geopolitical Entity, notably “Spain”. Finally, we have a DATE entity, “05 June 2022”. These are standard labels that one can expect to extract from a text. If the domain at hand, however, has additional labels, those can be extracted as well. Perhaps the client or user wants to not only extract people, GPEs, and dates, but also sports. In such a scenario “basketball” could be extracted and given the label SPORT.

Not all entities are singular. As is common with texts, sometimes entities are `Multi-word Span`. Let us consider the same sentence as above, but with one modification:

Martha Thompson, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.

Here, Martha now has a surname, “Thompson”. We can either extract Martha and Thompson as individual entities or, as is more common practice, extract both as a single entity, since “Martha Thompson” is a single individual. An NER system, therefore, should recognize “Martha Thompson” as a single, `Span`.

In [86]:
displacy.render(doc, style="ent", options=options)

# Why use spaCy when we have ChatGPT?

Now that we have a good sense about what spaCy can do, you may be asking yourself, how does this precisely fit into the world of LLMs. If ChatGPT can give me this same data, why would I even bother using spaCy? Over the next two lessons, we will answer this question.

# Exercises

Practice making spaCy Doc containers with different documents. Experiment with different model sizes. Explore where the models are good and, more importantly, where they are bad.

# References

1. From George Washington to John Adam, 15 October 1780,” Founders Online, National Archives, https://founders.archives.gov/documents/Washington/03-28-02-0299. [Original source: The Papers of George Washington, Revolutionary War Series, vol. 28, 28 August–27 October 1780, ed. William M. Ferraro and Jeffrey L. Zvengrowski. Charlottesville: University of Virginia Press, 2020, p. 558.